Gerrit/Operations
Restarting
Restarting Gerrit is a last resort. We used to have to restart it often due to misunderstanding of some of its behavior as well as nasty memory leak. As of February 2021, restart should not be conducted without a thorough review of the current behavior and taking traces. They will be of dramatic help to identify a potential bug or a configuration tuning.
If after all investigations you get clueless or really have no other options, you can restart Gerrit through systemd: sudo systemctl restart gerrit
.
The service will take a few seconds before it comes back during which any end user operations would error out (some Puppet catalogues, CI, developers).
Monitoring
JavaMelody monitors the state of the Gerrit JVM. They are collected by Prometheus from https://gerrit.wikimedia.org/r/monitoring?prometheus
- JavaMelody in Gerrit (only accesible to logged-in Gerrit Administrators/Gerrit Managers)
- Grafana "Gerrit" folder"
Important Graphs
- Gerrit overview dashboard
- Memory Usage
- GC Timing
- Grafana GC Timing
- Garbage collection metrics. Times in the 100s of milliseconds, rather than in the 10s of milliseconds can be indicative of a problem (running low on memory)
- Active Threads
- Gerrit Active Threads
- Grafana Active Threads
- Usually there are less than 20 active threads at any given time — more than that typically means that you should take a Thread Dump and restart.
Gerrit metrics
On top of the JavaMelody data, Gerrit has internal metrics.
For users having the View Metrics capabilities, various internal Gerrit metrics can be retrieved via:
Which obviously requires authentication. That complements gerrit show-caches
.
We use the metrics-reporter-prometheus plugin which exposes collected by Prometheus from the JavaMelody MBeans page under the metrics
branch.
See Gerrit Grafana dashboards folder.
Logs
They are consumed by our logging infrastructure and available in the Apache access logs.
Main logs
Logs are available on the gerrit servers at: /var/log/gerrit/
. There are a number of logfiles:
gerrit.log
: This is the main log file and will show stacktraces and errorsgerrit.json
: Likegerrit.log
bug not really human readable. For sending structured logs to logstash.sshd_log
: Log of sshd eventsgc_log
: Logs forgit gc
not the JVM garbage collection (those logs are available in/srv/gerrit/jvmlogs
)plugin_log
: Info about plugins being loaded and reloaded, this information is also ingerrit.log
HTTP Logs
Gerrit sits behind Apache, access and error logs are both in /var/log/apache2
:
gerrit.wikimedia.org.https.access.log
gerrit.wikimedia.org.https.error.log
find its logs by searching with type:log4j
.
JVM
Thread Dump
A thread dump is often useful in troubleshooting. To capture a thread dump use jstack
. This code should be safe to run at any time, and is run frequently while Gerrit is running:
sudo -u gerrit2 jstack -l $(pgrep java) > "/srv/gerrit/jstack-$(date +%Y-%m-%d-%H%M%S).dump"
It's often useful to upload the resulting file to https://fastthread.io/ to detect problems.
Java trace
Display a summary of garbage collection statistics every 1000 ms:
sudo -u gerrit2 /usr/lib/jvm/java-8-openjdk-amd64/bin/jstat -gcutil "$(pgrep -u gerrit2 java)" 1000
Java heap usage
Requires openjdk-X-dbg for the debugging symbols
sudo /usr/lib/jvm/java-8-openjdk-amd64/bin/jmap -heap "$( pgrep -u gerrit2 java)"
Access h2 account_patch_reviews
On copies of account_patch_reviews* files:
java -cp h2-1.3.176.jar org.h2.tools.Shell -url jdbc:h2:/home/hashar/account_patch_reviews
Which gives you a sql prompt:
sql> show columns from ACCOUNT_PATCH_REVIEWS ...> ; FIELD | TYPE | NULL | KEY | DEFAULT ACCOUNT_ID | INTEGER(10) | NO | PRI | 0 CHANGE_ID | INTEGER(10) | NO | PRI | 0 PATCH_SET_ID | INTEGER(10) | NO | PRI | 0 FILE_NAME | VARCHAR(255) | NO | PRI | '' (4 rows, 16 ms)
Blocking misbehaving bots / IPs
If necessary either IP addresses or user agents that are misbehaving can be blocked by making edits to modules/profile/templates/gerrit/apache.erb in the operations/puppet public git repository and merging them.
Throttling IPs
Since September 2024, implemented in profile::firewall::nftables_throttling keys in Hiera.
You can also observe data related to this on the grafana dashboard for gerrit.
Killing ssh connections
It can happen that a user reaches the limit of 8 concurrent ssh connections and then says they can't push to Gerrit anymore over ssh.
A member of Gerrit admins can run commands like these to kill connections for them:
ssh [email protected] -p 29418 gerrit show-connections ssh [email protected] -p 29418 gerrit close-connection <connection ID>
Switch over
Based on https://phabricator.wikimedia.org/T326368 |
If you're migrating on gerrit2003, please see https://phabricator.wikimedia.org/T338470#10506291 Try to ensure the user running Gerrit will be also owning the synced data. |
The old version of this section is still visible here: https://wikitech.wikimedia.org/w/index.php?title=Gerrit/Operations&oldid=2235783 |
Schedule and Announce Downtime
Announce the scheduled downtime for Gerrit services.
Prepare patches
Those hiera keys have to be updated:
profile::gerrit::active_host profile::gerrit::replica_hosts profile::gerrit::lfs_sync_dest
and the DNS configuration.
Data Synchronization on src-gerrit
rsync -avpPz --delete /var/lib/gerrit2/review_site/ rsync://dst-gerrit.wikimedia.org/gerrit-var-lib/ rsync -avpPz --delete /srv/gerrit/ rsync://dst-gerrit.wikimedia.org/gerrit-data/ --exclude=*.hprof
Begin Scheduled Downtime
Announce the start of the scheduled downtime on IRC #wikimedia-operations
and on Slack #engineering-all
.
Downtime Management
On cumin1002
, execute:
sudo cookbook sre.hosts.downtime -r 'maintenance' -D 30 src-gerrit.wikimedia.org && sudo cookbook sre.hosts.downtime -r 'maintenance' -H 1 dst-gerrit.wikimedia.org
Manually schedule downtime for checks connected to the virtual server "gerrit.wikimedia.org
" on icinga.wikimedia.org
.
Disable Puppet and Stop Gerrit on dst-gerrit
Execute the following commands on dst-gerrit
:
sudo disable-puppet 'gerrit maintenance' && systemctl stop gerrit
Merge DNS changes to remove gerrit-new and switch the IP of gerrit.wikimedia.org
.
Update DNS
Run authdns-update
on ns0.wikimedia.org
, review the diff but do not commit yet.
Stop Gerrit on src-gerrit
Execute the following commands on src-gerrit
:
sudo disable-puppet 'gerrit maintenance' && systemctl stop gerrit
Repeat Data Synchronization on src-gerrit
Repeat the rsync commands as in this step.
Start Gerrit on dst-gerrit
Execute the following command on dst-gerrit
:
systemctl start gerrit
Finalize DNS Update
Confirm the DNS change and merge it.
Testing
Wait for 5 minutes, then test:
- HTTPS access via browser: https://gerrit.wikimedia.org
- SSH access:
ssh yourusernameongerrit@gerrit.wikimedia.org -p 29418
Announce Downtime Conclusion
Announce that the downtime is over.
Post-Migration Tasks
- Ensure
src-gerrit
has Puppet disabled and/or services are masked. - Determine the grace period duration.
- Decommission the old host as per T336427.