Jump to content

Gerrit/Operations

From Wikitech

Restarting

Restarting Gerrit is a last resort. We used to have to restart it often due to misunderstanding of some of its behavior as well as nasty memory leak. As of February 2021, restart should not be conducted without a thorough review of the current behavior and taking traces. They will be of dramatic help to identify a potential bug or a configuration tuning.

If after all investigations you get clueless or really have no other options, you can restart Gerrit through systemd: sudo systemctl restart gerrit.

The service will take a few seconds before it comes back during which any end user operations would error out (some Puppet catalogues, CI, developers).

Monitoring

JavaMelody monitors the state of the Gerrit JVM. They are collected by Prometheus from https://gerrit.wikimedia.org/r/monitoring?prometheus

Important Graphs

Gerrit metrics

On top of the JavaMelody data, Gerrit has internal metrics.

For users having the View Metrics capabilities, various internal Gerrit metrics can be retrieved via:

Which obviously requires authentication. That complements gerrit show-caches.

We use the metrics-reporter-prometheus plugin which exposes collected by Prometheus from the JavaMelody MBeans page under the metrics branch.

See Gerrit Grafana dashboards folder.

Logs

They are consumed by our logging infrastructure and available in the Apache access logs.

Main logs

Logs are available on the gerrit servers at: /var/log/gerrit/. There are a number of logfiles:

  • gerrit.log: This is the main log file and will show stacktraces and errors
  • gerrit.json: Like gerrit.log bug not really human readable. For sending structured logs to logstash.
  • sshd_log: Log of sshd events
  • gc_log: Logs for git gc not the JVM garbage collection (those logs are available in /srv/gerrit/jvmlogs)
  • plugin_log: Info about plugins being loaded and reloaded, this information is also in gerrit.log

HTTP Logs

Gerrit sits behind Apache, access and error logs are both in /var/log/apache2:

  • gerrit.wikimedia.org.https.access.log
  • gerrit.wikimedia.org.https.error.log

find its logs by searching with type:log4j.

JVM

Thread Dump

A thread dump is often useful in troubleshooting. To capture a thread dump use jstack. This code should be safe to run at any time, and is run frequently while Gerrit is running:

sudo -u gerrit2 jstack -l $(pgrep java) > "/srv/gerrit/jstack-$(date +%Y-%m-%d-%H%M%S).dump"

It's often useful to upload the resulting file to https://fastthread.io/ to detect problems.

Java trace

This command isn't run very often, unsure how safe it is to run; kept here for folks who are familiar with jstat

Display a summary of garbage collection statistics every 1000 ms:

sudo -u gerrit2 /usr/lib/jvm/java-8-openjdk-amd64/bin/jstat -gcutil "$(pgrep -u gerrit2 java)" 1000

Java heap usage

Requires openjdk-X-dbg for the debugging symbols

  sudo /usr/lib/jvm/java-8-openjdk-amd64/bin/jmap -heap "$( pgrep -u gerrit2 java)"

Access h2 account_patch_reviews

On copies of account_patch_reviews* files:

java -cp h2-1.3.176.jar org.h2.tools.Shell -url jdbc:h2:/home/hashar/account_patch_reviews

Which gives you a sql prompt:

sql> show columns from ACCOUNT_PATCH_REVIEWS
...> ;
FIELD        | TYPE         | NULL | KEY | DEFAULT
ACCOUNT_ID   | INTEGER(10)  | NO   | PRI | 0
CHANGE_ID    | INTEGER(10)  | NO   | PRI | 0
PATCH_SET_ID | INTEGER(10)  | NO   | PRI | 0
FILE_NAME    | VARCHAR(255) | NO   | PRI | ''
(4 rows, 16 ms)

Blocking misbehaving bots / IPs

If necessary either IP addresses or user agents that are misbehaving can be blocked by making edits to modules/profile/templates/gerrit/apache.erb in the operations/puppet public git repository and merging them.

example change

Throttling IPs

Since September 2024, implemented in profile::firewall::nftables_throttling keys in Hiera.

You can also observe data related to this on the grafana dashboard for gerrit.

Killing ssh connections

It can happen that a user reaches the limit of 8 concurrent ssh connections and then says they can't push to Gerrit anymore over ssh.

A member of Gerrit admins can run commands like these to kill connections for them:


ssh [email protected] -p 29418 gerrit show-connections
ssh [email protected] -p 29418 gerrit close-connection <connection ID>

Switch over

Based on https://phabricator.wikimedia.org/T326368

If you're migrating on gerrit2003, please see https://phabricator.wikimedia.org/T338470#10506291

Try to ensure the user running Gerrit will be also owning the synced data.
The old version of this section is still visible here: https://wikitech.wikimedia.org/w/index.php?title=Gerrit/Operations&oldid=2235783
This section has recently been entirely rewritten and is still undergoing tests

Schedule and Announce Downtime

Announce the scheduled downtime for Gerrit services.

Prepare patches

Those hiera keys have to be updated:

profile::gerrit::active_host
profile::gerrit::replica_hosts
profile::gerrit::lfs_sync_dest

and the DNS configuration.

Data Synchronization on src-gerrit

rsync -avpPz --delete /var/lib/gerrit2/review_site/ rsync://dst-gerrit.wikimedia.org/gerrit-var-lib/
rsync -avpPz --delete /srv/gerrit/ rsync://dst-gerrit.wikimedia.org/gerrit-data/ --exclude=*.hprof

Begin Scheduled Downtime

Announce the start of the scheduled downtime on IRC #wikimedia-operations and on Slack #engineering-all.

Downtime Management

On cumin1002, execute:

sudo cookbook sre.hosts.downtime -r 'maintenance' -D 30 src-gerrit.wikimedia.org && sudo cookbook sre.hosts.downtime -r 'maintenance' -H 1 dst-gerrit.wikimedia.org

Manually schedule downtime for checks connected to the virtual server "gerrit.wikimedia.org" on icinga.wikimedia.org.

Disable Puppet and Stop Gerrit on dst-gerrit

Execute the following commands on dst-gerrit:

sudo disable-puppet 'gerrit maintenance' && systemctl stop gerrit

Merge DNS changes to remove gerrit-new and switch the IP of gerrit.wikimedia.org.

Update DNS

Run authdns-update on ns0.wikimedia.org, review the diff but do not commit yet.

Stop Gerrit on src-gerrit

Execute the following commands on src-gerrit:

sudo disable-puppet 'gerrit maintenance' && systemctl stop gerrit

Repeat Data Synchronization on src-gerrit

Repeat the rsync commands as in this step.

Start Gerrit on dst-gerrit

Execute the following command on dst-gerrit:

systemctl start gerrit

Finalize DNS Update

Confirm the DNS change and merge it.

Testing

Wait for 5 minutes, then test:

Announce Downtime Conclusion

Announce that the downtime is over.

Post-Migration Tasks

  1. Ensure src-gerrit has Puppet disabled and/or services are masked.
  2. Determine the grace period duration.
  3. Decommission the old host as per T336427.

Follow Lee on X/Twitter - Father, Husband, Serial builder creating AI, crypto, games & web tools. We are friends :) AI Will Come To Life!

Check out: eBank.nz (Art Generator) | Netwrck.com (AI Tools) | Text-Generator.io (AI API) | BitBank.nz (Crypto AI) | ReadingTime (Kids Reading) | RewordGame | BigMultiplayerChess | WebFiddle | How.nz | Helix AI Assistant