Jump to content

Continuous integration/Zuul

shortcut: CI/Z
From mediawiki.org

Zuul is a python daemon which acts as a gateway between Gerrit and Jenkins. It listens to Gerrit stream-events feed and trigger jobs function registered by Jenkins using the Jenkins Gearman plugin. The jobs triggering specification is written in YAML and hosted in the git repository integration/config.git as /zuul/layout.yaml.

Operational information

[edit]
What Where
Server contint.wikimedia.org (scheduler + merger)
Puppet classes manifests/role/zuul.pp
modules/contint

modules/zuul

Config /etc/zuul/zuul.conf
Init scripts /etc/init.d/zuul
/etc/init.d/zuul-merger
Log /var/log/zuul/*.log
Quick checks pgrep -l zuul (should yield zuul-merger and 2x zuul-server on active contint host
https://integration.wikimedia.org/zuul/

There are a few monitoring probe in Icinga which would alert members of the 'contint' group.

Name Description
zuul_service_running Ensure the zuul-server daemon runs as well as the forked gearman server (which carries the same name). There must be two processes matching zuul-server.
zuul_gearman_service The gearman server forked from Zuul must respond to TCP queries on port 4730. If it fails to respond but the two processes are present, the server is stall / misbehaving somehow. Else it might just be that the zuul-server itself is down (and there is thus no process).
zuul_gearman_wait_queue Alert whenever there are too many Gearman function requests waiting in the queue. It might be transient (a spike of requests, temporarily overload) or highlight a trouble with Jenkins / the Jenkins agent.

For the Gearman wait queue, one can look at the Grafana board https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10. It is often just a spike of requests for the zuul-merger, and sometime it might be due to Jenkins executors being all busy. The alarm usually self resolve.

On the CI master, one can look at the work status using: gearadmin --status, see the Debugging section below.

Architecture overview

[edit]

Settings described below comes mostly from /etc/zuul/zuul.conf which is maintained in puppet. They might not be up-to-date on this wiki page.

Zuul maintains an ssh connection with the Gerrit master. It connects as the user jenkins-bot and issue the Gerrit command stream-events which provides a JSON feed of anything happening in Gerrit that can be seen by the jenkins-bot user.

The main process is zuul-server . On startup it forks to boot an embedded Gearman server used to communicate with Jenkins. Another independent process is zuul-merger which connects to zuul-server and handles the git merges of proposed patches on tip of the target branch.

Zuul git repositories

[edit]

Whenever a new project is detected, Zuul clones a non-bare repository from Gerrit master under the base path defined by git_dir in zuul.conf. As of September 2013, that is /srv/ssd/zuul/git . Zuul uses non-bare repositories to merge the received patchsets against the tip of the branch they are made against. The end result is often a merge commit which is marked as a git reference under refs/zuul/<branch>/Z…). The reference is passed when triggering job so Jenkins can ultimately fetch it.

The local merge commits are not available publicly nor in Gerrit. Nonetheless, the Zuul bare repositories are made available to Wikimedia internal network over the git protocol on port 9418. This is made possible by using git-daemon configured via /etc/default/git-daemon . The daemon is restricted to internal network using ferm rules defined in puppet.

Access by replica to Zuul repositories

[edit]

The Zuul repositories should be accessed with the hostname zuul.eqiad.wmnet which points to the actve contint server.

On the server one can clone the mediawiki/core repository using: git clone git://zuul.eqiad.wmnet:9418/mediawiki/core/ though the master branch there will not be the one from gerrit but a random patch merge.

As of July 2014, an ongoing work is being conducted to have a Zuul merger to run on the second server lanthanum.eqiad.wmnet. The flow overview is:

Drawing of Wikimedia continuous integration flows between Zuul mergers and Jenkins replica client.

A second merger on lanthanum is not implemented yet since labs instances do NOT have access to production private IP addresses.

Git replications

[edit]

Note that the continuous integration production servers also receive Git repositories under /srv/ssd/gerrit . Thoses are bare repositories which are not suitable for testing patch sets via Zuul. The replication has been setup for two main usage:

  • take snapshots via git archive which is not supported by Gerrit 2.8
  • use them as a reference repository to avoid Jenkins replicas to fetch the whole repository over the network. Git clone will creates hardlinks since those repositories are on the same disk (ssd) as the workspace.

Triggering

[edit]

When an event is received, Zuul would pass it via a workflow specification defined in a YAML file (available in integration/config.git ). Zuul will communicate with its internal Gearman daemon to launch a Gearman function and resume proceeding. The Gearman server receives from Zuul a set of parameters such as the project name and commit SHA1, it then find a suitable worker to execute the function. As of January 2014 there is only one worker which is the Continuous integration Jenkins master server. Jenkins runs the job and execute a Gearman function to report back test results which is handled by Jenkins worker to update job descriptions and by Zuul itself to report back in Gerrit as a comment.

Whenever Jenkins is not reacheable or a job got deleted while running, the build result will be considered lost and Zuul will report the status of the build to be LOST.

Split between check and test

[edit]

Jobs executed on patch upload are split between ones that execute code from the uploaded patch which run in the check pipeline and those jobs that don't in the check and test pipeline. This is so that unknown registered accounts can't execute code on the Jenkins replicas. (This will not be needed any more once everything runs in Continuous_integration/Architecture/Isolation.)

The white list for test pipeline and the negated white list for the check pipeline should be kept in sync.

Debugging

[edit]

The Gearman server is embedded inside Zuul and uses the gear python module. You can send administrative commands to the server by using the gearadmin utility. List of commands:

Caption
Command Description
status List functions and their workers
show jobs (broken)
show unique jobs (broken)
cancel job (broken)
version Gear module version
workers List the workers and all their registered functions.


To list jobs registered in Gearman, send the status administrative commands to Zuul Gearman server:

$ gearadmin --status
build:mwext-TemplateData-phpcs-HEAD:hasSlaveScripts    0    0    13
build:mwext-TemplateData-lint    0    0    13
build:mwext-TemplateData-lint:hasSlaveScripts    0    0    13
build:mwext-TemplateData-testextensions-master:hasSlaveScripts    0    0    13
build:mwext-TemplateData-testextensions-master    0    0    13
build:mwext-TemplateData-jslint    0    0    13
build:mwext-TemplateData-phpcs-HEAD    0    0    13
build:mwext-TemplateData-qunit    0    0    5
build:mwext-TemplateData-jslint:hasSlaveScripts    0    0    13
...
$

The fields read as:

  • Gearman function (which is build: followed by the Jenkins job name.
  • the number of currently queued instances of that job
  • the number of currently running jobs
  • the number of workers for the job (there is one Gearman worker per executor)


The list of workers and their attached job is obtained with the workers command. Output cut to 72 characters and first 6 lines:

$ gearadmin --workers|cut -d\  -f1-3
13 ::ffff:127.0.0.1 -
16 ::ffff:127.0.0.1 -
30 ::ffff:127.0.0.1 Zuul
31 ::ffff:208.80.154.132 Zuul
15 ::ffff:208.80.153.39 Zuul
17 ::ffff:127.0.0.1 172.17.0.1_manager
18 ::ffff:127.0.0.1 contint2002_exec-5
19 ::ffff:127.0.0.1 contint2002_exec-7
20 ::ffff:127.0.0.1 contint2002_exec-10
21 ::ffff:127.0.0.1 contint2002_exec-9
...

The fields read as:

  • worker number
  • worker IP address
  • worker name. The Jenkins Gearman plugin forges it using: node name, '_exec-', executor slot
  • list of function the worker can handle (not shown in ouput above)

You can generate a thread dump by sending SIGUSR2 to the zuul process. The result is send to the debug log in /var/log/zuul/stack_dump.log . Warning: do not send the signal to the forked zuul process which runs the gearman process, it will terminate it and causes havoc.

Replay events

[edit]

Use the zuul command on the contint host (e.g. contint1001) to replay a Gerrit event to Zuul. This will then queue the same Jenkins jobs as if the event had just ocurred.

This can be useful when iterating locally on a Jenkins job that is managed via JJB (e.g. if it is difficult or impossible to trigger such build directly Jenkins, or when testing logic for Zuul merger or Zuul environment variables itself), or after creating a documentation publishing job to generate it for a backlog of previous releases.

Below are some examples:

# Patch jobs
zuul enqueue --trigger gerrit --pipeline test --project fresh --change 591214,1
# Post-merge jobs
zuul enqueue --trigger gerrit --pipeline postmerge --project mediawiki/extensions/EventLogging --change 591769,1
# Release tag jobs
zuul enqueue-ref --trigger gerrit --pipeline publish --project mediawiki/php/luasandbox --ref 'refs/tags/3.0.3' --newrev '41dfc79bbcd619e50f7dc44891d19b9b3f812aa9' --oldrev '0000000000000000000000000000000000000000'

zuul enqueue-ref --trigger gerrit --pipeline publish --project oojs/core --ref 'refs/tags/v2.0.0' --newrev '3cad296dc5b722c5061c12ae75c13fa8102fc693' --oldrev '0000000000000000000000000000000000000000'

Update configuration

[edit]

Change configuration

[edit]

Clone the integration/config repository:

git clone https://gerrit.wikimedia.org/r/integration/config

The Zuul configuration file is zuul/layout.yaml. Edit the file and push your commit to Gerrit then ask for review.

Deploy configuration

[edit]

Once your configuration change is merged it needs to be deployed on the continuous integration server (contint). This can be done by someone allowed to sudo as zuul user.

The deployment is done using a shell script named fab in the integration/config repository.

From the configuration directory, run ./fab deploy_zuul

That will:

  • ssh to the contint server where the Zuul scheduler runs,
  • update the local git clone of integration/config,
  • show a difference of changes,
  • asks you to accept the diff,
  • if you are happy with them, the repository is updated (rebased) and the Zuul scheduler service is reloaded.

IMPORTANT: In a second terminal you might want to have a look at the Zuul log file:

$ tail -f -n100 /var/log/zuul/zuul.log

Announce deployment to RelEng SAL via !log in #wikimedia-releng connect.

If you see any error in the log file, you should revert your change locally (git reset --hard HEAD^) and reload the daemon again (and revert the patch in Gerrit, and merge the revert).

Restart

[edit]
Graceful

A plain "restart" is graceful.

ssh contint.wikimedia.org
sudo /usr/sbin/service zuul restart && tail -f -n100 /var/log/zuul/zuul.log
Forced

A plain restart waits for currently queued jobs to finish. If you're in a position where Zuul is unresponsive, restarting will be futile as that will leave it no less stuck then it already is. In that case, perform a stop followed by a start. The stop command, contrary to restart, is not graceful and terminates the process immediately with no regard for currently running or queued jobs.

ssh contint.wikimedia.org
sudo /usr/sbin/service zuul stop
sudo /usr/sbin/service zuul start
tail -n100 /var/log/zuul/zuul.log

WMF Setup

[edit]

Zuul source code is maintained by OpenStack, the WMF maintains a copy of their git repository in its own Gerrit installation under the project integration/zuul. The Continuous Integration team manually update our master branch from the OpenStack master.

The puppet module zuul handles installation. It clones the source code from the WMF git repository and installs it on the server using python setup.py. WMF-specific configuration is handled via our puppet role classes: role::zuul::production and role::zuul::labs . The role classes invoke the zuul module using a set of parameter that fit our context. Changes to this configuration must be approved by the Operations team (it is in the project operations/puppet).

Zuul has additional configuration to finely tune how to trigger jobs. Since this is regularly updated by people in charge of Continuous Integration, the related configuration files has been extracted to a git repository out of Operations' responsibility: integration/config. This let CI people make changes without bothering Operations with configuration changes that are harmless to most WMF servers. A wrong change can still render Zuul inoperable, but CI people should be able to fix it by themselves.

Log files are available under /var/log/zuul/ and are rotated daily. zuul.log should cover most needs, if not the debug.log has extended informations. The logging configuration is handled via the puppet module zuul which copy the file in /etc/zuul/logging.conf.

The configuration repository is initially deployed by puppet simply by cloning the repository under /etc/zuul/wikimedia. The /etc/zuul/zuul.conf refers to it. Whenever a change is merged in integration/config, one needs to update the git working directory and reload zuul. Watch out the log file, since Zuul does not validate its configuration, it can well be made unstable whenever a typo appear in the zuul/layout.yaml file.

upgrading

[edit]

new package

[edit]

We deploy Zuul using Debian packages. The debian sources are in integration/zuul.git in branches debian/os-name.

The quilt patches under debian/patches are maintained using gbp-pq which grab the patches from sub branches patch-queue/debian/os-version .

To build for Jessie:

ssh integration-slave-jessie-1001.integration.eqiad.wmflabs
git clone https://gerrit.wikimedia.org/r/integration/zuul
git checkout origin/upstream
git checkout debian/jessie-wikimedia

# We use dh-virtualenv which fetches from pypi
echo "USENETWORK=yes" > ~/.pbuilderrc

sudo -s
DEB_BUILD_OPTIONS=nocheck GIT_PBUILDER_AUTOCONF=no DIST=jessie WIKIMEDIA=yes git-buildpackage -us -uc --git-builder=git-pbuilder

You should then have the resulting .deb stuff in the parent directory:

$ ls -1 ../zuul_*
zuul_2.5.1.orig.tar.gz
zuul_2.5.1-wmf10_amd64.changes
zuul_2.5.1-wmf10_amd64.deb
zuul_2.5.1-wmf10.debian.tar.xz
zuul_2.5.1-wmf10.dsc
$

git-buildpackage creates the source tarball based on your local upstream branch. Make sure your local branch matches the version in the debian/changelog.

You should diff the package with the previous one to see potential differences with debdiff or by extracting them:

$ dpkg-deb -x zuul_2.5.1-wmf9_amd64.deb current
$ dpkg-deb -x zuul_2.5.1-wmf10_amd64.deb new
$ colordiff -ur current new

Or to review only source code modifications:

$ colordiff -ur {wmf2,wmf3}/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul
diff -ur wmf2/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py wmf3/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py
--- wmf2/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py	2015-02-05 15:46:17.000000000 +0000
+++ wmf3/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py	2015-07-23 14:50:19.000000000 +0000
@@ -120,7 +120,7 @@
             if v is True:
                 cmd += ' --%s' % k
             else:
-                cmd += ' --label %s=%s' % (k, v)
+                cmd += ' --%s %s' % (k, v)
         cmd += ' %s' % change
         out, err = self._ssh(cmd)
         return err
$

Actually upgrade

[edit]

On the contint host, as root, stop the servers and uninstall Zuul entirely:

/etc/init.d/zuul stop
/etc/init.d/zuul-merger stop
pip uninstall zuul

Repeat pip uninstall zuul in case several versions were installed until you have a message confirming it is not:

Cannot uninstall requirement zuul, not installed
Storing complete log in /root/.pip/pip.log

Change the master branch of the local git working space to point to the desired commit. On contint, as root:

cd /usr/local/src/zuul
git remote update
git log --oneline --decorate --graph master..origin/master

If happy with the changes, continue:

git reset --hard origin/master
HTTP_PROXY=. HTTPS_PROXY=. python setup.py install

If easy_install attempts to download a python module, it will bails out. You will have to rollback master to whatever previous commit and package the missing python module.

MAKE SURE the layout still validates:

zuul-server -c /etc/zuul/zuul.conf -l /etc/zuul/wikimedia/zuul/layout.yaml -t

Any stack trace there mean Zuul will not be able to reload the configuration. Rollback.

Restart the services:

/etc/init.d/zuul-merger start
/etc/init.d/zuul start

Check /var/log/zuul/debug.log and /var/log/zuul/merger-debug.log to verify the daemon start properly. Once they have settled, you can change a dummy patch in Gerrit to confirm.

Known issues

[edit]

Force merge

[edit]

Force merge is clicking "Submit" when zuul is working through tests so that the patch is merged before zuul thinks it is. This causes zuul to enter a bad state and clogs the queue.

Gearman deadlock

[edit]

The Gearman server sometimes deadlocks when a job is created in Jenkins. The Gearman process is still around but TCP connections time out completely and it does not process anything. The workaround is to disconnect Jenkins from the Gearman server:

  1. Open https://integration.wikimedia.org/ci/configure logged in with a WMF ldap account
  2. Log what you're about to do at the RelEng SAL via #wikimedia-releng !log
  3. Search for "Gearman"
  4. Untick checkbox "Enable Gearman"
  5. "Save" at the bottom
  6. Search for "Gearman"
  7. Tick checkbox "Enable Gearman"
  8. "Save" at the bottom

Jenkins execution lock

[edit]

Sometimes a Jenkins node (in particular deployment-deploy03, which runs the Beta Cluster update jobs) gets stuck

  1. Open https://integration.wikimedia.org/ci/computer/deployment-deploy03/
  2. Log what you're about to do at the RelEng SAL via #wikimedia-releng !log
  3. Mark node as temporarily offline (there's a button at the top right of the page)
  4. Disconnect (there's a link in the left hand panel of the page)
  5. Relaunch replica agent
  6. Bring node back online

Very high queue of merger:merge functions

[edit]

Zuul might be flowed with lot of merger:merge function to triggers task T140297, that is usually due to a single repository sending way too many patches. When the server can not be restarted (that would lost the queue), one can make the merger:merge fail fast by preventing read access to the git repository.

To confirm, on the zuul master check the number of jobs awaiting. In the example below 2803:

$ gearadmin --status|grep merger:merge
merger:merge	2803	2	2

Identify the spamming repository:

tail -f /var/log/zuul/merger-debug.log

on the zuul-merger instances. You should see a spam of messages such as:

DEBUG zuul.Repo: CreateZuulRef master/Zxxxx at yyy on <git.Repo "/srv/zuul/git/someproject/.git">

On the zuul-merger instances, change ownership to root and prevent reads from the zuul user:

chown root:root /srv/zuul/git/someproject/.git
chmod go-rx /srv/zuul/git/someproject/.git

The merger:merge function will thus fail quickly and errors will show up in the /var/log/zuul/merger-debug.log. Once the queue has been drained to a more reasonable level:

$ gearadmin --status|grep merger:merge
merger:merge	19	2	2

Then restore the ownership/permissions:

chown zuul:zuul /srv/zuul/git/someproject/.git
chmod go+rx /srv/zuul/git/someproject/.git

All Gerrit patches complain of merge conflicts

[edit]

This appears to be caused by gerrit-bot holding open SSH connections and hitting the connection limit.

It is usually resolved by restarting Zuul per https://phabricator.wikimedia.org/T308943#7947453

In case that doesn't work, check the ssh connections to gerrit via the show-connections command. You'll need to be in the Gerrit Administrators group to do this (see Gerrit:cmd-show-connections).

$ ssh gerrit.wikimedia.org -p 29418 gerrit show-connections | grep jenkins-bot
deadbeef   jenkins-bot     contint2002.wikimedi
2626b319   jenkins-bot     contint2002.wikimedi
125baedf   jenkins-bot     contint2002.wikimedi

There should be two connections. If there are more than two connections it's a bad thing, it means something's hung-up in zuul. Go ahead and try to kill the oldest connection (the first in the list). You'll need to be in the Gerrit Administrators group to do this (see Gerrit:cmd-close-connection).

$ ssh gerrit.wikimedia.org -p 29418 gerrit close-connection deadbeef
closing connection deadbeef...

References

[edit]