Continuous integration/Zuul
This page needs to be improved: puppet configuration, the layout file, what we do etc.. |
Zuul is a python daemon which acts as a gateway between Gerrit and Jenkins. It listens to Gerrit stream-events
feed and trigger jobs function registered by Jenkins using the Jenkins Gearman plugin. The jobs triggering specification is written in YAML and hosted in the git repository integration/config.git
as /zuul/layout.yaml
.
Operational information
[edit]What | Where |
---|---|
Server | contint.wikimedia.org (scheduler + merger) |
Puppet classes | manifests/role/zuul.pp modules/contint modules/zuul |
Config | /etc/zuul/zuul.conf
|
Init scripts | /etc/init.d/zuul
|
Log | /var/log/zuul/*.log
|
Quick checks | pgrep -l zuul (should yield zuul-merger and 2x zuul-server on active contint hosthttps://integration.wikimedia.org/zuul/ |
There are a few monitoring probe in Icinga which would alert members of the 'contint' group.
Name | Description |
---|---|
zuul_service_running | Ensure the zuul-server daemon runs as well as the forked gearman server (which carries the same name). There must be two processes matching zuul-server. |
zuul_gearman_service | The gearman server forked from Zuul must respond to TCP queries on port 4730. If it fails to respond but the two processes are present, the server is stall / misbehaving somehow. Else it might just be that the zuul-server itself is down (and there is thus no process). |
zuul_gearman_wait_queue | Alert whenever there are too many Gearman function requests waiting in the queue. It might be transient (a spike of requests, temporarily overload) or highlight a trouble with Jenkins / the Jenkins agent. |
For the Gearman wait queue, one can look at the Grafana board https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10. It is often just a spike of requests for the zuul-merger, and sometime it might be due to Jenkins executors being all busy. The alarm usually self resolve.
On the CI master, one can look at the work status using: gearadmin --status
, see the Debugging section below.
Architecture overview
[edit]Settings described below comes mostly from /etc/zuul/zuul.conf
which is maintained in puppet. They might not be up-to-date on this wiki page.
Zuul maintains an ssh connection with the Gerrit master. It connects as the user jenkins-bot
and issue the Gerrit command stream-events
which provides a JSON feed of anything happening in Gerrit that can be seen by the jenkins-bot user.
The main process is zuul-server
. On startup it forks to boot an embedded Gearman server used to communicate with Jenkins. Another independent process is zuul-merger
which connects to zuul-server and handles the git merges of proposed patches on tip of the target branch.
Zuul git repositories
[edit]Whenever a new project is detected, Zuul clones a non-bare repository from Gerrit master under the base path defined by git_dir
in zuul.conf. As of September 2013, that is /srv/ssd/zuul/git
. Zuul uses non-bare repositories to merge the received patchsets against the tip of the branch they are made against. The end result is often a merge commit which is marked as a git reference under refs/zuul/<branch>/ZâŚ). The reference is passed when triggering job so Jenkins can ultimately fetch it.
The local merge commits are not available publicly nor in Gerrit. Nonetheless, the Zuul bare repositories are made available to Wikimedia internal network over the git protocol on port 9418. This is made possible by using git-daemon
configured via /etc/default/git-daemon
. The daemon is restricted to internal network using ferm rules defined in puppet.
Access by replica to Zuul repositories
[edit]The Zuul repositories should be accessed with the hostname zuul.eqiad.wmnet
which points to the actve contint server.
On the server one can clone the mediawiki/core repository using: git clone git://zuul.eqiad.wmnet:9418/mediawiki/core/
though the master branch there will not be the one from gerrit but a random patch merge.
As of July 2014, an ongoing work is being conducted to have a Zuul merger to run on the second server lanthanum.eqiad.wmnet. The flow overview is:
A second merger on lanthanum is not implemented yet since labs instances do NOT have access to production private IP addresses.
Git replications
[edit]Note that the continuous integration production servers also receive Git repositories under /srv/ssd/gerrit
. Thoses are bare repositories which are not suitable for testing patch sets via Zuul. The replication has been setup for two main usage:
- take snapshots via
git archive
which is not supported by Gerrit 2.8 - use them as a reference repository to avoid Jenkins replicas to fetch the whole repository over the network. Git clone will creates hardlinks since those repositories are on the same disk (ssd) as the workspace.
Triggering
[edit]When an event is received, Zuul would pass it via a workflow specification defined in a YAML file (available in integration/config.git
). Zuul will communicate with its internal Gearman daemon to launch a Gearman function and resume proceeding. The Gearman server receives from Zuul a set of parameters such as the project name and commit SHA1, it then find a suitable worker to execute the function. As of January 2014 there is only one worker which is the Continuous integration Jenkins master server. Jenkins runs the job and execute a Gearman function to report back test results which is handled by Jenkins worker to update job descriptions and by Zuul itself to report back in Gerrit as a comment.
Whenever Jenkins is not reacheable or a job got deleted while running, the build result will be considered lost and Zuul will report the status of the build to be LOST.
Split between check and test
[edit]Jobs executed on patch upload are split between ones that execute code from the uploaded patch which run in the check pipeline and those jobs that don't in the check and test pipeline. This is so that unknown registered accounts can't execute code on the Jenkins replicas. (This will not be needed any more once everything runs in Continuous_integration/Architecture/Isolation.)
The white list for test pipeline and the negated white list for the check pipeline should be kept in sync.
Debugging
[edit]The Gearman server is embedded inside Zuul and uses the gear
python module. You can send administrative commands to the server by using the gearadmin
utility. List of commands:
Command | Description |
---|---|
status |
List functions and their workers |
show jobs |
(broken) |
show unique jobs |
(broken) |
cancel job |
(broken) |
version |
Gear module version |
workers |
List the workers and all their registered functions. |
To list jobs registered in Gearman, send the status
administrative commands to Zuul Gearman server:
$ gearadmin --status
build:mwext-TemplateData-phpcs-HEAD:hasSlaveScripts 0 0 13
build:mwext-TemplateData-lint 0 0 13
build:mwext-TemplateData-lint:hasSlaveScripts 0 0 13
build:mwext-TemplateData-testextensions-master:hasSlaveScripts 0 0 13
build:mwext-TemplateData-testextensions-master 0 0 13
build:mwext-TemplateData-jslint 0 0 13
build:mwext-TemplateData-phpcs-HEAD 0 0 13
build:mwext-TemplateData-qunit 0 0 5
build:mwext-TemplateData-jslint:hasSlaveScripts 0 0 13
...
$
The fields read as:
- Gearman function (which is
build:
followed by the Jenkins job name. - the number of currently queued instances of that job
- the number of currently running jobs
- the number of workers for the job (there is one Gearman worker per executor)
The list of workers and their attached job is obtained with the workers
command. Output cut to 72 characters and first 6 lines:
$ gearadmin --workers|cut -d\ -f1-3
13Â ::ffff:127.0.0.1 -
16Â ::ffff:127.0.0.1 -
30Â ::ffff:127.0.0.1 Zuul
31Â ::ffff:208.80.154.132 Zuul
15Â ::ffff:208.80.153.39 Zuul
17Â ::ffff:127.0.0.1 172.17.0.1_manager
18Â ::ffff:127.0.0.1 contint2002_exec-5
19Â ::ffff:127.0.0.1 contint2002_exec-7
20Â ::ffff:127.0.0.1 contint2002_exec-10
21Â ::ffff:127.0.0.1 contint2002_exec-9
...
The fields read as:
- worker number
- worker IP address
- worker name. The Jenkins Gearman plugin forges it using: node name, '_exec-', executor slot
- list of function the worker can handle (not shown in ouput above)
You can generate a thread dump by sending SIGUSR2
to the zuul process. The result is send to the debug log in /var/log/zuul/stack_dump.log
. Warning: do not send the signal to the forked zuul process which runs the gearman process, it will terminate it and causes havoc.
Replay events
[edit]Use the zuul
command on the contint host (e.g. contint1001) to replay a Gerrit event to Zuul. This will then queue the same Jenkins jobs as if the event had just ocurred.
This can be useful when iterating locally on a Jenkins job that is managed via JJB (e.g. if it is difficult or impossible to trigger such build directly Jenkins, or when testing logic for Zuul merger or Zuul environment variables itself), or after creating a documentation publishing job to generate it for a backlog of previous releases.
Below are some examples:
# Patch jobs zuul enqueue --trigger gerrit --pipeline test --project fresh --change 591214,1
# Post-merge jobs zuul enqueue --trigger gerrit --pipeline postmerge --project mediawiki/extensions/EventLogging --change 591769,1
# Release tag jobs zuul enqueue-ref --trigger gerrit --pipeline publish --project mediawiki/php/luasandbox --ref 'refs/tags/3.0.3' --newrev '41dfc79bbcd619e50f7dc44891d19b9b3f812aa9' --oldrev '0000000000000000000000000000000000000000' zuul enqueue-ref --trigger gerrit --pipeline publish --project oojs/core --ref 'refs/tags/v2.0.0' --newrev '3cad296dc5b722c5061c12ae75c13fa8102fc693' --oldrev '0000000000000000000000000000000000000000'
Update configuration
[edit]Change configuration
[edit]Clone the integration/config
repository:
git clone https://gerrit.wikimedia.org/r/integration/config
The Zuul configuration file is zuul/layout.yaml
. Edit the file and push your commit to Gerrit then ask for review.
Deploy configuration
[edit]Once your configuration change is merged it needs to be deployed on the continuous integration server (contint
). This can be done by someone allowed to sudo as zuul user.
The deployment is done using a shell script named fab
in the integration/config
repository.
From the configuration directory, run ./fab deploy_zuul
That will:
- ssh to the contint server where the Zuul scheduler runs,
- update the local git clone of integration/config,
- show a difference of changes,
- asks you to accept the diff,
- if you are happy with them, the repository is updated (rebased) and the Zuul scheduler service is reloaded.
IMPORTANT: In a second terminal you might want to have a look at the Zuul log file:
$ tail -f -n100 /var/log/zuul/zuul.log
Announce deployment to RelEng SAL via !log
in #wikimedia-releng connect.
If you see any error in the log file, you should revert your change locally (git reset --hard HEAD^
) and reload the daemon again (and revert the patch in Gerrit, and merge the revert).
Restart
[edit]Do not restart Zuul when deploying a configuration change. Zuul can reread configuration from disk while running using the reload command (See Reload). This way no Gerrit events are missed. Do not take restarting Zuul lightly, as it means any Gerrit events during that time will be missed and need to be manually re-submitted to trigger tests (and merging). |
- Graceful
A plain "restart" is graceful.
ssh contint.wikimedia.org sudo /usr/sbin/service zuul restart && tail -f -n100 /var/log/zuul/zuul.log
- Forced
You don't need to do this in most cases where Zuul looks stuck. See #Known issues for common debugging methods and solutions. |
A plain restart waits for currently queued jobs to finish. If you're in a position where Zuul is unresponsive, restarting will be futile as that will leave it no less stuck then it already is. In that case, perform a stop
followed by a start
. The stop command, contrary to restart, is not graceful and terminates the process immediately with no regard for currently running or queued jobs.
ssh contint.wikimedia.org sudo /usr/sbin/service zuul stop sudo /usr/sbin/service zuul start tail -n100 /var/log/zuul/zuul.log
WMF Setup
[edit]Zuul source code is maintained by OpenStack, the WMF maintains a copy of their git repository in its own Gerrit installation under the project integration/zuul
. The Continuous Integration team manually update our master branch from the OpenStack master.
The puppet module zuul handles installation. It clones the source code from the WMF git repository and installs it on the server using python setup.py
. WMF-specific configuration is handled via our puppet role classes: role::zuul::production
and role::zuul::labs
. The role classes invoke the zuul module using a set of parameter that fit our context. Changes to this configuration must be approved by the Operations team (it is in the project operations/puppet
).
Zuul has additional configuration to finely tune how to trigger jobs. Since this is regularly updated by people in charge of Continuous Integration, the related configuration files has been extracted to a git repository out of Operations' responsibility: integration/config
. This let CI people make changes without bothering Operations with configuration changes that are harmless to most WMF servers. A wrong change can still render Zuul inoperable, but CI people should be able to fix it by themselves.
Log files are available under /var/log/zuul/
and are rotated daily. zuul.log
should cover most needs, if not the debug.log
has extended informations. The logging configuration is handled via the puppet module zuul which copy the file in /etc/zuul/logging.conf
.
The configuration repository is initially deployed by puppet simply by cloning the repository under /etc/zuul/wikimedia
. The /etc/zuul/zuul.conf
refers to it. Whenever a change is merged in integration/config, one needs to update the git working directory and reload zuul. Watch out the log file, since Zuul does not validate its configuration, it can well be made unstable whenever a typo appear in the zuul/layout.yaml file.
upgrading
[edit]new package
[edit]We deploy Zuul using Debian packages. The debian sources are in integration/zuul.git
in branches debian/os-name
.
The quilt patches under debian/patches
are maintained using gbp-pq
which grab the patches from sub branches patch-queue/debian/os-version
.
To build for Jessie:
ssh integration-slave-jessie-1001.integration.eqiad.wmflabs git clone https://gerrit.wikimedia.org/r/integration/zuul git checkout origin/upstream git checkout debian/jessie-wikimedia # We use dh-virtualenv which fetches from pypi echo "USENETWORK=yes" > ~/.pbuilderrc sudo -s DEB_BUILD_OPTIONS=nocheck GIT_PBUILDER_AUTOCONF=no DIST=jessie WIKIMEDIA=yes git-buildpackage -us -uc --git-builder=git-pbuilder
You should then have the resulting .deb stuff in the parent directory:
$ ls -1 ../zuul_* zuul_2.5.1.orig.tar.gz zuul_2.5.1-wmf10_amd64.changes zuul_2.5.1-wmf10_amd64.deb zuul_2.5.1-wmf10.debian.tar.xz zuul_2.5.1-wmf10.dsc $
git-buildpackage
creates the source tarball based on your local upstream
branch. Make sure your local branch matches the version in the debian/changelog
.
You should diff the package with the previous one to see potential differences with debdiff
or by extracting them:
$ dpkg-deb -x zuul_2.5.1-wmf9_amd64.deb current $ dpkg-deb -x zuul_2.5.1-wmf10_amd64.deb new $ colordiff -ur current new
Or to review only source code modifications:
$ colordiff -ur {wmf2,wmf3}/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul diff -ur wmf2/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py wmf3/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py --- wmf2/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py 2015-02-05 15:46:17.000000000 +0000 +++ wmf3/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/lib/gerrit.py 2015-07-23 14:50:19.000000000 +0000 @@ -120,7 +120,7 @@ if v is True: cmd += ' --%s'Â % k else: - cmd += ' --label %s=%s'Â % (k, v) + cmd += ' --%s %s'Â % (k, v) cmd += ' %s'Â % change out, err = self._ssh(cmd) return err $
Actually upgrade
[edit]On the contint host, as root, stop the servers and uninstall Zuul entirely:
/etc/init.d/zuul stop /etc/init.d/zuul-merger stop pip uninstall zuul
Repeat pip uninstall zuul
in case several versions were installed until you have a message confirming it is not:
Cannot uninstall requirement zuul, not installed Storing complete log in /root/.pip/pip.log
Change the master
branch of the local git working space to point to the desired commit. On contint, as root:
cd /usr/local/src/zuul
git remote update
git log --oneline --decorate --graph master..origin/master
If happy with the changes, continue:
git reset --hard origin/master HTTP_PROXY=. HTTPS_PROXY=. python setup.py install
If easy_install attempts to download a python module, it will bails out. You will have to rollback master to whatever previous commit and package the missing python module.
MAKE SURE the layout still validates:
zuul-server -c /etc/zuul/zuul.conf -l /etc/zuul/wikimedia/zuul/layout.yaml -t
Any stack trace there mean Zuul will not be able to reload the configuration. Rollback.
Restart the services:
/etc/init.d/zuul-merger start /etc/init.d/zuul start
Check /var/log/zuul/debug.log and /var/log/zuul/merger-debug.log to verify the daemon start properly. Once they have settled, you can change a dummy patch in Gerrit to confirm.
Known issues
[edit]Force merge
[edit]Force merge is clicking "Submit" when zuul is working through tests so that the patch is merged before zuul thinks it is. This causes zuul to enter a bad state and clogs the queue.
Gearman deadlock
[edit]The Gearman server sometimes deadlocks when a job is created in Jenkins. The Gearman process is still around but TCP connections time out completely and it does not process anything. The workaround is to disconnect Jenkins from the Gearman server:
- Open https://integration.wikimedia.org/ci/configure logged in with a WMF ldap account
- Log what you're about to do at the RelEng SAL via
#wikimedia-releng !log
- Search for "Gearman"
- Untick checkbox "Enable Gearman"
- "Save" at the bottom
- Search for "Gearman"
- Tick checkbox "Enable Gearman"
- "Save" at the bottom
Jenkins execution lock
[edit]Sometimes a Jenkins node (in particular deployment-deploy03, which runs the Beta Cluster update jobs) gets stuck
- Open https://integration.wikimedia.org/ci/computer/deployment-deploy03/
- Log what you're about to do at the RelEng SAL via
#wikimedia-releng !log
- Mark node as temporarily offline (there's a button at the top right of the page)
- Disconnect (there's a link in the left hand panel of the page)
- Relaunch replica agent
- Bring node back online
Very high queue of merger:merge functions
[edit]Zuul might be flowed with lot of merger:merge function to triggers task T140297, that is usually due to a single repository sending way too many patches. When the server can not be restarted (that would lost the queue), one can make the merger:merge fail fast by preventing read access to the git repository.
To confirm, on the zuul master check the number of jobs awaiting. In the example below 2803:
$ gearadmin --status|grep merger:merge merger:merge 2803 2 2
Identify the spamming repository:
tail -f /var/log/zuul/merger-debug.log
on the zuul-merger instances. You should see a spam of messages such as:
DEBUG zuul.Repo: CreateZuulRef master/Zxxxx at yyy on <git.Repo "/srv/zuul/git/someproject/.git">
On the zuul-merger instances, change ownership to root and prevent reads from the zuul user:
chown root:root /srv/zuul/git/someproject/.git chmod go-rx /srv/zuul/git/someproject/.git
The merger:merge function will thus fail quickly and errors will show up in the /var/log/zuul/merger-debug.log
. Once the queue has been drained to a more reasonable level:
$ gearadmin --status|grep merger:merge merger:merge 19 2 2
Then restore the ownership/permissions:
chown zuul:zuul /srv/zuul/git/someproject/.git chmod go+rx /srv/zuul/git/someproject/.git
All Gerrit patches complain of merge conflicts
[edit]This appears to be caused by gerrit-bot holding open SSH connections and hitting the connection limit.
It is usually resolved by restarting Zuul per https://phabricator.wikimedia.org/T308943#7947453
In case that doesn't work, check the ssh connections to gerrit via the show-connections
command. You'll need to be in the Gerrit Administrators group to do this (see Gerrit:cmd-show-connections).
$ ssh gerrit.wikimedia.org -p 29418 gerrit show-connections | grep jenkins-bot
deadbeef jenkins-bot contint2002.wikimedi
2626b319 jenkins-bot contint2002.wikimedi
125baedf jenkins-bot contint2002.wikimedi
There should be two connections. If there are more than two connections it's a bad thing, it means something's hung-up in zuul. Go ahead and try to kill the oldest connection (the first in the list). You'll need to be in the Gerrit Administrators group to do this (see Gerrit:cmd-close-connection).
$ ssh gerrit.wikimedia.org -p 29418 gerrit close-connection deadbeef
closing connection deadbeef...
References
[edit]- Online documentation :
- For zuul v2: the version used by Wikimedia CI: https://people.wikimedia.org/~thcipriani/docs/zuul/
- For zuul v3, Entirely rewritten, based on ansible: https://zuul-ci.org/docs/zuul/latest/index.html
- Wikimedia integration/config zuul/layout.yaml
- Jenkins Gearman plugin