Continuous integration/Data center switch
This page is currently a draft.
|
The core of the CI infrastructure is hosted on two production machines, one in each datacenter. Most services are active solely on one of the hosts, the other host acting as a cold spare. When doing hardware maintenance or operating system upgrades, we move the services and their data from an host to another one. This document describe the steps needed to do the swap.
Hosts and services
[edit]We have two bare metal hosts contint1001.wikimedia.org
and contint2001.wikimedia.org
, one in each of our primary datacenters. They hosts a variety of services:
- Zuul: the scheduler and workflow system
- Zuul mergers and their associated git-daemon. Active on both servers!
- Jenkins holding jobs, their build history and artifacts
- The website https://integration.wikimedia.org/ and the proxies to the above services:
- Zuul status https://integration.wikimedia.org/zuul/
- Jenkins interface https://integration.wikimedia.org/ci/
docker-pkg
to build images- Docker daemon and images
Switching over
[edit]The general idea is to synchronize Jenkins files from the primary to the spare server before anything else. Once done the sequence overview is:
- synchronize build artifacts
- Stop all services on the primary
- rsync data and states
- change DNS for
contint.wikimedia.org
- change primary in Puppet / Hiera
- Start Jenkins
- Start Zuul scheduler
synchronize build artifacts
[edit]This step should be made ahead of time since it takes hours to transfer.
The Jenkins builds history and their artifacts are solely on the primary Jenkins and located in /srv/jenkins/builds
. It is in the magnitude of hundred of gigabytes and million of files and directories.
On the spare server, ensure /srv/jenkins/builds
is empty.
TODO: check Transfer.py or MariaDB/ImportTableSpace.
Stop all services
[edit]On the primary:
systemctl stop jenkins
systemctl stop zuul
rsync data and states
[edit]Using rsync over ssh as root:
- refresh
/srv/jenkins/builds
from the artifacts from the primary to the spare.
Transfer the Jenkins and Zuul states:
/var/lib/jenkins
, jobs configurations, build indice, plugins etc/var/lib/zuul/times
, duration of functions execution used to speculate an ETA of each build
change DNS
[edit]The Varnish/ATS layer points to the backend via the DNS entry contint.wikimedia.org
which in turns point to the primary host.
change primary in Puppet / Hiera
[edit]TODO: find the changes that need to happen. Ideally should just be a role change.
Start services
[edit]On the new primary:
systemctl stop jenkins
systemctl stop zuul