Continuous integration/Data center switch

From mediawiki.org

The core of the CI infrastructure is hosted on two production machines, one in each datacenter. Most services are active solely on one of the hosts, the other host acting as a cold spare. When doing hardware maintenance or operating system upgrades, we move the services and their data from an host to another one. This document describe the steps needed to do the swap.

Hosts and services[edit]

We have two bare metal hosts contint1001.wikimedia.org and contint2001.wikimedia.org, one in each of our primary datacenters. They hosts a variety of services:

Switching over[edit]

The general idea is to synchronize Jenkins files from the primary to the spare server before anything else. Once done the sequence overview is:

  • synchronize build artifacts
  • Stop all services on the primary
  • rsync data and states
  • change DNS for contint.wikimedia.org
  • change primary in Puppet / Hiera
  • Start Jenkins
  • Start Zuul scheduler

synchronize build artifacts[edit]

This step should be made ahead of time since it takes hours to transfer.

The Jenkins builds history and their artifacts are solely on the primary Jenkins and located in /srv/jenkins/builds. It is in the magnitude of hundred of gigabytes and million of files and directories.

On the spare server, ensure /srv/jenkins/builds is empty.

TODO: check Transfer.py or MariaDB/ImportTableSpace.

Stop all services[edit]

On the primary: systemctl stop jenkins systemctl stop zuul

rsync data and states[edit]

Using rsync over ssh as root:

  • refresh /srv/jenkins/builds from the artifacts from the primary to the spare.

Transfer the Jenkins and Zuul states:

  • /var/lib/jenkins , jobs configurations, build indice, plugins etc
  • /var/lib/zuul/times , duration of functions execution used to speculate an ETA of each build

change DNS[edit]

The Varnish/ATS layer points to the backend via the DNS entry contint.wikimedia.org which in turns point to the primary host.

change primary in Puppet / Hiera[edit]

TODO: find the changes that need to happen. Ideally should just be a role change.

Start services[edit]

On the new primary: systemctl stop jenkins systemctl stop zuul