Program Goals and Status for FY18/19

TEC6 Address Infrastructure Gaps

Goal Owner: Mark Bergsma
Program Goals for FY18/19: At the conclusion of this program, Zend PHP7 will be the only PHP runtime supported or used in the Wikimedia Foundation production environment.
Annual Plan: TEC6 Address Infrastructure Gaps
- Primary Goal is Knowledge as a Service: Evolve our systems and structures
- Tech Goal: Sustaining

Q1 Goals

Outcome 2 / Output 3

Technical staff have increased visibility into the operation of our services and infrastructure.

Modernize logging, alerting and metrics monitoring infrastructure

Dependencies on: Search Platform; Primary team: Infrastructure Foundations

Goal(s)

Adopt Logstash Done

Review Logstash/Kibana's architecture and installation and identify next steps and gaps to be addressed.
Audit log producers across the infrastructure and plan their transition to centralized logging.
Investigate log shipping methods and standardize on them.

Status

Note: July 2018

In progress

Note: August 14, 2018

In progress

Note: September 11, 2018

In progress A comprehensive design document has been prepared for logging and is currently in final review.

Outcome 3 / Output 4

Wikimedia projects and content are protected against major disasters that threaten availability.

Strengthen backups with reliable and redundant backup infrastructure

Dependencies on: Infrastructure Foundations; Primary teams: Data Persistence

Goal(s)

Monitor database backup generation for failure or incorrect generation Done

Generate metrics and historic data about databases (objects, table and wiki sizes, growth over time, etc.)
Detect and alert on backup metrics anomalies

Status

Note: July 30, 2018

In progress

Note: August 14, 2018

In progress

Note: September 11, 2018

In progress Software to generate & track metrics for db backups has been written, and will soon be used to setup alerts on backup anomalies.

Outcome 4 / Output 6

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Dependencies on: Traffic; Primary teams: Infrastructure Foundations, Data center operations

Goal(s)

Migrate the hardware inventory from Racktables to Netbox Done

Define Netbox existing and custom fields usage standards/best practices
Switch over from Racktables to Netbox
Stretch: Investigate Netbox reporting capabilities to automatically validate data
Stretch: Investigate Netbox potential future integrations, towards a single source of truth To do

Status

Note: July 30, 2018

In progress

Note: August 2018

In progress

Note: September 11, 2018

<

In progress A final proposal for custom fields usage standards/best practices is under discussion; work will continue (including switching to netbox) after the data center switch from eqiad to codfw.

Q2 Goals

Outcome 2 / Output 3

Technical staff have increased visibility into the operation of our services and infrastructure.

Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Goal(s)

Begin the implementation of Q1's Logging Infrastructure design

Procure and provision Logging pipeline hardware in multiple datacenters In progress
Migrate >=90% of existing Logstash traffic to the logging pipeline In progress
Onboard at least 10 new non-sensitive log producers to the logging pipeline In progress
Investigate approaches to ingest sensitive log producers To do
[stretch] Deprecate >= 50% of udp2log producers To do

Expand modern metrics infrastructure coverage

Plan and execute a new organization scheme for SRE Grafana dashboards In progress
Retire >= 80% of production Diamond collectors In progress
Provision >= 50% of statsd/Graphite-only metrics in Prometheus In progress

Status

Note: November 14, 2018

updated goals for current status

Note: December 12, 2018

The implementation of logging infrastructure is going well and mostly still

In progress, and is expected to be

Done by the end of December. The stretch goals will be done in Q3.

Expanding the metrics infra is going well and is

In progress and should be done by end of quarter.

Outcome 3 / Output 4

Wikimedia projects and content are protected against major disasters that threaten availability.

Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s)

Design and prepare infrastructure for database binary backups In progress
- Research options for producing binary backups (lvm snapshots, cold backups, mariabackup) Done
- Implement a proof of concept of a snapshot cycle automation for a mediawiki section database In progress
- Procure hardware for binary backups In progress

Status

Note: November 14, 2018

updated goals for current status

Note: December 12, 2018

This goal is going much slower than expected, due to various things and it will be completed in Q3.

Outcome 3 / Output 4 (Performance)

Wikimedia projects and content are protected against major disasters that threaten availability.

Primary teams: SRE / Data Persistence, Performance

Goal(s)

Test Performance implications of MySQL TLS connectivity in production, once ready (carried over from 1718Q4) To do
Start migrating watchlist last-view updates to hybrid stash/async-DB to avoid the huge rate of DB writes on page views To do

Status

Note: November 14, 2018

updated goals for current status

Note: December 12, 2018

TLS is still

Stalled on DBA technology selection/implementation due to other work requirements that have higher priorities.

Watchlist also

Stalled due to emergent work and other work that has higher priorities, we hope to get it done in early Q3.

Outcome 4 / Output 6

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE / Infrastructure Foundations

Goal(s)

Expand Spicerack library and SRE Cookbooks

Split and convert the existing wmf-auto-reimage-lib into Spicerack modules In progress
Convert wmf-auto-reimage scripts to Cookbooks In progress
Convert other wmf-* scripts to Cookbooks (e.g. decom, downtime, upgrade & reboot, upgrade Varnish) To do
Generate documentation for Spicerack To do

Expand Netbox usage

Upgrade Netbox to the latest version (>= 2.4) Done
Track additional categories of infrastructure topology information (e.g. VLANs, IP space, network circuits, etc.) Done
Explore Netbox/NAPALM integration to pull live data from network devices Done
Develop and deploy at least three Netbox reports to assist with data correctness and consistency In progress
[stretch] Add a Cumin backend for Netbox To do

Status

Note: November 14, 2018

The migration of logging to Logstash and metrics into Prometheus is

In progress. Logstash hardware for the codfw data center is still being procured. Spicerack modules are being written and refactored with wmf-auto-reimage functionality. Netbox has been upgraded to a new version.

Note: December 12, 2018

Convert wmf-auto-reimage scripts to Cookbooks is

In progress and will mostly be finished in Q3 due to holidays. The other two goals will start after the conversion is done.

Upgrade Netbox to the latest version is

Done but the stretch goal will mostly tackled in Q3.

Q3 Goals

Outcome 1 / Output 1

Technical staff are able to deploy their changes to Production with confidence that their improvements have been tested in a credible staging environment to work according to expectation.

Create a staging cluster comparable to production infrastructure

Primary teams: SRE / Service Operations, Release Engineering

Goal(s)

First steps towards Canary Deployments

Introduce progressive rollouts to the mediawiki train
Introduce deployment run state in scap to keep track of successful scap runs
Investigate the use of versioning in MediaWiki, allowing scap to keep track of deployed revisions

Status

Note: April 8, 2019

This has been N Postponed to Q4

Outcome 2 / Output 3

Technical staff have increased visibility into the operation of our services and infrastructure.

Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Goal(s)

Build an understanding of our needs around external monitoring services

Produce a short document with a cost/benefit analysis of our current external monitoring systems
Gather a set of requirements, desires, and likely technology choices for an external monitoring system, with a focus on achievability in a short timeframe (1-2 quarters)

Increase utilization of application logging pipeline

Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch
Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs (candidates: log4j, udp2log, syslog/syslog_tls etc.)
Retire udp2log: onboard its producers and consumers to the logging pipeline
[stretch] Implement sensitive log access control, onboard 3 sensitive log producers

Upgrade metrics monitoring infrastructure core components

Serve >= 50% of production Prometheus systems with Prometheus v2
Upgrade production prometheus-node-exporter to >= 0.16
[stretch] Investigate distributed and long term storage solutions for Prometheus
- Formulate requirements around aggregation, retention, hardware, etc.
- Evaluate M3 and Thanos

Status

Note: April 8, 2019

Build an understanding of our needs around external monitoring services is {[partially done}} in Q3
Increase utilization of application logging pipeline is Partially done - there is still work to be done on the 'Migrate at least 3 existing Logstash' goal (so, Partially done) and the retiring udp2log and the stretch goal have been N Postponed to Q4

Outcome 3 / Output 4 (SRE / Data Persistence)

Wikimedia projects and content are protected against major disasters that threaten availability.

Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s)

Design and prepare infrastructure for database binary backups

Design a backup policy for logical and binary backups for both short term and long term storage
Procure and setup final hardware for binary backups
Fully implement binary backups and its rotation policy for all MediaWiki metadata and misc databases

Status

Note: April 8, 2019

Backup policy is Done but the procure and implement has been N Postponed to Q4

Outcome 4 / Output 6

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE / Infrastructure Foundations

Goal(s)

Build automated workflows for server provisioning

Take additional steps towards a "single source of truth" system (Netbox)
- Upgrade Netbox to v2.5 and use the new cable tracking feature
- Expose production VMs to Netbox and keep them synchronized with Ganeti
- Incorporate at least two more categories of data (servers interfaces, server IPs, MAC addresses, network device IPs, management/OOB, etc.)
Redesign the server provisioning and decommisioning process to facilitate orchestration
- Add Netbox module to Spicerack and integrate it in the reimage and decom cookbooks
- Convert virtual machine creation script to a cookbook
- Reduce the number of manual steps involved in the provisioning process by at least 4

Status

Note: April 8, 2019

Both goals are In progress and will continue into Q4

Q4 Goals

Outcome 2 / Output 3

Technical staff have increased visibility into the operation of our services and infrastructure.

Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Dependencies on:

Goal(s)

Logging

Deprecate all non-Kafka logstash inputs
[stretch] Implement sensitive log access control, onboard 3 sensitive log producers

Metrics

100% of Prometheus traffic served by Prometheus v2
Migrate all metrics originated by PoPs from statsd to Prometheus
Investigate distributed and long term storage solutions for Prometheus

Status

Note: May 8, 2019

Logging - deprecating non-Kafka is In progress, stretch goal is still To do
Metrics: 100% of Prometheus traffic served by Prometheus v2 is now Done! :)
Migrating the metrics and investigating the distributed storage solutions are In progress

Note: June 13, 2019

Logging: is In progress but will might be pushed into next quarter along with the stretch goal.
Metrics: 100% of prometheus is Done, migrate all metrics is currently N Blocked but should be able to resolve it by end of quarter; investigating the long term storage is Partially done and will be completely done by end of quarter.

Outcome 3 / Output 4 (SRE / Data Persistence)

Wikimedia projects and content are protected against major disasters that threaten availability.

Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s)

Stretch: Setup and deploy backup hardware

Install and setup eqiad/codfw backups/recovery hosts
Install and setup dump slaves
Perform fine tuning of snapshot and dumps performance on final hardware
Decommission old backups hosts dbstore1001, dbstore2001 and dbstore2002

Status

Note: May 8,2019

Install and setup the backups and dump slaves are Done and the rest is still In progress, fine tuning is ongoing and removal will take place later.

Note: June 13, 2019

Install and setup eqiad/codfw backups/recovery hosts is Done
Install and setup dump slaves is Done
Perform fine tuning of snapshot is still In progress and will be done by end of quarter
Decommission old backups hosts is N Blocked on time - we have to wait until the other work is done by end of quarter.

Outcome 4 / Output 6

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE (Infrastructure Foundations, Data Persistence, Service Operations)

Goal(s)

Database workflows automation

Complete and deploy the tool for pooling/depooling databases dynamically from MediaWiki (dbconfig)
Migrate MediaWiki to use etcd for the database configuration in production
Write Spicerack abstractions for common database operations (pool/depool)
[stretch] Write Spicerack cookbooks to automate 2 common DBA workflows

Status

Note: May 8, 2019

This is fully In progress except for the stretch goal

Note: June 13, 2019

Complete and deploy the tool should be finished up by end of this quarter, the rest of this particular goal will go into next quarter.

Outcome 4 / Output 7

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Provision a centralized, self-service identity and access management for privileged staff and volunteer accounts

Primary teams: SRE / Infrastructure Foundations

Dependencies on: Cloud Services, Security,

Goal(s)

Developer account management

Audit production and WMCS infrastructure and document all authenticated services and their authentication & authorization capabilities
Engage with stakeholders and collect functional and non-functional requirements for identity and access management for web services
Evaluate free & open source Identity Management/SSO software solutions against our requirements and create a short list of 1-2
Build a migration plan from OpenStackManager and Striker towards a unified identity and access management system for developer accounts

Status

Note: May 8, 2019

Audit production and WMCS infrastructure and document is In progress and the others are awaiting it's completion.

Note: June 13, 2019

Audit production and WMCS infrastructure is Done
Engage with stakeholders and collect functional and non-functional requirements is In progress and should be done by end of quarter
Evaluate free & open source Identity Management/SSO software solutions is Partially done
Build a migration plan is To do but the team met this week and should be In progress but probably finish early next quarter.