Jump to content

Wikimedia Technology/Annual Plans/FY2019/TEC6: Address Infrastructure Gaps/Goals

From mediawiki.org

Program Goals and Status for FY18/19

[edit]
  • Goal Owner: Mark Bergsma
  • Program Goals for FY18/19: At the conclusion of this program, Zend PHP7 will be the only PHP runtime supported or used in the Wikimedia Foundation production environment.
  • Annual Plan: TEC6 Address Infrastructure Gaps

Outcome 2 / Output 3

[edit]

Technical staff have increased visibility into the operation of our services and infrastructure.

Modernize logging, alerting and metrics monitoring infrastructure

Dependencies on: Search Platform; Primary team: Infrastructure Foundations

Goal(s)

[edit]

Adopt Logstash Yes Done

  • Review Logstash/Kibana's architecture and installation and identify next steps and gaps to be addressed.
  • Audit log producers across the infrastructure and plan their transition to centralized logging.
  • Investigate log shipping methods and standardize on them.

Status

[edit]

Note Note: July 2018

In progress In progress

Note Note: August 14, 2018

In progress In progress

Note Note: September 11, 2018

In progress In progress A comprehensive design document has been prepared for logging and is currently in final review.


Outcome 3 / Output 4

[edit]

Wikimedia projects and content are protected against major disasters that threaten availability.

Strengthen backups with reliable and redundant backup infrastructure

Dependencies on: Infrastructure Foundations; Primary teams: Data Persistence

Goal(s)

[edit]

Monitor database backup generation for failure or incorrect generation Yes Done

  • Generate metrics and historic data about databases (objects, table and wiki sizes, growth over time, etc.)
  • Detect and alert on backup metrics anomalies

Status

[edit]

Note Note: July 30, 2018

In progress In progress

Note Note: August 14, 2018

In progress In progress

Note Note: September 11, 2018

In progress In progress Software to generate & track metrics for db backups has been written, and will soon be used to setup alerts on backup anomalies.


Outcome 4 / Output 6

[edit]

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Dependencies on: Traffic; Primary teams: Infrastructure Foundations, Data center operations

Goal(s)

[edit]

Migrate the hardware inventory from Racktables to Netbox Yes Done

  • Define Netbox existing and custom fields usage standards/best practices
  • Switch over from Racktables to Netbox
  • Stretch: Investigate Netbox reporting capabilities to automatically validate data
  • Stretch: Investigate Netbox potential future integrations, towards a single source of truth To do To do

Status

[edit]

Note Note: July 30, 2018

In progress In progress

Note Note: August 2018

In progress In progress

Note Note: September 11, 2018

<In progress In progress A final proposal for custom fields usage standards/best practices is under discussion; work will continue (including switching to netbox) after the data center switch from eqiad to codfw.

Outcome 2 / Output 3

[edit]

Technical staff have increased visibility into the operation of our services and infrastructure.

Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Goal(s)

[edit]

Begin the implementation of Q1's Logging Infrastructure design

[edit]
  • Procure and provision Logging pipeline hardware in multiple datacenters In progress In progress
  • Migrate >=90% of existing Logstash traffic to the logging pipeline In progress In progress
  • Onboard at least 10 new non-sensitive log producers to the logging pipeline In progress In progress
  • Investigate approaches to ingest sensitive log producers To do To do
  • [stretch] Deprecate >= 50% of udp2log producers To do To do

Expand modern metrics infrastructure coverage

[edit]

Status

[edit]

Note Note: November 14, 2018

updated goals for current status

Note Note: December 12, 2018

The implementation of logging infrastructure is going well and mostly still In progress In progress, and is expected to be Yes Done by the end of December. The stretch goals will be done in Q3.
Expanding the metrics infra is going well and is In progress In progress and should be done by end of quarter.


Outcome 3 / Output 4

[edit]

Wikimedia projects and content are protected against major disasters that threaten availability.

Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s)

[edit]
  • Design and prepare infrastructure for database binary backups In progress In progress
    • Research options for producing binary backups (lvm snapshots, cold backups, mariabackup) Yes Done
    • Implement a proof of concept of a snapshot cycle automation for a mediawiki section database In progress In progress
    • Procure hardware for binary backups In progress In progress

Status

[edit]

Note Note: November 14, 2018

updated goals for current status

Note Note: December 12, 2018

This goal is going much slower than expected, due to various things and it will be completed in Q3.


Outcome 3 / Output 4 (Performance)

[edit]

Wikimedia projects and content are protected against major disasters that threaten availability.

Primary teams: SRE / Data Persistence, Performance

Goal(s)

[edit]
  • Test Performance implications of MySQL TLS connectivity in production, once ready (carried over from 1718Q4) To do To do
  • Start migrating watchlist last-view updates to hybrid stash/async-DB to avoid the huge rate of DB writes on page views To do To do

Status

[edit]

Note Note: November 14, 2018

updated goals for current status

Note Note: December 12, 2018

TLS is still N Stalled on DBA technology selection/implementation due to other work requirements that have higher priorities.
Watchlist also N Stalled due to emergent work and other work that has higher priorities, we hope to get it done in early Q3.


Outcome 4 / Output 6

[edit]

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE / Infrastructure Foundations

Goal(s)

[edit]

Expand Spicerack library and SRE Cookbooks

[edit]
  • Split and convert the existing wmf-auto-reimage-lib into Spicerack modules In progress In progress
  • Convert wmf-auto-reimage scripts to Cookbooks In progress In progress
  • Convert other wmf-* scripts to Cookbooks (e.g. decom, downtime, upgrade & reboot, upgrade Varnish) To do To do
  • Generate documentation for Spicerack To do To do

Expand Netbox usage

[edit]
  • Upgrade Netbox to the latest version (>= 2.4) Yes Done
  • Track additional categories of infrastructure topology information (e.g. VLANs, IP space, network circuits, etc.) Yes Done
  • Explore Netbox/NAPALM integration to pull live data from network devices Yes Done
  • Develop and deploy at least three Netbox reports to assist with data correctness and consistency In progress In progress
  • [stretch] Add a Cumin backend for Netbox To do To do

Status

[edit]

Note Note: November 14, 2018

The migration of logging to Logstash and metrics into Prometheus is In progress In progress. Logstash hardware for the codfw data center is still being procured. Spicerack modules are being written and refactored with wmf-auto-reimage functionality. Netbox has been upgraded to a new version.

Note Note: December 12, 2018

Convert wmf-auto-reimage scripts to Cookbooks is In progress In progress and will mostly be finished in Q3 due to holidays. The other two goals will start after the conversion is done.
Upgrade Netbox to the latest version is Yes Done but the stretch goal will mostly tackled in Q3.


Outcome 1 / Output 1

[edit]

Technical staff are able to deploy their changes to Production with confidence that their improvements have been tested in a credible staging environment to work according to expectation.

Create a staging cluster comparable to production infrastructure

Primary teams: SRE / Service Operations, Release Engineering

Goal(s)

[edit]

First steps towards Canary Deployments

  • Introduce progressive rollouts to the mediawiki train
  • Introduce deployment run state in scap to keep track of successful scap runs
  • Investigate the use of versioning in MediaWiki, allowing scap to keep track of deployed revisions

Status

[edit]

Note Note: April 8, 2019

  • This has been N Postponed to Q4


Outcome 2 / Output 3

[edit]

Technical staff have increased visibility into the operation of our services and infrastructure.

Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Goal(s)

[edit]

Build an understanding of our needs around external monitoring services

[edit]
  • Produce a short document with a cost/benefit analysis of our current external monitoring systems
  • Gather a set of requirements, desires, and likely technology choices for an external monitoring system, with a focus on achievability in a short timeframe (1-2 quarters)

Increase utilization of application logging pipeline

[edit]
  • Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch
  • Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs (candidates: log4j, udp2log, syslog/syslog_tls etc.)
  • Retire udp2log: onboard its producers and consumers to the logging pipeline
  • [stretch] Implement sensitive log access control, onboard 3 sensitive log producers

Upgrade metrics monitoring infrastructure core components

[edit]
  • Serve >= 50% of production Prometheus systems with Prometheus v2
  • Upgrade production prometheus-node-exporter to >= 0.16
  • [stretch] Investigate distributed and long term storage solutions for Prometheus
    • Formulate requirements around aggregation, retention, hardware, etc.
    • Evaluate M3 and Thanos

Status

[edit]

Note Note: April 8, 2019

  • Build an understanding of our needs around external monitoring services is {[partially done}} in Q3
  • Increase utilization of application logging pipeline is Incomplete Partially done - there is still work to be done on the 'Migrate at least 3 existing Logstash' goal (so, Incomplete Partially done) and the retiring udp2log and the stretch goal have been N Postponed to Q4


Outcome 3 / Output 4 (SRE / Data Persistence)

[edit]

Wikimedia projects and content are protected against major disasters that threaten availability.

Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s)

[edit]

Design and prepare infrastructure for database binary backups

  • Design a backup policy for logical and binary backups for both short term and long term storage
  • Procure and setup final hardware for binary backups
  • Fully implement binary backups and its rotation policy for all MediaWiki metadata and misc databases

Status

[edit]

Note Note: April 8, 2019

  • Backup policy is Yes Done but the procure and implement has been N Postponed to Q4


Outcome 4 / Output 6

[edit]

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE / Infrastructure Foundations

Goal(s)

[edit]

Build automated workflows for server provisioning

[edit]
  • Take additional steps towards a "single source of truth" system (Netbox)
    • Upgrade Netbox to v2.5 and use the new cable tracking feature
    • Expose production VMs to Netbox and keep them synchronized with Ganeti
    • Incorporate at least two more categories of data (servers interfaces, server IPs, MAC addresses, network device IPs, management/OOB, etc.)
  • Redesign the server provisioning and decommisioning process to facilitate orchestration
    • Add Netbox module to Spicerack and integrate it in the reimage and decom cookbooks
    • Convert virtual machine creation script to a cookbook
    • Reduce the number of manual steps involved in the provisioning process by at least 4

Status

[edit]

Note Note: April 8, 2019

  • Both goals are In progress In progress and will continue into Q4

Outcome 2 / Output 3

[edit]

Technical staff have increased visibility into the operation of our services and infrastructure.

Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Dependencies on:

Goal(s)

[edit]

Logging

[edit]
  • Deprecate all non-Kafka logstash inputs
  • [stretch] Implement sensitive log access control, onboard 3 sensitive log producers

Metrics

[edit]
  • 100% of Prometheus traffic served by Prometheus v2
  • Migrate all metrics originated by PoPs from statsd to Prometheus
  • Investigate distributed and long term storage solutions for Prometheus

Status

[edit]

Note Note: May 8, 2019

  • Logging - deprecating non-Kafka is In progress In progress, stretch goal is still To do To do
  • Metrics: 100% of Prometheus traffic served by Prometheus v2 is now Yes Done! :)
  • Migrating the metrics and investigating the distributed storage solutions are In progress In progress

Note Note: June 13, 2019

  • Logging: is In progress In progress but will might be pushed into next quarter along with the stretch goal.
  • Metrics: 100% of prometheus is Yes Done, migrate all metrics is currently N Blocked but should be able to resolve it by end of quarter; investigating the long term storage is Incomplete Partially done and will be completely done by end of quarter.


Outcome 3 / Output 4 (SRE / Data Persistence)

[edit]

Wikimedia projects and content are protected against major disasters that threaten availability.

Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s)

[edit]

Stretch: Setup and deploy backup hardware

[edit]
  • Install and setup eqiad/codfw backups/recovery hosts
  • Install and setup dump slaves
  • Perform fine tuning of snapshot and dumps performance on final hardware
  • Decommission old backups hosts dbstore1001, dbstore2001 and dbstore2002

Status

[edit]

Note Note: May 8,2019

  • Install and setup the backups and dump slaves are Yes Done and the rest is still In progress In progress, fine tuning is ongoing and removal will take place later.

Note Note: June 13, 2019

  • Install and setup eqiad/codfw backups/recovery hosts is Yes Done
  • Install and setup dump slaves is Yes Done
  • Perform fine tuning of snapshot is still In progress In progress and will be done by end of quarter
  • Decommission old backups hosts is N Blocked on time - we have to wait until the other work is done by end of quarter.


Outcome 4 / Output 6

[edit]

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE (Infrastructure Foundations, Data Persistence, Service Operations)

Goal(s)

[edit]

Database workflows automation

[edit]
  • Complete and deploy the tool for pooling/depooling databases dynamically from MediaWiki (dbconfig)
  • Migrate MediaWiki to use etcd for the database configuration in production
  • Write Spicerack abstractions for common database operations (pool/depool)
  • [stretch] Write Spicerack cookbooks to automate 2 common DBA workflows

Status

[edit]

Note Note: May 8, 2019

  • This is fully In progress In progress except for the stretch goal

Note Note: June 13, 2019

  • Complete and deploy the tool should be finished up by end of this quarter, the rest of this particular goal will go into next quarter.


Outcome 4 / Output 7

[edit]

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Provision a centralized, self-service identity and access management for privileged staff and volunteer accounts

Primary teams: SRE / Infrastructure Foundations

Dependencies on: Cloud Services, Security,

Goal(s)

[edit]

Developer account management

[edit]
  • Audit production and WMCS infrastructure and document all authenticated services and their authentication & authorization capabilities
  • Engage with stakeholders and collect functional and non-functional requirements for identity and access management for web services
  • Evaluate free & open source Identity Management/SSO software solutions against our requirements and create a short list of 1-2
  • Build a migration plan from OpenStackManager and Striker towards a unified identity and access management system for developer accounts

Status

[edit]

Note Note: May 8, 2019

  • Audit production and WMCS infrastructure and document is In progress In progress and the others are awaiting it's completion.

Note Note: June 13, 2019

  • Audit production and WMCS infrastructure is Yes Done
  • Engage with stakeholders and collect functional and non-functional requirements is In progress In progress and should be done by end of quarter
  • Evaluate free & open source Identity Management/SSO software solutions is Incomplete Partially done
  • Build a migration plan is To do To do but the team met this week and should be In progress In progress but probably finish early next quarter.