Wikimedia Technology/Annual Plans/FY2019/TEC6: Address Infrastructure Gaps/Goals
Program Goals and Status for FY18/19
[edit]- Goal Owner: Mark Bergsma
- Program Goals for FY18/19: At the conclusion of this program, Zend PHP7 will be the only PHP runtime supported or used in the Wikimedia Foundation production environment.
- Annual Plan: TEC6 Address Infrastructure Gaps
- Primary Goal is Knowledge as a Service: Evolve our systems and structures
- Tech Goal: Sustaining
Outcome 2 / Output 3
[edit]Technical staff have increased visibility into the operation of our services and infrastructure.
- Modernize logging, alerting and metrics monitoring infrastructure
Dependencies on: Search Platform; Primary team: Infrastructure Foundations
Goal(s)
[edit]Adopt Logstash Done
- Review Logstash/Kibana's architecture and installation and identify next steps and gaps to be addressed.
- Audit log producers across the infrastructure and plan their transition to centralized logging.
- Investigate log shipping methods and standardize on them.
Status
[edit]Note: July 2018
- In progress
Note: August 14, 2018
- In progress
Note: September 11, 2018
- In progress A comprehensive design document has been prepared for logging and is currently in final review.
Outcome 3 / Output 4
[edit]Wikimedia projects and content are protected against major disasters that threaten availability.
- Strengthen backups with reliable and redundant backup infrastructure
Dependencies on: Infrastructure Foundations; Primary teams: Data Persistence
Goal(s)
[edit]Monitor database backup generation for failure or incorrect generation Done
- Generate metrics and historic data about databases (objects, table and wiki sizes, growth over time, etc.)
- Detect and alert on backup metrics anomalies
Status
[edit]Note: July 30, 2018
- In progress
Note: August 14, 2018
- In progress
Note: September 11, 2018
- In progress Software to generate & track metrics for db backups has been written, and will soon be used to setup alerts on backup anomalies.
Outcome 4 / Output 6
[edit]Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
- Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management
Dependencies on: Traffic; Primary teams: Infrastructure Foundations, Data center operations
Goal(s)
[edit]Migrate the hardware inventory from Racktables to Netbox Done
- Define Netbox existing and custom fields usage standards/best practices
- Switch over from Racktables to Netbox
- Stretch: Investigate Netbox reporting capabilities to automatically validate data
- Stretch: Investigate Netbox potential future integrations, towards a single source of truth To do
Status
[edit]Note: July 30, 2018
- In progress
Note: August 2018
- In progress
Note: September 11, 2018
- < In progress A final proposal for custom fields usage standards/best practices is under discussion; work will continue (including switching to netbox) after the data center switch from eqiad to codfw.
Outcome 2 / Output 3
[edit]Technical staff have increased visibility into the operation of our services and infrastructure.
- Modernize logging, alerting and metrics monitoring infrastructure
Primary teams: SRE / Infrastructure Foundations
Goal(s)
[edit]Begin the implementation of Q1's Logging Infrastructure design
[edit]- Procure and provision Logging pipeline hardware in multiple datacenters In progress
- Migrate >=90% of existing Logstash traffic to the logging pipeline In progress
- Onboard at least 10 new non-sensitive log producers to the logging pipeline In progress
- Investigate approaches to ingest sensitive log producers To do
- [stretch] Deprecate >= 50% of udp2log producers To do
Expand modern metrics infrastructure coverage
[edit]- Plan and execute a new organization scheme for SRE Grafana dashboards In progress
- Retire >= 80% of production Diamond collectors In progress
- Provision >= 50% of statsd/Graphite-only metrics in Prometheus In progress
Status
[edit]Note: November 14, 2018
- updated goals for current status
Note: December 12, 2018
- The implementation of logging infrastructure is going well and mostly still In progress, and is expected to be Done by the end of December. The stretch goals will be done in Q3.
- Expanding the metrics infra is going well and is In progress and should be done by end of quarter.
Outcome 3 / Output 4
[edit]Wikimedia projects and content are protected against major disasters that threaten availability.
- Strengthen backups with reliable and redundant backup infrastructure
Primary teams: SRE / Data Persistence
Goal(s)
[edit]- Design and prepare infrastructure for database binary backups In progress
- Research options for producing binary backups (lvm snapshots, cold backups, mariabackup) Done
- Implement a proof of concept of a snapshot cycle automation for a mediawiki section database In progress
- Procure hardware for binary backups In progress
Status
[edit]Note: November 14, 2018
- updated goals for current status
Note: December 12, 2018
- This goal is going much slower than expected, due to various things and it will be completed in Q3.
Outcome 3 / Output 4 (Performance)
[edit]Wikimedia projects and content are protected against major disasters that threaten availability.
Primary teams: SRE / Data Persistence, Performance
Goal(s)
[edit]- Test Performance implications of MySQL TLS connectivity in production, once ready (carried over from 1718Q4) To do
- Start migrating watchlist last-view updates to hybrid stash/async-DB to avoid the huge rate of DB writes on page views To do
Status
[edit]Note: November 14, 2018
- updated goals for current status
Note: December 12, 2018
- TLS is still Stalled on DBA technology selection/implementation due to other work requirements that have higher priorities.
- Watchlist also Stalled due to emergent work and other work that has higher priorities, we hope to get it done in early Q3.
Outcome 4 / Output 6
[edit]Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
- Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management
Primary teams: SRE / Infrastructure Foundations
Goal(s)
[edit]Expand Spicerack library and SRE Cookbooks
[edit]- Split and convert the existing wmf-auto-reimage-lib into Spicerack modules In progress
- Convert wmf-auto-reimage scripts to Cookbooks In progress
- Convert other wmf-* scripts to Cookbooks (e.g. decom, downtime, upgrade & reboot, upgrade Varnish) To do
- Generate documentation for Spicerack To do
Expand Netbox usage
[edit]- Upgrade Netbox to the latest version (>= 2.4) Done
- Track additional categories of infrastructure topology information (e.g. VLANs, IP space, network circuits, etc.) Done
- Explore Netbox/NAPALM integration to pull live data from network devices Done
- Develop and deploy at least three Netbox reports to assist with data correctness and consistency In progress
- [stretch] Add a Cumin backend for Netbox To do
Status
[edit]Note: November 14, 2018
- The migration of logging to Logstash and metrics into Prometheus is In progress. Logstash hardware for the codfw data center is still being procured. Spicerack modules are being written and refactored with wmf-auto-reimage functionality. Netbox has been upgraded to a new version.
Note: December 12, 2018
- Convert wmf-auto-reimage scripts to Cookbooks is In progress and will mostly be finished in Q3 due to holidays. The other two goals will start after the conversion is done.
- Upgrade Netbox to the latest version is Done but the stretch goal will mostly tackled in Q3.
Outcome 1 / Output 1
[edit]Technical staff are able to deploy their changes to Production with confidence that their improvements have been tested in a credible staging environment to work according to expectation.
- Create a staging cluster comparable to production infrastructure
Primary teams: SRE / Service Operations, Release Engineering
Goal(s)
[edit]First steps towards Canary Deployments
- Introduce progressive rollouts to the mediawiki train
- Introduce deployment run state in scap to keep track of successful scap runs
- Investigate the use of versioning in MediaWiki, allowing scap to keep track of deployed revisions
Status
[edit]Note: April 8, 2019
- This has been Postponed to Q4
Outcome 2 / Output 3
[edit]Technical staff have increased visibility into the operation of our services and infrastructure.
- Modernize logging, alerting and metrics monitoring infrastructure
Primary teams: SRE / Infrastructure Foundations
Goal(s)
[edit]Build an understanding of our needs around external monitoring services
[edit]- Produce a short document with a cost/benefit analysis of our current external monitoring systems
- Gather a set of requirements, desires, and likely technology choices for an external monitoring system, with a focus on achievability in a short timeframe (1-2 quarters)
Increase utilization of application logging pipeline
[edit]- Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch
- Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs (candidates: log4j, udp2log, syslog/syslog_tls etc.)
- Retire udp2log: onboard its producers and consumers to the logging pipeline
- [stretch] Implement sensitive log access control, onboard 3 sensitive log producers
Upgrade metrics monitoring infrastructure core components
[edit]- Serve >= 50% of production Prometheus systems with Prometheus v2
- Upgrade production prometheus-node-exporter to >= 0.16
- [stretch] Investigate distributed and long term storage solutions for Prometheus
Status
[edit]Note: April 8, 2019
- Build an understanding of our needs around external monitoring services is {[partially done}} in Q3
- Increase utilization of application logging pipeline is Partially done - there is still work to be done on the 'Migrate at least 3 existing Logstash' goal (so, Partially done) and the retiring udp2log and the stretch goal have been Postponed to Q4
Outcome 3 / Output 4 (SRE / Data Persistence)
[edit]Wikimedia projects and content are protected against major disasters that threaten availability.
- Strengthen backups with reliable and redundant backup infrastructure
Primary teams: SRE / Data Persistence
Goal(s)
[edit]Design and prepare infrastructure for database binary backups
- Design a backup policy for logical and binary backups for both short term and long term storage
- Procure and setup final hardware for binary backups
- Fully implement binary backups and its rotation policy for all MediaWiki metadata and misc databases
Status
[edit]Note: April 8, 2019
- Backup policy is Done but the procure and implement has been Postponed to Q4
Outcome 4 / Output 6
[edit]Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
- Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management
Primary teams: SRE / Infrastructure Foundations
Goal(s)
[edit]Build automated workflows for server provisioning
[edit]- Take additional steps towards a "single source of truth" system (Netbox)
- Upgrade Netbox to v2.5 and use the new cable tracking feature
- Expose production VMs to Netbox and keep them synchronized with Ganeti
- Incorporate at least two more categories of data (servers interfaces, server IPs, MAC addresses, network device IPs, management/OOB, etc.)
- Redesign the server provisioning and decommisioning process to facilitate orchestration
- Add Netbox module to Spicerack and integrate it in the reimage and decom cookbooks
- Convert virtual machine creation script to a cookbook
- Reduce the number of manual steps involved in the provisioning process by at least 4
Status
[edit]Note: April 8, 2019
- Both goals are In progress and will continue into Q4
Outcome 2 / Output 3
[edit]Technical staff have increased visibility into the operation of our services and infrastructure.
- Modernize logging, alerting and metrics monitoring infrastructure
Primary teams: SRE / Infrastructure Foundations
Dependencies on:
Goal(s)
[edit]Logging
[edit]- Deprecate all non-Kafka logstash inputs
- [stretch] Implement sensitive log access control, onboard 3 sensitive log producers
Metrics
[edit]- 100% of Prometheus traffic served by Prometheus v2
- Migrate all metrics originated by PoPs from statsd to Prometheus
- Investigate distributed and long term storage solutions for Prometheus
Status
[edit]Note: May 8, 2019
- Logging - deprecating non-Kafka is In progress, stretch goal is still To do
- Metrics: 100% of Prometheus traffic served by Prometheus v2 is now Done! :)
- Migrating the metrics and investigating the distributed storage solutions are In progress
Note: June 13, 2019
- Logging: is In progress but will might be pushed into next quarter along with the stretch goal.
- Metrics: 100% of prometheus is Done, migrate all metrics is currently Blocked but should be able to resolve it by end of quarter; investigating the long term storage is Partially done and will be completely done by end of quarter.
Outcome 3 / Output 4 (SRE / Data Persistence)
[edit]Wikimedia projects and content are protected against major disasters that threaten availability.
- Strengthen backups with reliable and redundant backup infrastructure
Primary teams: SRE / Data Persistence
Goal(s)
[edit]Stretch: Setup and deploy backup hardware
[edit]- Install and setup eqiad/codfw backups/recovery hosts
- Install and setup dump slaves
- Perform fine tuning of snapshot and dumps performance on final hardware
- Decommission old backups hosts dbstore1001, dbstore2001 and dbstore2002
Status
[edit]Note: May 8,2019
- Install and setup the backups and dump slaves are Done and the rest is still In progress, fine tuning is ongoing and removal will take place later.
Note: June 13, 2019
- Install and setup eqiad/codfw backups/recovery hosts is Done
- Install and setup dump slaves is Done
- Perform fine tuning of snapshot is still In progress and will be done by end of quarter
- Decommission old backups hosts is Blocked on time - we have to wait until the other work is done by end of quarter.
Outcome 4 / Output 6
[edit]Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
- Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management
Primary teams: SRE (Infrastructure Foundations, Data Persistence, Service Operations)
Goal(s)
[edit]Database workflows automation
[edit]- Complete and deploy the tool for pooling/depooling databases dynamically from MediaWiki (dbconfig)
- Migrate MediaWiki to use etcd for the database configuration in production
- Write Spicerack abstractions for common database operations (pool/depool)
- [stretch] Write Spicerack cookbooks to automate 2 common DBA workflows
Status
[edit]Note: May 8, 2019
- This is fully In progress except for the stretch goal
Note: June 13, 2019
- Complete and deploy the tool should be finished up by end of this quarter, the rest of this particular goal will go into next quarter.
Outcome 4 / Output 7
[edit]Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.
- Provision a centralized, self-service identity and access management for privileged staff and volunteer accounts
Primary teams: SRE / Infrastructure Foundations
Dependencies on: Cloud Services, Security,
Goal(s)
[edit]Developer account management
[edit]- Audit production and WMCS infrastructure and document all authenticated services and their authentication & authorization capabilities
- Engage with stakeholders and collect functional and non-functional requirements for identity and access management for web services
- Evaluate free & open source Identity Management/SSO software solutions against our requirements and create a short list of 1-2
- Build a migration plan from OpenStackManager and Striker towards a unified identity and access management system for developer accounts
Status
[edit]Note: May 8, 2019
- Audit production and WMCS infrastructure and document is In progress and the others are awaiting it's completion.
Note: June 13, 2019
- Audit production and WMCS infrastructure is Done
- Engage with stakeholders and collect functional and non-functional requirements is In progress and should be done by end of quarter
- Evaluate free & open source Identity Management/SSO software solutions is Partially done
- Build a migration plan is To do but the team met this week and should be In progress but probably finish early next quarter.