Platform Excellence: Resilience

FY21/22 Organization Efficacy & Resilience OKR for Wikimedia Technology Department

OKR Overview

Our services, infrastructure and data are resilient to and/or quick to recover from unexpected malicious or nonmalicious events

Key Result 1
Wikimedia's infrastructure is scaled to address known compute, storage and traffic capacity risks, by adding a new data center in EMEA (by end of Q1), expanding our main data center by at least 20% (by end of Q2), and by documenting two new capacity plans (by end of Q4)

Key Result 2
Service and security operational issues are detected, escalated, remediated and communicated to stakeholders and the movement, as measured by a 20% incident score improvement

Key Result 3
Security and privacy services are enterprise wide, centrally coordinated, scalable and resilient in a way that empowers all users to make good security and privacy decisions, measured by a 10% increase in consumption of consultation services and a 30% decrease in operational services

Objective Rationale

As part of the Foundation's Efficacy and Resilience Framework, this objective is intended to capture programmatic work that goes into improving the resilience in our services, processes, infrastructure and data. In many ways, this objective is a continuation of our Front-Line Defenses objective from FY20-21; it builds upon that, extending its scope to cover resilience concerns, including reliability and security, for malicious and nonmalicious events alike.

The activities to achieve this objective will happen across our technical stack, in the infrastructure and software we build, and also in process and documentation improvements, cross-training, consulting and advocacy.

The work is necessary for the Foundation to evolve to meet the demands of a changing landscape, and specifically:

The emergence of new use cases (e.g. machine learning & data engineering) or traffic patterns, including ones that result in increased usage of our resources;
Technologies in the industry evolving, requiring us to respond to new challenges, threats, and adjust the way we work;
Growth of our engineering organization, requiring us to adjust for our size
Past investments (such as capital investments, large infrastructure contracts, or technical implementation choices) at the end of their runway, requiring extended, renewed commitments for us to sustain our current pace of growth.

Three concrete Key Results are envisioned for this Objective. However, the Objective is purposefully broad in nature, with more OKRs across various levels in the organization, in service of that broader goal, expected to be aligned to it, on a quarter-by-quarter basis.

The objective meets the budgetary guardrails necessary to surface this as one of the top priorities of the department and organization, by requiring work from multiple teams and at least 15 FTEs, and with over a $1M budget in OpEx.

Key Result 1: Address capacity risks

Wikimedia's infrastructure is scaled to address known compute, storage and traffic capacity risks, by adding a new data center in EMEA (by end of Q1), expanding our main data center by at least 20% (by end of Q2), and by documenting two new capacity plans (by end of Q4)

Intent and Desired Outcomes
This is a multi-faceted KR, attempting to "move the needle" on multiple fronts.

The first part is a continuation of the FY20-21 Front-line Defenses Key Result around a data center expansion for our EMEA service region (originally envisioned to complete in FY21-22 Q1). It captures the addition of a site in Marseille, France, to increase our resilience against failures. Our Amsterdam location (“esams”) has grown to serve half of our traffic, making it “too big to fail”, or be “drained” for scheduled maintenance, without the potential for cascading failures in other sites and the site as a whole, or performance degradation in all of our regions. The location is also strategically located in a network hub that is not only well connected to our backbone network and US sites, and reasonably priced, but is also uniquely placed in a location where multiple submarine communication cables that interconnect Europe, Middle-East and Africa land. Therefore, while the primary impact is envisioned to be increased resiliency for our infrastructure, we expect this key result will also contribute to improved website performance for users in North Africa and the Middle East.

The second part is the growth of one of our main data centers, and specifically our largest one in Ashburn. As the Foundation has steadily grown its use cases and overall footprint, the utilization of that site has increased, to the point where the data center has only a few months of runway before space runs out entirely, and with effects of space constraints being visible even today for some of our use cases. Given data center contracts require an upfront investment with fixed costs in bootstrapping, we envision a growth step of at least 20%, with a ramp of power usage up to 50% over the coming year.

Finally, the third part of this KR is around building a culture of structured capacity planning, to support attempts to predict future growth and provide inflection points in future decision making. This can be a complex endeavour, at the heart of the SRE practice; we envision the KR in this FY to deliver on two capacity plans as pilots, to help inform the FY22-23 annual planning process.

Definitions and Scoping

SRE – Site Reliability Engineering, an engineering discipline and a Foundation team comprising engineers & managers specializing in the discipline and responsible for the Foundation’s site reliability
Data center – large facilities where servers are hosted. In this context, we are referring to leased space, power & cooling in secured spaces within a vendor’s larger facility.
EMEA – Europe, Middle East and Africa
FTE – Full-time Employee

Related Quarterly OKRs

Add a new data center in the EMEA region (Marseille) [Q1-Q2]
Expand our main data center (eqiad) by at least 20% [Q1-Q2]
Document two capacity plans [Q1-Q3]

Activities and Deliverables

EMEA data center (“drmrs”)
- Contract & hardware procurement
- Physical deployment (buildout)
- Network design and deployment
- Traffic edge site deployment and turn-up
Main data center (“eqiad”) expansion
- Contract & hardware procurement
- Physical deployment (buildout)
- Network design and deployment
Capacity plan pilots

Resourcing

Activity	Responsible	Accountable	Consulted	Informed
EMEA data center: Contract & hardware procurement	Data Center Operations team	Willy Pao	Contracts, Purchasing, FP&A, Infrastructure Foundations team
EMEA: data center: Physical deployment	Data Center Operations team	Willy Pao	Traffic team, Infrastructure Foundations team	SRE organization
EMEA data center: Network design & deployment	Infrastructure Foundations team	Joanna Borun	Traffic team	SRE organization
EMEA data center: Traffic edge site deployment and turn-up	Traffic team	Mark Bergsma	Performance team, Infrastructure Foundations team	The world
Main data center expansion: Contract & hardware procurement	Data Center Operations team	Willy Pao	Traffic team, Infrastructure Foundations team	SRE organization
Main data center expansion: Physical deployment	Data Center Operations team	Willy Pao	Infrastructure Foundations team	SRE organization
Main data center expansion: Network design and deployment	Infrastructure Foundations team	Joanna Borun	Traffic team	SRE organization
Capacity plan pilots	To be selected later	Lukasz Sobanski	SRE organization	Budget owners for SRE & APP delegates

Key Result 2: Incident management

Service and security operational issues are detected, escalated, remediated and communicated to stakeholders and the movement, as measured by a 20% incident score improvement

Intent and Desired Outcomes
As the Site Reliability Engineering (SRE) organization grows and evolves, some practices require maturing and polishing. Stability is a critical aspect of steady and predictable operations. Systems at the Foundation are generally stable, and the incident count is arguably low, considering Wikimedia’s traffic scale; however, there is always room for improvement.

In 2020 the ONFIRE working group alongside the Foundation’s SRE Observability team put together an internal Incident Management Survey. The survey received about 20 responses from engineers, indicating a need for improvement in managing processes, people and tooling around the overall incident management practice. Feedback from this survey has been used as means to gather bottom-up feedback, and combined with top-down requirements.

The result was a set of broad directional practices in an attempt to bolster our incident management practices:

Transparency & communication: better user & community experience for our users by openly sharing relevant information about impactful events, expectations for recovery, impact to the movement, and our actions towards remediation, evolving and enhancing our current practices. (“look at Phabricator or Wikitech” is not sufficient)
Efficiency & sustainability: as the organization has grown, practices that worked in the past ("all hands on deck" paging for every incident) are, at the current team size, not just inefficient but also counterproductive.
Availability & scalability: efforts made in service of Incident Management should result in long-term reduction in the number and severity of incidents, and a reduction of time to recover.
Equity & fairness: in preventing impact to individuals disproportionately, in having to respond to incidents that occur on a 24/7 basis, promoting fairness and equity, also in alignment with our “Resilient and Inclusive Foundation” organizational goal.

Applying these directional practices, the intent is for progress in FY21-22 to result into the following desired outcomes:

Engineers, including but not limited to Site Reliability Engineers, hold the necessary knowledge and skills to engage effectively in any incident.
An incident management process that is well documented and understood by everyone in the technology team/organization.
A structured incident documentation with detailed event categorization, severity, impact, and other relevant metrics or Incident Artifacts.
Clear guidelines for communications and escalation during incidents.
Adequate tooling exists, to be able to communicate, respond and engage effectively during incidents.

To measure progress on all of those different fronts, the intent is for work to be spent primarily in Q1 and Q2 to construct an incident scorecard, to be applied in every major incident, measuring the team’s incident response and engagement score. Throughout the activities that have already been envisioned and described below, as well as other activities to be devised throughout the course of FY21-22, we expect the score to be improved by 20% by the end of the fiscal year.

This Key Result is expected to also benefit significantly by the work that is described in the “Culture, Equity and Team Practices” FY21-22 Objective, and specifically the SLO activities in KR1. The gradual deployment of Service Level Objectives (SLO), as well as associated Service Level Indicators (SLI) are going to be instrumental in defining entry and exit conditions for the incident management processes described in this Key Result, and therefore progress in these two initiatives is expected to be synergistic in nature.

Definitions

Incident: An incident is an outage, security issue, or other operational issue whose severity demands an immediate human response.
SLO: A Service Level Objective (SLO) is an understanding between teams about expectations for reliability and performance. An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. (ie More than 99% of all request are successful)
SLI: An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided. The measurements are often aggregated: i.e., raw data is collected over a measurement window and then turned into a rate, average, or percentile. Ideally, the SLI directly measures a service level of interest, but sometimes only a proxy is available because the desired measure may be hard to obtain or interpret.

Scoping

People: How do we prepare our people to respond to incidents better?
Process: How should we behave/operate during an incident?
Tooling: What tools should we implement to facilitate responding to incidents?

Related Quarterly OKRs

Define and document a project plan which outlines changes to people practices, processes, and tooling required to ensure the success of the objective [Q1]
Define and document the scorecard used to measure incident engagement [Q1]
100% adoption of scorecard across all incidents to establish metrics baseline [draft, Q2]
~10% scorecard assessment improvement over previous quarter [draft, Q3]
~10% scorecard assessment improvement over previous quarter [draft, Q4]

Activities and Deliverables
There are two overarching activities envisioned in Q1 & Q2: to define and document a detailed project plan as well as the scorecard that will be used to measure incident engagement.

After this phase of the project is complete, and a baseline is established, activities in Q3 & Q4 will be selected based on where we can make the most measurable impact to the incident scores. Activities that have been envisioned so far include (but are not limited to) the following, with only a subset of them expected to be implemented in FY 21-22:

People
- Development of a training & certification program for incident responders
- Clear expectation setting for 24/7 incident responders
- Tabletop incident walkthroughs and simulations
Process
- Incident response process assessment, documentation and revamp
- “Post-mortem” incident review protocols & standards
Tooling
- Development of incident management coordination tooling
- Improvements in alerting and escalation tooling
- Improvements on public communication and visibility of incidents

Resourcing

Activity	Responsible	Accountable	Consulted	Informed
Define and document a detailed project plan	Working group	Leo Mata	SRE organization
Define and document scorecard	To be defined	Leo Mata	SRE organization
Development of a training & certification program for incident responders	To be defined	Leo Mata	SRE organization
Clear expectation setting for 24/7 incident responders	Leo Mata	Faidon Liambotis	T&C organization
Tabletop incident walkthroughs and simulations	Multiple SREs	Leo Mata	SRE organization, Security
Incident response process assessment, documentation and revamp	Working group	Leo Mata		Technology Department
Post-mortem” incident review protocols & standards		Leo Mata	SRE organization, Release Engineering, Security	Technology Department
Development of incident management coordination tooling	Observability, Infrastructure Foundations	Leo Mata		SRE organization
Improvements in alerting and escalation tooling	Observability	Leo Mata	SRE organization, Security Leadership
Improvements on public communication and visibility of incidents	Observability, Infrastructure Foundations	Leo Mata	SRE organization, Communications	The world

Key Result 3: Security and privacy services

Security and privacy services are enterprise wide, centrally coordinated, scalable and resilient in a way that empowers all users to make good security and privacy decisions, measured by a 10% increase in consumption of consultation services and a 30% decrease in operational services

Intent and Desired Outcomes
The intention is to identify, prioritize, coordinate and scale security and privacy activities across the Foundation. The outcomes will be expressed as a reduction in operational work, meaning the Security team will be pushing security and privacy services in a consumable way so that other teams in Technology and Product can make good security and privacy decisions. This Key Result is all about helping the Security and Privacy teams better develop and deliver their services.

Definitions
Work will begin within the Security team where we will be baselining services and their consumption from the following security services:

Application Security
Privacy Engineering
Threat and Vulnerability Management
Security Incident Response
Capabilities Management
Cyber Risk

1st pass will be to baseline these activities to understand volume and who is and how these services are being consumed.

2nd pass will be to understand bottlenecks and prioritize and identify service gaps and where we need to be equipping consumers differently.

3rd pass will be to apply controls to address efficacy and efficiency and various other gaps in our deliverables.

Related Quarterly OKRs

Security and privacy services have each identified and documented baseline measurements for the purposes of transparency, accountability, and quality control [Q1]

Activities and Deliverables

Fusion Center
- Count of Severity 1 & 2 Security Incidents
- Count of Critical and high risk vulnerabilities
- Count of supplier security reviews
Capabilities Management
- Count of onboarding security awareness modules
  - Not yet available
- Count of new members in #talk-to-security
  - Increase from 31 members on 8/3/21 (first date of measurement) to 59 members 8/25/21
Cyber Risk
- Count of Critical and High risk issues
Architecture
- Count of number of privacy engineering risk assessments

Resourcing

Activity	Responsible	Accountable	Consulted	Informed
Baseline services	Security team	Jennifer Cross / John Bennett	Various teams	Various consumers
Identify service delivery issues	Security team	Jennifer Cross / John Bennett	Various teams	Various consumers
Implement service delivery controls	Security team	Jennifer Cross / John Bennett	Various teams	Various consumers
Test and provide feedback on deliverables	Various consumers	Security team	Security team	Jennifer Cross / John Bennett