Data as a Service

FY21/22 MTP Priority OKR for Wikimedia Technology Department

Accountable: Tajh Taylor

OKR Overview

Wikimedia application data is easily discoverable and well-prepared to enable data-informed decision making, application development and research by the community and the Foundation.

Key Result 1
Establish organizational data management structure - Build a browseable, shareable data dictionary, and describe 25% of known data elements.

Key Result 2
Enable efficient program evaluation and decision support through three novel use cases. - Create a baseline selection of three use cases.

Key Result 3
Build machine learning services - Operationalize an ML governance strategy for the Foundation, and create ways to understand, evaluate, and provide feedback on ML models. Baseline to be determined.

Objective Rationale

Today, there are many barriers to the use of Wikimedia application data for analysis, decision-making, intelligence, and applied data science. These include lack of shared information describing the data, varying methods of access and access control, distributed and unclear data stewardship, technical and architectural impedance mismatches, unclear responsibility for data policy enforcement, etc. Although we have several teams around the organization performing data analysis and using data for a variety of purposes, their capacity is limited by these barriers.

Our purpose is to address these problems, at a scope and scale that crosses organizational boundaries, to establish a home for clear answers to questions about data access, accountability, and organizational policy. And to dissolve the barriers, enabling and empowering the data capabilities of the entire community (staff, volunteers, and external users of data).

By establishing the data governance capabilities described in Key Result 1, we provide the organizational structure to manage data at the Foundation level. In fulfilling the use cases described in Key Result 2, we demonstrate the ability to deliver capabilities that have previously been stymied by the barriers described above. And in Key Result 3, we transform our machine learning capabilities to be modern, standardized, flexible, scalable, and transparent.

To fulfill these goals, we must create a data strategy that clearly articulates how enhancing the data management capabilities of the Foundation enables us to better support Movement and Foundation strategy and to better measure our own performance and capabilities in the intersection of systems, programs, and people.

Key Result 1: Organizational Data Management and Data Catalog

Establish organizational data management structure - Build a browseable, shareable data dictionary, and describe 25% of known data elements.

Accountability for this Key Result is assigned to Olja Dimitrijevic, Director of Data Engineering

Intent and Desired Outcomes
Data management is an organizational-scale discipline and approach that:

Recognizes the high value of reliable, well-maintained, and easy-to-use data to inform our strategic mission and that of our community
Describes the access to and the use of data for the Foundation and community as a set of services for stakeholders
Provides systematic approaches to data discovery, assurance, access control, and policy application

The strategic value of data is evident in the frequent questions that we are asked and the informational requests that we receive for data. We can currently answer only a fraction of the questions and fulfill only some of the requests even when the requisite data is collected and available.

It is important to recognize that we are not starting from a blank slate. Some teams around the Foundation manage their own data sets, with varying tools and degrees of sophistication. This independent approach has the advantage of liberating teams to determine their own data destinies, but presents several issues: duplication of effort as teams determine and implement Foundation policy for things like sensitive data access control; derivable insight is somewhat limited to what can be learned from within particular sets of data because combining data from different sources and teams is cumbersome and difficult; and not every team has the same degree of technical data expertise on-hand to fulfill needs.

To address these issues and to more fully enable the distillation of value from Foundation & movement data, we intend to establish Foundation-wide data management practice, with the following outcomes:

A data governance council, comprised of stakeholder representatives from across the Foundation, and empowered by senior management to make decisions and determinations regarding data standards and practices
A Foundation data strategy describing our principles and objectives regarding the use and development of Foundation data and access
A data catalog describing the Foundation’s data (encoding, format, location, character, provenance, applicable policy, accessibility, stewardship), available online to Foundation and community users
An internal data services team that will support the embedded domain analysts, as well as teams that do not have expertise in data preparation and delivery
Recognition and elevation of data services and operations as products that are publicly accessible and available, with establishment of relevant SLOs and other treatment

Clearly, not all of these outcomes are achievable within 12 months, and they are not all presently included in this fiscal year’s annual planning. They represent the long term goals, and will be reflected in the data strategy we create.

Definitions

Wikimedia Application data – Structured data elements and records that are generated by the operation of our publicly accessible systems. This includes product metrics and production user data collected by and generated from Wikimedia properties, but does not include unstructured wiki content data, survey data, or third party data, which may be used with Wikimedia application data to support analyses or other work.
Metadata – Data about data. E.g. location, modification time, ownership, format, access permissions, privacy sensitivity, etc.
Discoverable – New and experienced users of WM data are able to find new data elements relevant to their use cases.
Accessible – The experience of locating and retrieving relevant data sets or viewing relevant reporting is self-directed and easy.
Well-defined – Data elements and records each have definitions that explain their provenance, appropriate and inappropriate uses, formats, constraints, expected ranges/distributions, and restrictions.
Navigable – The relationships between data elements are documented and defined.
Sourceable - It is easy to load data into systems for serving production features and analytics
Integrable – It is easy to use data sets from different sources in combination with one another.
Prepared – The data is in or close to the format in which the user wants to consume it.

Related Quarterly OKRs

OKRs to be drafted

Activities & Deliverables We expect that we will undertake the following activities in fulfillment of our objective:

1. Data Strategy

Organizing a working group of staff participants to define the strategy
Identify and describe the high-level mission principles and objectives that will guide the development of a data strategy
Working group to meet regularly to determine the scope and content of the strategy
Identifying and soliciting input from community members with high interest in data access, including tool developers, researchers and current users of bulk data access
Ratifying a first draft of the data strategy with signoff from stakeholders

2. Data Governance Assessment and Plan

Standard templates/ format to collect information about data sources and data policies
Inventory data sources (leverage use cases)
- Where data comes from
- Data stewards - Who may own it (if owned)
- Maintenance policies
Collation of data policies
Organize investigation and insights from use cases to begin building out:
- Data catalog
- Data dictionary
SWOT analysis & preliminary findings
Present findings to the organization
Validate & decide on focus areas

3. Data Governance Implementation

Soliciting participant commitments from all relevant departments and stakeholders for the data governance council
Build a shared understanding of data principles, concepts, and best practices
- Targeting the C team
- Targeting data council members
Establish an initial (non-comprehensive) charter to define the work of the data governance council for the current FY
Determine the high-level requirements for the data catalog
Establish comprehensive document of data sources across the organizational silos
Identify and train data stewards
Iterate:
- Determine the scope of data sets and elements to enter into governance
- Collect and synthesize current and new governing policies and practices in a single place
- Write guidance for the access and use of the scoped data elements
- Solicit feedback on the guidance at the Foundation and with the movement (?)
- Release guidance
Repeat

4. Data Catalog

Determine scope and establish requirements for gathering and publishing metadata information
Engineering design and implementation of data catalog
Data steward assignments and dataset metadata review
Iterate with release and community input
Determine policies to show and hide meta-data

Resourcing

Activity	Responsible	Accountable	Consulted	Informed
Data Source	Data Engineering	Data PM	Data	Executives
Data Strategy Document	Tajh Taylor, VP of Data Science & Engineering Olja Dimitrijevic, Director of Data Engineering Desiree Abad, Director of Product Management Sumeet Bodington, Director of Global Data Insights Chris Albon, Director of Machine Learning Leila Zia, Director of Research, Guillaume Lederrey, Engineering Manager for Search Kate Zimmerman, Director of Data Science	Tajh Taylor	Technology leadership, Product leadership, Advancement, Community Investment, Legal, Trust & Safety, Site Reliability Engineering, Administration	VP Cohort, Foundation, Community
Governance Council Formation	Tajh Taylor, Olja Dimitrijevic, John Bennett / Director of Security, Desiree Abad, Kate Zimmerman / Director of Data Science, other participants in Data Engineering	Tajh Taylor	C-Team, VP Cohort, other departmental leaders
Data Governance Assessment and Plan	Data Engineering Team (Dan Andreescu, Olja Dimitrijevic), Security Team, Legal Team	Olja Dimitrijevic	Data Governance Council, T&S, Data Persistence
Inventory	(Dan A., Olja D.), Global Data Insights, Data PM (Desiree as stand-in)	Data PM (Desiree as stand-in	Governance Council, Product Analytics, Data Persistence, GD Insights & ops FR-Tech SLO working group
Data Catalog Requirements	Data PM (Desiree as stand-in), Data Engineering, (Dan Andreescu, Olja Dimitrijevic, Desiree Abad)	Data PM (Desiree as stand-in	Data Governance Council, Product Analytics, Security, GD Insights & Ops, Designers
Data Catalog Implementation	Data Engineering Team (Dan Andreescu, Olja Dimitrijevic)	Olja Dimitrijevic	Product Analytics, Data Engineering, Product Management, Research, Machine Learning

Key Result 2: New Efficiencies in Program Evaluation

Enable efficient program evaluation and decision support through three novel use cases. - Create a baseline selection of three use cases

Accountability for this Key Result is assigned to Desiree Abad, Director of Product Management

Intent and Desired Outcomes
Over time, the Foundation has invested heavily in various departments, teams, tools, and processes that can collect and analyze data to distill business insights. While we have analytics capabilities across the organization, these capabilities are often siloed by team and or dataset making it difficult to answer questions, distill insights, and derive intelligence across the Foundation. Key challenges include:

Data is often PII sensitive and must have adequate security and privacy controls to ensure that data is sanitized, access is controlled, and that data is handled as per our policies.
Data is siloed across the organization, being stored in different locations with different policies and security.
Data is difficult to translate within and across departments due to misaligned definitions, interpretations, and a common language.
Data is sourced from a wide variety of product platforms, applications, tools, and surveys, requiring customized ingestion/pipeline solutions.
Data analysis tools and skill sets vary across individuals and teams.

In order to address these challenges, we will narrow down the scope of these problems in the context of three novel use cases that specifically target challenging areas with cross-functional involvement:

1. Grant & Grantee Reporting

Context:
- The Foundation uses an application called Fluxx to manage grants and interact with Grantees. As such, this data will need to be collected from Fluxx, existing sources will need to be switched, and additional analytics will be required.
Goals:
- Ingest grant-related data from Fluxx and reflect on wiki;
- Combine site-generated data with grant data
- Identify and generate grantee impact metrics
Challenges:
- Custom data ingestion will be required for Fluxx
- A data privacy & security strategy must be identified and agreed upon across teams to support co-location of data, serving on wiki, and/or dataset blending for the purpose of impact metrics.
- Data analysis skills, especially of different analytics datasets vary.

2. SpamBots impact on content, admins, and users

Context:
- Currently we have hypothesized that due to inadequate captcha, we are permitting spambots through which results in spamming of content and administrators spending time removing these accounts and reverting spambot changes.
Goals:
- Understand the impact of spambots on our content and users
- Understand whether captchas are successfully blocking spambot accounts
- Identify root causes of spam
Challenges:
- Inventory the observability and product data we have and determine any potential gaps
- Examine how we can isolate specific observability data and serve that data so that it may be joined with product analytics data
- Support additional ETL, as needed, in order to join and analyze different datasets

3. Diversity, Equity, and Inclusion (DEI) Reporting

Context:
- Global Data Insights (GDI) will launch Foundation's first public-facing dashboard to enable movement organizers and partners to map diversity, equity & inclusion among movement-wide programs and spaces.
Goals:
- Understand what data is available and what data is missing that we would need to collect for each lens
- Establish pubic-facing dashboards with any existing that can be used to analyze and uncover insights, that glean intelligence to inform data-driven decisions in the DEI space
Challenges
- Legal barriers to data collection and use
- Data can be over- or under-counted

Other Use Cases

4. VisualEditor (VE) Load Time

Why this was not selected: While a better understanding of load time would be valuable for product and engineering decisions, the specific case of VE was tabled for a variety of reasons, and the Editing team is focused on developing Talk Pages this year
Context:
- User adoption of the VE on mobile was slower than expected and didn't meet expectations in certain geographic areas. One hypothesis was that VE caused load time to be too long. https://phabricator.wikimedia.org/T221198
Goals:
- Determine whether we collect page load time and related data at the required level of granularity and with the appropriate additional metadata to answer the questions in the hypothesis
- Join the load time data with the product data in a single dashboard to drive product intelligence.
- Understand what level of performance is required and distill SLOs.
Challenges:
- Inventory the observability and performance data we have and determine any potential gaps.
- Examine how we can isolate specific observability and performance data and serve that data so that it may be joined with product analytics data.
- Support additional ETL, as needed, in order to join and analyze different datasets.

5. A/B testing

Why this was not selected: We’re not ready yet for this, more involved architecture and stack needs.

Definitions

ETL – Extract, Transform, and Load. A common approach to retrieving and transactional application data for analytical use cases.
PII – Personally-identifying information. Data elements such as user name or IP address, that either alone or in combination with other data elements can be used to uniquely identify a person.
Observability Data - Data collected for the primary purpose of server administration, but which may also have applications for data insights, such as request rate and server-side latency.
Data Blending - a technique to combine data from multiple data sources in data analytics, reporting, and/or visualizations (Data blending - Wikipedia)

Activities & Deliverables

Data Security - Ensure data is stored, transformed, and accessed in a way that doesn’t compromise security and privacy.
- Supporting individuals’ privacy needs
- Blending and/or Co-location of data - establish a framework for how to blend and/or co-locate data
- Access and usage controls
  - Authentication
    - Provision an authentication system at the required level of security and scale to facilitate more broad user access to data analysis tools
  - Authorization
    - Who can access what dashboards and data sets
    - Who can access data for exploratory use-cases
- Address retention policies and practices
- Auditability of access as well access control changes
Define & Develop Data Ingestion Processes
- Establish ingestion methods for:
  - Fluxx
  - Bespoke datasets (ex: survey data)
- Streamline data ingestion for:
  - Product instrumentation data (metrics platform)
- Determine a methodology to support blending or x-dataset analysis
  - Data destination
Discovery & Analytics
- Do product instrumentation dives to look at what data can be leveraged
- Look across datasets
Data Analysis Support
- Uncover and address any unmet technical requirements for data analysis engines and analytical data storage
- Address scale and performance issues in data reporting tools; provision or improve an existing shared platform for data reporting
Data Dictionaries
- For each use case define the fields, definitions, calculation methodology, etc. in a standardized format

Related Quarterly OKRs OKRs to be drafted

Resourcing

Activity	Decision-Makers	Responsible	Accountable	Consulted	Informed
Discovery & Analytics: Product Instrumentation	Kate Zimmerman, Mikhail Popov	Product Analytics (Maya Kampurath, Irene Florez, and analysts working with consulted Product Teams)	Kate Zimmerman	Product Teams
Discovery & Analytics: Grants Data & Metrics	Kassia Echevarri-Queen	Community Investment	Kassia Echevarri-Queen	Product Analytics, Ilana Fried, Irene Florez
Discovery & Analytics: DEI Reporting	Sumeet Bodington	Global Data Insights	Sumeet Bodington	Product Analytics
Technology Implementation: Analytics Stack	Olja Dimitrijevic, Data Engineering	Data Engineering	Olja Dimitrijevic	Product Analytics
Technology Implementation: Metrics Platform	Jason Linehan, Analytics Data PM, Desiree Abad	Analytics Data PM (Desiree back-up)	Desiree Abad	Product Analytics, Data Engineering	Senior Leadership, Product
Data Privacy & Security	John Bennett	Data Engineering	John Bennett	Security, Privacy, Legal, Product Analytics	Senior Leadership, Product
Requirements, Roadmaps, & Product Management	Analytics Data PM, Implementation Teams	Platform Product Management	Desiree Abad	Product Analytics

Key Result 3: Build Machine Learning Services

Build machine learning services - Operationalize a ML governance strategy for the Foundation, and create ways to understand, evaluate, and provide feedback on ML models. Baseline to be determined.

Accountability for this Key Result is assigned to Chris Albon, Director of Machine Learning

Intent and Desired Outcomes
Over the last five years, the Wikimedia Foundation has proven that machine learning can add meaningful value to the experiences of both readers and editors. However, despite these successes, there are areas for improvement around how the Foundation conducts machine learning, including:

Models are trained using a framework that is difficult to maintain and limited in the variety of models able to be created.
Models are served using an aging infrastructure requiring users to be mindful of their own usage in order to prevent system failures.
There is no formal review process for evaluating models hosted by the Foundation.
Models hosted by the Foundation are largely opaque to the communities impacted.

In this Key Result, we have an opportunity to elevate machine learning at the Foundation to an example of best practices for applied ethical machine learning, while at the same time strengthening the technical infrastructure and expanding its features.

To accomplish this, the desired outcome for the fiscal year is to create a new modern training, serving, and management infrastructure incorporating the best practices in MLOps and used to host a wide variety of existing and new machine learning models, all of which are transparent and accessible to the public and governed by a well-developed Wikimedia machine learning strategy.

Activities & Deliverables

1. Ethical Machine Learning Governance

Publish data and model cards
Draft machine learning governance strategy

2. New Model Deployment

Launch Lift Wing model serving framework
Migrate ORES/RevScoring models to Lift Wing
Deploy new models to Lift Wing
Deprecate the ORES infrastructure

3. New Model Training And Management Infrastructure

Launch minimum viable product of Train Wing model training framework
Launch minimum viable product feature store

Related Quarterly OKRs

1. All machine learning models hosted by the Foundation are managed by an Ethical Machine Learning Governance Strategy

Q2 - 50% of machine learning models hosted by the Foundation have an accompanying model card
Q2 - Draft a proposed ML governance strategy
Q4 (Stretch KR) - Operationalize final governance strategy

2. Machine learning models hosted by the Foundation are easy to train, deploy, and manage at scale

Q2 - A trained model can be loaded, deployed, and serve API requests in less than four hours in a repeatable process
Q4 - Wikimedia hosts and serves 100 machine learning models on Lift Wing
Q4 (Stretch KR) - One machine learning model is automatically retrained and deployed nightly in repeatable process

Resourcing

Activity	Responsible	Accountable	Consulted	Informed
Publish data and model cards	Machine Learning	Chris Albon	Product Management Design	Research
Draft machine learning governance strategy	Machine Learning	Chris Albon	Movement Comms, Product Management, Data Engineering	Research
Internal Launch Lift Wing	Machine Learning, Data Center Ops	Chris Albon	SRE teams, Product Management	Research
Migrate ORES/RevScoring models to Lift Wing	Machine Learning	Chris Albon		SRE teams, Product Management
Deploy new models to Lift Wing	Machine Learning	Chris Albon	Product Management, Research
Deprecate ORES	Machine Learning, Data Center Ops	Chris Albon	Movement Comms, Product Management, Data Engineering
Launch Train Wing MVP	Machine Learning, Data Center Ops	Chris Albon	Research, Product Management
Launch feature store	Machine Learning	Chris Albon	Data Engineering, Research, Product Management