Wikimedia Technology/Annual Plans/FY2019/TEC5: Scoring Platform

The Scoring Platform team is an experimental, research-focused, community-supported, AI-as-a-service team. Our work focuses on balancing the efficiency that machine classification strategies bring to wiki-processes with transparency, ethics, and fairness. Our primary platform is ORES, an AI service that supports wiki processes such as vandal fighting, gap detection, and new page patrolling. The current set of ORES-supported products are loved by our communities, and our team's work of relieving overloaded community processes with AI, has shown great potential to enable conversations about growing our community (Knowledge Equity). In this proposal, we'll describe what we think we can accomplish given our current, minimal staffing. We'll also propose to fully staff the team along the lines of the original FY2018 Scoring Platform proposal so that we can expand our capacity in the critical area of bias detection and mitigation.

Overview of FY2018

Last year, we invested in the Scoring Platform team by giving Aaron a budget, staffing the team with Adam Wight as a senior engineer (80%) and Amir Sarabadani as a junior engineer (50%). We also retained a contracting budget to hire experts to develop new AIs and evaluation strategies. In total, we have a staff of 2.55 FTEs.

Despite this minimal staffing, the team has been quite successful.

Lots more models delivered to lots more wikis (targeting emerging communities, increasing capacity for knowledge equity)
Deployed ORES on a dedicated cluster and refactored the ORES extension (more uptime, evolving infrastructure)
Collaborated with Community Tech on a study of new page review issues�—training and testing a critical technology for mitigating the issue (evolving our infrastructure and experimenting with new strategies for supporting newcomers).
Published papers about why people cite what they cite, and the dynamics of bot governence (increasing our understanding of wiki processes).
We performed a community consultation and system design process for JADE, our proposed auditing support infrastructure.

Contingency planning for FY2019

In order to deal with funding realities, we've prepared two annual plans for our department. The first alternative presents our ideal plan, which includes a reasonable amount of growth. The other is what we can accomplish if staffing levels cannot be improved.

Staffing increased as requested

Ask

Bring the team up to a higher level capacity and robustness by:

Promote Amir to a full-time requisition holder
Hire an engineering manager/tech lead to remove this burden from Aaron. This was proposed in our original plan for FY2018.

Benefits

Support more languages and wikis: we have a large backlog of requests for ORES support.
Bring our new auditing system, mw:JADE, online. Start to tracking algorithmic bias—the kind of problems that keep some potential contributors out—much more effectively (Knowledge Equity).
More robust ORES service.
Develop new prediction models more quickly (Knowledge as a Service). Many of the models we will target are intended to provide fertile ground for experimentation around the balance between efficient quality control and better newcomer support (Knowledge Equity).
Once Aaron is wearing fewer hats, he'll be less of a bottleneck for the team. With more time, Aaron will be able to participate in thought leadership/outreach more effectively.

Staffing unchanged from FY2018 levels

Ask

Continue funding the Scoring Platform Team at FY2018 levels.

Benefits

In the next fiscal year, we will continue our work of making ORES more robust, and expanding our prediction models to new wiki processes and under-served wiki communities.

Slowly increase model support to more wikis, prioritizing emerging communities.
Experiment with the new article routing models and expand them to more communities.
Publish datasets and papers about the process and machine-based process augmentation time.

Risks and challenges

While the Scoring Platform team has been able collaborate effectively with volunteers in order to supplement its minimal resourcing, the fact is that the development of ORES (useful AIs) and JADE (our auditing system) has been slowed substantially by understaffing. We have the chance to help lead the industry on this front, but it may escape us.
Our bus factor is still far too low. Were we to lose the one full-time engineer on the team, development and deployments would nearly come to a halt. Or worse, if Aaron were to be lost, the majority of the team's infrastructure would leave with him.

Program outline

The following program outline describes the expanded set of goals that we think we can achieve with a fully-staffed team. If we don't expand the team this year, then we'll need to limit our goals outcomes 1 and 2.

Teams contributing to the program

Scoring Platform

Annual Plan priorities

Primary Goal: 1. Knowledge Equity: Grow new contributors and content

How does your program affect annual plan priority?

Knowledge as a service: Basic machine learning support is essential to keep Wikipedia and other Wikimedia projects open at scale. Without machine learning support, quality control work and other curation activities are too cumbersome to maintain and the urge to control access and limit newcomers becomes overwhelming. We've seen this in English Wikipedia with restrictions on anonymous editors and registered newcomers. Collaborations between the Scoring Platform team and product teams in Audiences represent the best hope we have re-opening English Wikipedia and ensuring that other communities don't close off to new participants.

Knowledge equity: While these machine learning technologies are critical to continuing our mission, they also come with a great potential cost with regards to bias. This proposal includes a request for growth so that we can more effectively focus on the development of effective auditing technologies. These technologies ensure that the AIs that we use to curate content are effectively accountable to our volunteer community. If we do not invest in the development of auditing technologies, we risk furthering our current inequities by encoding them in our prediction models. This is a great risk when it comes to quality control, because these algorithms are part of the decision system controlling who gets to contribute and who does not.

Through investments in this program, we hope to (1) boost the capacity of our communities to effectively curate content at the scale of human knowledge and (2) to ensure that systems we build serve as a force for good in Wikimedia.

Program Goal

Improve the efficiency of wiki processes and mitigate the effects of algorithmic biases that are introduced.

Outcome 1: More wiki communities benefit from semi-automated curation support

Output 1: ORES supports the edit quality prediction models for more wikis/languages

Output 2: ORES supports the draft quality/topic prediction models for more wikis/langauges

Outcome 2: Grow the community of wiki decision process modelers and tool builders (staff, volunteers, academics)

Output 3: Published posts about ORES, AI, wiki processes, etc. on the Wikimedia blog

Output 4: Workshops run, papers published, datasets published, tutorials published, hackathons co-organized

Outcome 3: Users of ORES-based-tools can build a repository of human judgement to contrast with model-predictions

Output 5: JADE (our auditing system) accepts and stores human judgements

Output 6: JADE supports basic curation activities (reverts, suppression, watchlists -- MediaWiki integration)

Outcome 4: Developers and volunteer analysts will be able to analyze trends in ORES bias.

Output 7: JADE data appears in mw:Quarry and public dumps

Output 8: Reports about ORES bias are published.

Outcome 5: Tool developers and product teams will be able to use JADE to help patrollers collaborate by providing a central location for noting which items have been reviewed and what the outcome of that review was.

Output 9: A stream of judgements are available for consumption by tools/products.

Resources

We can meet outcomes 1 and 2 with current resourcing. Outcomes 3, 4, and 5 will require additional resourcing highlighted below.

	FY2017–18	FY2018–19
People (OpEx)	Principal research scientist Senior engineer 0.5 ✕ Junior engineer (contract budget) contracting budget for expert modelers/interns 0.1 x Tech Writer	Principal research scientist (no change) Senior engineer (no change) Junior engineer (FTE conversion) contracting budget for expert modelers/interns (no change) 0.5 ✕ Product manager (new hire, shared from WMCS) 0.25 x Tech Writer (new hire, shared from WMCS)
Stuff (CapEx)	New ORES cluster servers (kubernetes nodes)	No substantial new hardware needed
Travel & Other	2 ✕ Wikimania 2 ✕ Wikimedia Hackathon 2 ✕ professional conference	n/a ✕ Wikimania (centralized) 3 ✕ Wikimedia Hackathon (+1 for new hire) 3 ✕ professional conference (+1 for new hire)

Targets

Outcome 1

More wiki communities benefit from semi-automated curation support

Target: 2 new wikis with advanced edit quality models each quarter; At least one semi-automated tool adopt the draftquality and drafttopic prediction models (e.g. PageCuration)
Measurement method

Count of Wikis supported after models are deployed.
Tool developers self-report use of ORES prediction models

Outcome 2

Grow the community of wiki decision process modelers and tool builders (staff, volunteers, academics)

Target: Publish two papers in peer reviewed journals about wiki process support with AI models; Publish two datasets used in modeling wiki processes; Publish two wikimedia blog posts about modeling, auditing, and problems of scale.; Recruit and train new developers at the hackathon events (Wikimedia Hackathon and Wikimania)

Measurement method

Papers, datasets, and blog posts published by the team
Wiki data science workshops organized
Papers published that cite papers and datasets we release
Count of collaborators recruited at hackathon & how many remain active (retained) on the mailing list (ai@lists) or the IRC channel (#wikimedia-ai)

Outcome 3

Users of ORES-based-tools can build a repository of human judgement to contrast with model-predictions

Target: A test JADE service is deployed in WMF Cloud; JADE is ready to be deployed into production wikis; JADE has revert/suppression/watchlist integrations in MediaWiki

Measurement method

Track deployments
The number of 3rd party tools that build JADE integrations (self-reported & discovered)

Outcome 4

Developers and volunteer analysts will be able to analyze trends in ORES bias.

Target: JADE data appears in mw:Quarry and produces database dumps; Publish at least one report about bias/non-bias in ORES using JADE data

Measurement method

Demo query in Quarry & inclusion in dumps.wikimedia.org
Number of bias report publications

Outcome 5

Tool developers and product teams will be able to use JADE to help patrollers collaborate by providing a central location for noting which items have been reviewed and what the outcome of that review was.

Target: JADE judgments appear in mw:EventStream; JADE judgments appear along with predictions in ORES; At least one tool adopts the use of JADE for distributed coordination between patrollers

Measurement method

Inclusion in EventStream
Deployment of JADE data to ORES
Developers report the usage of JADE data in curation tools

Dependencies

We rely on Operations for helping with basic hardware and service support. We'll also need somewhat minimal hardware resources for bringing JADE to production -- should we decide to do so this year. We expect that JADE's primary systems resource usage will be in production MediaWiki, but we may want to have some minimal services producing novel event streams (Judgements) and data dumps.

We'll rely on WMCS for support in hosting JADE-related datasets in public analytics infrastructure like PAWS and Quarry. This will help us make sure that JADE data is open for analysis by our volunteer communities.

We expect that Research will depend on us for datasets and for productionizing some of their experimental models.

We expect that Wikimedia Product teams in the Contributors department will depend on us to support their use of our prediction models (ORES) and auditing/distributed-curation support (JADE) in their tools.

References