When we first deployed ORES, we reached out to several different wiki-communities and invited them to test out the system for use in patrolling for vandalism. In these announcements, we encouraged editors to install ScoredRevisions -- the only tool that used made use of ORES' edit quality models at the time. ScoredRevisions both highlights edits that are likely to be damaging (as predicted by the model) and displays the likelihood of the prediction as a percentage.
It didn't take long before our users began filing false-positive reports on wiki pages of their own design. In this section we will describe three cases where our users independently developed these false-positive reporting pages and how they used them to understand ORES, the roles of automated quality control in their own spaces, and to communicate with us.
Case studies
Report mistakes (Wikidata)
When we first deployed prediction models for Wikidata, a free and open knowledge base that can be read and edited by both humans and machines[1], we were breaking new ground by building a damage detection classifier based on a structured data wiki[2]. So we created a page called "Report mistakes" and invited users to tell us about mistakes that the prediction model made on that page but we left the format and structure largely up to the users.
Within 20 minutes, we received our first report from User:Mbch that ORES was reporting edits that couldn't possibly be vandalism as potentially damaging. As reports streamed in, we began to respond to them and make adjustments to the model building process to address data extraction bugs and to increase the signal so that the model differentiate damage from non-damaging edits. After a month of reports and bug fixes, we decided to build a table to represent the progress that we made in iterations on the model against the reported false-positives. See Figure ?? for a screenshot of the table. Each row represents a mis-classified edit and each column describes the progress we made in not detecting those edits as damaging in future iterations of the model. Through this process, we learned how Wikidata editors saw damage and how our modeling and feature extraction process captured signals in ways that differed from Wikidata editors' understandings. We were also able to publicly demonstrate improvements to this community.
Patrolling/ORES (Italian Wikipedia)
Italian Wikipedia was one of the first wikis where we deployed basic edit quality models. Our local collaborator who helped us develop the language specific features, User:Rotpunkt, created a page for ORES[3] with a section for reporting false-positives ("falsi positivi"). Within several hours, Rotpunkt and a few other edits started to notice some trends in their false positive reports. First, Rotpunkt noticed that there were several counter-vandalism edits that ORES was flagging as potentially damaging, so he made a section for collecting that specific type of mistake ("annullamenti di vandalismo"). A few reports later and he added a section for corrections to the verb for "have" ("correzioni verbo avere"). Through this process, editors from Italian Wikipedia were essential performing a grounded theory exploration of the general classes of errors that ORES was making.
Once there were several of these mistake-type sections and several reports within each section, Rotpunkt reached out to us to let us know what he'd found. He explained to us (via our IRC channel) that many of ORES mistakes were understandable, but there were some general trends in mistakes around the Italian verb for have: "ha". We knew immediately what was likely to be the issue. It turns out that "ha" in English and many other languages is laughing -- an example of informal language that doesn't belong in an encyclopedia article. While the word "ha" in Italian translates to have and is perfectly acceptable in articles.
Because of the work of Rotpunkt and his collaborators in Italian Wikipedia, we were able to recognize the source of this issue (a set of features intended to detect the use of informal language in articles) and to remove "ha" from that list for Italian Wikipedia. This is just one example of many issues we were able to address because of the grounded theory and thematic analysis performed by Italian Wikipedians.
PatruBOT (Spanish Wikipedia)
Soon after we released support for Spanish Wikipedia, User:jem developed a robot to automatically revert damaging edits using ORES predictions (PatruBOT). This robot was not running for long before our discussion pages started to be bombarded with confused Spanish-speaking editors asking us questions about why ORES did not like their work. We struggled to understand the origin of the complaints until someone reached out to us to tell us about PatruBOT and its activities.
We haven't been able to find the source code for PatruBot, but from what we've been able to gather looking at its activity, it appears to us that PatruBOT was too sensitive and was likely reverting edits that ORES did not have enough confidence about. Generally, when running an automated counter-vandalism bot, the most immediately operational concern is around precision (the proportion of positive predictions that are true-positives). This is because mistakes are extra expensive when there's no human judgement between a prediction and a revert (rejection of the contribution). The proportion of all damaging edits that are actually caught by the bot (recall) is a secondary concern to be optimized.
We generally recommend that bot developers who are interested in running an automated counter-vandalism bot use a threshold that maximizes recall at high precision (90% is a good starting point). According to our threshold optimization query, the Spanish Wikipedia damaging model can be expected to have 90% precision and catch 17% of damage if the bot only reverted edits where the likelihood estimate is above 0.959.
We reached out to the bot developer to try to help, but given the voluntary nature of their work, they were not available to discuss the issue with us. Eventually, other editors who were concerned with PatruBOT's behavior organized an informal crowdsourced evaluation of the fitness of PatruBOT's behavior[4] where they randomly sampled 1000 reverts performed by PatruBOT and reviewed their appropriateness. At the time of writing, PatruBOT has been stopped[5] and the informal evaluation is ongoing.
Discussion
These case studies in responses to ORES provide a window into how our team has been able to work with the locals in various communities to refine our understandings of their needs, into methods for recognizing and addressing biases in ORES' models, and into how people think about what types of automation their find acceptable in their spaces.
Refining out understandings and iterating our models. The information divide between us researchers/engineers and those member of a community is often wider than we realize. Through iteration with the Wikidata and Italian models, we learned about incorrect assumptions we'd made about how edits happen (e.g. client edits in Wikidata) and how language works (e.g. "ha" is not laughing in Italian). It's likely we'd never be able to fully understand the context in which damage detection models should operate before deploying the models. But these case studies demonstrate how, with a tight communication loop, many surprising and wrong assumptions that were baked into our modeling process could be identified and addressed quickly. It seems that many of the relevant issues in feature engineering and model tuning become *very* apparent when the model is used in context to try to address a real problem (in these cases, vandalism).
Methods for recognizing and addressing bias. The Italian Wikipedians showed us something surprising and interesting about collaborative evaluation of machine prediction: thematic analysis is very powerful. Through the collection of ORES mistakes and iteration, our Italian collaborators helped us understand general trends in the types of mistakes that ORES made. It strikes us that this a somewhat general strategy for bias detection. While our users certainly brought their own biases to their audit of ORES, they were quick to discover and come to consensus about trends in ORES' issues. Before they had performed this process and shared their results with us, we had no idea that any issues was present. After all, the fitness statistics for the damage detection model looked pretty good -- probably good enough to publish a research paper! Their use of thematic analysis seems to like a powerful tool that developers will want to make sure is well supported in any crowd based auditing support technologies.
How people think about acceptable automation. In our case study, Spanish Wikipedians are in the processes of coming to agreements about what roles are acceptable for automated agents. Through observation of PatruBOT's behavior, they have decided that the false discovery rate (i.e., 1 - precision) was too high by watching the bot work in practice and they started their own independent analysis to find quantitative, objective answers about what the real rate is. Eventually they may come to a conclusion about an acceptable rate or they may decide that no revert is acceptable without human intervention.
- ↑ https://wikidata.org
- ↑ Sarabadani, A., Halfaker, A., & Taraborelli, D. (2017, April). Building automated vandalism detection tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion (pp. 1647-1654). International World Wide Web Conferences Steering Committee.
- ↑ it:Progetto:Patrolling/ORES>
- ↑ es:Wikipedia:Mantenimiento/Revisión_de_errores_de_PatruBOT/Análisis
- ↑ [[:es:Wikipedia:Café/Archivo/Miscelánea/Actual#Parada_de_PatruBOT