Jump to content

Talk:ORES/Paper

About this board

Previous discussion was archived at meta:Research talk:ORES paper/Archive 1 on 2016-12-19.

Simplifying occurrences of "reification"

1
Adamw (talkcontribs)

I had a question about the use of “reification”, whether there’s a specific background or reason for using that word?  If not, I find it distracting and maybe wrong… AFAICT, reification is more about ideas becoming a thing in people’s minds, rather than ideas actually turning “real”/“physical”.  I’m thinking we can say “values are embodied in technologies” rather than reified…. "embedded" as you have elsewhere in the paper works for me as well. Or maybe “brought to life”.

Reply to "Simplifying occurrences of "reification""

ORES system: Open, transparent process

1
EpochFail (talkcontribs)

Our goals in the development of ORES and the deployment of models is to keep the process -- the flow of data from random samples to model training and evaluation open for review, critique, and iteration. In this section, we'll describe how we implemented transparent replay-ability in our model development process and how ORES outputs a wealth of useful and nuanced information for users. By making this detailed information available to users and developers, we hope to enable flexibility and power in the evaluation and use of ORES predictions for novel purposes.

Gathering labeled data

There are two primary strategies for gathering labeled data for ORES' models: found traces and manual labels.

Found traces. For many models, there are already a rich set of digital traces that can be assumed to reflect a useful human judgement. For example, in Wikipedia, it's very common that damaging edits will be reverted and that good edits will not be reverted. Thus the revert action (and remaining traces) can be used to assume that the reverted edit is damaging. We have developed a re-usable script[1] that when given a sample of edits, will label the edits as "reverted_for_damage" or not based on a set of constraints: edit was reverted within 48 hours, the reverting editor was not the same person, and the edit was not restored by another editor.

However, this "reverted_for_damage" label is problematic in that many edits are reverted not because they are damaging but because they are involved in some content dispute. Also, the label does not differentiate damage that is a good-faith mistake from damage that is intentional vandalism. So in the case of damage prediction models, we'll only make use of the "reverted_for_damage" label when manually labeled data is not available.

Another case of found traces is article quality assessments -- named "wp10" after the Wikipedia 1.0 assessment process originated the article quality assessment scale[2]. We follow the process developed by Warncke-Wang et al.[3] to extract the revision of an article that was current at the time of an assessment. Many other wikis employ a similar process of article quality labeling (e.g. French Wikipedia and Russian Wikipedia), so we can use the same script to extract their assessments with some localization[4]. However other wikis either do not apply the same labeling scheme consistently or at all and manual labeling is our only option.

The Wiki labels interface embedded in Wikipedia

Manual labeling. We hold manual labels for the purposes of training a model to replicate a specific human judgement as a gold standard. This contrasts with found data that is much easier to come by when it is available. Manual labeling is expensive upfront from a human labor hours perspective. In order to minimize the investment of time among our collaborators (mostly volunteer Wikipedians), we've developed a system called "Wiki labels"[5]. Wiki labels allows Wikipedians to submit judgments of specific samples of Wiki content using a convenient interface and logging in via their Wikipedia account.

To supplement our models of edit quality, we replace the models based on found "reverted_for_damage" traces with manual judgments where we specifically ask labelers to distinguish "damaging"/good from "good-faith"/vandalism. Using these labels we can build two separate models of that can allow users to filter for edits that are likely to be good-faith mistakes[6], to just focus on vandalism, or to focus on all damaging edits broadly.

We've managed to complete manual labeling campaigns article quality for Turkish and Arabic Wikipedia (wp10) as well as item quality in Wikidata. We've found that, when working with manually labeled data, we can attain relatively high levels of fitness with 150 observations per quality class.

Explicit pipelines

One of our openness goals with regards to how prediction models are trained and deployed in ORES involves making the whole data flow process clear. Consider the following code that represents a common pattern from our model-building Makefiles:

datasets/enwiki.human_labeled_revisions.20k_2015.json:
        ./utility fetch_labels \
                https://labels.wmflabs.org/campaigns/enwiki/4/ > $@

datasets/enwiki.labeled_revisions.w_cache.20k_2015.json: \
                datasets/enwiki.labeled_revisions.20k_2015.json
        cat $< | \
        revscoring extract \
                editquality.feature_lists.enwiki.damaging \
                --host https://en.wikipedia.org \
                --extractor $(max_extractors) \
                --verbose > $@

models/enwiki.damaging.gradient_boosting.model: \
                datasets/enwiki.labeled_revisions.w_cache.20k_2015.json
        cat $^ | \
        revscoring cv_train \
                revscoring.scoring.models.GradientBoosting \
                editquality.feature_lists.enwiki.damaging \
                damaging \
                --version=$(damaging_major_minor).0 \
                (... model parameters ...)
                --center --scale > $@

Essentially, this code helps someone determine where the labeled data comes from (manually labeled via the Wiki Labels system). It makes it clear how features are extracted (using the utility revscoring extract and the enwiki.damaging feature set). Finally, this dataset with extracted features is used to cross-validate and train a model predicting the damaging label and a serialized version of that model is written to a file. A user could clone this repository, install the set of requirements, and run "make enwiki_models" and expect that all of the data-pipeline would be reproduced.

By explicitly using public resources and releasing our utilities and Makefile source code under an open license (MIT), we have essential implemented a turn-key process for replicating our model building and evaluation pipeline. A developer can review this pipeline for issues knowing that they are not missing a step of the process because all steps are captured in the Makefile. They can also build on the process (e.g. add new features) incrementally and restart the pipeline. In our own experience, this explicit pipeline is extremely useful for identifying the origin of our own model building bugs and for making incremental improvements to ORES' models.

At the very base of our Makefile, a user can run "make models" to rebuild all of the models of a certain type. We regularly perform this process ourselves to ensure that the Makefile is an accurate representation of the data flow pipeline. Performing complete rebuild is essential when a breaking change is made to one of our libraries. The resulting serialized models are saved to the source code repository so that a developer can review the history of any specific model and even experiment with generating scores using old model versions.

Model information

In order to use a model effectively in practice, a user needs to know what to expect from model performance. E.g. how often is it that when an edit is predicted to be "damaging" it actually is? (precision) or what proportion of damaging edits should I expect will be caught by the model? (recall) The target metric of an operational concern depends strongly on the intended use of the model. Given that our goal with ORES is to allow people to experiment with the use and reflection of prediction models in novel ways, we sought to build an general model information strategy.

https://ores.wikimedia.org/v3/scores/enwiki/?model_info&models=damaging returns:

      "damaging": {
        "type": "GradientBoosting",
        "version": "0.4.0",
        "environment": {"machine": "x86_64", ...},
        "params": {center": true, "init": null, "label_weights": {"true": 10},
                   "labels": [true, false], "learning_rate": 0.01, "min_samples_leaf": 1,
                   ...},
        "statistics": {
          "counts": {"labels": {"false": 18702, "true": 743},
                     "n": 19445,
                     "predictions": {"false": {"false": 17989, "true": 713},
                                     "true": {"false": 331, "true": 412}}},
          "precision": {"labels": {"false": 0.984, "true": 0.34},
                        "macro": 0.662, "micro": 0.962},
          "recall": {"labels": {"false": 0.962, "true": 0.555},
                     "macro": 0.758, "micro": 0.948},
          "pr_auc": {"labels": {"false": 0.997, "true": 0.445},
                     "macro": 0.721, "micro": 0.978},
          "roc_auc": {"labels": {"false": 0.923, "true": 0.923},
                      "macro": 0.923, "micro": 0.923},
          ...
        }
      }

The output captured in Figure ?? shows a heavily trimmed JSON (human- and machine-readable) output of model_info for the "damaging" model in English Wikipedia. Note that many fields have been trimmed in the interest of space with an ellipsis ("..."). What remains gives a taste of what information is available. Specifically, there's structured data about what kind of model is being used, how it is parameterized, the computing environment used for training, the size of the train/test set, the basic set of fitness metrics, and a version number so that secondary caches know when to invalidate old scores. A developer using an ORES model in their tools can use these fitness metrics to make decisions about whether or not a model is appropriate and to report to users what fitness they might expect at a given confidence threshold.

The scores

The predictions made by through ORES are also, of course, human- and machine-readable. In general, our classifiers will report a specific prediction along with a set of probability (likelihood) for each class. Consider article quality (wp10) prediction output in figure ??.

https://ores.wikimedia.org/v3/scores/enwiki/34234210/wp10 returns

        "wp10": {
          "score": {
            "prediction": "Start",
            "probability": {
              "FA": 0.0032931301528326693,
              "GA": 0.005852955431273448,
              "B": 0.060623380484537165,
              "C": 0.01991363271632328,
              "Start": 0.7543301344435299,
              "Stub": 0.15598676677150375
            }
          }
        }

A developer making use of a prediction like this may choose to present the raw prediction "Start" (one of the lower quality classes) to users or to implement some visualization of the probability distribution across predicted classed (75% Start, 16% Stub, etc.). They might even choose to build an aggregate metric that weights the quality classes by their prediction weight (e.g. Ross's student support interface[7] or the weighted_sum metric from [8]).

Threshold optimization

(import from other thread/essay)

  1. see "autolabel" in https://github.com/wiki-ai/editquality
  2. en:WP:WP10
  3. Warncke-Wang, Morten (2017): English Wikipedia Quality Asssessment Dataset. figshare. Fileset. https://doi.org/10.6084/m9.figshare.1375406.v2
  4. see the "extract_labelings" utility in https://github.com/wiki-ai/articlequality
  5. m:Wiki labels
  6. see our report meta:Research_talk:Automated_classification_of_edit_quality/Work_log/2017-05-04
  7. Sage Ross, Structural completeness
  8. Keilana Effect paper
Reply to "ORES system: Open, transparent process"
EpochFail (talkcontribs)

This is largely adapted from Jmorgan's notes.

Wikipedia as a genre ecology. Unlike traditional mass-scale projects, Wikipedia's structure and processes are not centrally planned. Wikipedia's system functions as a heterogeneous assemblage of humans, practices, policies, and software. Wikipedia is an open system and its processes are dynamic, complex, and non-deterministic.

A theoretical framework that accounts for the totality of factors and their relationships is essential to building a system-level understanding of state and change processes. Genre ecologies[1] give us such a framework. A genre ecology consists of “an interrelated group of genres (artifact types and the interpretive habits that have developed around them) used to jointly mediate the activities that allow people to accomplish complex objectives.”[2].

Morgan & Zachry (2010) used genre ecologies to characterize the relationships between Wikipedia’s official policies and essays--unofficial rules, best practices, and editing advice documents that are created by editors in order to contextualize, clarify, and contradict policies. Their research demonstrated that on Wikipedia, essays and policies not only co-exist, but interact. The “proper” interpretation of Wikipedia’s official Civility policy[3] within a particular context is mediated by the guidance provided in the related essay No Angry Mastodons[4].

In genre ecology terms, performing the work of enforcing civil behavior on Wikipedia is mediated by a dynamic equilibrium between the guidance provided in the official policy and the guidance provided in any related essays, with the unofficial genres providing interpretive flexibility in the application of official rules to local circumstances as well as challenging and re-interpreting official ideologies and objectives.

Algorithmic systems clearly have a role in mediating the policy, values, and rules in social spaces as well[5]. When looking at Wikipedia's articulation work through the genre ecology lens, it's clear that robots mediate the meaning of policies (c.f., Sinebot's enforcement of the signature policy[6]) and human-computation software mediates the way that Wikipedia enacts quality controls (c.f., the Huggle's vision of quality in Wikipedia as separating good from bad[7]).

Wikipedia's problems in automated mediation Wikipedia has a long-standing historic problem with regards to how quality control is enacted. In 2006, when Wikipedia was growing exponentially, the volunteers who managed quality control processes were overwhelmed and they turned to software agents to help make their process more efficient[8]. But the software they developed and appropriate only focused on reifying quality standards and not on good community management practices[9]. The result was a sudden decline in the retention of new editors in Wikipedia and a threat to the core values of the project.

Past work has described these problems as systemic and related to dominant shared-understandings embedded in policies, processes, and software agents[10]. Quality control itself is a distributed cognition system that emerged based on community needs and volunteer priorities[11]. So, where does change come from in such a system -- where problematic assumptions have been embedded in the mediation of policy and the design of software for over a decade? Or maybe more generally, how does deep change take place in a genre ecology?

Making change is complicated by the distributed nature Since the publication of a seminal report about the declining retention in Wikipedia, knowledge that Wikipedia's quality control practices are problematic and at the heart of a existential problem for the project have become widespread. Several initiatives have been started that are intended to improve socialization practices (e.g. the Teahouse, a question and answer space for newcomers[12] and outreach efforts like Inspire Campaigns eliciting ideas from contributors on the margins of the community). Such initiatives can show substantial gains under controlled experimentation[13].

However, the process of quality control itself has remained largely unchanged. This assemblage of mindsets, policies, practices, and software prioritizes quality/efficiency and does so effectively (cite: Levee paper and Snuggle paper). To move beyond the current state of quality control, we need alternatives to the existing mode of seeing and acting within Wikipedia.

While it’s tempting to conclude that we just need to fix quality control, it’s not at all apparent what a better quality control would look like. Worse, even if we did, how does one cause systemic change in a distributed system like Wikipedia? Harding and Harraway’s concept of successors[14][15] gives us insight into how we might think about the development of new software/process/policy components. Past work has explored specifically developing a successor view that prioritizes the support of new editors in Wikipedia over the efficiency of quality control[16][17], but a single point rarely changes the direction of an entire conversation, so change is still elusive.

Given past efforts to improve the situation for newcomers[18] and the general interest among Wikipedia's quality control workers toward improving socialization[19], we know that there is general interest in balancing quality/efficiency and diversity/welcomingness more effectively. So where are these designers who incorporate this expanded set of values?  How to we help them bring forward their alternatives? How do we help them re-mediate Wikipedia’s policies and values through their lens? How do we support the development of more successors.

Expanding the margins of the ecology Successors come from the margin -- they represent non-dominant values and engage in the re-mediation of articulation. We believe that history suggests that such successors are a primary means to change in an open ecology like Wikipedia. For anyone looking to enact a new view of quality control into the designs of a software system, there’s a high barrier to entry -- the development of a realtime machine prediction model. Without exception, all of the critical, high efficiency quality control systems that keep Wikipedia clean of vandalism and other damage employ a machine prediction model for highlighting the edits that are most likely to be bad. For example, Huggle[20] and STiki[21] use a machine prediction models to highlight likely damaging edits for human reviews. ClueBot NG[22] uses a machine prediction model to automatically revert edits that are highly likely to be damaging. These automated tools and their users work to employ a multi-stage filter that quickly and efficiently addresses vandalism[23].

So, historically, the barrier to entry with regards to participating in the mediation of quality control policy was a deep understanding of machine classification models. Without this deep understanding, it wasn't possible to enact an alternative view of how quality controls should be while also accounting for efficiency and the need to scale. Notably, one of the key interventions in this area that did so was built by a computer scientist[24].

The result is a dominance of a certain type of individual -- a computer scientist (stereotypically, with an eye towards efficiency and with lesser interest in messy human interaction). This high barrier to entry and peculiar in-group has exacerbated a minimized margin and a supreme dominance of the authority of quality control regimes that were largely developed in 2006 -- long before the social costs of efficient quality control were understood.

If the openness of this space to the development of successors (the re-mediation of quality control) is limited by a rare literacy, then we have two options for expanding the margins beyond the current authorities: (1) increase general literacy around machine classification techniques or (2) remove the need to deeply understand practical machine learning in order to develop an effective quality control tool.

Through the development of ORES, we seek to reify the latter. By deploy a high-availability machine prediction service and engaging in basic outreach efforts, we intend to dramatically lower the barriers to the development of successors. We hope that by opening the margin to alternative visions of what quality control and newcomer socialization in Wikipedia should look like, we also open the doors to participation of alternative views in the genre ecology around quality control. If we’re successful, we’ll see new conversations about how algorithmic tools affect editing dynamics.  We’ll see new types of tools take advantage of these resources (implementing alternative visions).

  1. ?
  2. (Spinuzzi & Zachry, 2000)
  3. en:WP:CIVIL
  4. en:WP:MASTODON
  5. Lessig's Code is Law
  6. Lives of bots
  7. Snuggle paper
  8. Snuggle paper
  9. R:The Rise and Decline paper
  10. Snuggle paper
  11. Banning of a vandal
  12. Teahouse CSCW paper
  13. Teahouse Opensym paper
  14. Haraway, D. 1988. “Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective.” Feminist Studies, Vol. 14, No.3. (Autumn, 1988), pp. 575-599.
  15. Harding, S. 1987. The Science Question in Feminism. Ithaca: Cornell University Press.
  16. Snuggle paper
  17. Geiger, R.S. (2014, October 22-24). Successor systems: the role of reflexive algorithms in enacting ideological critique. Paper presented at Internet Research 15: The 15th Annual Meeting of the Association of Internet Researchers. Daegu, Korea: AoIR. Retrieved from http://spir.aoir.org.
  18. Teahouse CSCW paper
  19. Snuggle paper
  20. en:WP:Snuggle
  21. en:WP:STiki
  22. en:User:ClueBot NG
  23. When the levee breaks
  24. Snuggle paper
Jmorgan (WMF) (talkcontribs)

@EpochFail This is excellent. I made two very small textual changes. There's one additional piece of argument that you might want to add. Starting in the 4th paragraph from the end, you start to describe barriers to participation in quality control. You discuss the technical/expertise barrier around implementing machine learning systems, and I agree that is very important. I think it would also be useful to discuss the ADDITIONAL barrier created by the systems and practices that have developed around the use of these models. Could you argue, for example, that the existing models prioritize recall over precision in vandalism detection, and ignore editor intent, and that this is because those design decisions reflect a particular set of values (or a mindset) related to quality control? People who don't share that mindset--people who are more interested in mentoring new editors, or who care about the negative impacts of being reverted on new editor retention--won't use these tools because they don't share the values and assumptions embedded in the tools. By creating alternative models that embed different values--through interpretability, adjustable thresholds, and "good faith" scores--you provide incentives for folks who were previously marginalized from participating in quality control. Thoughts?

Adamw (talkcontribs)

I’m trying to catch up with the genre ecologies reading, and a first impression is that genre diagrams have a lot in common with data flow diagrams.  The edges contain a process, and the nodes might contain multiple data stores.  I appreciate that the genre theory is giving us a more zoomed-out perspective, in which human behaviors like habits and culture begin to emerge.  From my quick browsing of the background work on genre ecology, I think you’re breaking ground by suggesting that machines mediate in this space as well, in order words considering the data flows which become invisible because they don’t generate genres.  For example, editors will read the genre of ORES scores via a UI, and their administrative actions create a record of reverts, but we must account for the mostly automatic process of training a ML model on the reverts and updating scores, which changes the network topology into a feedback loop.  I’d appreciate help freeing myself of my data flow interpretation on genre ecologies, at some point.  If machine mediation is something new in genre ecologies, then I’m curious about what we gain by bringing in this theory.

Great to see the focus on effecting change!  I personally agree wholeheartedly that “successors come from the margin”, that we could design interventions all day and the results might even be quite positive, but that the most just, lasting, and visionary change will come from empowering our stakeholders to “let a hundred algorithms bloom”, and we may be able to catalyze this by creating space at the margins.

Not sure we need to present a stereotypical computer programmer who prefers determinism and logic to messy humans.  It feels like a straw dog, although I won’t deny I’ve heard those exact words at the lunch table…  Maybe better to just point out how simplistic solutions are seductive, and are encouraged by techie culture.

I want to hear more about how we’re opening the margins.  So far, I’m left with the suggestion that JADE will allow patrollers to push our models in new directions without ML-expert mediation.  This won’t be the obvious conclusion for most readers, I’m guessing, and I’d love to see this conclusion expanded.

EpochFail (talkcontribs)

First, I'm not sure I can address your thoughts re. process diagrams. I'm personally not as interested in actually modeling out the ecology as much as using the framework to communicate effectively about general dynamics. Maybe Jmorgan has some thoughts.

I love how you put this:

we could design interventions all day and the results might even be quite positive, but that the most just, lasting, and visionary change will come from empowering our stakeholders to “let a hundred algorithms bloom”, and we may be able to catalyze this by creating space at the margins.

When I'm thinking about margins, I'm imagining the vast space for re-mediation of quality control process without pushing the prediction models at all -- just making use of them in novel ways. I think that one does not have to fully open the world in order for effective openness to happen in a marginal sense. Though still, I do think there's going to be some interesting future work potential around making the prediction models more malleable. In the end, if there's a single shared model for "damaging" then that model will represent an authority and not a marginal perspective. We'd instead need to allow multiple damaging models if we were to support marginal activities at that level.

Reply to "Design rationale"

ORES system overview

1
EpochFail (talkcontribs)
The ORES architecture at a high-level view.

ORES can be understood as a machine prediction model container service where the "container" is referred to as a ScoringModel. A ScoringModel contains a reference to a set of dependency-aware features (see discussion of Dependency Injection) and has a common interface method called "score()" that takes the extracted feature-values as a parameter and produces a JSON blob (called a "score"). ORES is responsible for extracting the features and serving the score object via a RESTful HTTP interface. In this section we describe ORES architecture and how we have engineered the system to support the needs of our users.

Horizontal scaling

In order to be a useful tools for Wikipedians and tool developers, the ORES system uses distributed computation strategies to serve a robust, fast, high-availability service. In order to make sure that ORES can keep up with demand, we've focused on two points at which the ORES system implements horizontal scale-ability: the input-output(IO) workers (uwsgi[1]) and the computation workers (celery[2]). When a request is received, it is split across the pool of available IO workers. During this step of computation, all of the root dependencies are gathered for feature extraction using external APIs (e.g. the MediaWiki API[3]). Then these root dependencies are submitted to a job queue managed by celery for the CPU-intensive work. By implementing ORES in this way, we can add/remove new IO and CPU workers dynamically to the service in order to adjust with demand.

Robustness

Currently, IO workers and CPU workers are split across a set of 9 servers in two datacenters (for a total of 18 servers). Each of these 9 servers are running 90 CPU workers and 135 IO workers. The major limitation for running more workers on a single server is memory (RAM) due to the requirements for keeping several different prediction models loaded into memory. IO and CPU workers are drawing from a shared queue, so other servers can take over should any individual go down. Further, should one datacenter go fully offline, our load-balancer can detect this and will route traffic to the remaining datacenter. This implements a high level of robustness and allows us to guarantee a high degree of uptime. Given the relative youth of the ORES system, it's difficult to give a fair estimate of the exact up-time percentage[4].

Batch processing

Many some of our users' use-cases involve the batch scoring of a large number of revisions. E.g. when using ORES to build work-lists for Wikipedia editors, it's common to include an article quality prediction. Work-lists are either built from the sum total of all 5m+ articles in Wikipedia or from some large subset specific to a single WikiProject (e.g. WikiProject Women Scientists claims about 6k articles[5].). Robots that maintain these worklists will periodically submit large batch processing jobs to score ORES once per day. It's relevant to note that many researchers are also making use of ORES for varying historical analyses and their activity usually shows up in our logs as a sudden burst of requests.

The separation between IO and CPU work is very useful as it allows us to efficiently handle multi-score requests. A request to score 50 revisions will be able to take advantage of batch IO during the first step of processing and still extract features for all 50 scores in parallel during the second CPU-intensive step. This batch processing affords up to a 5X increase in time to scoring speed for large numbers of scores[6]. We generally recommend that individuals looking to do batch processing with ORES submit requests in 50 score blocks using up to two parallel connections. This would allow a user to easily score 1 million revisions in less than 24 hours in the worst case scenario that none of the scores were cached -- which is unlikely for recent Wikipedia activity.

Single score processing

Many of our users' use-cases involve the request for a single score/prediction. E.g. when using ORES for realtime counter-vandalism, tool developers will likely listen to a stream of edits as they are saved and submit a scoring request immediately. It's critical that these requests return in a timely manner. We implement several strategies to optimize this request pattern.

Single score speed. In the worst case scenario, ORES is generating a score from scratch. This is the common case when a score is requested in real-time -- right after the target edit/article is saved. We work to ensure that the median score duration is around 1 second. Currently our metrics tracking suggests that for the week April 6-13th, our median, 75%, and 95% score response timings are 1.1, 1.2, and 1.9 seconds respectively.

Caching and Precaching. In order to take advantage of the overlapping interests around recency between our users, we also maintain a basic LRU cache (using redis[7]) using a deterministic score naming scheme (e.g. enwiki:1234567:damaging would represent a score needed for the English Wikipedia damaging model for the edit identified by 123456). This allows requests for scores that have recently been generated to be returned within about 50ms via HTTPS.

In order to make sure that scores for recent edits are available in the cache for real-time use-cases, we implement a "precaching" strategy that listens to a highspeed stream of recent activity in Wikipedia and automatically requests scores for a specific subset of actions (e.g. edits). This allows us to attain a cache hit rate of about 80% consistently.

There are also secondary caches of ORES scores implemented outside of our service. E.g. the ORES Review Tool (an extension of MediaWiki) roughly mimics our own precaching strategy for gathering scores for recent edits in Wikipedia. Since this cache and its access patterns are outside the metrics gathering system we use for the service, our cache hit rate is actually likely much higher than we're able to report.

De-duplication. In real-time use-cases of ORES it's common that we'll receive many requests to score the same edit/article right after it was saved. We use the same deterministic score naming scheme from the cache to identify scoring tasks to ensure that simultaneous requests for that same score attach to the same result (or pending result) rather that starting a duplicate scoring job. This pattern is very advantageous in the case of precaching, because of our network latency advantage: we can generally guarantee that the precaching request for a specific score precedes the external request for a score. The result is that the external request for the a score attaches to the result of a score generation process that had started before the external request arrived. So even worst case scenarios where the score is not yet generate often result in a better-than-expected response speed from the tool developer/users' point of view.

Empirical access patterns

External requests per minute. The number of requests per minute is plotted for ORES for the week ending on April 13th, 2018. A 4 hour block is broken out to show the shape of a recent, periodic burst of activity that usually happens at 11:40 UTC.
Precaching requests per minute. The total number of precaching requests per minute is plotted for the week ending on April 13th, 2018.

The ORES service has been online since July of 2015[8]. Since then, we have seen steadily rising usage as we've developed and deployed new models. Currently, ORES support 66 different models for 33 different language-specific wikis.

Generally, we see 50 to 125 requests per minute from external tools that are using ORES' predictions (excluding the MediaWiki extension that is more difficult to track). Sometimes these external requests will burst up to 400-500 requests per second. Figure ?? shows the periodic and bursty nature of scoring requests received by the ORES service. Note that every day at about 11:40 UTC, the request rate jumps as some batch scoring job--most likely a bot.

Figure ?? shows our rate of precaching requests coming from our own systems. This graph roughly reflects the rate of edits that are happening to all of the wikis that we support since we'll start a scoring job for nearly every edit as it happens. Note that the number of precaching requests is about an order of magnitude higher than our known external score request rate. This is expected since Wikipedia editors and the tools they use will not request a score for every single revisions. It is the computational price that we pay to attain a high cache hit rate and to ensure that our users get the quickest response possible for the scores that they do need.

  1. https://uwsgi-docs.readthedocs.io/en/latest/
  2. http://www.celeryproject.org/
  3. MW:API
  4. en:High_availability#"Nines"
  5. https://quarry.wmflabs.org/query/14033
  6. Sarabadani, A., Halfaker, A., & Taraborelli, D. (2017, April). Building automated vandalism detection tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion (pp. 1647-1654). International World Wide Web Conferences Steering Committee.
  7. https://redis.io/
  8. See our announcement in Nov. 2015: https://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/
Reply to "ORES system overview"

Talk about proto-jade (misclassification pages)

2
EpochFail (talkcontribs)

When we first deployed ORES, we reached out to several different wiki-communities and invited them to test out the system for use in patrolling for vandalism. In these announcements, we encouraged editors to install ScoredRevisions -- the only tool that used made use of ORES' edit quality models at the time. ScoredRevisions both highlights edits that are likely to be damaging (as predicted by the model) and displays the likelihood of the prediction as a percentage.

It didn't take long before our users began filing false-positive reports on wiki pages of their own design. In this section we will describe three cases where our users independently developed these false-positive reporting pages and how they used them to understand ORES, the roles of automated quality control in their own spaces, and to communicate with us.

Case studies

Report mistakes (Wikidata)

ORES report mistakes -- improvements table.

When we first deployed prediction models for Wikidata, a free and open knowledge base that can be read and edited by both humans and machines[1], we were breaking new ground by building a damage detection classifier based on a structured data wiki[2]. So we created a page called "Report mistakes" and invited users to tell us about mistakes that the prediction model made on that page but we left the format and structure largely up to the users.

Within 20 minutes, we received our first report from User:Mbch that ORES was reporting edits that couldn't possibly be vandalism as potentially damaging. As reports streamed in, we began to respond to them and make adjustments to the model building process to address data extraction bugs and to increase the signal so that the model differentiate damage from non-damaging edits. After a month of reports and bug fixes, we decided to build a table to represent the progress that we made in iterations on the model against the reported false-positives. See Figure ?? for a screenshot of the table. Each row represents a mis-classified edit and each column describes the progress we made in not detecting those edits as damaging in future iterations of the model. Through this process, we learned how Wikidata editors saw damage and how our modeling and feature extraction process captured signals in ways that differed from Wikidata editors' understandings. We were also able to publicly demonstrate improvements to this community.

Patrolling/ORES (Italian Wikipedia)

Italian Wikipedia was one of the first wikis where we deployed basic edit quality models. Our local collaborator who helped us develop the language specific features, User:Rotpunkt, created a page for ORES[3] with a section for reporting false-positives ("falsi positivi"). Within several hours, Rotpunkt and a few other edits started to notice some trends in their false positive reports. First, Rotpunkt noticed that there were several counter-vandalism edits that ORES was flagging as potentially damaging, so he made a section for collecting that specific type of mistake ("annullamenti di vandalismo"). A few reports later and he added a section for corrections to the verb for "have" ("correzioni verbo avere"). Through this process, editors from Italian Wikipedia were essential performing a grounded theory exploration of the general classes of errors that ORES was making.

Once there were several of these mistake-type sections and several reports within each section, Rotpunkt reached out to us to let us know what he'd found. He explained to us (via our IRC channel) that many of ORES mistakes were understandable, but there were some general trends in mistakes around the Italian verb for have: "ha". We knew immediately what was likely to be the issue. It turns out that "ha" in English and many other languages is laughing -- an example of informal language that doesn't belong in an encyclopedia article. While the word "ha" in Italian translates to have and is perfectly acceptable in articles.

Because of the work of Rotpunkt and his collaborators in Italian Wikipedia, we were able to recognize the source of this issue (a set of features intended to detect the use of informal language in articles) and to remove "ha" from that list for Italian Wikipedia. This is just one example of many issues we were able to address because of the grounded theory and thematic analysis performed by Italian Wikipedians.

PatruBOT (Spanish Wikipedia)

Soon after we released support for Spanish Wikipedia, User:jem developed a robot to automatically revert damaging edits using ORES predictions (PatruBOT). This robot was not running for long before our discussion pages started to be bombarded with confused Spanish-speaking editors asking us questions about why ORES did not like their work. We struggled to understand the origin of the complaints until someone reached out to us to tell us about PatruBOT and its activities.

We haven't been able to find the source code for PatruBot, but from what we've been able to gather looking at its activity, it appears to us that PatruBOT was too sensitive and was likely reverting edits that ORES did not have enough confidence about. Generally, when running an automated counter-vandalism bot, the most immediately operational concern is around precision (the proportion of positive predictions that are true-positives). This is because mistakes are extra expensive when there's no human judgement between a prediction and a revert (rejection of the contribution). The proportion of all damaging edits that are actually caught by the bot (recall) is a secondary concern to be optimized.

We generally recommend that bot developers who are interested in running an automated counter-vandalism bot use a threshold that maximizes recall at high precision (90% is a good starting point). According to our threshold optimization query, the Spanish Wikipedia damaging model can be expected to have 90% precision and catch 17% of damage if the bot only reverted edits where the likelihood estimate is above 0.959.

We reached out to the bot developer to try to help, but given the voluntary nature of their work, they were not available to discuss the issue with us. Eventually, other editors who were concerned with PatruBOT's behavior organized an informal crowdsourced evaluation of the fitness of PatruBOT's behavior[4] where they randomly sampled 1000 reverts performed by PatruBOT and reviewed their appropriateness. At the time of writing, PatruBOT has been stopped[5] and the informal evaluation is ongoing.

Discussion

These case studies in responses to ORES provide a window into how our team has been able to work with the locals in various communities to refine our understandings of their needs, into methods for recognizing and addressing biases in ORES' models, and into how people think about what types of automation their find acceptable in their spaces.

Refining out understandings and iterating our models. The information divide between us researchers/engineers and those member of a community is often wider than we realize. Through iteration with the Wikidata and Italian models, we learned about incorrect assumptions we'd made about how edits happen (e.g. client edits in Wikidata) and how language works (e.g. "ha" is not laughing in Italian). It's likely we'd never be able to fully understand the context in which damage detection models should operate before deploying the models. But these case studies demonstrate how, with a tight communication loop, many surprising and wrong assumptions that were baked into our modeling process could be identified and addressed quickly. It seems that many of the relevant issues in feature engineering and model tuning become *very* apparent when the model is used in context to try to address a real problem (in these cases, vandalism).

Methods for recognizing and addressing bias. The Italian Wikipedians showed us something surprising and interesting about collaborative evaluation of machine prediction: thematic analysis is very powerful. Through the collection of ORES mistakes and iteration, our Italian collaborators helped us understand general trends in the types of mistakes that ORES made. It strikes us that this a somewhat general strategy for bias detection. While our users certainly brought their own biases to their audit of ORES, they were quick to discover and come to consensus about trends in ORES' issues. Before they had performed this process and shared their results with us, we had no idea that any issues was present. After all, the fitness statistics for the damage detection model looked pretty good -- probably good enough to publish a research paper! Their use of thematic analysis seems to like a powerful tool that developers will want to make sure is well supported in any crowd based auditing support technologies.

How people think about acceptable automation. In our case study, Spanish Wikipedians are in the processes of coming to agreements about what roles are acceptable for automated agents. Through observation of PatruBOT's behavior, they have decided that the false discovery rate (i.e., 1 - precision) was too high by watching the bot work in practice and they started their own independent analysis to find quantitative, objective answers about what the real rate is. Eventually they may come to a conclusion about an acceptable rate or they may decide that no revert is acceptable without human intervention.

  1. https://wikidata.org
  2. Sarabadani, A., Halfaker, A., & Taraborelli, D. (2017, April). Building automated vandalism detection tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion (pp. 1647-1654). International World Wide Web Conferences Steering Committee.
  3. it:Progetto:Patrolling/ORES>
  4. es:Wikipedia:Mantenimiento/Revisión_de_errores_de_PatruBOT/Análisis
  5. [[:es:Wikipedia:Café/Archivo/Miscelánea/Actual#Parada_de_PatruBOT
EpochFail (talkcontribs)
Reply to "Talk about proto-jade (misclassification pages)"

TODO: ORES system (Threshold optimizations)

1
EpochFail (talkcontribs)

When we first started developing ORES, we realized that interpreting the likelihood estimates of our prediction models would be crucial to using the predictions effectively. Essentially, the operational concerns of Wikipedia's curators need to be translated into a likelihood threshold. For example, counter-vandalism patrollers seek catch all (or almost all) vandalism before it is allowed to stick in Wikipedia for very long. That means they have an operational concern around the recall of a damage prediction model. They'd also like to review as few edits as possible in order to catch that vandalism. So they have an operational concern around the filter rate -- the proportion of edits that are not flagged for review by the model[1].

By finding the threshold of prediction likelihood that optimizes the filter-rate at a high level of recall, we can provide vandal-fighters with an effective trade-off for supporting their work. We refer to these optimizations in ORES as threshold optimizations and ORES provides information about these thresholds in a machine-readable format so that tools can automatically detect the relevant thresholds for their wiki/model context.

Originally, when we developed ORES, we defined these threshold optimizations in our deployment configuration. But eventually, it became apparent that our users wanted to be able to search through fitness metrics to adapt their own optimizations. Adding new optimizations and redeploying quickly became a burden on us and a delay for our users. So we developed a syntax for requesting an optimization from ORES in realtime using fitness statistics from the models tests. E.g. "maximum recall @ precision >= 0.9" gets a useful threshold for a counter-vandalism bot or "maximum filter_rate @ recall >= 0.75" gets a useful threshold for semi-automated edit review (with human judgement).

Example:

https://ores.wikimedia.org/v3/scores/enwiki/?models=damaging&model_info=statistics.thresholds.true.'maximum filter_rate @ recall >= 0.75'

Returns:

  {"threshold": 0.299, ..., 
   "filter_rate": 0.88, "fpr": 0.097, "match_rate": 0.12, "precision": 0.215, "recall": 0.751}

This result shows that, when a threshold is set on 0.299 likelihood of damaging=true, then you can expect to get a recall of 0.751, precision of 0.215, and a filter-rate of 0.88. While the precision is low, this threshold reduces the overall workload of vandal-fighters by 88% while still catching 75% of (the most egregious) damaging edits.

Reply to "TODO: ORES system (Threshold optimizations)"
EpochFail (talkcontribs)

Beyond the obvious remediation of quality control processes itself.

Think technological probe:

The transparency encourages people to consider the relationship of their process to the algorithmic predictions that support them. "What is the proper role of algorithmic tools in quality control in Wikipedia?" See discussion of PatruBot in Spanish Wikipedia.

The availability of the algorithm allows people to critique dominant narratives about quality control issues. See ACTRIAL and the draft quality model.

Reply to "Re-mediating what?"

Preserving the margins on digital platforms

4
EpochFail (talkcontribs)

https://medium.com/@gmugar/preserving-the-margins-on-digital-platforms-c42bdbab8dad

While [Open Online Platforms] welcome and encourage a wide range of participation, they have distinct terms of participation that constrain what we can and cannot do.

My take-away from this is that the "margins" are a source for innovation. The conflict between order and wide participation is negotiated and adapted in the margins. In ORES case, the intervention originated from the observation that Wikipedia was failing to adapt to a problem. The goal is to expand the margins around IUI tool developers in order to jump-start innovation/adaptation there.

Jmorgan what would the lit on genre ecologies have to say about this?

EpochFail (talkcontribs)
EpochFail (talkcontribs)

I want to mash together the ideas around Genre Ecologies, Successors, Hearing to speech/Boundary preservation and ask what we'd expect to see. I think the answer is that we'd expect to see stuff like what Sage et al. are developing.

EpochFail (talkcontribs)

By hearing to speech, we're opening the door for re-mediation artifacts that can gain ground/support/anchor.

"By virtue of the way that ORES is designed, it creates more opportunity for different interpretations for what quality control means or how quality control is enacted." --J-Mo

"Can you say that a tool built on ORES fundamentally changes the way that people view quality control process?" --J-Mo

"The system for quality control has taken on a kind of funnel model where ... [lines of defense] ... and that has stabilized."

"JADE is part of the ORES system. It is a hearing-to-speech."

"JADE/ORES-transparency encourages people to articulate what they think quality means."

Reply to "Preserving the margins on digital platforms"

Title ideas -- what *is* ORES anyway?

3
EpochFail (talkcontribs)

Hey folks, I've been working on figuring out what to call ORES -- and in effect, what to title the paper. ORES is an intervention that enables "successors" to the status quo. It "expands/preserves the margin" of quality control tool development. It's a successor system itself -- given that we now value transparency, audit-ability, and interrogability and thus ORES has been developed with those values in mind.

I joked on twitter that we call ORES a "successor platform" -- as in a "platform for developing successors". I wonder if there's something from the genre ecologies literature that might give this a name. E.g. what does one call the ecosystem or maybe something that improves viability in the ecosystem? ORES enables an increase in ecological diversity for the purposes of boosting the adaptive capacity of the larger system.

EpochFail (talkcontribs)
EpochFail (talkcontribs)

Facilitating re-mediation of Wikipedia's socio-technical problems.

Reply to "Title ideas -- what *is* ORES anyway?"

Observation: Adoption patterns

1
EpochFail (talkcontribs)

When we designed and developed ORES, we were targeting a specific problem -- expanding the set values applied to the design of quality control tools to include recent a recent understanding of the importance of newcomer socialization. However, we don't have any direct control of how developers chose to use ORES. We hypothesize that, by making edit quality predictions available to all developers, we'd lower the barrier to experimentation in this space. However, it's clear that we lowered barriers to experimentation generally. After we deployed ORES, we implemented some basic tools to showcase ORES, but we observed a steady adoption of our various prediction models by external developers in current tools and through the development of new tools.

Showcase tools

In order to showcase the utility of ORES, we developed a two simple tools to surface ORES predictions within MediaWiki -- the wiki that powers Wikipedia: ScoredRevisions and the ORES Review Tool.

ScoredRevisions[1] is a javascript-based "gadget" that runs on top of MediaWiki. When certain pages load in the MediaWiki interface (E.g. Special:RecentChanges, Special:Watchlist, etc.), the ScoredRevisions submits requests to the ORES service to score the edits present on the page. The javascript then updates the page with highlighting based on ORES predictions. Edits that are likely to be "damaging" are highlighted in red. Edits that might be damaging and are worth reviewing are highlighted in yellow. Other edits are left with the default background.

While this interface was excellent for displaying ORES potential, it had limited utility. First, it was severely limited by the performance of the ORES system. While ORES is reasonably fast for scoring a single edit, scoring 50-500 edits (the ranges that commonly appear on these pages) can take 30 seconds to 2 minutes. So a user is left waiting for the highlighting to appear. Also, because ScoredRevisions is only able to score edits after they are rendered, there was no way for a user to ask the system to filter edits ahead of time -- for example, to only show edits that are likely to be damaging. So the user needed to visually filter the long lists based on highlighted rows.

The ORES Review Tool[2] is a MediaWiki extension implemented in PHP. It uses an offline process to score all recent edits to Wikipedia and to store those scores in a table for querying and quick access. This tool implemented similar functionality to 'ScoredRevisions but because it had pre-cached ORES scores in a table, it rendered highlights for likely damaging edits as soon as the page loaded, and it enabled users to filter based on likely damaging edits.

We released the ORES Review Tool as a "beta feature" on Wikimedia wikis were we were able to develop advanced edit quality models. The response was extremely positive. Over 26k editors in Wikipedia had manually enabled the ORES Review Tool by April of 2017. For reference, the total number of active editors across all languages of Wikipedia varies around 70k[3], so this means that a large proportion of active editors consciously chose to enable the feature.

Adoption in current tools

Many tools for counter-vandalism in Wikipedia were already available when we developed ORES. Some of them made use of machine prediction (e.g. Huggle[4], STiki, ClueBotNG), but most did not. Soon after we deployed ORES, many developers that had not previously included their own prediction models in their tools were quick to adopt ORES. For example, RealTime Recent Changes[5] includes ORES predictions along-side their realtime interface and FastButtons[6], a Portuguese Wikipedia gadget, began displaying ORES predictions next to their buttons for quick reviewing and reverting damaging edits.

Other tools that were not targeted at counter-vandalism also found ORES predictions -- specific that of article quality(wp10) -- useful. For example, RATER[7], a gadget for supporting the assessment of article quality began to include ORES predictions to help their users assess the quality of articles and SuggestBot[8], a robot for suggesting articles to an editor, began including ORES predictions in their tables of recommendations.

New tools

A screenshot of the Edit Review Filters interface with ORES score-based filters displayed at the top of the list

Many new tools have been developed since ORES has released that may not have been developed at all otherwise. For example, the Wikimedia Foundation developed a complete redesign on MediaWiki's Special:RecentChanges interface that implements a set of powerful filters and highlighting. They took the ORES Review Tool to it's logical conclusion with an initiative that they referred to as Edit Review Filters[9]. In this interface, ORES scores are prominently featured at the top of the list of available features.

When we first developed ORES, English Wikipedia was the only wiki that we are aware of that had a robot that used machine prediction to automatically revert obvious vandalism[10]. After we deployed ORES, several wikis developed bots of their own to use ORES predictions to automatically revert vandalism. For example, in PatruBot in Spanish Wikipedia[11] and Dexbot in Persian Wikipedia[12] now automatically revert edits that ORES predicts are damaging with high confidence. These bots have been received with mixed acceptance. Because of the lack of human oversight, concerns were raised about PatruBot's false positive rate but after consulting with the developer, we were able to help them find an acceptable threshold of confidence for auto-reverts.

One of the most noteworthy new tools is the suite of tools developed by Sage Ross to support the Wiki Education Foundation's[13] activities. Their organization supports classroom activities that involve editing Wikipedia. They develop tools and dashboards that help students contribute successfully and to help teachers monitor their students' work. Ross has recently published about how they interpret meaning from ORES' article quality models[14] and has integrate this prediction into their tools and dashboards to recommend work that students need to do to bring their articles up to Wikipedia's standards. See our discussion of interrogation in Section [foo].

Reply to "Observation: Adoption patterns"