When we first started developing ORES, we realized that interpreting the likelihood estimates of our prediction models would be crucial to using the predictions effectively. Essentially, the operational concerns of Wikipedia's curators need to be translated into a likelihood threshold. For example, counter-vandalism patrollers seek catch all (or almost all) vandalism before it is allowed to stick in Wikipedia for very long. That means they have an operational concern around the recall of a damage prediction model. They'd also like to review as few edits as possible in order to catch that vandalism. So they have an operational concern around the filter rate -- the proportion of edits that are not flagged for review by the model[1].
By finding the threshold of prediction likelihood that optimizes the filter-rate at a high level of recall, we can provide vandal-fighters with an effective trade-off for supporting their work. We refer to these optimizations in ORES as threshold optimizations and ORES provides information about these thresholds in a machine-readable format so that tools can automatically detect the relevant thresholds for their wiki/model context.
Originally, when we developed ORES, we defined these threshold optimizations in our deployment configuration. But eventually, it became apparent that our users wanted to be able to search through fitness metrics to adapt their own optimizations. Adding new optimizations and redeploying quickly became a burden on us and a delay for our users. So we developed a syntax for requesting an optimization from ORES in realtime using fitness statistics from the models tests. E.g. "maximum recall @ precision >= 0.9" gets a useful threshold for a counter-vandalism bot or "maximum filter_rate @ recall >= 0.75" gets a useful threshold for semi-automated edit review (with human judgement).
Example:
https://ores.wikimedia.org/v3/scores/enwiki/?models=damaging&model_info=statistics.thresholds.true.'maximum filter_rate @ recall >= 0.75'
Returns:
{"threshold": 0.299, ..., "filter_rate": 0.88, "fpr": 0.097, "match_rate": 0.12, "precision": 0.215, "recall": 0.751}
This result shows that, when a threshold is set on 0.299 likelihood of damaging=true, then you can expect to get a recall of 0.751, precision of 0.215, and a filter-rate of 0.88. While the precision is low, this threshold reduces the overall workload of vandal-fighters by 88% while still catching 75% of (the most egregious) damaging edits.