User:TJones (WMF)/Notes/Analysis of DYM Method 0

May 2019 — See TJones_(WMF)/Notes for other projects. See also T212888. For help with technical jargon, check out the Search Glossary.

Background

Spelling corrections and other “Did you mean…?” (DYM) suggestions on Wikipedia and its sister projects are often not great. (I personally think it is in part to do with the ridiculously wide coverage of topics on Wikipedia and the very large number of low-frequency terms in many languages on Wiktionary, for example.) So, we’ve undertaken a project to improve the DYM suggestions on Wikipedia.

The goal is to implement multiple methods for making suggestions that address the issues of different languages and projects, including difficulties introduced by writing systems, and the amount of data available. (See T212884.) “Method 0” (M0—T212888) mines search logs for queries and apparent corrections to the queries made by the same person (e.g., in that last sentence I typed “smae” then retyped it as “same”).

Detection of such human-made self-corrections is not perfect, and short queries can be one character off many other reasonable queries (e.g. suy could have been intended to be sub, sun, sur, say, shy, sly, soy, spy, sty, buy, or guy, or it could be, Sūy, the village in Iran), but generally it is a high-accuracy approach—for the search nerds, it’s high precision, but low recall, because it can only apply to searches a human searcher has corrected.

The goal of this analysis is to get a sense of how broadly Method 0 applies to queries on English Wikipedia and a baseline of the current DYM process to compare it to. I also want to get a rough idea of the quality of the suggestions, and some information on the queries themselves. Finally, I’ll have some suggestions for improving M0 processing.

Data

David ran M0 training (i.e., looking for and evaluating human self-corrections) on two month’s of query data and then took a third month’s worth of data (about 170K queries) for evaluation. He pulled the original query and the current DYM suggestion, if any, made at the time, and then computed the M0 DYM suggestion, if any.

Note #0: The data we are using is not a perfect reflection of the user experience. There are some “last mile” tricks used to improve suggestions. For example, if a query is a title match for an article, we don’t make a suggestion. The most common current suggestion is a common word in place of a YouTube personality’s name, but since there is an article with that exact name, the suggestion doesn’t actually get shown to anyone. These deviations are minor, and shouldn’t affect the overall impression we get from our corpus.

Note #1: We only keep 90 days worth of logs, so we don’t have a ton of data to train or test on. (Commercial systems are reported to use years’ worth of log data, for example.) So, M0 performance (particularly coverage) is probably somewhat worse than it would be if we used all the data available for training (which we would do if we were showing the result to users).

Note #2: M0 normalizes queries before making suggestions. Simple normalizations include lowercasing and using standard whitespace, so that “same thing”, “Same Thing”, “SaMe ThiNG ”, and “ sAmE tHiNg ”, are all treated as the same thing.^{[badum tiss]}

Stats and Analysis

There are 170,181 queries in our test corpus. 86,940 (51.087%) of them are identical after M0 normalization.

Current DYM: 70,202 queries (41.251%) got a DYM suggestion from the current production system. This is higher than some of us thought it would be. 67,948 queries (39.927%) got suggestions from the current DYM system but not from M0.
M0 DYM: 2,622 queries (1.541%) got a DYM suggestion from M0. This is a bit lower than we expected, though this version was only built on 2/3 of the available data. 368 queries (0.216%) got suggestions from M0 but not from the current DYM system.
Both: 2,254 queries (1.324%) got suggestions from both systems. 1,583 of the suggestions (0.930%) were the same for both the current DYM system and M0. There weren’t any additional suggestions that differed only by case.

Multiple Suggestions

The current DYM system only makes one suggestion at a time, and we’re only looking at one suggestion from M0 for any given query. (Longer term we may consider multiple suggestions for a given query. It is difficult, but possible, to get Google to generate multiple suggestions—I usually do it by putting together two ambiguously misspelled words that don’t have anything to do with each other. For example, glake bruck—at the time of this writing—gets four suggestions from Google: glade brook, glaze brook, glass brick, and glow brick.)

However, because the current DYM suggestions come from different shards of the index—or from the same shard accessed weeks apart—the underlying word statistics can differ, and the exact DYM suggestion shown to users can vary.

In our corpus, the current prod DYM suggestions gave different suggestion for the same queries for 82 different queries, which is very, very few. Most had two different suggestions, but four queries each got 3 different suggestions!

None of the queries got differing suggestions from M0, but over months they could, as the stats underlying M0 change.

Head-to-Head Evaluation

I pulled out a sample of 100 queries where the current DYM and M0 both had suggestions and those suggestions differ, and I reviewed them manually.

The final categorization counts are as follows:

68: M0 DYM gave a better suggestion.
5½: the current DYM gave a better suggestion. (In one case, the current DYM gave two different suggestions at different times. One of them was good, one wasn’t, so I counted it as ½.)
7: Both suggestions were reasonable.
7: Both suggestions were not very good.
12: User intent was too unclear.

While the sample size is not good enough to make fine-grained estimates (the 95% confidence interval for 68% (68/100) is ~58-76%), it’s clear that M0 is better than the current DYM when they both make suggestions.

Not-Done Evaluations

I didn’t look deeply into the quality of the suggestions for these three other obvious classes of queries:

When the current DYM and M0 agree—I would guess these are more likely to be good suggestions since the methods are rather different.
Current DYM only (~40%)—these could be better (more easy ones) or worse (random DYM guesses).
M0 DYM only (~0.2%)—based on the method internals, I’d expect these to be pretty good.

I could evaluate samples for any of these, if desired.

Queries by Script Type

I also broke down the queries by script type (after ignoring numbers and punctuation). This list isn’t exhaustive; it only includes the easily categorizable and excludes queries with certain invisible characters.

Almost 97% of queries are in the Latin script, so obviously that’s a reasonable place to put our focus, though it is surprising that almost 1% of queries are in Arabic script and a bit more than 0.5% are Cyrillic.

The “Other/Mixed” category is generally made up of queries in multiple scripts/languages, or those that have emoji or other rarer characters.

The “IPA-ish” category attracts writing systems that use characters that are also used in the International Phonetic Alphabet (IPA); in this case, 4 of 7 are Azerbaijani (which uses ə and ı), 2 look like genuine IPA, and the last one is unclear.

Categories	queries	curr DYM	M0 DYM
Latin	164615	69877	2610
Arabic	1475	69	7
Cyrillic	926	17	0
Ideographic	690	1	2
Bengali	346	3	0
Devanagari	236	3	0
Hangul	166	0	0
Greek	111	2	0
Hebrew	107	0	0
Thai	76	0	0
Katakana	53	1	0
Georgian	38	0	0
Myanmar	34	0	0
Tamil	27	0	0
Khmer	21	0	0
Malayalam	19	0	0
Kannada	11	0	0
Gujarati	10	0	0
Hiragana	10	0	0
Ethiopic	8	0	0
IPA-ish	7	3	0
Sinhala	6	0	0
Armenian	5	0	0
Telugu	4	0	0
Mongolian	1	0	0
Tibetan	1	0	0

Other/Mixed	504	82	0

Invisibles

Looking at language analyzers, I often run into invisible characters that block proper searching. The most common are zero-width non-joiners, zero-width joiners, non-breaking spaces, soft hyphens, and bi-directional marks.

These marks all occur in our query sample, with Arabic, Bengali, Hebrew, Latin, Myanmar, and Sinhala scripts. A handful of the Latin examples get suggestions from the current DYM suggester. It makes sense to me to strip these characters out (or substitute with spaces for the non-breaking space) during the M0 normalization process.

Numbers

There are a number of queries that are primarily numbers in our corpus. M0 makes very few suggestions for these—only 3—and the current DYM system only made 98 such suggestions—though there are clearly some phone numbers and ISBNs (and less importantly, dates) in the list. Queries without suggestions include IP addresses and other ID-looking numbers.

Also, there is a query that is all Bengali numbers, though it doesn’t get any suggestions.

Additional Observations

While the current DYM suggestions occasionally make suggestions with a different number of tokens, that’s clearly a place M0 does much better. For example:
- bang;adesh: M0 gives bangladesh rather than band adele
- mig21: mig-21 versus might
M0 is better at word-initial errors:
- toy tory 4: toy story 4 versus toy tony 4
- the ebatles: the beatles versus the ebatela
M0 is better with names (especially those that don’t occur in the wiki—though making a perfect suggestion and then finding nothing may not make users particularly happy). Some examples:
- marcus johanson: marcus johansson versus marcus johnson
- sanza stark: sansa stark versus santa stars
M0 is better at transpositions and misplaced letters:
- shamaz: shazam versus shaman
- arbath: abarth versus armagh
Numbers are hard. The current DYM system is much more prone to this, but both systems will change years in queries. Search for “spiderman 2019” and get a suggestion for “spider man 2017”. I think some of this for M0 is people clicking on current DYM suggestions. Overall, that’s probably a good thing, since it reinforces good suggestions made by the current DYM suggester.

Summary and Suggestions

M0 is clearly much higher precision than the current DYM suggester, though it with limited input data, it doesn’t have as much coverage as one would like.

Clearly it’s ready for an A/B test of some sort (if we feel that is still necessary—I’m happy with the overwhelming improvement it gives).

However, we could still improve its performance in a few easy and not-so-easy ways:

We should filter more aggressively on queries that are all numbers, spaces, and punctuation.
We should remove the most common invisible characters (except for non-breaking spaces, which should just be regular spaces).
We might want to modify M0’s edit distance algorithm to assign a lower cost to swaps. Adjacent swaps are easy (“be” for “eb” in “ebatles”), while longer-distance swaps (“shamaz” vs “shazam”) are harder to detect, but would give a higher weight to such potential self-corrections. These kinds of errors make sense linguistically, too; when people do it in speech it’s called metathesis.
Long term, it might be helpful to track self-corrections that come from people clicking on the current DYM suggestions (or, when M0 or others are deployed, any suggestions). At least for current DYM suggestions, these are often lower quality, and I think many people are clicking on them without carefully assessing them.
We should work on figuring out how/whether we can increase the time-depth of our training data. Two thoughts come to mind: holding on to the self-correction data longer than 3 months, and holding on to particularly useful self-correction data (i.e., that which was used to make a suggestion that got clicked on one or more times) for longer than 3 months. Though the concerns in both cases may be the same, so the distinction may not matter.

Other thoughts:

There may be an error in our logging or in the extraction script, but the query template:discpute gets a current DYM suggestion of template:dispute, but only dispute is recorded in this corpus. I don’t think this is very important for this analysis because M0 doesn’t make any suggestions for queries with colons in them, but something weird is happening.