User:TJones (WMF)/Notes/Analysis of DYM Method 0
May 2019 â See TJones_(WMF)/Notes for other projects. See also T212888. For help with technical jargon, check out the Search Glossary.
Background
[edit]Spelling corrections and other âDid you meanâŚ?â (DYM) suggestions on Wikipedia and its sister projects are often not great. (I personally think it is in part to do with the ridiculously wide coverage of topics on Wikipedia and the very large number of low-frequency terms in many languages on Wiktionary, for example.) So, weâve undertaken a project to improve the DYM suggestions on Wikipedia.
The goal is to implement multiple methods for making suggestions that address the issues of different languages and projects, including difficulties introduced by writing systems, and the amount of data available. (See T212884.) âMethod 0â (M0âT212888) mines search logs for queries and apparent corrections to the queries made by the same person (e.g., in that last sentence I typed âsmaeâ then retyped it as âsameâ).
Detection of such human-made self-corrections is not perfect, and short queries can be one character off many other reasonable queries (e.g. suy could have been intended to be sub, sun, sur, say, shy, sly, soy, spy, sty, buy, or guy, or it could be, SĹŤy, the village in Iran), but generally it is a high-accuracy approachâfor the search nerds, itâs high precision, but low recall, because it can only apply to searches a human searcher has corrected.
The goal of this analysis is to get a sense of how broadly Method 0 applies to queries on English Wikipedia and a baseline of the current DYM process to compare it to. I also want to get a rough idea of the quality of the suggestions, and some information on the queries themselves. Finally, Iâll have some suggestions for improving M0 processing.
Data
[edit]David ran M0 training (i.e., looking for and evaluating human self-corrections) on two monthâs of query data and then took a third monthâs worth of data (about 170K queries) for evaluation. He pulled the original query and the current DYM suggestion, if any, made at the time, and then computed the M0 DYM suggestion, if any.
Note #0: The data we are using is not a perfect reflection of the user experience. There are some âlast mileâ tricks used to improve suggestions. For example, if a query is a title match for an article, we donât make a suggestion. The most common current suggestion is a common word in place of a YouTube personalityâs name, but since there is an article with that exact name, the suggestion doesnât actually get shown to anyone. These deviations are minor, and shouldnât affect the overall impression we get from our corpus.
Note #1: We only keep 90 days worth of logs, so we donât have a ton of data to train or test on. (Commercial systems are reported to use yearsâ worth of log data, for example.) So, M0 performance (particularly coverage) is probably somewhat worse than it would be if we used all the data available for training (which we would do if we were showing the result to users).
Note #2: M0 normalizes queries before making suggestions. Simple normalizations include lowercasing and using standard whitespace, so that âsame thingâ, âSame Thingâ, âSaMe ThiNG â, and â sAmE tHiNg â, are all treated as the same thing.[badum tiss]
Stats and Analysis
[edit]There are 170,181 queries in our test corpus. 86,940 (51.087%) of them are identical after M0 normalization.
- Current DYM: 70,202 queries (41.251%) got a DYM suggestion from the current production system. This is higher than some of us thought it would be. 67,948 queries (39.927%) got suggestions from the current DYM system but not from M0.
- M0 DYM: 2,622 queries (1.541%) got a DYM suggestion from M0. This is a bit lower than we expected, though this version was only built on 2/3 of the available data. 368 queries (0.216%) got suggestions from M0 but not from the current DYM system.
- Both: 2,254 queries (1.324%) got suggestions from both systems. 1,583 of the suggestions (0.930%) were the same for both the current DYM system and M0. There werenât any additional suggestions that differed only by case.
Multiple Suggestions
[edit]The current DYM system only makes one suggestion at a time, and weâre only looking at one suggestion from M0 for any given query. (Longer term we may consider multiple suggestions for a given query. It is difficult, but possible, to get Google to generate multiple suggestionsâI usually do it by putting together two ambiguously misspelled words that donât have anything to do with each other. For example, glake bruckâat the time of this writingâgets four suggestions from Google: glade brook, glaze brook, glass brick, and glow brick.)
However, because the current DYM suggestions come from different shards of the indexâor from the same shard accessed weeks apartâthe underlying word statistics can differ, and the exact DYM suggestion shown to users can vary.
In our corpus, the current prod DYM suggestions gave different suggestion for the same queries for 82 different queries, which is very, very few. Most had two different suggestions, but four queries each got 3 different suggestions!
None of the queries got differing suggestions from M0, but over months they could, as the stats underlying M0 change.
Head-to-Head Evaluation
[edit]I pulled out a sample of 100 queries where the current DYM and M0 both had suggestions and those suggestions differ, and I reviewed them manually.
The final categorization counts are as follows:
- 68: M0 DYM gave a better suggestion.
- 5½: the current DYM gave a better suggestion. (In one case, the current DYM gave two different suggestions at different times. One of them was good, one wasnât, so I counted it as ½.)
- 7: Both suggestions were reasonable.
- 7: Both suggestions were not very good.
- 12: User intent was too unclear.
While the sample size is not good enough to make fine-grained estimates (the 95% confidence interval for 68% (68/100) is ~58-76%), itâs clear that M0 is better than the current DYM when they both make suggestions.
Not-Done Evaluations
[edit]I didnât look deeply into the quality of the suggestions for these three other obvious classes of queries:
- When the current DYM and M0 agreeâI would guess these are more likely to be good suggestions since the methods are rather different.
- Current DYM only (~40%)âthese could be better (more easy ones) or worse (random DYM guesses).
- M0 DYM only (~0.2%)âbased on the method internals, Iâd expect these to be pretty good.
I could evaluate samples for any of these, if desired.
Queries by Script Type
[edit]I also broke down the queries by script type (after ignoring numbers and punctuation). This list isnât exhaustive; it only includes the easily categorizable and excludes queries with certain invisible characters.
Almost 97% of queries are in the Latin script, so obviously thatâs a reasonable place to put our focus, though it is surprising that almost 1% of queries are in Arabic script and a bit more than 0.5% are Cyrillic.
The âOther/Mixedâ category is generally made up of queries in multiple scripts/languages, or those that have emoji or other rarer characters.
The âIPA-ishâ category attracts writing systems that use characters that are also used in the International Phonetic Alphabet (IPA); in this case, 4 of 7 are Azerbaijani (which uses É and Äą), 2 look like genuine IPA, and the last one is unclear.
Categories | queries | curr DYM | M0 DYM |
Latin | 164615 | 69877 | 2610 |
Arabic | 1475 | 69 | 7 |
Cyrillic | 926 | 17 | 0 |
Ideographic | 690 | 1 | 2 |
Bengali | 346 | 3 | 0 |
Devanagari | 236 | 3 | 0 |
Hangul | 166 | 0 | 0 |
Greek | 111 | 2 | 0 |
Hebrew | 107 | 0 | 0 |
Thai | 76 | 0 | 0 |
Katakana | 53 | 1 | 0 |
Georgian | 38 | 0 | 0 |
Myanmar | 34 | 0 | 0 |
Tamil | 27 | 0 | 0 |
Khmer | 21 | 0 | 0 |
Malayalam | 19 | 0 | 0 |
Kannada | 11 | 0 | 0 |
Gujarati | 10 | 0 | 0 |
Hiragana | 10 | 0 | 0 |
Ethiopic | 8 | 0 | 0 |
IPA-ish | 7 | 3 | 0 |
Sinhala | 6 | 0 | 0 |
Armenian | 5 | 0 | 0 |
Telugu | 4 | 0 | 0 |
Mongolian | 1 | 0 | 0 |
Tibetan | 1 | 0 | 0 |
Other/Mixed | 504 | 82 | 0 |
Invisibles
[edit]Looking at language analyzers, I often run into invisible characters that block proper searching. The most common are zero-width non-joiners, zero-width joiners, non-breaking spaces, soft hyphens, and bi-directional marks.
These marks all occur in our query sample, with Arabic, Bengali, Hebrew, Latin, Myanmar, and Sinhala scripts. A handful of the Latin examples get suggestions from the current DYM suggester. It makes sense to me to strip these characters out (or substitute with spaces for the non-breaking space) during the M0 normalization process.
Numbers
[edit]There are a number of queries that are primarily numbers in our corpus. M0 makes very few suggestions for theseâonly 3âand the current DYM system only made 98 such suggestionsâthough there are clearly some phone numbers and ISBNs (and less importantly, dates) in the list. Queries without suggestions include IP addresses and other ID-looking numbers.
Also, there is a query that is all Bengali numbers, though it doesnât get any suggestions.
Additional Observations
[edit]- While the current DYM suggestions occasionally make suggestions with a different number of tokens, thatâs clearly a place M0 does much better. For example:
- bang;adesh: M0 gives bangladesh rather than band adele
- mig21: mig-21 versus might
- M0 is better at word-initial errors:
- toy tory 4: toy story 4 versus toy tony 4
- the ebatles: the beatles versus the ebatela
- M0 is better with names (especially those that donât occur in the wikiâthough making a perfect suggestion and then finding nothing may not make users particularly happy). Some examples:
- marcus johanson: marcus johansson versus marcus johnson
- sanza stark: sansa stark versus santa stars
- M0 is better at transpositions and misplaced letters:
- shamaz: shazam versus shaman
- arbath: abarth versus armagh
- Numbers are hard. The current DYM system is much more prone to this, but both systems will change years in queries. Search for âspiderman 2019â and get a suggestion for âspider man 2017â. I think some of this for M0 is people clicking on current DYM suggestions. Overall, thatâs probably a good thing, since it reinforces good suggestions made by the current DYM suggester.
Summary and Suggestions
[edit]M0 is clearly much higher precision than the current DYM suggester, though it with limited input data, it doesnât have as much coverage as one would like.
Clearly itâs ready for an A/B test of some sort (if we feel that is still necessaryâIâm happy with the overwhelming improvement it gives).
However, we could still improve its performance in a few easy and not-so-easy ways:
- We should filter more aggressively on queries that are all numbers, spaces, and punctuation.
- We should remove the most common invisible characters (except for non-breaking spaces, which should just be regular spaces).
- We might want to modify M0âs edit distance algorithm to assign a lower cost to swaps. Adjacent swaps are easy (âbeâ for âebâ in âebatlesâ), while longer-distance swaps (âshamazâ vs âshazamâ) are harder to detect, but would give a higher weight to such potential self-corrections. These kinds of errors make sense linguistically, too; when people do it in speech itâs called metathesis.
- Long term, it might be helpful to track self-corrections that come from people clicking on the current DYM suggestions (or, when M0 or others are deployed, any suggestions). At least for current DYM suggestions, these are often lower quality, and I think many people are clicking on them without carefully assessing them.
- We should work on figuring out how/whether we can increase the time-depth of our training data. Two thoughts come to mind: holding on to the self-correction data longer than 3 months, and holding on to particularly useful self-correction data (i.e., that which was used to make a suggestion that got clicked on one or more times) for longer than 3 months. Though the concerns in both cases may be the same, so the distinction may not matter.
Other thoughts:
- There may be an error in our logging or in the extraction script, but the query template:discpute gets a current DYM suggestion of template:dispute, but only dispute is recorded in this corpus. I donât think this is very important for this analysis because M0 doesnât make any suggestions for queries with colons in them, but something weird is happening.