User:TJones (WMF)/Notes/Balanced Language Identification Evaluation Set for Queries

February 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T121539)

Balanced Language Identification Evaluation Set for Queries

Building the Corpus

The goal of this task was to create a balanced language identification evaluation set for queries for top 21 wikis by query volume. It would have been the top 20, but I accidentally grabbed the top 20 after English, so we get 21. The purpose of a hand-selected balanced query set is to be able to test the accuracy of language identification where all languages are competing equally (by volume) and all queries are decent exemplars of the language in question.

The 21 languages are: Arabic, Chinese, Czech, Dutch, English, French, German, Hebrew, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, Ukrainian, and Vietnamese.

I extracted a few day’s worth of full text queries from all wikis (19,273,806 queries total). For each of the 21 languages, I randomly selected several hundred queries for each language, and whittled them down to 200 queries each, removing queries composed primarily of names of people, places, and products, text in the wrong language, bad misspellings, numbers or acronyms, appeared bot-like (i.e., a very large number of very similar queries), etc. Names made up of normal words were kept—e.g., “The Revenant” and “Bridge of Spies” are names of movies, but they are made up of non-name words. Longer queries were allowed a small bit of text from any of the unacceptable categories.

I did not filter out queries that would obviously be hard for language identification, such as very short, unaccented queries in the Latin script, like Portuguese os (“the”), Swedish ur (“from”), English the, and French rue (“street”). The longest queries are hundreds of characters.

TextCat Evaluation

I tested TextCat against the balanced corpus of 200 queries in each of 21 languages (4,200 queries total) in two ways:

against the known list of 21 languages
against the full list of 59 languages for which language models have been built on query data.

Note that some of the full set of 59 models are known to be pretty poor (Igbo has way too much English in the training data, for example) and part of the purpose of this set is to let us better evaluate these models.

In each case, I tested language models in increments of 500 ngrams up to 10,000 ngrams. Previous work on a sample derived from enwiki queries showed an optimal model size of 3,000 ngrams (on messy data that was also heavily unbalanced—i.e., mostly English). In this case, surprisingly, the best results came from the maximum 10,000 ngram models! However, the improvement probably isn’t enough to warrant the extra cost in speed and memory of using the 10K model—it’s no more than 4% F_0.5 score.

Looking at Model Sizes

F_0.5 scores against the known 21 languages:

     ngrams     1000    2000    3000     4000    5000    6000    7000    8000    9000    10000
      TOTAL	84.0%	85.6%	86.5%	87.1%	87.2%	87.5%	87.8%	87.9%	88.2%	88.3%
     Arabic	92.6%	92.1%	92.5%	92.7%	93.0%	93.9%	93.9%	93.9%	93.7%	93.7%
    Chinese	81.5%	85.5%	86.9%	87.7%	87.1%	87.9%	89.0%	89.0%	89.0%	89.4%
      Czech	89.9%	91.1%	92.9%	91.9%	91.8%	92.6%	93.2%	93.0%	93.8%	93.8%
      Dutch	72.8%	75.6%	78.0%	78.3%	78.2%	79.4%	79.6%	80.2%	80.9%	81.1%
    English	77.6%	83.7%	86.6%	87.3%	86.2%	84.9%	85.4%	85.3%	86.2%	86.8%
     French	85.7%	88.2%	89.3%	88.9%	89.0%	90.1%	88.7%	88.9%	89.6%	88.8%
     German	75.7%	77.6%	80.5%	79.3%	79.5%	80.2%	80.8%	81.7%	82.6%	82.8%
     Hebrew	99.3%	100.0%	100.0%	99.8%	99.8%	99.8%	100.0%	100.0%	100.0%	99.8%
 Indonesian	80.5%	83.1%	83.4%	83.9%	85.4%	86.1%	86.3%	86.7%	86.7%	86.1%
    Italian	74.8%	74.6%	73.3%	74.6%	76.3%	77.3%	78.2%	78.2%	78.6%	78.8%
   Japanese	79.9%	83.9%	85.4%	87.2%	86.2%	86.7%	87.9%	87.9%	88.5%	88.8%
     Korean	99.2%	99.5%	99.7%	99.7%	99.5%	99.7%	99.7%	99.7%	99.7%	99.7%
    Persian	91.9%	91.7%	92.0%	92.0%	92.5%	93.6%	93.5%	93.1%	93.0%	93.0%
     Polish	90.0%	91.7%	93.3%	93.6%	93.3%	94.3%	94.1%	94.8%	95.3%	96.0%
 Portuguese	73.4%	73.1%	74.6%	76.1%	77.1%	75.8%	78.4%	78.4%	79.7%	79.5%
    Russian	85.5%	84.8%	85.0%	85.0%	84.4%	84.4%	83.9%	83.9%	83.9%	84.5%
    Spanish	72.5%	74.2%	73.4%	78.1%	78.7%	77.4%	78.1%	77.9%	78.0%	78.4%
    Swedish	68.9%	72.3%	76.6%	77.5%	78.0%	78.0%	78.9%	79.1%	80.1%	80.4%
    Turkish	89.6%	92.4%	92.1%	93.4%	93.2%	93.6%	93.6%	93.9%	93.6%	93.1%
  Ukrainian	82.9%	81.0%	82.1%	82.4%	81.6%	81.3%	80.8%	80.8%	81.2%	81.7%
 Vietnamese	97.5%	98.5%	99.0%	99.3%	99.0%	98.8%	98.8%	98.3%	98.0%	97.8%

F_0.5 scores against the all 59 available languages:

     ngrams     1000    2000    3000     4000    5000    6000    7000    8000    9000    10000
      TOTAL	69.8%	73.3%	74.7%	76.0%	76.6%	76.9%	77.4%	77.5%	78.1%	78.5%
     Arabic	92.2%	91.5%	91.9%	92.6%	93.6%	93.8%	94.1%	93.9%	93.6%	93.9%
    Chinese	52.7%	55.8%	58.0%	61.1%	60.2%	60.6%	62.2%	64.3%	65.9%	67.2%
      Czech	83.9%	87.5%	88.6%	87.6%	88.3%	87.7%	87.4%	87.6%	87.9%	87.7%
      Dutch	67.3%	71.1%	76.1%	75.7%	76.6%	77.6%	78.5%	79.1%	80.0%	79.6%
    English	66.7%	72.3%	73.2%	74.5%	74.2%	74.2%	72.8%	72.2%	72.8%	74.2%
     French	84.6%	87.6%	86.7%	87.2%	87.7%	87.3%	87.3%	87.7%	88.0%	87.9%
     German	74.5%	76.3%	80.0%	79.7%	81.5%	81.6%	80.9%	82.2%	83.6%	83.9%
     Hebrew	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%	100.0%
 Indonesian	42.1%	50.7%	54.1%	58.7%	62.6%	66.0%	68.7%	68.7%	69.7%	71.2%
    Italian	71.9%	74.6%	74.4%	75.1%	76.9%	78.4%	79.5%	78.8%	78.5%	78.5%
   Japanese	80.4%	82.9%	85.2%	86.2%	85.9%	86.4%	87.6%	87.6%	88.2%	88.2%
     Korean	99.5%	99.5%	99.7%	99.7%	99.7%	99.7%	99.7%	99.7%	99.7%	99.7%
    Persian	85.3%	86.6%	87.0%	87.6%	87.6%	88.4%	88.0%	87.4%	86.8%	86.8%
     Polish	92.1%	93.1%	93.8%	93.9%	93.8%	94.1%	94.1%	95.4%	95.6%	96.4%
 Portuguese	70.2%	70.4%	72.2%	74.0%	74.5%	74.7%	76.6%	77.1%	78.2%	78.2%
    Russian	72.8%	77.0%	76.8%	79.6%	77.5%	77.2%	77.8%	78.4%	78.7%	79.1%
    Spanish	66.7%	70.4%	72.0%	74.6%	75.5%	75.8%	76.5%	75.7%	78.4%	78.7%
    Swedish	55.2%	59.2%	62.0%	62.8%	65.3%	65.9%	65.2%	66.3%	66.9%	68.5%
    Turkish	85.8%	88.8%	89.8%	90.7%	91.5%	91.0%	91.3%	91.2%	90.2%	90.5%
  Ukrainian	78.9%	79.9%	80.2%	80.8%	79.7%	78.5%	78.5%	77.8%	77.9%	77.9%
 Vietnamese	97.5%	99.0%	99.0%	99.2%	98.7%	98.5%	98.5%	98.5%	98.5%	98.5%

Obviously, performance is noticeably worse when additional “spoiler” languages are available to be selected.

Looking at Languages with a 3,000-ngram model

Since we are using 3,000-ngram models for our current A/B tests, we’ll evaluate those models by language.

21 Known Languages

Here is the detailed accuracy report by language when using the set of 21 known languages, with 3,000 ngram models:

                f0.5    f1      f2      recall  prec   total   hits    misses
      TOTAL     86.5%   86.5%   86.5%   86.5%   86.5%  4200    3635    565
     Arabic     90.8%   92.5%   94.3%   95.5%   89.7%  200     191     22
    Chinese     83.8%   86.9%   90.2%   92.5%   81.9%  200     185     41
      Czech     91.9%   92.9%   93.8%   94.5%   91.3%  200     189     18
      Dutch     81.6%   78.0%   74.6%   72.5%   84.3%  200     145     27
    English     90.4%   86.6%   83.2%   81.0%   93.1%  200     162     12
     French     86.9%   89.3%   91.8%   93.5%   85.4%  200     187     32
     German     81.1%   80.5%   79.9%   79.5%   81.5%  200     159     36
     Hebrew    100.0%  100.0%  100.0%  100.0%  100.0%  200     200     0
 Indonesian     80.9%   83.4%   86.1%   88.0%   79.3%  200     176     46
    Italian     72.4%   73.3%   74.3%   75.0%   71.8%  200     150     59
   Japanese     91.0%   85.4%   80.5%   77.5%   95.1%  200     155     8
     Korean     99.9%   99.7%   99.6%   99.5%  100.0%  200     199     0
    Persian     93.6%   92.0%   90.5%   89.5%   94.7%  200     179     10
     Polish     92.6%   93.3%   94.0%   94.5%   92.2%  200     189     16
 Portuguese     75.8%   74.6%   73.3%   72.5%   76.7%  200     145     44
    Russian     81.3%   85.0%   89.1%   92.0%   79.0%  200     184     49
    Spanish     70.6%   73.4%   76.4%   78.5%   68.9%  200     157     71
    Swedish     79.0%   76.6%   74.4%   73.0%   80.7%  200     146     35
    Turkish     91.3%   92.1%   92.9%   93.5%   90.8%  200     187     19
  Ukrainian     86.6%   82.1%   78.0%   75.5%   89.9%  200     151     17
 Vietnamese     98.7%   99.0%   99.3%   99.5%   98.5%  200     199     3
                f0.5    f1      f2      recall  prec   total   hits    misses

The poorest performers in recall are Dutch (72.0%), Swedish (72.5%), Ukrainian (75.0%), Portuguese (75.0%), Italian (77.0%), Japanese (79.5%), and Spanish (79.5%).

The poorest performers in precision are Spanish (72.6%), Italian (73.0%), Portuguese (76.9%), and Russian (78.3%).

Below are the most common identification errors for each language (all cases ≥10, plus highest for each language), grouped by similarity (language and/or script family) when there is considerable confusion within the group.

Most common identification errors:

Arabic     Persian (9)
Persian    Arabic (21)

Chinese    Japanese (8)
Japanese   Chinese (41)

Dutch      German (17)
German     Dutch (16)

French     Italian (4)
Italian    Spanish (12)    Indonesian (11) Portuguese (11)
Portuguese Spanish (37)    Italian (11)
Spanish    Portuguese (21) Italian (11)

Russian    Ukrainian (16)
Ukrainian  Russian (49)

Czech      Polish (4)
English    Dutch (5)       French (5)      German (5)      Spanish (5)
Indonesian Italian (6)
Korean     Turkish (1)
Polish     Indonesian (3)
Swedish    Indonesian (15)
Turkish    Indonesian (3)  Swedish (3)
Vietnamese Italian (1)

So, confusion among Arabic/Persian, Chinese/Japanese, Dutch/German, French/Italian/Portuguese/Spanish, and Russian/Ukrainian is not too surprising.

Indonesian seems to be the most obvious outlier here, incorrectly claiming a fair number of Italian and Swedish queries.

59 Available Language Models

Keep in mind that some of these are known to be a bit dodgy.

Here is the detailed accuracy report by language when using the full set of 59 languages, with 3,000 ngram models:

                   f0.5    f1      f2      recall  prec   total   hits    misses
         TOTAL     74.7%   74.7%   74.7%   74.7%   74.7%  4200    3138    1062
        Arabic     91.0%   91.9%   92.9%   93.5%   90.3%  200     187     20
       Chinese     67.5%   58.0%   50.9%   47.0%   75.8%  200     94      30
         Czech     94.8%   88.6%   83.2%   80.0%   99.4%  200     160     1
         Dutch     82.9%   76.1%   70.4%   67.0%   88.2%  200     134     18
       English     85.0%   73.2%   64.3%   59.5%   95.2%  200     119     6
        French     88.0%   86.7%   85.4%   84.5%   88.9%  200     169     21
        German     84.1%   80.0%   76.3%   74.0%   87.1%  200     148     22
        Hebrew    100.0%  100.0%  100.0%  100.0%  100.0%  200     200     0
    Indonesian     66.1%   54.1%   45.8%   41.5%   77.6%  200     83      24
       Italian     80.5%   74.4%   69.1%   66.0%   85.2%  200     132     23
      Japanese     91.8%   85.2%   79.4%   76.0%   96.8%  200     152     5
        Korean     99.9%   99.7%   99.6%   99.5%  100.0%  200     199     0
       Persian     91.7%   87.0%   82.6%   80.0%   95.2%  200     160     8
        Polish     95.6%   93.8%   92.1%   91.0%   96.8%  200     182     6
    Portuguese     76.9%   72.2%   68.0%   65.5%   80.4%  200     131     32
       Russian     80.7%   76.8%   73.2%   71.0%   83.5%  200     142     28
       Spanish     74.6%   72.0%   69.5%   68.0%   76.4%  200     136     42
       Swedish     71.7%   62.0%   54.5%   50.5%   80.2%  200     101     25
       Turkish     92.5%   89.8%   87.2%   85.5%   94.5%  200     171     10
     Ukrainian     87.9%   80.2%   73.8%   70.0%   94.0%  200     140     9
    Vietnamese     99.0%   99.0%   99.0%   99.0%   99.0%  200     198     2
      Albanian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       12
   Azerbaijani      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       11
        Basque      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       25
       Bengali      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       16
     Bulgarian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       50
     Cantonese      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       107
       Catalan      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       39
      Croatian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       7
        Danish      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       26
      Estonian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       7
       Finnish      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       25
     Hungarian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       7
          Igbo      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       37
        Kazakh      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       8
         Latin      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       65
       Latvian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       10
    Lithuanian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       8
    Macedonian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       21
         Malay      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       85
     Malayalam      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       1
     Mongolian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       2
     Norwegian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       28
      Romanian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       29
       Serbian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       2
Serbo-Croatian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       15
        Slovak      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       27
     Slovenian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       14
       Tagalog      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       19
         Tamil      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       2
          Urdu      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       25
                   f0.5    f1      f2      recall  prec   total   hits    misses

The poorest performers in recall are Indonesian (41.5%), Chinese (47.0%), Swedish (50.5%), English (59.5%), Portuguese (65.5%), Italian (66.0%), Dutch (67.0%), and Spanish (68.0%).

The poorest performers in precision are Chinese (75.8%), Indonesian (77.6%), and Spanish (76.4%).

The poorest performers in terms of false positives among the languages not in the balanced query set are Cantonese (107), Malay (85), Latin (65), Bulgarian (50), Catalan (39), and Igbo (37).

Below are the most common identification errors for each language (all cases ≥10, plus highest for each language), grouped by similarity (language and/or script family) when there is considerable confusion within the group.

Most common identification errors:

Arabic      Persian (8)
Persian     Arabic (20)     Urdu (20)

Chinese     Cantonese (94)
Japanese    Chinese (30)    Cantonese (13)

Dutch       German (12)
German      Dutch (11)

French      Catalan (7)
Italian     Latin (10)
Portuguese  Spanish (24)    Latin (17)
Spanish     Portuguese (18) Catalan (13)

Russian     Bulgarian (28)  Macedonian (15)
Ukrainian   Russian (28)    Bulgarian (22)

Czech       Slovak (20)

Indonesian  Malay (75)

Swedish     Norwegian (17)  Danish (11)

English     Igbo (32)

Korean      Azerbaijani (1)
Polish      Latin (3)       Serbo-Croatian (3)
Turkish     Azerbaijani (7)
Vietnamese  Italian (1)     Latin (1)

As before, confusion among Arabic/Persian/Urdu, Chinese/Japanese/Cantonese, Dutch/German, French/Italian/Portuguese/Spanish/Catalan/Latin, and Russian/Ukrainian/Bulgarian/Macedonian is not too surprising. Neither are Czech/Slovak, Indonesian/Malay, nor Swedish/Norwegian/Danish.

English/Igbo would be a surprise, but we already know there’s a lot of English in the Igbo training data.

Conclusions

For the 21 languages we should be able to release these query-based models and include them with the PHP version of TextCat used for our A/B tests.

Indonesian needs the most work, since it is performing poorly in unexpected ways (i.e., with Swedish and Italian).

The other language/script families that perform poorly may also benefit from additional work to improve the quality of their training data.

For the full list of 59 languages, Igbo sticks out as the worst performing. As expected, language/script families are generally more easily confused.

Next Steps

To Do:

Release the rest of the 21 languages in the balanced query set, because they seem to be working reasonably well on reasonably clean and balanced data. T121539

To Consider:

Try to improve the training data for Indonesian, and re-assess against this test set. T121547
Try to improve the training data for the various language/script families, and re-asses against this test set. also T121547
Release improved models.

Add to the balanced test set additional languages, based on query volume, the uniqueness of the language-script mapping (e.g., Thai, Armenian), by language family, or some other criteria of desirability. Assess performance on this set.
Determine which models need improvement, and release the acceptable models.