User:TJones (WMF)/Notes/Language Detection Evaluation/Corpus Info
Appearance
Language Identification Corpus Information
[edit]- 1452 zero result queries
- 775 (53.4%) are tagged as being in some language
%lang %total lang 77.3% 41.3% English 5.5% 3.0% Spanish 2.6% 1.4% Chinese 2.5% 1.3% Portuguese 1.3% 0.7% Arabic 1.3% 0.7% French 1.2% 0.6% Tagalog 1.0% 0.6% German 0.8% 0.4% Malay 0.6% 0.3% Russian 0.6% 0.3% Turkish 0.5% 0.3% Indonesian 0.5% 0.3% Persian 0.5% 0.3% Swahili 0.4% 0.2% Korean 0.3% 0.1% Bengali 0.3% 0.1% Bulgarian 0.3% 0.1% Hindi 0.3% 0.1% Italian 0.3% 0.1% Norwegian 0.1% 0.1% Croatian 0.1% 0.1% Dutch 0.1% 0.1% Estonian 0.1% 0.1% Finnish 0.1% 0.1% Greek 0.1% 0.1% Hmong 0.1% 0.1% Japanese 0.1% 0.1% Kannada 0.1% 0.1% Latin 0.1% 0.1% Polish 0.1% 0.1% Serbian 0.1% 0.1% Somali 0.1% 0.1% Swedish 0.1% 0.1% Tamil 0.1% 0.1% Thai 0.1% 0.1% Uzbek
Tokens per Query
[edit]number of tokens (total) 469 1 tokens 364 2 tokens 213 3 tokens 127 4 tokens 86 5 tokens 58 6 tokens 40 7 tokens 19 8 tokens 23 9 tokens 11 10 tokens 9 11 tokens 5 12 tokens 2 13 tokens 4 14 tokens 3 15 tokens 1 16 tokens 4 17 tokens 1 18 tokens 1 19 tokens 1 21 tokens 1 23 tokens 2 28 tokens 2 30 tokens 2 31 tokens 1 33 tokens 1 34 tokens 1 61 tokens 1 84 tokens number of tokens (lang) 160 1 tokens 152 2 tokens 141 3 tokens 91 4 tokens 63 5 tokens 49 6 tokens 35 7 tokens 18 8 tokens 22 9 tokens 10 10 tokens 9 11 tokens 3 12 tokens 2 13 tokens 4 14 tokens 3 15 tokens 1 16 tokens 3 17 tokens 1 18 tokens 1 21 tokens 1 23 tokens 2 28 tokens 2 30 tokens 1 31 tokens 1 34 tokens 1 61 tokens number of tokens (non-lang) 309 1 tokens 212 2 tokens 72 3 tokens 36 4 tokens 23 5 tokens 9 6 tokens 5 7 tokens 1 8 tokens 1 9 tokens 1 10 tokens 2 12 tokens 1 17 tokens 1 19 tokens 1 31 tokens 1 33 tokens 1 84 tokens