Which languages does this software support? Can it be trained with all Wikipedia's languages?
Topic on Talk:TextCat
We currently have two classes of language models: those based on Wikipedia query strings (30 languages) and those based on Wikipedia Text (70 models). We get better performance from the query-string models on new query strings, because queries differ from encyclopedic text in several ways (less formality, more question words, fewer diacritics in languages that use them, more nouns and fewer verbs, etc.). However, I'm working on a new config that will allow us to use both query-based and Wikitext-based models together (since, for example, the Oriya Wikitext model is probably good enough).
You can see the currently available list on github. The LM-query directory has the query-based models, and the LM directory has the Wiki-text–based models. They are named for their Wiki codes, which are usually but not always the same as or similar to current or former ISO 639 codes.
The query-based models require a lot of manual work, since a lot of queries are not in the language of the Wiki. (Igbo Wikipedia, for example, had about half its queries in English in a sample I took in 2015; English Wikipedia has lots of other languages show up, which is what started this project.) The Wiki-text Models are less work, but still require validation. For smaller wikis there isn't enough text, and for the ones still in development (like Igbo Wikipedia), there's a lot of text that's not in the language of the wiki (often English).
For the larger, more well-developed Wikipedias, we could build models for all of them. But it does take some work, and so I haven't done it, though I'd like to.
I'd also like to cover a topic that I think is implicit in your question, but which you may not have intended. Having all those language models available wouldn't make it easy to detect all those languages. Running all those models would require more computational power, but also lead to worse results in language detection. As I mentioned, right now we don't detect French on English Wikipedia because there are too many false positives for French and too few actual queries in French. I will be able to turn on French on English Wikipedia soon, but having all the languages available would lead to too many errors (e.g., Scots vs English, or all the Romance languages) in the general case. Instead, we enable at most the languages we see in the query logs for a given Wikipedia, minus the ones that cause give more errors than correct answers in that context.
Thanks for the questions! And please let me know if I can help explain anything else.
Thanks a lot for the detailed answer.
I'd love to see a page that lists suggestions for people who write in small language wikis about how to get their wikis' content usable for this. E.g., as obvious as it should sound, a suggestion to not write in English in the Igbo Wikipedia needs to be explicit.
Good point. And it's not so much that people are writing in English on the Igbo Wikipedia—though there can be lots of titles in English (say for an American actor or singer)—but also templates have English fallbacks when no Igbo translation is available. I'm not sure if that ends up in the extracted Wiki-text or not.
Where do you think would be a good place to put such a list of suggestions, if one were formulated?
Something like TextCat/Best practices for editors would be a good start.