User:TJones (WMF)/Notes/Folding Diacritics in Slovak
June/July/October 2019 — See TJones_(WMF)/Notes for other projects. See also T223787 and T235561. For help with the technical jargon used in Analysis Chain Analysis, see the Language Analysis section of the Search Glossary.
Background
[edit]In March 2018 I did an analysis of potential Slovak Stemmers and the use of the best stemmer in an analysis chain.
I followed my usual process for new analysis chains, which I developed after my experience with doing it exactly wrong for Swedish (see T155822). I enabled ICU folding (which is fairly aggressive normalization of unicode characters, including diacritic removal), with exceptions for letters in the alphabet of the wiki's language (the Slovak alphabet)—in this case, Áá Ää Čč Ďď Éé Íí Ĺĺ Ľľ Ňň Óó Ôô Ŕŕ Šš Ťť Úú Ýý Žž.
In a clever bit of foreshadowing, I looked briefly at the question of whether to enable folding before or after stemming. At the time it didn't seem to matter much because the differences were very slight if you exclude the Slovak letters from folding. I also cleverly reminded my future self that we have a "preserve" option, which allows us to index both the folded and unfolded version of a token.
At the 2019 Hackathon in Prague, Jetam2 and I talked about Slovak search, and he told me why it sucks... He expressed a concern that people don't always have access to a Slovak keyboard, so I said I'd look into the impact of removing the exceptions from ICU folding (and here we are). I looked into the Universal Language Selector and there is already a Slovak keyboard mapping for touch typists, and it would be possible to create a keyboard that could convert more widely available characters into diacritical characters. (For example, á as a~/
, ô as o~^
, č as c~v
, ä as a~:
.)
However, in the discussion on Phabricator (T155822) and on the Slovak Wikipedia Teahouse, Teslaton pointed out that Slovak search usually ignores diacritics and it usually doesn't cause any problems.
I was still worried (in the abstract) that Wikipedia and Wiktionary in particular have lots of text from other languages (or in IPA) which could cause weird results (though this should be mitigated in many cases by matching in the plain field). There's also the possible interaction of folding and stemming, which might be mitigated by changes to the stemmer, since we maintain the code for it.
Data
[edit]The usual process for creating a sample of documents (for testing language analysis modifications) is to retrieve 10,000 Wikipedia articles and 10,000 Wiktionary entries for the language in question. Sometimes we get fewer than 10,000 if there aren’t that many articles available in a particular project. Wikipedia articles usually provide a good example of typical formal written text in the language, and Wiktionary usually provides a larger number of distinct forms of words, and some additional variety of foreign scripts and languages. Foreign scripts and languages are not always processed well by language-specific text processing.
I sanitize the documents by removing markup (mostly HTML tags) and leading white space, and deduplicating individual lines. Deduplication reduces the number of instances of wiki-specific words, such as the local equivalent of "References", "See also", "Noun", "Etymology", etc.
For this analysis, I also pulled a random collection of 50,000 user queries from Slovak Wikipedia over a couple of months and 9,266 (~9k) user queries from Slovak Wiktionary (which is everything that was available at the time).
Analyzing the user queries will be a new kind of analysis, since I usually use the Wikipedia article text as a reference for the way people write in a language. Some of the info from the user queries will probably be less detailed compared to the usual analysis.
Query Data: Inspection
[edit]I started out by looking at the most common queries and most common words in queries on Slovak Wikipedia. Two of the top results were Zuzana Čaputová and Maroš Šefčovič, the two candidates in the recent Slovak presidential election. There were many variants of their names. Ignoring extra spaces, single-word queries include:
48 čaputová 29 šefčovič 33 caputova 29 Sefcovic 31 Caputova 27 Šefčovič 26 Čaputová 24 sefcovic 17 Čaputova 3 Šefčovic 11 čaputova 1 šefcovic 1 CaputovA 1 ŠEFČOVIČ 1 Sefčovič 1 Sefcovič
If we ignore case, the lists look like this:
74 čaputová 57 šefčovič 65 caputova 53 sefcovic 28 čaputova 3 šefčovic 1 šefcovic 1 sefčovič 1 sefcovič
Clearly, at least for these two presumably very well-known names, searching without diacritics is common.
I searched for other common words with diacritics and then searched for variants without diacritics. There are many cases that seem relatively unambiguous—such as čím, článok, Kočner, Košice, planéta, škola, štáty, voľby, Žilina, and živí. In these cases, the diacriticless version is also common, often equally common, as above. (It seems that the length-marking diacritic ´ is more likely to be dropped—especially for the vowels áéíóúý, but also the consonants ĺŕ. But overall it seems to happen frequently with any diacritic.)
So, clearly Slovak searchers are expecting diacriticless searches to get results, contrary to the expectations of the Swedish searchers.
I have a future concern for Slovak Wiktionary. Right now it only has about 26K entries so there aren't as many other languages represented. However, on English Wiktionary, there are often diacriticless versions of Slovak words in other Slavic languages. (Google Translate also often suggests Czech—and sometimes Slovenian and Swedish—for the diacriticless versions of Slovak words.)
On the other hand, (a) English Wiktionary folds all diacritics, and it usually works okay, (b) if Slovak-speaking searchers are used to diacriticless search, they at least won't be surprised, and (c) quotes are always available, and they aren't as restrictive on Wiktionary as they are on Wikipedia (because (i) you are more likely to be looking for an exact form of a word, and (ii) all forms of a word are much more likely to be on the page for the base form).
Option 1: Enabling Folding
[edit]The first thing I tried was disabling the ICU folding exception for the Slovak diacritical letters (Áá Ää Čč Ďď Éé Íí Ĺĺ Ľľ Ňň Óó Ôô Ŕŕ Šš Ťť Úú Ýý Žž).
Interestingly, this lead to an increase in the number of post-analysis tokens in the Wikipedia sample (i.e., the number of distinct words coming out of the analysis chain), from 131,091 to 137,538.
There were a lot of new collisions—words that would be indexed the same: 12,207 pre-analysis types (5.484% of pre-analysis types) / 168,977 tokens (10.305% of tokens) were added to 4,863 groups (3.710% of post-analysis types), affecting a total of 26,744 pre-analysis types (12.014% of pre-analysis types) in those groups.
Collisions are what we expect—words getting folded together. The impact is pretty high, though, 5% of distinct words and 10% of all words got folded together with something new.
There were also many new splits: 9,220 pre-analysis types (4.142% of pre-analysis types) / 41,111 tokens (2.507% of tokens) were lost from 4,475 groups (3.414% of post-analysis types), affecting a total of 29,339 pre-analysis types (13.180% of pre-analysis types) in those groups.
That's 4% of distinct words and 2.5% of all words would no longer be indexed together.
The main cause of the splits seems to be interference with the stemmer.
The Wiktionary sample had a roughly similar number of collisions: 5% of distinct words and 4.7% of all words. Wiktionary had very differently balanced splits: <1% of distinct words, but still 5% of all words. The difference seems to come down to a much smaller sample size—the Wikipedia sample has approximately 25x as many tokens in it—and many more distinct words in the Wiktionary sample.
[Note: I've made the the fold-first examples collapsible since we didn't get any speaker review, and stemming first is probably the right way to go.]
Fold-First Examples |
---|
Speaker Review: Overview[edit]The core task of the speaker doing the review is to decide whether words are being properly grouped together for search, and whether any changes to those groupings are better or worse. When words are grouped together, it means that searching for one word in the group will find all of the other words in the group, too. With the current English language processing, for example, searching for any of the words hope, hopes, hoped, hoping, hope’s, hoper, or hopers will find all of the others. (Note that the results in each case will be ranked differently because exact matches are preferred). In addition to listing the words that are grouped together, we also include the number of times each word appears in the text sample. This helps us estimate the relative importance of potential errors. For example, if two words are improperly grouped together, but the words are very rare, that’s not as bad as if they were very common. [For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.] When we make less extreme modifications to the language processing done for search—like introducing diacritic folding—we can usually look more meaningfully at groups before and after the modification to assess the effect of the group changes. Old-vs-new groups are presented as follows: hope >> 2 o: [152 Hope][23 Hopes][1208 hope][346 hoped][488 hopes] n: [152 Hope][1 Hopē][23 Hopes][1208 hope][346 hoped][488 hopes][2 ĥợṕễ] The first line shows the stem ( The stem is the form that all of the other words were reduced to. The stem does not have to be the actual root form of the word or even a word at all. However, seeing the stem sometimes makes it easier to understand what the stemmer or other parts of the analysis were trying to do. In terms of gains and losses:
The The numbers with the word—e.g., Problems can arise when more common words are grouped together incorrectly. For example, a grouping like [For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.] Speaker Review: Folding Groups that Lost Members[edit]The question for speakers of Slovak reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "lost" words no longer found the remaining words, and vice versa? Random Sample[edit]Looking at a random sample of the word groups is the best way to see what the typical effects of a modification are. If the majority of changes are good, and any less desirable changes are understandable and acceptable, then overall the modification is good. [For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.] Below is a sample of 25 randomly selected stemming groups (words that would all be indexed together) that lost members as a result of folding Slovak diacritical characters. (These are from the Wikipedia sample.) All of these examples seem to be caused by the presence of the following suffixes (with counts of how many times they are lost in the sample below—the total is more than 25 because some groups lost multiple members): 1 -ách 1 -aný 12 -ého 3 -ému 1 -í 7 -ú 11 -ých 5 -ým 6 -ými Other obvious suffixes (-á, -é, -ý) don't have problems because unaccented versions (-a, -e, -y) are also suffixes in Slovak (though they have different meanings). -é and -e can cause differences in the way the rest of the word root is normalized, though there's no evidence of that here. Key:
Lost members are bolded. gazdovsk << 1 o: [1 Gazdovská][1 Gazdovské][1 gazdovské][2 gazdovský][1 gazdovskými] n: [1 Gazdovská][1 Gazdovské][1 gazdovské][2 gazdovský] odlet << 1 o: [1 odletom][1 odletovou][1 odletových] n: [1 odletom][1 odletovou] implementovan << 1 o: [1 implementovaná][3 implementované][1 implementovanú][6 implementovaný] n: [1 implementovaná][3 implementované][6 implementovaný] vyzdvihovan << 1 o: [1 vyzdvihované][1 vyzdvihovaný][1 vyzdvihovaných] n: [1 vyzdvihované][1 vyzdvihovaný] novonaroden << 1 o: [1 Novonarodená][1 novonarodených] n: [1 Novonarodená] rytmick << 2 o: [1 Rytmická][3 rytmickej][1 rytmickou][2 rytmicky][8 rytmická][2 rytmické] [1 rytmickému][3 rytmickú] n: [1 Rytmická][3 rytmickej][1 rytmickou][2 rytmicky][8 rytmická][2 rytmické] angarsk << 1 o: [1 Angarského][2 Angarský] n: [2 Angarský] komunikuj << 1 o: [6 komunikuje][5 komunikujú] n: [6 komunikuje] balneologick << 1 o: [6 Balneologické][2 Balneologického][1 balneologické] n: [6 Balneologické][1 balneologické] vyjadren << 1 o: [1 vyjadrenou][14 vyjadrená][4 vyjadrené][1 vyjadreného][5 vyjadrení] [2 vyjadrený] n: [1 vyjadrenou][14 vyjadrená][4 vyjadrené][5 vyjadrení][2 vyjadrený] domorod << 4 o: [3 domorodé][2 domorodého][1 domorodí][6 domorodých][1 domorodým] [2 domorodými] n: [3 domorodé][1 domorodí] divadl << 1 o: [14 Divadla][12 Divadle][48 Divadlo][1 Divadlom][4 Divadlá][128 divadla] [1 divadlami][51 divadle][127 divadlo][10 divadlom][15 divadlá][10 divadlách] n: [14 Divadla][12 Divadle][48 Divadlo][1 Divadlom][4 Divadlá][128 divadla] [1 divadlami][51 divadle][127 divadlo][10 divadlom][15 divadlá] hamersk << 1 o: [2 Hamerského][1 Hamerský] n: [1 Hamerský] karpatsk << 6 o: [1 KARPATSKÁ][6 Karpatskej][1 Karpatsko][2 Karpatská][10 Karpatské] [2 Karpatského][1 Karpatskí][1 Karpatskú][3 Karpatský][1 Karpatskými] [2 karpatskej][2 karpatsko][1 karpatské][9 karpatského][2 karpatskému] [1 karpatskí][3 karpatský][5 karpatských] n: [1 KARPATSKÁ][6 Karpatskej][1 Karpatsko][2 Karpatská][10 Karpatské] [1 Karpatskí][3 Karpatský][2 karpatskej][2 karpatsko][1 karpatské] [1 karpatskí][3 karpatský] samotn << 9 o: [11 Samotná][16 Samotné][1 Samotného][1 Samotnému][1 Samotní][1 Samotnú] [17 Samotný][35 samotnej][14 samotnom][6 samotnou][28 samotná][33 samotné] [36 samotného][2 samotnému][6 samotní][15 samotnú][36 samotný] [12 samotných][14 samotným][4 samotnými] n: [11 Samotná][16 Samotné][1 Samotní][17 Samotný][35 samotnej][14 samotnom] [6 samotnou][28 samotná][33 samotné][6 samotní][36 samotný] krewsk << 1 o: [1 Krewská][1 krewská][1 krewskú] n: [1 Krewská][1 krewská] slienit << 2 o: [1 slienité][1 slienitého][1 slienitých] n: [1 slienité] madridsk << 1 o: [1 Madridským][1 madridskom] n: [1 madridskom] rastr << 2 o: [1 Rastrová][1 rastra][2 rastri][1 rastrovej][1 rastrového][1 rastrový] [1 rastrových] n: [1 Rastrová][1 rastra][2 rastri][1 rastrovej][1 rastrový] ontogenetick << 1 o: [1 ontogenetického][1 ontogenetický] n: [1 ontogenetický] pondelk << 1 o: [3 pondelka][1 pondelkového] n: [3 pondelka] nitovan << 1 o: [1 nitované][1 nitovaných] n: [1 nitované] zostupuj << 1 o: [1 Zostupuje][3 zostupuje][1 zostupujú] n: [1 Zostupuje][3 zostupuje] umiestnen << 5 o: [1 Umiestnený][6 umiestnenej][6 umiestnenou][62 umiestnená][92 umiestnené] [4 umiestneného][11 umiestnení][4 umiestnenú][50 umiestnený] [13 umiestnených][4 umiestneným][4 umiestnenými] n: [1 Umiestnený][6 umiestnenej][6 umiestnenou][62 umiestnená][92 umiestnené] [11 umiestnení][50 umiestnený] rovnocenn << 3 o: [1 rovnocennej][6 rovnocenné][1 rovnocenní][4 rovnocenných][2 rovnocenným] [1 rovnocennými] n: [1 rovnocennej][6 rovnocenné][1 rovnocenní] High-Impact Groups[edit]High-impact groups are those with 10 or more changes to the number of distinct words in the group (gains Sometimes an apparent high-impact group is not really an outlier. This happens when a large group has the stem of a small group. For example, if a group of 10 words and a group of 2 words merge, you could see it as the group of 10 gaining 2 new members (which is not an outlier), or as the group of 2 gaining 10 new members (which looks like an outlier). The most interesting cases are when two relatively large groups merge, or when more than two medium-sized groups merge—because then lots of potentially unrelated words are being grouped together. [For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.] There are thirteen stemming groups that lost 10 or more members. Many of the same suffixes as above show up, so I've excluded the groups that only appear in the list because they are more common words that have more variants of the same list as above. There are three new phenomena here:
Lost members are bolded. anglick << 11 o: [48 Anglicka][3 Anglickej][47 Anglicko][10 Anglickom][1 Anglickou][8 Anglická] [2 Anglické][1 Anglickí][3 Anglický][2 Angličtina][359 anglickej] [6 anglicko][19 anglickom][5 anglickou][1073 anglicky][30 anglická] [16 anglické][75 anglického][2 anglickému][4 anglickú][190 anglický] [16 anglických][9 anglickým][1 anglickými][8 angličtina][45 angličtine] [1 angličtinou][20 angličtiny] n: [48 Anglicka][3 Anglickej][47 Anglicko][10 Anglickom][1 Anglickou][8 Anglická] [2 Anglické][1 Anglickí][3 Anglický][359 anglickej][6 anglicko] [19 anglickom][5 anglickou][1073 anglicky][30 anglická][16 anglické] [190 anglický] bohat << 10 o: [2 Bohaté][1 Bohatému][1 Bohatý][12 bohatej][8 bohato][2 bohatom] [13 bohatou][20 bohatá][22 bohaté][7 bohatého][5 bohatí][15 bohatú] [21 bohatý][12 bohatých][10 bohatým][3 bohatými][4 najbohatším] n: [2 Bohaté][12 bohatej][8 bohato][2 bohatom][13 bohatou][22 bohaté][5 bohatí] nov << 14 o: [1 NOV][1 NOVA][1 NOVÁ][8 Nov][14 Nova][1 Nove][66 Novej][9 Novi][1 Novo] [54 Novom][1 Novou][153 Nová][181 Nové][1 NovéHO][47 Nového][2 Novému] [3 Noví][10 Novú][103 Nový][14 Nových][15 Novým][1 Novými][1 nov][2 nova] [3 nove][119 novej][34 novo][34 novom][27 novou][2 novus][1 novy][98 nová] [282 nové][152 nového][15 novému][12 noví][84 novú][209 nový] [156 nových][50 novým][30 novými][1 novším] n: [1 NOV][1 NOVA][1 NOVÁ][8 Nov][14 Nova][1 Nove][66 Novej][9 Novi][1 Novo] [54 Novom][1 Novou][153 Nová][181 Nové][3 Noví][103 Nový][1 nov][2 nova] [3 nove][119 novej][34 novo][34 novom][27 novou][2 novus][1 novy][98 nová] [282 nové][12 noví][209 nový] tureck << 10 o: [26 Turecka][1 Tureckej][31 Turecko][6 Tureckom][5 Turecká][2 Turecké] [1 Tureckí][1 Tureckú][3 Turecký][1 Tureckých][34 tureckej][1 turecki] [1 turecko][1 tureckom][3 tureckou][9 turecky][8 turecká][4 turecké] [9 tureckého][4 tureckému][2 tureckú][16 turecký][18 tureckých] [9 tureckým][3 tureckými][7 turečtine][1 turečtiny] n: [26 Turecka][1 Tureckej][31 Turecko][6 Tureckom][5 Turecká][2 Turecké] [1 Tureckí][3 Turecký][34 tureckej][1 turecki][1 turecko][1 tureckom] [3 tureckou][9 turecky][8 turecká][4 turecké][16 turecký] High-Frequency Words[edit]High-frequency words are those that occur 1,000 times or more in the sample. These are more likely to be very common words, so it’s important to look at cases where a high-frequency word was added or removed from a group, to make sure the change isn’t going to cause problems. [For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.] I also looked for high-frequency words that were lost from a group, but there weren't any in the Wikipedia sample. The Wiktionary sample had one example, podstatného, which means "of the noun", and so occurs very frequently in Wiktionary. The other lost words in that group have the now-familiar suffixes. Lost members are bolded. podstatn << 4 o: [8 podstatné][1909 podstatného][1 podstatných][1 podstatným] [1 podstatnými] n: [8 podstatné] Speaker Review: Folding Groups that Gained Members[edit]The question for speakers of Slovak reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa? Random Sample[edit]Looking at a random sample of the word groups is the best way to see what the typical effects of a modification are. If the majority of changes are good, and any less desirable changes are understandable and acceptable, then overall the modification is good. [For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.] Below is a sample of 25 randomly selected stemming groups (words that would all be indexed together) that gained members as a result of folding Slovak diacritical characters. (These are from the Wikipedia sample.) Key:
Note that which group is shown as "gaining" new members is always in favor of the stem with no accents. In the case of Putna "gaining" Pútny, pútne, and pútny, you could argue just as well that Pútny, pútne, and pútny, added Putna to their group. What actually happened is that the stem putn and the stem pútn merged. Similarly with the new additions to the budapest group. A lot of the changes here are the kinds we'd expect to see, with accented versions of words (especially names) being merged. Some notes:
Gained members are bolded. amali >> 4 o: [3 Amalia][1 Amalie] n: [3 Amalia][1 Amalie][12 Amália][4 Amálie][1 Amáliina][2 Amáliou] pospolitost >> 1 o: [1 pospolitosti] n: [1 pospolitosti][2 pospolitosť] polen >> 1 o: [2 Polen][1 polene] n: [2 Polen][2 Poleň][1 polene] sni >> 1 o: [1 Snipes] n: [1 Snipes][1 sní] niob >> 1 o: [1 Nioba][1 Niobe][1 Nioby] n: [1 Nioba][1 Niobe][1 Nioby][1 niób] dal >> 5 o: [4 Dal][2 Dala][8 Dale][2 Dalo][1 Dalou][1 Dalího][158 dal][47 dala][57 dali] [23 dalo] n: [4 Dal][2 Dala][8 Dale][2 Dalo][1 Dalou][1 Dalího][158 dal][47 dala][57 dali] [23 dalo][12 dál][1 najďalej][2 Ďale][103 Ďalej][340 ďalej] ruben >> 1 o: [1 Ruben] n: [1 Ruben][2 Rubén] ilov >> 3 o: [1 Ilové] n: [1 Ilové][1 ílov][1 ílovou][2 ílové] ultim >> 2 o: [1 Ultima][1 ultimo] n: [1 Ultima][1 ultimo][2 ultimáta][1 Última] taih >> 1 o: [2 Taiho] n: [2 Taiho][2 Taihó] ods >> 1 o: [6 ODS][6 ods] n: [6 ODS][6 ods][1 odsať] spas >> 4 o: [4 SPAS][2 Spas] n: [4 SPAS][2 Spas][14 spása][1 spásať][4 spáse][2 spásy] hellad >> 1 o: [2 Hellados] n: [2 Hellados][1 Helládos] evoqu >> 1 o: [1 Evoque] n: [1 Evoque][1 évoque] bedarieux >> 1 o: [1 Bedarieux] n: [1 Bedarieux][2 Bédarieux] parizek >> 1 o: [1 Parizek] n: [1 Parizek][1 Pařízek] hojnost >> 1 o: [2 hojnosti] n: [1 Hojnosť][2 hojnosti] giap >> 1 o: [1 GIAP] n: [1 GIAP][1 Giáp] bazin >> 1 o: [2 Bazin] n: [2 Bazin][1 bažin] styri >> 2 o: [1 Styria] n: [1 Styria][6 Štyria][13 štyria] baton >> 1 o: [2 Baton] n: [2 Baton][4 Batón] putn >> 3 o: [1 Putna] n: [1 Putna][1 Pútny][1 pútne][1 pútny] budapest >> 4 o: [7 Budapest] n: [7 Budapest][1 Budapesť][78 Budapešti][35 Budapešť][1 Budapešťi] zp >> 1 o: [1 ZP] n: [1 ZP][3 ŽP] partizan >> 8 o: [1 Partizanom] n: [1 Partizanom][2 Partizán][3 Partizáni][2 partizán][1 partizána] [5 partizáni][1 partizánmi][11 partizánom][3 partizánov] High-Impact Groups[edit]High-impact groups are those with 10 or more changes to the number of distinct words in the group (gains Sometimes an apparent high-impact group is not really an outlier. This happens when a large group has the stem of a small group. For example, if a group of 10 words and a group of 2 words merge, you could see it as the group of 10 gaining 2 new members (which is not an outlier), or as the group of 2 gaining 10 new members (which looks like an outlier). The most interesting cases are when two relatively large groups merge, or when more than two medium-sized groups merge—because then lots of potentially unrelated words are being grouped together. [For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.] There were 140 groups with 10 or more additions, so I raised the threshold to 15 or more additions, which gave 29 groups. I've removed the groups that are like the putn or budapest groups above, where a large group with diacritics merged with a one or two distinct words (after ignoring upper- and lowercase) that have a stem without diacritics. The remaining 16 groups are shown below. These represent large groups with diacritics merging with medium to large groups without diacritics. The converse—large groups without diacritics merging with smaller groups with diacritics—is not represented. I can go looking for examples if anyone thinks they would be significantly different from the ones here. The stal and pol groups look to be made up of the largest distinct groups that merged. Gained members are bolded. byval >> 16 o: [1 ByVal][1 byvalá][1 byvalé][1 byvalý] n: [1 ByVal][1 Býval][4 Bývalá][2 Bývalé][4 Bývalí][9 Bývalý][1 byvalá] [1 byvalé][1 byvalý][27 býval][9 bývala][46 bývalej][17 bývali] [5 bývalo][20 bývalom][7 bývalou][60 bývalá][13 bývalé][12 bývalí] [189 bývalý] desperad >> 1 o: [1 Desperado] n: [1 Desperado][1 desperádmi] lud >> 19 o: [1 Lud][1 Lude][1 Ludiès][1 Ludo][1 Ludus] n: [1 Lud][1 Lude][1 Ludiès][1 Ludo][1 Ludus][1 luďom][1 ĽUDÍ][1 Ľud] [1 Ľuda][1 Ľudo][1 Ľudoví][1 Ľudí][1 Ľuďom][46 ľud][1 ľude][6 ľudi] [1 ľudmi][7 ľudom][3 ľudy][421 ľudí][42 ľuďmi][13 ľuďoch][25 ľuďom] [1 ľuďí] narodn >> 18 o: [1 Narodna][1 Narodni][1 narodne] n: [1 Narodna][1 Narodni][2 NÁRODNÁ][92 Národnej][22 Národnom][13 Národnou] [44 Národná][37 Národné][16 Národní][1 Národního][43 Národný] [1 narodne][10 národne][103 národnej][6 národno][37 národnom][16 národnou] [96 národná][171 národné][9 národní][103 národný] nas >> 17 o: [77 NASA][1 NaS][1 Nas][1 Naso][4 nas][1 nasi][1 naso] n: [77 NASA][1 NaS][1 Nas][1 Naso][11 Naša][8 Naše][1 Našej][5 Naši][1 Našou] [9 Náš][4 nas][2 nasatý][1 nasi][1 naso][10 naša][19 naše][30 našej] [4 naši][37 našich][15 našom][2 našou][1 naší][84 nás][19 náš] plan >> 19 o: [12 Plan][3 Planina][1 plan][2 plane][1 planej][14 planina][6 planine] [3 planinou][15 planiny][1 plané] n: [12 Plan][3 Planina][11 Plán][2 Plánom][3 Plány][1 Pláň][1 plan][2 plane] [1 planej][14 planina][6 planine][3 planinou][15 planiny][1 plané] [12 planéte][1 planín][49 plán][16 pláne][3 pláni][4 plánmi][5 plánoch] [7 plánom][20 plánov][3 plánovať][37 plány][1 plání][3 pláň] [2 pláňami][1 pláňou] pol >> 16 o: [1 POLE][5 Pol][6 Pola][24 Pole][2 Poli][5 Polo][1 Polom][1 Polus][69 pol] [81 pole][56 poli][5 polo][2 polom][8 polos][2 poly][19 polí][1 polích] n: [1 POLE][5 Pol][6 Pola][24 Pole][2 Poli][5 Polo][1 Polom][1 Polus][1 Poľa] [1 Póly][69 pol][81 pole][1 poletí][56 poli][5 polo][2 polom][8 polos] [2 poly][19 polí][1 polích][1 polôch][39 poľ][57 poľa][4 poľami] [13 poľom][13 pól][5 póla][16 póle][2 pólmi][12 pólo][2 póloch] [5 pólom][1 póly] polsk >> 17 o: [1 Polsce][11 Polska][6 Polski][7 Polskich][1 Polsko][1 polski][1 polsko] n: [1 Polsce][11 Polska][6 Polski][7 Polskich][1 Polsko][93 Poľska][20 Poľskej] [138 Poľsko][20 Poľskom][15 Poľská][11 Poľské][2 Poľskí][10 Poľský] [1 polski][1 polsko][49 poľskej][22 poľsko][12 poľskom][6 poľskou] [37 poľsky][32 poľská][22 poľské][7 poľskí][82 poľský] post >> 18 o: [41 Post][102 post][25 poste][4 postoch][1 postom][1 postov][2 posty] n: [41 Post][2 Pošta][1 Poštovou][2 Poštová][2 Poštové][1 Poštový] [102 post][25 poste][4 postoch][1 postom][1 postov][2 posty][12 pošta] [2 pošte][5 poštou][2 poštovej][1 poštovou][4 poštová][1 poštové] [1 poštoví][4 poštový][15 pošty][1 pôst][1 pôsty][3 pôšt] povodn >> 16 o: [9 povodne][3 povodni][2 povodní] n: [127 Pôvodne][1 Pôvodnou][25 Pôvodná][17 Pôvodné][1 Pôvodní] [29 Pôvodný][9 povodne][3 povodni][2 povodní][5 povodňami][3 povodňou] [307 pôvodne][77 pôvodnej][17 pôvodnom][8 pôvodnou][19 pôvodná] [61 pôvodné][6 pôvodní][55 pôvodný] premier >> 15 o: [2 PREMIER][19 Premier][1 Premiera][1 Premierom][3 première] n: [2 PREMIER][19 Premier][1 Premiera][1 Premierom][1 Premiér][22 Premiéra] [2 Premiérom][1 Premiérový][3 première][33 premiér][26 premiéra] [14 premiére][10 premiérom][2 premiérou][1 premiérov][4 premiérovo] [1 premiérovom][3 premiérový][6 premiéry][1 premiér] seri >> 17 o: [23 Serie][1 Serio][1 seria] n: [23 Serie][1 Serio][9 Séria][11 Série][4 Sérii][4 Sériová][3 Sériové] [1 seria][65 séria][1 sériami][102 série][52 sérii][9 sériou] [10 sériovej][11 sériovo][2 sériovou][8 sériová][5 sériové][5 sériový] [11 sérií] stal >> 19 o: [51 Stal][13 Stala][1 Stali][30 Stalin][8 Stalina][4 Stalinom][3 Stalinovi] [16 Stalo][862 stal][338 stala][1 stale][150 stali][173 stalo] n: [51 Stal][13 Stala][1 Stali][30 Stalin][8 Stalina][4 Stalinom][3 Stalinovi] [16 Stalo][3 Stál][4 Stála][16 Stále][3 Stálej][2 Stáli][1 Stálo] [1 Stálou][2 Stály][862 stal][338 stala][1 stale][2 staletí][150 stali] [173 stalo][67 stál][41 stála][249 stále][7 stálej][28 stáli][19 stálo] [2 stálom][3 stálou][13 stály][1 Štál] stat >> 20 o: [1 Stat][48 State][7 Status][11 stat][11 state][2 stati][53 status][5 statí] n: [1 Stat][48 State][7 Status][11 stat][11 state][2 stati][53 status][5 statí] [71 stať][4 stát][12 stáť][2 sťatá][2 sťatí][1 sťatý][1 sťať] [14 Štát][1 Štátoch][3 Štátov][3 Štáty][1 štatom][1 štatov] [186 štát][124 štáte][37 štátmi][116 štátoch][66 štátom] [238 štátov][91 štáty] studi >> 21 o: [1 STUDIO][2 Studia][4 Studie][20 Studio][36 Studios][2 studie][7 studio] [1 studií] n: [1 STUDIO][2 Studia][4 Studie][20 Studio][36 Studios][1 Stúdió][2 studie] [7 studio][1 studií][12 Štúdia][3 Štúdie][6 Štúdio][2 Štúdiom] [3 Štúdiové][1 Štúdioví][2 Štúdiá][140 štúdia][4 štúdiami] [57 štúdie][12 štúdii][18 štúdio][17 štúdiom][3 štúdiou] [3 štúdiovom][1 štúdiová][5 štúdiové][67 štúdiový][16 štúdiá] [50 štúdií] system >> 17 o: [1 SYSTEM][52 System][17 system][1 systema][1 systeme] n: [1 SYSTEM][52 System][65 Systém][1 Systémová][1 Systémové][10 Systémy] [17 system][1 systema][1 systeme][415 systém][16 systémami][67 systéme] [37 systémoch][70 systémom][86 systémov][3 systémovej][1 systémovom] [2 systémová][9 systémové][4 systémový][102 systémy][1 sýstéma] High-Frequency Words[edit]I also looked for high-frequency words that were added to a group. High-frequency words are those that occur 1,000 times or more in the sample. These are more likely to be very common words, so it’s important to look at cases where a high-frequency word was added or removed from a group, to make sure the change isn’t going to cause problems. [For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.] I dropped groups that are easily interpreted as a small number of words without diacritics being added to a larger group of words with diacritics, one of which is high-frequency. For example, 1 instance of ktoru would be indexed with 1063 instances of ktorú, which isn't actually very interesting (and may just be a typo). (Though, see the mixed groups below for more on ktorú.) I kept groups where the group being added to had at least 3 different words in it, or at least one of the words had 10 or more instances. The remaining 8 groups with high-frequency words are below. The most interesting collisions (ignoring case) seem to be:
Gained members are bolded. az >> 2 o: [84 AZ][1 Az][6 az] n: [84 AZ][1 Az][101 Až][6 az][2357 až] cast >> 23 o: [2 Cast][2 Castles][6 Castres][1 caste][1 casti] n: [2 Cast][2 Castles][6 Castres][1 caste][1 casti][1 Časti][67 Často] [1 Častou][7 Častá][5 Časté][5 Častý][37 Časť][1 častej][1171 časti] [1 častich][541 často][1 častom][3 častou][7 častá][39 časté] [240 častí][6 častý][1 často][1012 časť][18 časťami][122 časťou] [5 část][1 části] co >> 3 o: [3 CO][28 Co][7 Comes][7 co][2 comes] n: [3 CO][28 Co][7 Comes][26 Côtes][7 co][2 comes][50 Čo][1418 čo] ked >> 2 o: [1 Kedy][1 ked][461 kedy] n: [1 Kedy][297 Keď][1 ked][461 kedy][1084 keď] podl >> 3 o: [4 Podla][1 Podle][9 podla][3 podle][1 podlete] n: [1 PODĽA][4 Podla][1 Podle][531 Podľa][9 podla][3 podle][1 podlete] [1254 podľa] su >> 5 o: [6 SU][141 Su][2 Sü][9 su][3 sü] n: [6 SU][141 Su][146 Sú][2 Sü][9 su][3855 sú][3 sü][19 ŠÚ][1 šu][3 šú] wikipedi >> 5 o: [2 Wikipedia][1 Wikipedie][2 wikipedia] n: [2 Wikipedia][1 Wikipedie][6 Wikipédia][10 Wikipédie][1477 Wikipédii] [1 Wikipédiou][2 wikipedia][1 wikipédie] ze >> 2 o: [4 Ze][14 ze] n: [4 Ze][14 ze][4 Že][3268 že] Speaker Review: Folding Groups that Lost and Gained (Mixed) Members[edit]The question for speakers of Slovak reviewing these sections (Random Sample and High-Frequency Words) is this: would it be bad if searching for the the new groups of words found each other, in stead of the old groups? (That's a bit clunky, but after looking separately at groups that lost and gained members, the idea should be clear enough.) Random Sample[edit]Looking at a random sample of the word groups is the best way to see what the typical effects of a modification are. If the majority of changes are good, and any less desirable changes are understandable and acceptable, then overall the modification is good. [For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.] I don't see a lot of differences here, other than we happen to have both gains and losses applying at once. However, I'm including them in case there is something non-obvious. Below is a sample of 10 randomly selected stemming groups (words that would all be indexed together) that both lost and gained members as a result of folding Slovak diacritical characters. (These are from the Wikipedia sample.) Key:
Lost and gained members are bolded. otcov >< 3 o: [6 Otcov][3 Otcovy][9 otcov][13 otcovej][1 otcových][1 otcovým] n: [6 Otcov][3 Otcovy][1 Otcové][9 otcov][13 otcovej] braln >< 2 o: [2 Bralná][1 bralných] n: [2 Bralná][2 bralnatý] odtrhnut >< 2 o: [1 odtrhnutí][1 odtrhnutých] n: [1 odtrhnutí][1 odtrhnúť] pohyb >< 4 o: [5 Pohyb][2 Pohybová][3 Pohyby][106 pohyb][24 pohybe][4 pohybmi][20 pohybom] [11 pohybov][4 pohybovej][1 pohybovo][3 pohybová][4 pohybové][2 pohybového] [4 pohybovú][1 pohybový][2 pohybových][19 pohyby] n: [5 Pohyb][2 Pohybová][3 Pohyby][106 pohyb][24 pohybe][4 pohybmi][20 pohybom] [11 pohybov][17 pohybovať][4 pohybovej][1 pohybovo][3 pohybová][4 pohybové] [1 pohybový][19 pohyby] stop >< 4 o: [12 Stop][1 Stopové][3 Stopy][5 stop][9 stopa][2 stopami][7 stope][2 stopom] [2 stopou][2 stopový][3 stopových][44 stopy][10 stopách][1 stopám] n: [12 Stop][1 Stopové][3 Stopy][5 stop][9 stopa][2 stopami][7 stope][2 stopom] [2 stopou][2 stopový][44 stopy][26 stôp] tatr >< 3 o: [41 Tatra][2 Tatrami][2 Tatre][2 Tatro][2 Tatrou][60 Tatry][47 Tatrách] n: [41 Tatra][2 Tatrami][2 Tatre][2 Tatro][2 Tatrou][60 Tatry][1 Tatrín] [1 Tátra] lisk >< 11 o: [8 Liskovej][5 Lisková][1 Lištinou] n: [8 Liskovej][5 Lisková][1 Liška][1 Lišková][2 Líška][1 Líšková] [2 Líščí][1 liška][1 liščí][4 líška][2 líšky][2 líščí] konkol >< 2 o: [2 Konkol][1 Konkolových][5 Konkoly] n: [2 Konkol][5 Konkoly][2 Konkoľ] stol >< 12 o: [1 STOL][3 Stolová][2 Stolového][3 Stolový][1 Stoly][1 stol][5 stola] [8 stole][2 stoloch][2 stolom][4 stolová][1 stolové][1 stolových][1 stoly] [1 stolé] n: [1 STOL][3 Stolová][3 Stolový][1 Stoly][1 Stół][1 Stôl][1 stol][5 stola] [8 stole][13 století][2 stoloch][2 stolom][4 stolová][1 stolové][1 stoly] [1 stolé][1 stół][18 stôl][1 Štola][1 Štóla][1 štola][5 štóla] [1 štôl] reform >< 3 o: [2 Reform][2 Reforma][12 reforma][3 reformami][8 reforme][4 reformou] [29 reformy][5 reformách][3 reformám] n: [2 Reform][2 Reforma][12 reforma][3 reformami][8 reforme][4 reformou] [3 reformovať][29 reformy] High-Impact Groups[edit]High-impact groups are those with 10 or more changes to the number of distinct words in the group (gains Sometimes an apparent high-impact group is not really an outlier. This happens when a large group has the stem of a small group. For example, if a group of 10 words and a group of 2 words merge, you could see it as the group of 10 gaining 2 new members (which is not an outlier), or as the group of 2 gaining 10 new members (which looks like an outlier). The most interesting cases are when two relatively large groups merge, or when more than two medium-sized groups merge—because then lots of potentially unrelated words are being grouped together. [For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.] There were 68 groups with 10 or more changes, so I raised the threshold to 18 or more additions, which gave only 7 groups, which are shown below. Lost and gained members are bolded. elektrick >< 18 o: [1 Elektrickej][10 Elektrická][3 Elektrické][2 Elektrickú][19 Elektrický] [45 elektrickej][3 elektrickom][3 elektrickou][9 elektricky][31 elektrická] [51 elektrické][37 elektrického][5 elektrickému][17 elektrickú] [36 elektrický][19 elektrických][11 elektrickým][1 elektrickými] n: [1 Elektrickej][10 Elektrická][3 Elektrické][19 Elektrický][3 Električka] [3 Električková][1 Električky][45 elektrickej][3 elektrickom][3 elektrickou] [9 elektricky][31 elektrická][51 elektrické][36 elektrický][8 električka] [2 električkami][2 električkou][6 električkovej][12 električková] [3 električkové][5 električkový][21 električky] horn >< 19 o: [6 Horn][17 Hornej][3 Hornina][10 Horniny][13 Hornom][130 Horná][48 Horné] [21 Horného][23 Horní][1 Horních][1 Horního][2 Hornú][22 Horný] [6 Horných][3 Horným][57 hornej][19 hornina][18 horninami][6 hornine] [3 horninou][4 horninové][91 horniny][26 horninách][2 horninám][48 hornom] [4 hornou][15 horná][7 horné][15 horného][1 hornému][6 hornú][5 horný] [17 horných][7 horným][1 hornými] n: [6 Horn][2 Hornatý][17 Hornej][3 Hornina][10 Horniny][13 Hornom][130 Horná] [48 Horné][23 Horní][1 Horních][1 Horního][22 Horný][1 Hôrny][3 hornatá] [2 hornatý][57 hornej][19 hornina][18 horninami][6 hornine][3 horninou] [91 horniny][48 hornom][4 hornou][15 horná][7 horné][97 hornín][5 horný] [7 hôrny] mlad >< 20 o: [3 Mladej][30 Mladá][19 Mladé][1 Mladého][2 Mladí][2 Mladú][12 Mladý] [1 Mladých][1 Mladým][2 mlada][1 mlade][24 mladej][1 mlado][7 mladom] [3 mladou][14 mladá][26 mladé][20 mladého][5 mladému][16 mladí][7 mladú] [45 mladý][63 mladých][6 mladým][6 mladými][9 mladším][11 najmladším] n: [3 Mladej][30 Mladá][19 Mladé][2 Mladí][12 Mladý][1 Mláďa][19 Mláďatá] [2 mlada][1 mlade][24 mladej][1 mlado][7 mladom][3 mladou][14 mladá] [26 mladé][16 mladí][45 mladý][3 mládí][8 mláďa][3 mláďat] [23 mláďatá][1 mláďať][1 mláďaťa] pas >< 19 o: [1 PASO][119 Pas][1 Paso][1 Passes][1 Pasú][5 pas][1 pasy][1 pasú] n: [1 PASO][119 Pas][1 Paso][1 Passes][5 Paša][2 Pás][1 Páse][2 Pásy][5 pas] [1 pasy][1 pasátoch][1 paša][1 paše][1 paši][41 pás][1 pása][1 pásami] [18 páse][8 pásmi][3 pásoch][11 pásom][5 pásy][1 páší] platn >< 18 o: [1 Platná][1 Platné][1 Platným][28 platne][2 platnej][4 platni][5 platnom] [1 platnou][5 platná][11 platné][2 platného][5 platní][1 platnú] [2 platný][11 platných][3 platným][3 platnými] n: [1 Platná][1 Platné][3 Platňa][1 Platňová][28 platne][2 platnej][4 platni] [5 platnom][1 platnou][5 platná][11 platné][5 platní][2 platný][25 platňa] [1 platňami][10 platňou][1 platňovej][1 platňové][7 plátna][6 plátne] [6 plátno][3 plátnom][1 plátnová] svat >< 20 o: [1 Svatom][5 Svatá][4 Svaté][10 Svatého][12 Svatý][2 svaté][2 svatého] [1 svatý][1 svatých] n: [1 Svatom][1 Svatoš][5 Svatá][4 Svaté][12 Svatý][1 Sváti][39 Svätej] [1 Sväto][10 Svätom][1 Svätou][43 Svätá][6 Sväté][3 Svätí][94 Svätý] [2 svaté][1 svatý][51 svätej][2 svätom][5 svätou][8 svätá][4 sväté] [2 svätí][21 svätý] velk >< 21 o: [1 Velkom][1 Velkou][13 Velká][4 Velké][2 Velkého][15 Velký][1 Velkým] [2 velkou][2 velké][1 velkého] n: [1 Velkom][1 Velkou][13 Velká][4 Velké][15 Velký][1 Veľk][1 Veľka] [151 Veľkej][18 Veľkom][16 Veľkou][163 Veľká][83 Veľké][1 Veľkí] [180 Veľký][2 velkou][2 velké][136 veľkej][1 veľko][38 veľkom][1 veľkos] [57 veľkou][96 veľká][306 veľké][2 veľkí][223 veľký] High-Frequency Words[edit]High-frequency words are those that occur 1,000 times or more in the sample. These are more likely to be very common words, so it’s important to look at cases where a high-frequency word was added or removed from a group, to make sure the change isn’t going to cause problems. [For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.] Again, I looked for groups with mixed losses and gains that involve high-frequency words. The five examples are below. Lost and gained members are bolded. byt >< 7 o: [1 Bytom][1 Byty][15 byt][9 byte][7 bytmi][2 bytoch][4 bytom][5 bytové] [12 byty] n: [1 Bytom][1 Byty][4 Byť][1 Být][15 byt][9 byte][7 bytmi][2 bytoch][4 bytom] [12 byty][1931 byť][4 být][1 býti][2 býť] ktor >< 7 o: [2 Ktorá][1 Ktoré][1 Ktorý][746 ktorej][524 ktorom][117 ktorou][3655 ktorá] [3353 ktoré][580 ktorého][80 ktorému][672 ktorí][1063 ktorú][3250 ktorý] [692 ktorých][255 ktorým][104 ktorými] n: [2 Ktorá][1 Ktoré][1 Ktorý][746 ktorej][524 ktorom][117 ktorou][3655 ktorá] [3353 ktoré][672 ktorí][1 ktoróm][3250 ktorý] ma >< 8 o: [5 MA][16 Ma][2 Makes][1 Manes][1 Mates][1 mA][61 ma][1 makes][1 malém] [1 mares] n: [5 MA][16 Ma][4 Maheš][1 Mahéš][2 Makes][2 Mamés][1 Manes][1 Mareš] [1 Mates][476 Má][2 Mánes][1 mA][61 ma][1 makes][1 mares][3428 má] neskor >< 8 o: [1 Neskoro][1 Neskorším][13 neskorej][13 neskoro][3 neskorom][1 neskorou] [1 neskory][2 neskoré][10 neskorého][7 neskorý][6 neskorých][9 neskorším] n: [1 Najneskôr][1 Neskoro][270 Neskôr][9 najneskôr][13 neskorej][13 neskoro] [3 neskorom][1 neskorou][1 neskory][2 neskoré][7 neskorý][1075 neskôr] ponuk >< 5 o: [1 Ponuka][9 ponuka][9 ponuke][3 ponukou][20 ponuky][1 ponukách] n: [1 Ponuka][8 Ponúka][9 ponuka][9 ponuke][3 ponukou][20 ponuky][3 ponúk] [2285 ponúka][5 ponúkať] Wiktionary Notes[edit]The Wiktionary sample is generally similar in terms of words lost and gained from stemming groups. The most obvious difference other than the smaller size of the sample is the presence of pronunciations in IPA—e.g., slɔvniː stems with slovné because it gets folded to slovni before stemming. These generally aren't changed by the folding changes. |
Interlude: Some Stemmer Struggles
[edit]Ugh. While looking into Option 2—Stem Before Folding—I ran into some unexpected changes.
I noticed that francúzskeho and Francúzského got split up. That makes sense, since the -ého suffix is stripped, but not -eho. However, the numbers were backwards from what I expected: 62 francúzskeho, but only 1 Francúzského, making it look like Francúzského was the typo. A little research later, and I discover that some adjectives take the -eho suffix instead of the -ého suffix, and the stemmer doesn't strip it.
I pulled some Slovak declension and conjugation tables from English Wiktionary and discovered that a lot of Slovak suffixes are not handled by the stemmer, including some unaccented varieties. There are a lot of potential reasons for this, like some suffixes being too ambiguous. For example, in English -ing can be a verbal suffix (hoping, talking, thinking) or just the way a word ends (ceiling, sibling, lightning), which makes stripping -ing harder than it could be. Another likely source of the problem is that -ého could be more common than -eho—though a very rough search on Slovak Wikipedia gives a similar number of instances.
We didn't detect this when looking at the stemmer because the process doesn't really focus on false negatives. As long as everything grouped together is supposed to be together (true positives), it's "right". Plus, you can't always infer that a missing form is a stemmer deficiency. For example, if you have hope, hoped, and hoping together, but not hopes, is that because hopes isn't processed properly, or because it isn't in your sample?
In the future when looking at stemmers, I'll try to pull some relevant data from Wiktionary inflection tables and spend some time looking for false negatives, too.
I've gathered a few (probably unrepresentative) examples of Slovak adjectives, nouns, and verbs with inflection tables on English Wiktionary, and run all the inflections through the stemmer. The stems are collected on a sub-page for future reference. The first few are perfect—every form has the same stem—but some of the later ones are all over the place.
For now, I'll open a Phab ticket (T227924) and leave improving the stemmer for a future project.
Option 2: Stem Before Folding
[edit]The most obvious solution to the problem of the unexpectedly large number of lost tokens is to first stem words with diacritics, then fold and remove the diacritics.
One potential problem with this approach is that suffixes that always include diacritics won't be removed by the stemmer if the diacritics are missing—leading to false negatives. Option 3: Modify the Stemmer, below, could address that, though it is possible that it could introduce new problems if it results in the stemmer being too aggressive, or if suffixes that differ only in diacritics should be treated differently.
Some positive aspects of stemming first should include:
- We won't lose tokens with diacritical suffixes (and forms involving čt will be treated correctly), which seems desirable.
- Many of the merged groups will still merge, because their stems will merge after stemming.
- e.g., Amalia will stem to amali, while Amália will stem to amáli, and then be folded to amali, so Amalia and Amália will still be grouped together—for better or worse.
- We won't get false positives on suffix removal, so -áta won't be treated as a a suffix.
The first good sign from stemming before folding is that the total number of distinct post-analysis types (unique words in the sample) decreased, from 131,091 to 125,638—as opposed to increasing when we folded before stemming.
As before, there were a lot of new collisions—words that would be indexed the same: 11,594 pre-analysis types (5.208% of pre-analysis types) / 161,318 tokens (9.838% of tokens) were added to 4,167 groups (3.179% of post-analysis types), affecting a total of 24,699 pre-analysis types (11.096% of pre-analysis types) in those groups.
That's not quite as many new collisions as when folding first, but the impact is very similar: 5% of distinct words and almost 10% of all words got folded together with something new.
There were very few splits: 148 pre-analysis types (0.066% of pre-analysis types) / 294 tokens (0.018% of tokens) were lost from 137 groups (0.105% of post-analysis types), affecting a total of 776 pre-analysis types (0.349% of pre-analysis types) in those groups.
That's much less than before: less than 0.1% of distinct words and less than 0.02% of all words will no longer be indexed together.
So if the new collisions are good, then this arrangement is probably doing much of the same good work as folding first, without the bad side effects.
The Wiktionary sample had roughly similar collision stats: about 3.8% of both distinct words and total words got folded with other words. There were more splits than in the fold-first test, with 1.8% of distinct words and about 0.9% of all words no longer being indexed with something they were indexed with before.
Speaker Review: Overview
[edit]The core task of the speaker doing the review is to decide whether words are being properly grouped together for search, and whether any changes to those groupings are better or worse. When words are grouped together, it means that searching for one word in the group will find all of the other words in the group, too. With the current English language processing, for example, searching for any of the words hope, hopes, hoped, hoping, hope’s, hoper, or hopers will find all of the others. (Note that the results in each case will be ranked differently because exact matches are preferred).
In addition to listing the words that are grouped together, we also include the number of times each word appears in the text sample. This helps us estimate the relative importance of potential errors. For example, if two words are improperly grouped together, but the words are very rare, that’s not as bad as if they were very common.
[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]
When we make less extreme modifications to the language processing done for search—like introducing diacritic folding—we can usually look more meaningfully at groups before and after the modification to assess the effect of the group changes.
Old-vs-new groups are presented as follows:
hope >> 2 o: [152 Hope][23 Hopes][1208 hope][346 hoped][488 hopes] n: [152 Hope][1 Hopē][23 Hopes][1208 hope][346 hoped][488 hopes][2 ĥợṕễ]
The first line shows the stem (hope
), a pair of arrow heads (>>
) indicating whether words were gained or lost by the group, and a number indicating how many gains and/or losses there were (2
).
The stem is the form that all of the other words were reduced to. The stem does not have to be the actual root form of the word or even a word at all. However, seeing the stem sometimes makes it easier to understand what the stemmer or other parts of the analysis were trying to do.
In terms of gains and losses:
>>
indicates that words were gained by the group<<
indicates that words were lost from the group><
indicates that there were both losses and gains
The o:
section (for “old”) shows all the words that shared a stem before the change. The n:
section (for “new”) shows all the words that shared a stem after the change. Sharing a stem means that searching for any of the words will find all of the others. (Note that while searching for each word in a group will give the same results, the results could be in a very different order—in particular because exact matches are given more weight.)
The numbers with the word—e.g., [1208 hope]
and [1 Hopē]
—indicate how many times a given word appears in the text sample. In this case, hope is over a thousand times more common than Hopē. Rare words that are not great matches with the rest of a group are less of a problem because they don’t occur very often. When you search for them, exact matching will usually bring them to the top of the results list.
Problems can arise when more common words are grouped together incorrectly. For example, a grouping like [1208 hope][747 hop]
would be worse, because these words don’t belong together, and both words are common.
[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]
Speaker Review: Folding Groups that Lost Members
[edit]The question for speakers of Slovak reviewing the Random Sample is this: would it be bad if searching for the "lost" words no longer found the remaining words, and vice versa?
Random Sample
[edit]Looking at a random sample of the word groups is the best way to see what the typical effects of a modification are. If the majority of changes are good, and any less desirable changes are understandable and acceptable, then overall the modification is good.
[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]
Below is a sample of 25 randomly selected stemming groups (words that would all be indexed together) that lost members as a result of folding Slovak diacritical characters after stemming. (These are from the Wikipedia sample.)
The lost terms almost all seem to have the same pattern: a diacritic on one of the last few letters in the word that blocks the stemmer from removing what otherwise looks like a Slovak suffix. (An exception is Gæa, which is too short to be stemmed (while the folded version, Gaea, is not.)
Some of the lost terms look to be incorrectly lost, to me, but possibly unavoidably so. Jarząbcza, Jarząbczej and Jarząbczy look to be inflected forms of the name Jarząbczą, though the final ą blocks the citation form of the name from being stemmed.
Key:
- bechyn << 1
- bechyn indicates that all of these words were stemmed to bechyn. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
- << 1 indicates that from "old" to "new", this stemming groups lost 1 member.
- o: — the "old" group, in this case, the current behavior
- n: — the "new" group, in this case, with Slovak letters folded after stemming
- [19 Bechyně] — Bechyně occurs 19 times in our sample (of 10K articles)
Lost members are bolded.
bechyn << 1 o: [4 Bechyni][19 Bechyně] n: [4 Bechyni] desk << 1 o: [1 Deskový][2 deska][7 dešti][1 deště] n: [1 Deskový][2 deska][7 dešti] gae << 1 o: [1 GAE][2 Gaea][2 Gæa] n: [1 GAE][2 Gaea] gyongy << 1 o: [1 Gyöngyi][2 Gyöngyös] n: [1 Gyöngyi] issar << 1 o: [1 Issari][1 Issarlès] n: [1 Issari] jarzabcz << 1 o: [3 Jarząbcza][2 Jarząbczej][2 Jarząbczy][2 Jarząbczą] n: [3 Jarząbcza][2 Jarząbczej][2 Jarząbczy] jesk << 2 o: [2 Jesko][1 Ještě][2 ještě] n: [2 Jesko] kart << 1 o: [1 Karta][2 Kartová][1 Kartové][17 karta][5 kartami][3 karte][2 karti] [1 kartiny][9 kartou][1 kartovej][4 kartová][5 kartové][2 kartových] [27 karty][2 kartách][1 kartą] n: [1 Karta][2 Kartová][1 Kartové][17 karta][5 kartami][3 karte][2 karti] [1 kartiny][9 kartou][1 kartovej][4 kartová][5 kartové][2 kartových] [27 karty][2 kartách] kork << 1 o: [6 Korçë][1 korkových] n: [1 korkových] lau << 1 o: [1 Lau][1 Laua][5 Lauzès] n: [1 Lau][1 Laua] maneth << 3 o: [2 Manetho][1 Manethos][1 Manethovi][1 manetʰō][1 maˈnetʰō] [1 maˈnetʰōs] n: [2 Manetho][1 Manethos][1 Manethovi] melk << 1 o: [1 Melk][1 Melka][1 mělčině] n: [1 Melk][1 Melka] mu << 1 o: [3 MU][8 Mu][3 Mureș][1 Musím][625 mu][3 musím] n: [3 MU][8 Mu][1 Musím][625 mu][3 musím] nasz << 1 o: [1 Nasza][1 naszą] n: [1 Nasza] national << 2 o: [2 NATIONAL][84 National][1 Nationala][6 Nationale][1 Națională][11 national] [4 nationale][1 națională] n: [2 NATIONAL][84 National][1 Nationala][6 Nationale][11 national][4 nationale] nestl << 1 o: [1 Nestle][1 Nestlé][1 ˈnɛstlə] n: [1 Nestle][1 Nestlé] niccol << 1 o: [2 Niccola][1 Niccolo][8 Niccolò] n: [2 Niccola][1 Niccolo] nicol << 1 o: [4 Nicola][8 Nicole][2 Nicolò] n: [4 Nicola][8 Nicole] paran << 1 o: [15 Paraná][1 Paranã] n: [15 Paraná] sabra << 1 o: [1 Sabrazes][1 Sabrazès] n: [1 Sabrazes] vor << 2 o: [1 VOR][1 Vorë][1 Vőrös] n: [1 VOR] vrchov << 1 o: [2 Vrchovinami][59 vrchovina][19 vrchovine][2 vrchovinou][38 vrchoviny] [1 vrchovině] n: [2 Vrchovinami][59 vrchovina][19 vrchovine][2 vrchovinou][38 vrchoviny] vresovisk << 1 o: [1 vresoviskového][2 vresoviská][2 vřesoviště] n: [1 vresoviskového][2 vresoviská] want << 1 o: [18 Want][1 Wantą] n: [18 Want] zem << 2 o: [2 ZEM][2 ZEMÍCH][72 Zem][157 Zeme][66 Zemi][13 Zemou][1 Země][76 zem] [47 zeme][43 zemi][8 zemou][7 zemí][1 zemích][6 země] n: [2 ZEM][2 ZEMÍCH][72 Zem][157 Zeme][66 Zemi][13 Zemou][76 zem][47 zeme] [43 zemi][8 zemou][7 zemí][1 zemích]
High-Impact Groups
[edit]There are no stemming groups that lost 10 or more members in either of the Wikipedia or Wiktionary samples..
High-Frequency Words
[edit]There are no high-frequency words (> 1000 occurrences) lost from any groups in either of the Wikipedia or Wiktionary samples.
Speaker Review: Folding Groups that Gained Members
[edit]The question for speakers of Slovak reviewing these sections (Random Sample, High-Impact Groups, and High-Frequency Words) is this: would it be bad if searching for the "gained" words now found the other words, and vice versa?
Random Sample
[edit]Looking at a random sample of the word groups is the best way to see what the typical effects of a modification are. If the majority of changes are good, and any less desirable changes are understandable and acceptable, then overall the modification is good.
[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]
Below is a sample of 25 randomly selected stemming groups (words that would all be indexed together) that gained members as a result of folding Slovak diacritical characters. (These are from the Wikipedia sample.)
Key:
- alternativ >> 7
- alternativ indicates that all of these words were stemmed to alternativ. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
- >> 7 indicates that from "old" to "new", this stemming groups gained 7 members.
- o: — the "old" group, in this case, the current behavior
- n: — the "new" group, in this case, with Slovak letters folded after stemming
- [5 Alternative] — Alternative occurs 5 times in our sample (of 10K articles)
Note that which group is shown as "gaining" new members is always in favor of the stem with no accents.
A lot of the changes here are the kinds we'd expect to see, with accented versions of words (especially names) being merged.
Some notes:
- For longer words that aren't names, it's very likely that the words are related. For example, it's hard to imagine that pohyblivosti and pohyblivosťou are not related, though whether searching for one should find the other is a different question (hence, speaker review).
Gained members are bolded.
alternativ >> 7 o: [5 Alternative][2 alternative] n: [5 Alternative][1 Alternatívou][1 Alternatívy][2 alternative][1 alternatív] [14 alternatíva][1 alternatívami][4 alternatívou][2 alternatívy] ange >> 1 o: [76 Angeles] n: [76 Angeles][1 Ángeles] cedric >> 1 o: [1 Cedric] n: [1 Cedric][1 Cédric] dubravk >> 4 o: [1 Dubravko][1 dubravka] n: [1 Dubravko][7 Dúbravka][3 Dúbravke][1 Dúbravkou][1 Dúbravky][1 dubravka] emili >> 3 o: [8 Emilia][2 Emilio] n: [8 Emilia][2 Emilio][5 Emília][1 Emílie][2 Émilie] gerard >> 1 o: [14 Gerard][1 Gerarda][2 Gerardo][1 Gerardus][1 gerard] n: [14 Gerard][1 Gerarda][2 Gerardo][1 Gerardus][16 Gérard][1 gerard] gramatik >> 1 o: [3 Gramatika][1 Gramatiko][1 Gramatiky][7 gramatik][5 gramatika][5 gramatike] [3 gramatikom][3 gramatikou][5 gramatiky] n: [3 Gramatika][1 Gramatiko][1 Gramatiky][7 gramatik][5 gramatika][5 gramatike] [3 gramatikom][3 gramatikou][5 gramatiky][1 gramatík] hermely >> 1 o: [1 Hermelyová] n: [1 Hermelyová][1 Hermélyová] hors >> 8 o: [5 Horse][5 horse] n: [5 Horse][2 Horší][5 horse][1 horšej][1 horšom][2 horší][2 horších] [1 najhoršom][2 najhorší][3 najhorších] hra >> 2 o: [56 Hra][135 hra][7 hrami][4 hraním] n: [56 Hra][11 Hrá][135 hra][7 hrami][4 hraním][95 hrá] kalabrijsk >> 5 o: [1 kalabrijské] n: [1 Kalabríjska][1 Kalábrijský][1 Kalábrijských][1 kalabrijské] [1 kalábrijskom][2 kalábrijský] karol >> 1 o: [192 Karol][80 Karola][3 Karolina][15 Karolom][1 Karolova][9 Karolovej] [9 Karolovi][1 Karoly] n: [192 Karol][80 Karola][3 Karolina][15 Karolom][1 Karolova][9 Karolovej] [9 Karolovi][1 Karoly][4 Károly] kuril >> 1 o: [2 Kurilová][1 Kurily] n: [2 Kurilová][1 Kurily][2 Kuríl] magic >> 1 o: [34 Magic] n: [34 Magic][1 Mágico] mocnost >> 2 o: [2 mocnosti][6 mocností] n: [2 mocnosti][6 mocností][2 mocnosť][1 mocnosťami] pohyblivost >> 2 o: [2 pohyblivosti] n: [2 pohyblivosti][3 pohyblivosť][1 pohyblivosťou] prohask >> 1 o: [2 Prohaska] n: [2 Prohaska][1 Proháska] romk >> 3 o: [1 ROMKY][1 ROMky][2 Romka] n: [1 ROMKY][1 ROMky][2 Romka][1 Rómka][1 rómčina][1 rómčine] sob >> 1 o: [1 Sob][1 soba][1 sobe][1 soby] n: [1 Sob][1 soba][1 sobe][1 soby][1 ŠOBA] spalovac >> 6 o: [1 spalovacej] n: [1 Spalovač][1 spalovacej][21 spaľovacej][1 spaľovacom][1 spaľovacou] [4 spaľovací][3 spaľovacích] studn >> 4 o: [1 Studna][5 studne][1 studni] n: [1 Studna][1 Studňa][1 Studňou][5 studne][1 studni][7 studňa][2 studňou] ukladani >> 1 o: [5 Ukladanie][5 ukladania][10 ukladanie] n: [5 Ukladanie][5 ukladania][10 ukladanie][1 ukládanie] util >> 1 o: [4 Utila] n: [4 Utila][1 Útila] vals >> 1 o: [1 Vals] n: [1 Vals][1 Valšov] volov >> 1 o: [4 volov][1 volovými] n: [4 volov][1 volovými][1 vôľovej]
High-Impact Groups
[edit]High-impact groups are those with 10 or more changes to the number of distinct words in the group (gains >>
, losses <<
, or a mix ><
). These groups are more likely to have problems because they are outliers.
Sometimes an apparent high-impact group is not really an outlier. This happens when a large group has the stem of a small group. For example, if a group of 10 words and a group of 2 words merge, you could see it as the group of 10 gaining 2 new members (which is not an outlier), or as the group of 2 gaining 10 new members (which looks like an outlier).
The most interesting cases are when two relatively large groups merge, or when more than two medium-sized groups merge—because then lots of potentially unrelated words are being grouped together.
[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]
There were 202 groups with 10 or more additions, so I raised the threshold to 15 or more additions, which gave 60 groups. I've removed the groups where a large group with diacritics merged with a one or two distinct words (after ignoring upper- and lowercase) that have a stem without diacritics. (Though I kept groups like greck where multiple stems were merged, in this case grečk-, gréck-, and gréčt-.)
The remaining 30 groups are shown below. These represent large groups with diacritics merging with medium to large groups without diacritics. The converse—large groups without diacritics merging with smaller groups with diacritics—is not represented. I can go looking for examples if anyone thinks they would be significantly different from the ones here or above.
One thing I noticed is that Czech ř gets folded to r, which presumably ends up merging Czech/Slovak cognates, which is probably not a bad thing.
Gained members are bolded.
byval >> 22 o: [1 ByVal][1 byvalá][1 byvalé][1 byvalý] n: [1 ByVal][1 Býval][4 Bývalá][2 Bývalé][4 Bývalí][9 Bývalý][1 byvalá] [1 byvalé][1 byvalý][27 býval][9 bývala][46 bývalej][17 bývali] [5 bývalo][20 bývalom][7 bývalou][60 bývalá][13 bývalé][53 bývalého] [5 bývalému][12 bývalí][4 bývalú][189 bývalý][63 bývalých] [15 bývalým][6 bývalými] cast >> 29 o: [2 Cast][2 Castles][6 Castres][1 caste][1 casti] n: [2 Cast][2 Castles][6 Castres][1 caste][1 casti][1 Časti][67 Často] [1 Častou][7 Častá][5 Časté][5 Častý][3 Častým][2 Častými] [37 Časť][1 častej][1171 časti][1 častich][541 často][1 častom] [3 častou][7 častá][39 časté][2 častého][240 častí][6 častý] [4 častých][14 častým][9 častými][1 často][1012 časť][18 časťami] [122 časťou][5 část][1 části] ciel >> 19 o: [3 Ciel][2 Ciele][5 ciel][1 ciela][63 ciele][2 cieli] n: [3 Ciel][2 Ciele][4 Cieľ][33 Cieľom][1 Cieľovou][5 ciel][1 ciela][63 ciele] [2 cieli][52 cieľ][16 cieľa][4 cieľmi][2 cieľoch][204 cieľom][34 cieľov] [4 cieľovej][1 cieľovou][2 cieľová][1 cieľové][4 cieľového] [1 cieľovú][4 cieľový][3 cieľových][1 cieľovým][1 cieľovými] drah >> 15 o: [1 Drahý][1 drahej][1 draho][5 drahá][7 drahé][2 drahú][6 drahý] [3 drahých][1 drahými][3 najdrahším] n: [1 Drahý][6 Dráha][1 Dráhovej][1 Dráhovou][5 Dráhy][1 drahej][1 draho] [5 drahá][7 drahé][2 drahú][6 drahý][3 drahých][1 drahými][15 dráh] [36 dráha][3 dráhami][50 dráhe][17 dráhou][6 dráhovej][2 dráhové] [1 dráhovú][1 dráhových][117 dráhy][1 dráze][3 najdrahším] elektrick >> 15 o: [1 Elektrickej][10 Elektrická][3 Elektrické][2 Elektrickú][19 Elektrický] [45 elektrickej][3 elektrickom][3 elektrickou][9 elektricky][31 elektrická] [51 elektrické][37 elektrického][5 elektrickému][17 elektrickú] [36 elektrický][19 elektrických][11 elektrickým][1 elektrickými] n: [1 Elektrickej][10 Elektrická][3 Elektrické][2 Elektrickú][19 Elektrický] [3 Električka][3 Električková][1 Električky][45 elektrickej][3 elektrickom] [3 elektrickou][9 elektricky][31 elektrická][51 elektrické][37 elektrického] [5 elektrickému][17 elektrickú][36 elektrický][19 elektrických] [11 elektrickým][1 elektrickými][8 električka][2 električkami] [2 električkou][6 električkovej][12 električková][3 električkové] [4 električkového][2 električkovú][5 električkový][2 električkových] [21 električky][1 električkách] greck >> 23 o: [1 greckej] n: [1 Grečka][2 Grečko][95 Grécka][13 Grécke][5 Gréckej][4 Grécki] [56 Grécko][10 Gréckom][1 Gréckou][8 Grécky][1 Gréčtiny][1 greckej] [25 grécka][51 grécke][137 gréckej][14 grécki][9 grécko][7 gréckom] [4 gréckou][170 grécky][9 gréčtina][9 gréčtine][4 gréčtinou] [13 gréčtiny] katolick >> 15 o: [1 Katolickom][2 Katolický][1 katolicko][1 katolické] n: [1 Katolickom][2 Katolický][19 Katolícka][5 Katolícke][12 Katolíckej] [2 Katolícki][3 Katolíckou][8 Katolícky][1 katolicko][1 katolické] [21 katolícka][14 katolícke][46 katolíckej][1 katolícki][2 katolíckom] [7 katolíckou][34 katolícky][1 katolíckého][1 katolíčky] kral >> 34 o: [1 Kral][1 Krali] n: [1 KRÁĽ][1 Kral][1 Krali][18 Král][1 Králi][4 Králova][5 Královo] [1 Královou][29 Králové][1 Králového][57 Kráľ][9 Kráľa][2 Kráľom] [4 Kráľov][7 Kráľova][3 Kráľovej][1 Kráľovi][1 Kráľovo][1 Kráľovou] [12 Kráľová][1 kraľ][2 král][1 krála][3 krále][12 králi][1 králom] [1 králov][288 kráľ][309 kráľa][76 kráľom][51 kráľov][3 kráľova] [1 kráľovej][42 kráľovi][1 kráľovo][1 kráľových] kriz >> 33 o: [1 Kriza][2 krizy] n: [1 Kriza][5 Kríza][7 Kríž][10 Kríža][1 Krížom][2 Krížovej] [2 Krížová][1 Krížové][5 Kříž][2 krizy][2 kríz][17 kríza][6 krízou] [1 krízovej][1 krízové][2 krízového][1 krízovú][1 krízový] [2 krízových][28 krízy][36 kríž][26 kríža][1 krížmi][1 krížoch] [11 krížom][2 krížov][8 krížovej][7 krížovou][3 krížová] [2 krížové][2 krížového][2 krížovú][2 krížový][1 krížovým] [2 krížovými] lav >> 23 o: [1 lava] n: [1 Láv][2 Láva][1 Lávy][1 lava][1 láv][2 láva][2 lávami][2 láve] [8 lávové][22 lávy][1 Ľavej][1 Ľavom][1 Ľavá][3 Ľavé][1 Ľavý] [44 ľavej][47 ľavom][1 ľavou][2 ľavá][4 ľavé][7 ľavého][7 ľavú] [8 ľavý][2 ľavým] minut >> 15 o: [1 Minute][1 Minutos][1 minute][3 minutus][1 minutých] n: [1 Minute][1 Minutos][1 minute][3 minutus][1 minutých][107 minút][6 minúta] [12 minúte][1 minútovej][2 minútovom][1 minútovou][2 minútová] [1 minútové][2 minútového][1 minútovú][3 minútový][1 minútových] [2 minútovými][24 minúty][1 minúť] nas >> 16 o: [77 NASA][1 NaS][1 Nas][1 Naso][4 nas][1 nasi][1 naso] n: [77 NASA][1 NaS][1 Nas][1 Naso][11 Naša][8 Naše][1 Našej][5 Naši][1 Našou] [9 Náš][4 nas][1 nasi][1 naso][10 naša][19 naše][30 našej][4 naši] [37 našich][15 našom][2 našou][1 naší][84 nás][19 náš] pas >> 17 o: [1 PASO][119 Pas][1 Paso][1 Passes][1 Pasú][5 pas][1 pasy][1 pasú] n: [1 PASO][119 Pas][1 Paso][1 Passes][1 Pasú][5 Paša][2 Pás][1 Páse][2 Pásy] [5 pas][1 pasy][1 pasú][1 paša][1 paše][1 paši][41 pás][1 pása] [1 pásami][18 páse][8 pásmi][3 pásoch][11 pásom][5 pásové][5 pásy] [1 páší] plan >> 16 o: [12 Plan][3 Planina][1 plan][2 plane][1 planej][14 planina][6 planine] [3 planinou][15 planiny][1 plané] n: [12 Plan][3 Planina][11 Plán][2 Plánom][3 Plány][1 Pláň][1 plan][2 plane] [1 planej][14 planina][6 planine][3 planinou][15 planiny][1 plané][49 plán] [16 pláne][3 pláni][4 plánmi][5 plánoch][7 plánom][20 plánov][37 plány] [1 plání][3 pláň][2 pláňami][1 pláňou] polsk >> 31 o: [1 Polsce][11 Polska][6 Polski][7 Polskich][1 Polsko][1 polski][1 polsko] n: [1 Polsce][11 Polska][6 Polski][7 Polskich][1 Polsko][93 Poľska][20 Poľskej] [138 Poľsko][20 Poľskom][15 Poľská][11 Poľské][7 Poľského][2 Poľskí] [1 Poľskú][10 Poľský][1 Poľských][1 Poľským][1 polski][1 polsko] [49 poľskej][22 poľsko][12 poľskom][6 poľskou][37 poľsky][32 poľská] [22 poľské][39 poľského][2 poľskému][7 poľskí][5 poľskú][82 poľský] [21 poľských][16 poľským][2 poľskými][2 poľština][8 poľštine] [2 poľštinou][6 poľštiny] post >> 19 o: [41 Post][102 post][25 poste][4 postoch][1 postom][1 postov][2 posty] n: [41 Post][2 Pošta][1 Poštovou][2 Poštová][2 Poštové][1 Poštový] [102 post][25 poste][4 postoch][1 postom][1 postov][2 posty][12 pošta] [5 poštou][2 poštovej][1 poštovou][4 poštová][1 poštové][2 poštového] [1 poštoví][4 poštový][4 poštových][15 pošty][1 pôst][1 pôsty] [3 pôšt] premier >> 18 o: [2 PREMIER][19 Premier][1 Premiera][1 Premierom][3 première] n: [2 PREMIER][19 Premier][1 Premiera][1 Premierom][1 Premiér][22 Premiéra] [2 Premiérom][1 Premiérový][3 première][33 premiér][26 premiéra] [14 premiére][10 premiérom][2 premiérou][1 premiérov][4 premiérovo] [1 premiérovom][1 premiérového][1 premiérovému][1 premiérovú] [3 premiérový][6 premiéry][1 premiér] prirodn >> 25 o: [1 prirodne] n: [1 Prírodne][3 Prírodnej][2 Prírodnou][46 Prírodná][4 Prírodné] [2 Prírodnú][5 Prírodný][1 Prírodných][4 Přírodní][1 prirodne] [1 prírodna][3 prírodne][32 prírodnej][1 prírodniny][1 prírodno] [5 prírodnom][19 prírodnou][192 prírodná][49 prírodné][29 prírodného] [14 prírodnú][15 prírodný][85 prírodných][6 prírodným][8 prírodnými] [17 přírodní] rimsk >> 18 o: [1 Rimského] n: [1 Rimského][7 Rímska][5 Rímske][24 Rímskej][2 Rímski][4 Rímsko] [1 Rímskom][3 Rímskou][10 Rímsky][18 rímska][24 rímske][45 rímskej] [12 rímski][12 rímsko][5 rímskom][2 rímskou][41 rímsky][3 Římské] [1 římskou] siet >> 18 o: [4 SIETe][6 Siete][103 siete][53 sieti][12 sietí] n: [4 SIETe][11 SIEŤ][6 Siete][8 Sieť][4 Sieťová][1 Sieťový][103 siete] [53 sieti][12 sietí][82 sieť][4 sieťami][16 sieťou][2 sieťovej] [1 sieťovom][1 sieťovou][1 sieťová][4 sieťové][1 sieťového] [2 sieťovú][1 sieťový][7 sieťových][3 sieťovým][1 sieťovými] skol >> 15 o: [2 skole][1 skoly] n: [2 skole][1 skoly][1 ŠKOLA][1 ŠKOLY][24 Škola][2 Škole][3 Školy][2 škol] [329 škola][2 školami][163 škole][12 školou][306 školy][46 školách] [2 školám][1 škoła][65 škôl] stal >> 18 o: [51 Stal][13 Stala][1 Stali][30 Stalin][8 Stalina][4 Stalinom][3 Stalinovi] [16 Stalo][862 stal][338 stala][1 stale][150 stali][173 stalo] n: [51 Stal][13 Stala][1 Stali][30 Stalin][8 Stalina][4 Stalinom][3 Stalinovi] [16 Stalo][3 Stál][4 Stála][16 Stále][3 Stálej][2 Stáli][1 Stálo] [1 Stálou][2 Stály][862 stal][338 stala][1 stale][150 stali][173 stalo] [67 stál][41 stála][249 stále][7 stálej][28 stáli][19 stálo][2 stálom] [3 stálou][13 stály][1 Štál] stat >> 20 o: [1 Stat][48 State][7 Status][11 stat][11 state][2 stati][53 status][5 statí] n: [1 Stat][48 State][7 Status][11 stat][11 state][2 stati][53 status][5 statí] [71 stať][4 stát][12 stáť][2 sťatá][2 sťatí][1 sťatý][1 sťať] [14 Štát][1 Štátoch][3 Štátov][3 Štáty][1 štatom][1 štatov] [186 štát][124 štáte][37 štátmi][116 štátoch][66 štátom] [238 štátov][91 štáty] studi >> 27 o: [1 STUDIO][2 Studia][4 Studie][20 Studio][36 Studios][2 studie][7 studio] [1 studií] n: [1 STUDIO][2 Studia][4 Studie][20 Studio][36 Studios][2 studie][7 studio] [1 studií][12 Štúdia][3 Štúdie][6 Štúdio][2 Štúdiom][3 Štúdiové] [1 Štúdioví][2 Štúdiá][1 študiovým][140 štúdia][4 štúdiami] [57 štúdie][12 štúdii][18 štúdio][17 štúdiom][3 štúdiou] [3 štúdiovom][1 štúdiová][5 štúdiové][14 štúdiového][67 štúdiový] [10 štúdiových][4 štúdiovým][2 štúdiovými][16 štúdiá] [34 štúdiách][1 štúdiám][50 štúdií] styl >> 15 o: [2 Style][1 Stylos][3 styl][3 style] n: [2 Style][1 Stylos][3 styl][3 style][5 Štýl][1 Štýlom][1 Štýlové] [69 štýl][61 štýle][4 štýlmi][3 štýloch][20 štýlom][13 štýlov] [1 štýlovej][2 štýlovo][3 štýlové][1 štýlovú][1 štýlový] [13 štýly] svat >> 25 o: [1 Svatom][5 Svatá][4 Svaté][10 Svatého][12 Svatý][2 svaté][2 svatého] [1 svatý][1 svatých] n: [1 Svatom][5 Svatá][4 Svaté][10 Svatého][12 Svatý][1 Sváti][39 Svätej] [1 Sväto][10 Svätom][1 Svätou][43 Svätá][6 Sväté][47 Svätého] [7 Svätému][3 Svätí][3 Svätú][94 Svätý][2 svaté][2 svatého][1 svatý] [1 svatých][51 svätej][2 svätom][5 svätou][8 svätá][4 sväté] [140 svätého][5 svätému][2 svätí][3 svätú][21 svätý][59 svätých] [7 svätým][2 svätými] system >> 18 o: [1 SYSTEM][52 System][17 system][1 systema][1 systeme] n: [1 SYSTEM][52 System][1 Systémová][1 Systémové][10 Systémy][17 system] [1 systema][1 systeme][16 systémami][67 systéme][37 systémoch][70 systémom] [86 systémov][3 systémovej][1 systémovom][2 systémová][9 systémové] [3 systémového][1 systémovú][4 systémový][6 systémových][102 systémy] [1 sýstéma] velk >> 30 o: [1 Velkom][1 Velkou][13 Velká][4 Velké][2 Velkého][15 Velký][1 Velkým] [2 velkou][2 velké][1 velkého] n: [1 Velkom][1 Velkou][13 Velká][4 Velké][2 Velkého][15 Velký][1 Velkým] [1 Veľk][1 Veľka][151 Veľkej][18 Veľkom][16 Veľkou][163 Veľká] [83 Veľké][70 Veľkého][3 Veľkému][1 Veľkí][31 Veľkú][180 Veľký] [15 Veľkých][25 Veľkým][3 Veľkými][2 velkou][2 velké][1 velkého] [136 veľkej][1 veľko][38 veľkom][1 veľkos][57 veľkou][96 veľká] [306 veľké][83 veľkého][15 veľkému][2 veľkí][120 veľkú][223 veľký] [137 veľkých][79 veľkým][36 veľkými] voln >> 19 o: [1 Volnin][1 Volný][2 volne][1 volné][1 volného][1 volný] n: [1 Volnin][1 Volný][1 Voľne][1 Voľná][8 Voľné][1 Voľným][2 volne] [1 volné][1 volného][1 volný][2 voľna][64 voľne][23 voľnej][3 voľno] [18 voľnom][3 voľnou][8 voľná][21 voľné][17 voľného][1 voľnému] [4 voľnú][19 voľný][22 voľných][38 voľným][1 voľnými] vyber >> 18 o: [11 vyberá][2 vyberú] n: [3 Výber][1 Výberová][1 Výberové][11 vyberá][2 vyberú][76 výber] [17 výbere][1 výbermi][9 výberom][1 výberov][2 výberovom][1 výberová] [1 výberové][1 výberovú][6 výberový][2 výberových][2 výberovým] [1 výberovými][1 výběr][1 výběrový]
High-Frequency Words
[edit]I also looked for high-frequency words that were added to a group.
High-frequency words are those that occur 1,000 times or more in the sample. These are more likely to be very common words, so it’s important to look at cases where a high-frequency word was added or removed from a group, to make sure the change isn’t going to cause problems.
[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]
I dropped groups that are easily interpreted as a small number of words without diacritics being added to a larger group of words with diacritics, one of which is high-frequency. For example, 1 instance of clanok would be indexed with 1501 instances of článok, which isn't actually very interesting (and may just be a typo).
I kept groups where the group being added to had at least 3 different words in it, or at least one of the words had 10 or more instances. The remaining 12 groups with high-frequency words are below.
As before, the most interesting collisions (ignoring case) seem to be:
- co and čo
- kedy and keď
- su, sú, šu, and šú
Gained members are bolded.
az >> 2 o: [84 AZ][1 Az][6 az] n: [84 AZ][1 Az][101 Až][6 az][2357 až] byt >> 6 o: [1 Bytom][1 Byty][15 byt][9 byte][7 bytmi][2 bytoch][4 bytom][5 bytové] [12 byty] n: [1 Bytom][1 Byty][4 Byť][1 Být][15 byt][9 byte][7 bytmi][2 bytoch][4 bytom] [5 bytové][12 byty][1931 byť][4 být][1 býti][2 býť] cast >> 29 o: [2 Cast][2 Castles][6 Castres][1 caste][1 casti] n: [2 Cast][2 Castles][6 Castres][1 caste][1 casti][1 Časti][67 Často] [1 Častou][7 Častá][5 Časté][5 Častý][3 Častým][2 Častými] [37 Časť][1 častej][1171 časti][1 častich][541 často][1 častom] [3 častou][7 častá][39 časté][2 častého][240 častí][6 častý] [4 častých][14 častým][9 častými][1 často][1012 časť][18 časťami] [122 časťou][5 část][1 části] co >> 3 o: [3 CO][28 Co][7 Comes][7 co][2 comes] n: [3 CO][28 Co][7 Comes][26 Côtes][7 co][2 comes][50 Čo][1418 čo] ked >> 2 o: [1 Kedy][1 ked][461 kedy] n: [1 Kedy][297 Keď][1 ked][461 kedy][1084 keď] ma >> 3 o: [5 MA][16 Ma][2 Makes][1 Manes][1 Mates][1 mA][61 ma][1 makes][1 malém] [1 mares] n: [5 MA][16 Ma][2 Makes][1 Manes][1 Mates][476 Má][2 Mánes][1 mA][61 ma] [1 makes][1 malém][1 mares][3428 má] neskor >> 4 o: [1 Neskoro][1 Neskorším][13 neskorej][13 neskoro][3 neskorom][1 neskorou] [1 neskory][2 neskoré][10 neskorého][7 neskorý][6 neskorých][9 neskorším] n: [1 Najneskôr][1 Neskoro][1 Neskorším][270 Neskôr][9 najneskôr] [13 neskorej][13 neskoro][3 neskorom][1 neskorou][1 neskory][2 neskoré] [10 neskorého][7 neskorý][6 neskorých][9 neskorším][1075 neskôr] podl >> 3 o: [4 Podla][1 Podle][9 podla][3 podle][1 podlete] n: [1 PODĽA][4 Podla][1 Podle][531 Podľa][9 podla][3 podle][1 podlete] [1254 podľa] ponuk >> 3 o: [1 Ponuka][9 ponuka][9 ponuke][3 ponukou][20 ponuky][1 ponukách] n: [1 Ponuka][8 Ponúka][9 ponuka][9 ponuke][3 ponukou][20 ponuky][1 ponukách] [3 ponúk][2285 ponúka] region >> 9 o: [6 Region][1 Regione][2 Regionova] n: [6 Region][1 Regione][2 Regionova][7 Región][1 Regióny][76 región] [1859 regióne][3 regiónmi][13 regiónoch][9 regiónom][28 regiónov] [11 regióny] su >> 5 o: [6 SU][141 Su][2 Sü][9 su][3 sü] n: [6 SU][141 Su][146 Sú][2 Sü][9 su][3855 sú][3 sü][19 ŠÚ][1 šu][3 šú] ze >> 2 o: [4 Ze][14 ze] n: [4 Ze][14 ze][4 Že][3268 že]
Speaker Review: Folding Groups that Lost and Gained (Mixed) Members
[edit]The question for speakers of Slovak reviewing these sections (Random Sample and High-Frequency Words) is this: would it be bad if searching for the the new groups of words found each other, in stead of the old groups? (That's clunky, but after looking separately at groups that lost and gained members, the idea should be clear enough.)
Random Sample
[edit]Looking at a random sample of the word groups is the best way to see what the typical effects of a modification are. If the majority of changes are good, and any less desirable changes are understandable and acceptable, then overall the modification is good.
[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]
I don't see a lot of differences here, other than we happen to have both gains and losses applying at once. However, I'm including them in case there is something non-obvious.
Below is a sample of 10 randomly selected stemming groups (words that would all be indexed together) that both lost and gained members as a result of folding Slovak diacritical characters after stemming. (These are from the Wikipedia sample.)
Key:
- cest >< 5
- cest indicates that all of these words were stemmed to cest. The stem does not have to be the root form of the word or even a word at all, but seeing it sometimes makes it easier to understand what the stemmer did.
- >< 5 indicates that from "old" to "new", this stemming groups lost some members and gained some members and the total lost or gained is 5.
- o: — the "old" group, in this case, the current behavior
- n: — the "new" group, in this case, with Slovak letters folded before stemming
- [1 CESTA] — CESTA occurs 1 time in our sample (of 10K articles)
Lost and gained members are bolded.
cest >< 5 o: [1 CESTA][76 Cesta][1 Cestami][2 Ceste][12 Cestou][21 Cesty][283 cesta] [20 cestami][118 ceste][2 cesto][94 cestou][177 cesty][24 cestách][2 cestám] [1 cestě] n: [1 CESTA][76 Cesta][1 Cestami][2 Ceste][12 Cestou][21 Cesty][283 cesta] [20 cestami][118 ceste][2 cesto][94 cestou][177 cesty][24 cestách][2 cestám] [3 Česť][1 česti][1 čestine][10 česť] dob >< 2 o: [9 Doba][2 Dobové][2 dob][36 doba][1 dobami][317 dobe][18 dobou][8 dobové] [213 doby][15 dobách][1 době] n: [9 Doba][2 Dobové][2 dob][36 doba][1 dobami][317 dobe][18 dobou][8 dobové] [213 doby][15 dobách][6 dôb] hlav >< 2 o: [11 Hlava][1 Hlavina][2 Hlavou][1 Hlavový][4 Hlavy][26 hlava][2 hlavami] [27 hlave][44 hlavou][2 hlavová][1 hlavovú][1 hlavový][70 hlavy][2 hlavách] [1 hlavým][1 hlavě] n: [11 Hlava][1 Hlavina][2 Hlavou][1 Hlavový][4 Hlavy][26 hlava][2 hlavami] [27 hlave][44 hlavou][2 hlavová][1 hlavovú][1 hlavový][70 hlavy][2 hlavách] [1 hlavým][37 hláv] kop >< 2 o: [8 Kop][10 Kopa][1 Kopú][1 Kopę][2 kop][14 kopa][5 kope][1 kopom][2 kopou] [17 kopy] n: [8 Kop][10 Kopa][1 Kopú][2 kop][14 kopa][5 kope][1 kopom][2 kopou][17 kopy] [1 kôp] mat >< 7 o: [1 MAT][9 Mat][2 Mate][63 Matej][6 Mato][1 Matom][2 Matěj][5 mat] n: [1 MAT][1 MAŤA][9 Mat][2 Mate][63 Matej][6 Mato][1 Matom][3 Mať][1 Maťo] [5 mat][192 mať][1 máta][1 máte] otrokyn >< 2 o: [1 Otrokyně][4 otrokyne] n: [4 otrokyne][5 otrokyňa] pohrebisk >< 2 o: [1 Pohrebisko][1 Pohrebiská][1 pohrebiska][6 pohrebisko][1 pohrebiskom] [1 pohrebiská][1 pohřebiště] n: [1 Pohrebisko][1 Pohrebiská][1 pohrebiska][6 pohrebisko][1 pohrebiskom] [1 pohrebiská][3 pohrebísk] sa >< 2 o: [14 SA][5 Sa][1 Sages][2 Sales][3 Savès][24614 sa] n: [14 SA][5 Sa][1 Sages][2 Sales][1 Sá][24614 sa] slavk >< 2 o: [4 Slavkov][1 Slavkova][6 Slavkove][1 Slavkově] n: [4 Slavkov][1 Slavkova][6 Slavkove][1 Slávka] sut >< 5 o: [2 sute][1 sutí][1 sutě] n: [2 sute][1 sutí][2 suť][1 suťové][1 Šuta][1 Šuty]
High-Impact Groups
[edit]High-impact groups are those with 10 or more changes to the number of distinct words in the group (gains >>
, losses <<
, or a mix ><
). These groups are more likely to have problems because they are outliers.
Sometimes an apparent high-impact group is not really an outlier. This happens when a large group has the stem of a small group. For example, if a group of 10 words and a group of 2 words merge, you could see it as the group of 10 gaining 2 new members (which is not an outlier), or as the group of 2 gaining 10 new members (which looks like an outlier).
The most interesting cases are when two relatively large groups merge, or when more than two medium-sized groups merge—because then lots of potentially unrelated words are being grouped together.
[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]
There were 8 groups with 10 or more changes, which are shown below. Changes are bolded.
Lost and gained members are bolded.
dom >< 12 o: [2 DOM][71 Dom][4 Doma][9 Dome][1 Domo][3 Domus][7 Domy][170 dom][58 doma] [6 domami][55 dome][10 domoch][7 domom][78 domy][1 domě] n: [2 DOM][71 Dom][4 Doma][9 Dome][1 Domo][3 Domus][7 Domy][7 Dóm][1 Dóma] [2 Dóme][22 Dôme][170 dom][58 doma][6 domami][55 dome][10 domoch][7 domom] [78 domy][9 dóm][1 dóma][4 dómami][3 dóme][1 dómoch][3 dómom][1 dómy] kon >< 10 o: [4 Kon][4 Kone][3 Koná][1 Koně][16 kone][11 koni][43 koná][50 koní] n: [4 Kon][4 Kone][3 Koná][1 Koňa][1 Kóňa][5 Kôň][16 kone][11 koni] [43 koná][50 koní][19 koňa][3 koňmi][5 koňoch][4 koňom][3 koňovi] [18 kôň] lud >< 21 o: [1 Lud][1 Lude][1 Ludiès][1 Ludo][1 Ludus] n: [1 Lud][1 Lude][1 Ludo][1 Ludus][1 luďom][1 ĽUDÍ][1 Ľud][1 Ľuda][1 Ľudo] [4 Ľudové][1 Ľudí][1 Ľuďom][46 ľud][1 ľude][6 ľudi][1 ľudmi] [7 ľudom][33 ľudové][3 ľudy][421 ľudí][42 ľuďmi][13 ľuďoch] [25 ľuďom][1 ľuďí] metod >< 10 o: [9 Metod][20 Metoda][1 Metodom][1 Metodov][1 Metodova][2 Metodovi] [1 Metodových][1 Metoděj] n: [9 Metod][20 Metoda][1 Metodom][1 Metodov][1 Metodova][2 Metodovi] [1 Metodových][8 Metóda][1 Metódou][5 Metódy][34 metód][58 metóda] [15 metódami][5 metóde][33 metódou][62 metódy] roman >< 17 o: [1 ROMAN][66 Roman][12 Romana][1 Romani][5 Romano][5 Romanom][1 Romanov] [1 Romanova][2 Romanovej][3 Romanovi][2 Romanus][1 Romany][1 Româna][3 roman] [2 române][1 română] n: [1 ROMAN][66 Roman][12 Romana][1 Romani][5 Romano][5 Romanom][1 Romanov] [1 Romanova][2 Romanovej][3 Romanovi][2 Romanus][1 Romany][23 Román] [1 Románi][1 Româna][3 roman][195 román][19 románe][4 románmi] [4 románoch][11 románom][27 románov][9 románovej][1 románovou] [5 románová][1 románové][2 románového][1 románový][2 románových] [33 romány][2 române] slovak >< 10 o: [1 Slovaci][15 Slovak][1 Slovakė][1 Slovači] n: [1 Slovaci][15 Slovak][1 Slovači][47 Slováci][22 Slovák][3 Slováka] [5 Slovákmi][2 Slovákoch][6 Slovákom][58 Slovákov][1 Slovákovi] [1 slováci] volb >< 10 o: [1 volba][1 volbě] n: [5 Voľba][5 Voľby][1 volba][14 voľba][10 voľbami][15 voľbe][5 voľbou] [58 voľby][117 voľbách][3 voľbám] zbran >< 11 o: [6 Zbrane][144 zbrane][1 zbrani][78 zbraní][1 zbraně] n: [6 Zbrane][9 Zbraň][144 zbrane][1 zbrani][78 zbraní][64 zbraň][16 zbraňami] [11 zbraňou][2 zbraňová][2 zbraňové][1 zbraňového][1 zbraňový] [10 zbraňových][1 zbraňovými]
High-Frequency Words
[edit]High-frequency words are those that occur 1,000 times or more in the sample. These are more likely to be very common words, so it’s important to look at cases where a high-frequency word was added or removed from a group, to make sure the change isn’t going to cause problems.
[For more details about speaker review of modifications to language processing for search, see the Speaker Review Notes.]
There are no high-frequency words (> 1000 occurrences) in any groups that lost and gained words in either of the Wikipedia or Wiktionary samples.
Wiktionary Notes
[edit]The Wiktionary sample is generally similar in terms of words lost and gained from stemming groups. The most obvious difference other than the smaller size of the sample is the presence of pronunciations in IPA—in particular, IPA ending with ː (the vowel lengthening mark), such as slɔvniː, no longer get stemmed.
Speaker Review Summary
[edit]Update October 2019: Jetam2 looked over the samples here, and said, "When I compare the old and new there, it really seems that the ones we lost are rather foreign, versus the ones we gained are rather useful."
More details on specific groups, specific examples, and some minor concerns:
- In the random sample where words are lost from groups (the list starting with bechyn), the most common groups are low-frequency non-Slovak words or names. The primary reason for words to be lost from a group is that a diacritic near the end of the word blocks the stemmer from removing what looks like a suffix. So, -e is a valid suffix, but -ě is not a suffix, so Bechyně no longer gets stemmed to bechyn. Some related forms may not group together, but they are non-Slovak, so that’s okay. Overall, this is fine.
- In the random sample where words are added to groups (the list starting with alternativ), the groups are generally Slovak words and they all look okay. The only concern is Vals/Valšov, where Valšov gets -ov removed by the stemmer, giving valš, which gets folded to vals. This is the kind of thing that happens with diacritic folding, so it’s expected, and based on this sample, not what happens in the majority of cases. This is good!
- In the random sample where words are both added and lost from groups (the list starting with cesta), Czech cognates ending in -ě are lost because the stemmer doesn’t recognize them anymore. It would be okay if they were grouped with the Slovak cognate, but it’s okay if they are not. This group is basically a combination of the two previous groups—some related non-Slovak words lost from the group because diacritics block stemming, with generally good additions. Overall, this is fine.
These three random groups are the most representative of the changes we will see, and they are generally good, so we are good to go.
- In the groups with high-frequency words that added to the groups (the list starting with az), the only case that might be a problem is kedy and keď, but they look to be etymologically related, high-frequency function words, and so not a huge problem. Overall, this is fine.
- In the groups with high-impact changes that both added and lost from groups (the list starting with dom), most lost words are non-Slovak and most gained words are good additions. This is good!
In summary, folding after stemming is a net improvement, and we should implement the change to enable this version of the analysis chain for now, and follow up on improving the stemmer at a later time. (See T227924.)
Next Steps
[edit]- DONE Modify the Slovak analysis chain to enable diacritic folding for Slovak diacritics. (See T235561.)
- Once that is merged and in production, re-index Slovak-language wikis. (See T235654.)
Option 3: Modify the Stemmer
[edit]THIS IS A PLACEHOLDER FOR POSSIBLE FUTURE WORK.
And, getting waaaaay ahead of myself, another option to consider if stemming before folding doesn't work is to modify the stemmer to include an option to work on words without diacritics. This could be a fair amount of work to minimize the number of inaccurate stems.
I did note above that the stemmer needs some additional suffixes added to it, which is a separate task. (See Phab ticket T227924.)