User:TJones (WMF)/Notes/Adding Ascii-Folding to French Wikipedia
August 2016 â See TJones_(WMF)/Notes for other projects. (T142620) For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.
Summary
[edit]We want to add ascii-folding to French Wikipedia, so we thought weâd try it out âin the labâ and see how many extra indexing collisions it caused.
Highlights:
- The default French analysis chain unexpectedly does some ascii-folding already, after stemming.
- Unpacking the default French analysis chain per the Elasticsearch docs leads to different results, but most of the changes are desirable, and the effect size is very small.
- English and Italian, which have been similarly unpacked to add ascii-folding in the past, include a bit of extra tokenizing help for periods and underscores, which we may want to also do for Frenchâthough it does violence to acronyms and may not work with BM25.
- Ascii-folding itself effects significantly more tokens than ascii-folding in Englishâ50 times as many (as a percentage) for a 50K article corpusâwhich is not entirely a surprise, since many more accented characters are regularly used in French.
Introduction
[edit]The current configuration of the Elasticsearch text analysis for French Wikipedia uses the default analysis French chain, which includes handling elision (e.g., converting lâamour, dâamour, etc to amour), stop words (usually small ignorable non-content words that are dropped, like the, a, an, it, etc. in English, and au, de, il, etc. in French ), and stemming, but no separate ascii-folding step. Unlike in English, the French stemmer expects accents and handles them fine. See my recent English write up for a more detailed description of stemming and ascii-folding from and English-language point of view.
The lack of ascii-folding is causing problems for query terms like louĂżs (See T141216.) David suggested that we should enable asciifolding_preserve, which not only indexes the ascii-folded version of a term, but also preserves and indexes the original unfolded version. The point of this experiment is to make sure that not too much unexpected noise would be introduced by such a reconfiguration.
Corpora Generation and Processing
[edit]I extracted two corpora of randomly selected articles from French Wikipediaâwith 1K and 50K articles, respectively. The main point of the 1K corpus is to test code and do exploratory analysis. After extracting the corpora, I filtered all HTML tags, and other XML-like tags from the text.
In order to add ascii-folding to the analysis chain, it was necessary to unpack the built-in French analyzer into its constituent parts so that the additional step could be added. (This had previously been done for English to add ascii-folding to the end, as we suggest doing here, and similarly for Italian.) The equivalent explicit analysis chain for French in Elasticsearch 2.3 is available in the Elasticsearch docs.
For each corpus, I planned to call the Elasticsearch analyzer in each configuration, and note the results. In particular, we were looking for words that would now âcollideââthat is, would be indexed under the same analyzed formâthat hadnât collided before. There were a few unexpected bumps along the way.
N.B.: See the notes in the English analysis on types (unique words) and tokens (all words) is you arenât familiar with the terms or want more details and examples.
Unexpected Features of French Analysis Chain
[edit]While manually running analyses on the command line to make sure Iâd properly switched my local config from English to French, I discovered that some ascii-folding is already going on, apparently after stemming. Some common French diacritics are folded to their unaccented variants, while some French diacritics and other more general diacritics are not. In particular:
- folded: ĂĄ Ăą Ă Ă© ĂȘ Ăš Ăź ĂŽ Ă» Ăč ç
- unfolded: Ă€ Ă„ ĂŁ Ă« Ă ĂŻ ĂŹ Ăł ö ĂČ Ăž Ă” Ăș ĂŒ Ăż ñ Ă Ć ĂŠ
Of special note, the characters that are folded are left unchanged if the word is less than five letters long, so Ăąge is not folded to age. Also, deduplication doesnât happen if the word is fewer than five characters: aaaa (4xa) is indexed as aaaa, but aaaaa (5xa) comes out as just a. (This turns out to be pretty important!)
I was also able to determine that folding happens after stemming based on the analyzed version of Ă©lĂ©ment and element. The proper form, Ă©lĂ©ment (âelementâ) is analyzed as element (without accents, as per the list above). Meanwhile, element was analyzed as ele, with the final -ment (roughly equivalent English adverbial ending -ly) stripped by the stemmer. (As a result, you can search French Wikipedia for ele and get results on element, which is often in accentless redirects. As a comparison in English, searching Bever Hills on English Wikipedia gives âexact matchesâ on Beverly Hills because Beverly is stemmed in English to bever because the -ly looks like an adverbial ending, even though it isnât.)
Unexpected Differences in Unpacked Analysis Chain
[edit]After re-implementing the French analysis chain as its component steps I re-ran my small 1K sample to make sure that the results were the same as the built-in analysis chain. It turns out that there are a few differences that donât seem to be a product of my re-implementation. I re-ran the 50K sample, too, to get a better idea of the differences.
The differences seem to mostly be improvements:
- Invisible Unicode characters are stripped; they would otherwise keep some words from being indexed properly. Examples:
- bidirectional codes like U+202C and U+200E.
- byte order mark like U+FEFF.
- Better handling of Greek:
- Ï (the word-final form of Ï) is folded to Ï.
- cursive forms are folded to non-cursive forms: Ï to ÎČ (apparently this is a French thing!)
- capitals are properly lowercased, like â to Ï.
- other variants are folded, such as Ï to Ξ.
- Letter-like symbols are folded into matching characters: the micro sign (”) is folded to lowercase letter mu (ÎŒ), the aleph symbol (â”) is folded into the letter aleph (Ś)âdepending on your font, those pairs can be indistinguishable!
- Double-struck letters (common in mathematics) are folded to their normal version: â, â, and †become q, r, and z. (This isnât always idealâe.g., the mathematical n†is folded in with NZ, the abbreviation for New Zealand.)
- German Ă is folded to s; it is probably folded to ss first, but the French analysis chain is already known for deduping repeated characters.
- Some phonetic characters are folded to their normal counterparts: ÊČ and Ê° become j and h.
- Other small raised letters, as in 1á”Êł (cf., English 1Ëąá”) are folded to their normal counterparts, e and r.
- Fullwidth and halfwidth CJK characters are folded to their more typical variants: ïŒ becomes 3,  becomes ăąă.
- Soft hyphens are ignored.
- Arabic âisolatedâ variants are folded with the normal character.
- One obvious regression: Turkish dotted I (Ä°) is no longer folded to plain i.
I also noticed that there were slightly different versions of the unpacked French analysis chain in different versions of the docs. Whenever we update Elasticsearch, we should check the docs to see if the default analysis chains have changed. If they have, we might want to consider making similar changes to our unpacked analysis chains (English, Italian, and probably now French), even if the results of the unpacked chains are not identical to the built ins.
ResultsâBuilt-in French vs Unpacked French
[edit]The size of the effect is very small, and generally positive.
50K sample | |||
built-in French | unpacked French | ||
total tokens | 13,442,996 | 13,442,955 | |
pre-analysis types | 631,687 | 631,677 | |
post-analysis types | 427,911 | 427,627 | |
new collision types | 318 | 0.074% | |
new collision tokens | 994 | 0.007% |
- total tokens: basically the total number of words in the corpora
- pre-analysis types: the number of unique different forms of words before stemming and folding
- post-analysis types: the number of unique different forms of words after stemming and folding
- new collision types: the number of unique post-analysis forms of words that are bucketed together by the change (the % changed is in comparison to the post-analysis types)
- new collision tokens: the number of individual words that are bucketed together by the change (the % changed is in comparison to the original total tokens)
We have a very small net loss in tokensâthe tokenizers appear to be slightly different between the built-in and unpacked French analysis chains.
types | tokens | |
folded | 261 | 735 |
folded_plurals | 4 | 4 |
other | 61 | 293 |
Note that "new collisions" are post-analysis types (final buckets), and all other are pre-analysis types (original forms). This is confusingâsorry. The number of types changes (it is reduced) after analysis, but the number of tokens doesn't.
- plurals: where an apparent singular and plural came together, such as CrĂłnica/CrĂłnicas.
- folded: where accented and unaccented forms came together, such as Eric/Ăric, and Elias/ElĂas.
- folded_plurals: got a match both folded and pluralized.
- others: where it wasnât likely that the new bucketing was helpful, such as Gore/Göring.
N.B.: I did not match if the capitalization was different. Too many names out there, but there are more matches that could be made if case were not a factorâĆuvres / oeuvres, for example.
The overall impact is very small, but most of it is clearly positive.
Because of these changes, I had to re-analyze my larger sample with the unpacked French analysis chain to form a new baseline to isolate the effect of the ascii folding. It looks like some of the ascii-folding job is done by unpacking the French analysis chainâhowever our motivating characterâĂż, and umlauts/trĂ©mas in generalâare not affected.
Notes on Italian, character filters and tokenizing, etc.
[edit]Looking at the config for English and Italian (which have also been similarly unpacked so that ascii-folding can be added), I noticed that both the English and Italian configurationsâwhich may have been copied one from the otherâinclude the word_break_helper character filter in the tokenizer. This is a custom filter that maps underscores, periods, and parens to spaces, to make sure those things are definitely counted as word breaks. (It looks like parens are already word boundaries for French at least.)
Among other effects, this splits up domain names like wikipedia.org and youtube.com into parts, so that queries like wikipedia and youtube, respectively, could match the domains.
Since it takes 6 hours to run the full 50K French corpus, I only ran a quick test on my 1K corpus to see what effect the word_break_helper has on tokenizing.
- Dates and similar period-separated numbers like 01.06.1958 are broken up into parts (01, 06, and 1958).
- The same applies to letters, and acronyms (A.S.S.E.T.T. or A.D.) and web domains are split up.
- Certain typos (d'années.Elle) are split up and processed correctly.
Mini ResultsâUnpacked French vs Unpacked French with word_break_helper
[edit]Looking at collisions and token counts:
1K sample | |||
unpacked | unpacked + WBH | ||
total tokens | 298,110 | 298,617 | |
pre-analysis types | 58,123 | 57,947 | |
post-analysis types | 38,852 | 38,618 | |
new collision types | 237 | 0.610% | |
new collision tokens | 773 | 0.259% |
It was a fairly minor impact. I think itâs a net positive, though I donât like the way acronyms are treated.
After talking to David, and looking at the impact on word_break_helper on acronyms in English and how it interacts with his BM25 work, I think maybe we shouldnât implement it, and maybe we should turn it off for English, too.
ResultsâUnpacked French vs Unpacked and Ascii-folded French
[edit]These results focus on the effect of ascii-folding (preserving the original accented form as well).
50K sample | |||
unpacked | unpacked + folded | ||
total tokens | 13,442,955 | 13,878,966 | |
pre-analysis types | 631.677 | 664.876 | |
post-analysis types | 427.627 | 445.823 | |
new collision types | 8.906 | 2.083% | |
new collision tokens | 351,167 | 2.612% |
There is an increase in total tokens because we preserve the accented form and the ascii-folded form. The effect is relatively large (> 2%) and would be even larger for the full set of 1.7M French articles.
More than 80% of the new collisions are folded matchesâthe accentless form exists as its own pre-analysis type, generally indicative of a decent match.
types | tokens | |
plural | 0 | 0 |
folded | 9,258 | 287,462 |
folded_plurals | 725 | 17,606 |
other | 3,581 | 48,604 |
Review
[edit]There were 361 new collisions in the 1K data. I reviewed them and hereâs what jumped out at me:
- lots of short word folding: Ă, Ă , and Ă€ are now all folded in with a; Ăąge and ĂągĂ© are folded in with age. This was already happening with longer words, but now happens for the short ones, too.
- names with diacritics are folded in with their accentless versions: AgnĂšs & AgnĂ©s with Agnes; aĂŻkido with aikido; DĂŒsseldorf with Dusseldorf, RodrĂguez with Rodriguez, ShĂąh with Shah, etc.
- I also noticed a typo (BajĂ for Spanish BajĂĄ), which would now get indexed properly.
- some of the short words that are folded together, especially when dedpulication has happened, donât strike me as great: BĂąle with balle(s), bĂ©bĂ© with Beebe
- English contractions and possessives with smart quotes are correctly being indexed with straight quote variants: canât, donât, itâs, ainât, Kingâs. This is also happening to a few French words, like aujourdâhui. Looks like the French stemmer can handle straight quotes or smart quotes for elision, but doesnât fold them in general.
- Long words with very short stems are folded together. education is stemmed to educ, Ă©ducation is stemmed to Ă©ducâsince the stems are only four letters long, thereâs no ascii folding for Ă© in the French stemmer, and these are indexed separately. Now they are one!
- This happens with plurals, too, so that Ă©dits is stemmed to Ă©dit, which is too short to be ascii-folded by the French stemmer.
- It seems that inside the French stemmer, some stemming happens before ascii-folding, some after. édité, éditée, éditées, édités, éditeur, and éditeurs all stem to edit, but édits does not. With explicit post-stemmer ascii-folding, they are all indexed together.
- Ewww. The short word thing hits some masc/fem pairs. Ă©gal (the masculine, âequalâ) is indexed as Ă©gal, but Ă©gale (the feminine) is 5 letters, and eligible for ascii-folding before the final e comes off. It comes out as egal. With post-stemmer ascii-folding, they all end up together under egal. Similarly for reçu / reçue.
- Better handling of digraphs: Ćuvre with oeuvre, PhĆnix with Phoenix, SchnĂŠbelĂ© with SchnaebelĂ©, CĂŠsar with Caesar.
- Itâs not all great: thĂ© with the is going to be the worst, Iâm sure.
- Even with the specific ascii-folding step, Turkish dotted I (Ä°) is no longer folded to plain iâso Ä°stanbul and Istanbul are no longer indexed together. We could fix this by mapping Ä° to I before tokenization; dotted Ä° is not going to be distinctive very often in French.
I was less optimistic at first because of all the short words I was seeing, but looking through the whole list, I think itâs a net positive. David kindly looked more closely at the 244 âotherâ collisions that didnât fall into the folded or plurals categories and gave his native-speaker judgements on the quality of the merges. His judgements, plus the automatically assessed folded & plural counts are below.
types | tokens | |
Folded | 315 | 2064 |
Folded Plural | 54 | 647 |
OtherâGood | 106 | 612 |
OtherâBad | 83 | 417 |
OtherâUnclear | 55 | 348 |
So, only 10.20% of tokens (417/4088) involved in new collisions in the 1K sample are demonstrably worse.
It looks like we have a winner!
Potential Hard to Explain Behavior
[edit]I havenât come across a concrete example, but Iâm going to write this down here because itâs going to confuse someone at some point. The French stemmer does some ascii-folding (as noted with Ă©lĂ©ment and element), before it does deduplication (hhhhoooommmmmmmmeeeee goes to home). The additional, more universal ascii-folding step weâve added comes after that, so the deduplication doesnât always work out like youâd expect.
So, here we have an artificial example of 6 aâs in a row, with various accents. The French stemmer folds ĂĄ to a, but not Scandanvian Ă„. Dedpulication happens on exact character matches (modulo case). Then the extra ascii-folding happens (preserving the âoriginalâ, which is the output of the French stemmer). None of the five tokens used to index the three originals are the same as each other, and only one is the same as its original.
original token | ÄÄaaÄÄ | ÄaÄaÄa | åaåaåa | ||||
stemmer-based folding | ÄÄaaÄÄ | ÄaÄaÄa | aaaaaa | ||||
dedpulication | ÄaÄ | ÄaÄaÄa | a | ||||
asciifolding_preserve | ÄaÄ | aaa | ÄaÄaÄa | aaaaaa | a |
I donât know when or where, but this will eventually come up. Mark my words!
Conclusions
[edit]- The overall impact of performing ascii-folding on French Wikipedia is largely positive. We should do it!
- We should probably set up a character map from Ä° to I so as not to regress on Turkish names.
- Adding the word_break_helper character filter is dubious.
Deployment Notes
[edit]Since this change affects how terms are indexed, it requires a full re-index of French Wikipedia. Weâll be doing that for the BM25 implementation within the next quarter or so, so it makes sense to do BM25 and stemming-before-indexing at the same time.