User:TJones (WMF)/Notes/On Merging Apostrophes and Other Unicode Characters
August 2016 â See TJones_(WMF)/Notes for other projects. ( T41501) For help with the technical jargon used in discussing Analysis Chains, check out the Language Analysis section of the Search Glossary.
Intro
[edit]I wrote this up for T41501, but wanted to put it with my other notes for easy finding later.
T41501 is a pretty old ticket dating back to a discussion in 2009 about straight and curly apostrophes in ldsearch, which has since been retired. Below is an abbreviated version of the ticket description.
Task Description
[edit]When doing a search with the apostrophe character U+0027 "apostrophe/single quote" available on most keyboard, results should match other Unicode apostrophe-like characters like the preferred apostrophe U+2019 and others.
Basically indexing should convert all apostrophes to U+0027, and searching should convert all apostrophes to U+0027. So articles containing U+2019 for exemple would be matches when search with U+0027, U+2019 or other apostrophes.
From the 2009 discussion, the list of apostrophes was:
- U+0027 APOSTROPHE
- U+2018 LEFT SINGLE QUOTATION MARK
- U+2019 RIGHT SINGLE QUOTATION MARK
- U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
- U+2032 PRIME
- U+00B4 ACUTE ACCENT
- U+0060 GRAVE ACCENT
- U+FF40 FULLWIDTH GRAVE ACCENT
- U+FF07 FULLWIDTH APOSTROPHE
I would add other characters for which U+0027 is often used as an accessible substitute like some modifier letters and saltillo:
- U+02B9 MODIFIER LETTER PRIME
- U+02BB MODIFIER LETTER TURNED COMMA
- U+02BC MODIFIER LETTER APOSTROPHE
- U+02BD MODIFIER LETTER REVERSED COMMA
- U+02BE MODIFIER LETTER RIGHT HALF RING
- U+02BF MODIFIER LETTER LEFT HALF RING
- U+0384 GREEK TONOS
- U+1FBF GREEK PSILI
- U+A78B LATIN CAPITAL LETTER SALTILLO
- U+A78C LATIN SMALL LETTER SALTILLO
My Response
[edit]There are a lot of connected issues here. Iâll try to untangle some of them.
In Elasticsearch, there are two particularly relevant steps to processing text. Tokenizing, which breaks up text into tokens to be indexed (usually what we think of as words), and ascii-folding, which converts non-plain ascii into plain ascii if possibleâthough, for example, you canât convert Chinese characters into plain ascii because thereâs no reasonable mapping.
The rules Elasticsearch uses for tokenizing and other processing can differ by language, so Iâve only tested these on the English analysis chain for now.
A normal apostrophe is treated as a word break, so looking at prickettâs (from the prickettâs charge in the article from the Daily WTF), we get prickett and s as our terms to be indexed. Searching for prickettâs charge actually searches for three tokens: prickett s charge. The obvious title comes up because that phrase in that exact order is the title of the article, which is usually a very good result.
Many of the apostrophe-like characters listed above also serve as word breaks in English. The ones listed here that are not word breaks include all the listed modifier letters, and the small saltilloâoddly, the capital saltillo is a word break. Of course, in other languages, the analysis could be different, though I checked Greek and the separate tonos is still a word breaker. (I think itâs because itâs not a modifier mark, since all the vowels with tonos have precomposed Unicode charactersâbut Iâm guessing.)
For characters that are not word breaks, ascii-folding often does what youâd want, but not always. Ascii-folding is currently enabled on English Wikipedia, so searching for pĂŻÄkÄttâs ÄhĂŁrgè works like youâd want. In my (not quite done) research into French (T142620), Turkish dotted-I (Ä°) is properly folded to I by the default French analysis chain, but not by the explicit ascii-folding step. The French stemmer does some ascii-folding, but generally not as much as the explicit ascii-folding step (dotted-I notwithstanding).
In general, the Elasticsearch ascii-folding is pretty goodâthough linguists cringe at folding É° to m. Undoubtedly there are other minor errors in the ascii-folding.
The tokenizer is causing some of these problems, particularly with the multiplication mark, Ă, which is a non-word character, and so acts as a word break. When using the multiplication symbol, 3Ă4 is tokenized as two tokens: 3 4; while when using an x, 3x4 is tokenized as three tokens: 3 x 4.
We are currently doing explicit ascii-folding for English and Italian, and weâre adding it for French (which will come with BM25). Some probably happens in other language-specific analysis chains, but we donât know exactly what or where without testing.
It is possible to add any of these othersâx for Ă, I for Ä°âas Elastic character filters, which just uniformly map one character to another, but that could have unintended consequences. They would definitely no longer distinguish between the mapped charactersâso we couldnât apply them universally, since in Turkish, the distinction between I and Ä° matters.
There can always be problems with particular ânon-nativeâ characters and a particular symbols that the default tokenizing and ascii-folding doesnât handle as well as weâd like. More issues will come up, but Iâd consider closing this specific task since this is based on the behavior of lsearchd which is no longer around, all of the original apostrophe-like characters now behave like apostrophes, and we are looking into ICU folding (T137830), which is more appropriate for other languages that arenât using the Latin alphabet (itâs already enabled for Greek).