Jump to content

User:TJones (WMF)/Notes/Khmer Reordering

From mediawiki.org

September 2019 — See TJones_(WMF)/Notes for other projects. See also T185721.

Syllable Re-Writing to Improve Khmer Search Performance

[edit]

Background

[edit]

Khmer, also called Cambodian, is the official language of Cambodia. The Khmer script is a syllabary closely related to Thai and Lao and more distantly to all the Brahmic scripts.

Khmer is written left to right, and often without spaces between words. That's the easy part! (Though it's still hard.)

Khmer orthographic syllables[*] are built around a "base character" that represents a consonant and a vowel. Up to two additional subscript consonants may be added to the onset of the syllable by stacking them underneath the base character (usually), and the vowel of the base character can be changed by adding other characters to the base character. Other diacritics may also be added to alter the pronunciation of the consonant or vowel.

[* Orthographic syllables are a new concept for me, and they are not well documented on English Wikipedia, alas. The basic idea is that orthographic syllables represent the way words are broken down when writing, which may not line up 1-to-1 with they way they are broken down in speaking. This makes more sense in some of the Brahmic scripts—which includes Khmer—than it does in English.]

The various elements that glom on to the base character can attach themselves below, above, to the left, or to the right of the base character, and sometimes multiple elements can stack, especially above or below the base character. Some subscript consonants can also go to one side of the base character, and some of the vowel diacritics have two parts—one to the left, and the other above or to the right. Some diacritics change shape or location in the presence of other diacritics—for example, if both would normally go on top of the base character—presumably to keep things from getting too crowded.

Different fonts and different operating systems support typing the code points of an orthographic syllable in different orders, in that they will render the resulting syllable the same (or very nearly the same). This happens because—to simplify a bit—there isn't really an obvious linear order to elements that glom on top, to the left, and below the base character.

The Unicode Standard sets out a preferred order, which corresponds to the order the elements are spoken (though there seem to still be some elements that are ordered arbitrarily). Incorrect ordering should result in glyphs with a dotted circle (◌) showing that they aren't combining correctly, but many fonts, applications, and OSes will still render non-canonical orders perfectly fine, or at least reasonably well, and some don't show the dotted circle even when they render poorly.

Another issue is that for some of the diacritics, multiple copies will render directly on top of each other so that you can't see that there are multiple copies of the diacritic. This can apply to vowels, subscript consonants, and other diacritics.

All of this variability causes great difficulty in search, because two words that look exactly the same could underlyingly be composed of very different sequences of code points.

A similar but much smaller scale problem in English is that é can be either a single character or two characters (e + ´) composed together—but turned up to 11! A more analogous example would be if the character sequences strap, srtap, satrp, and straaaaap all looked identical.

Khmer Examples

[edit]

As an example, the syllable ង្រ្កា consists of a base character ង, plus the subscript consonants ្រ and ្ក, and the vowel ា. In terms of rendering ្រ goes to the left, ្ក goes below, and ា goes to the right. Depending on how smart and how forgiving the font you are using is, the elements can be re-ordered to get the same or nearly the same final visual form:

  • ង្រ្កា ( ង + ្រ + ្ក + ា )
  • ង្រា្ក ( ង + ្រ + ា + ្ក )
  • ង្ក្រា ( ង + ្ក + ្រ + ា )

Another example, ញ៉ាំ, includes the diacritic ៉, which can be rendered in this context similar to ុ, which can be used instead to generate a similar looking result. ុ is also more flexible in its ordering than ៉, so there are up to seven variants that look approximately the same (depending on your font, OS, and application support):

  • ញ៉ាំ ( ញ + ៉ + ា + ំ )
  • ញុាំ ( ញ + ុ + ា + ំ )
  • ញុំា ( ញ + ុ + ំ + ា )
  • ញាុំ ( ញ + ា + ុ + ំ )
  • ញាំុ ( ញ + ា + ំ + ុ )
  • ញំុា ( ញ + ំ + ុ + ា )
  • ញំាុ ( ញ + ំ + ា + ុ )

As mentioned before, some of the vowel diacritics are multi-part, with a component to the left and another to the right or above, such as ើ or ោ. The components also exist as independent pieces, and can be combined as such (as always depending on font support):

  • កើ ( ក + ើ )
  • កេី ( ក + េ + ី )

or:

  • កោ ( ក + ោ )
  • កេា ( ក + េ + ា)

The most extreme case of duplicate diacritics I found in my sample include fourteen copies of ំ on top of each other. Depending on the font, it can be rendered as a barely noticeable thickening of the lines, which is almost impossible to see (and definitely easy to miss) in a small text size.

  • តិំ ( ត + ិ + ំ )
  • តិំំំំំំំំំំំំំំ ( ត + ិ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ + ំ )

The Plan

[edit]

The goal, then, is to create an Elasticsearch plugin (or other config) that can re-order Khmer syllables into a standard order before indexing.

To make it easier for me and for Khmer speakers to evaluate the results, my initial goal is to re-order syllables into the correct canonical order, though it may be necessary to back off to a standard order that is not always correct, but always consistent.

Several people have suggested simply "alphabetizing" the syllable elements to give them a standard order. Readers would never see the re-ordered syllables, since they would only be used internally as part of the index. I thought of doing that, too, and it would definitely work for simpler syllables. For the more complex syllables, I think it could cause problems. The elements of the syllable can have structure of their own. For example, the subscript consonant ្ក is made up of a special symbol, "coeng" ( ្), plus the full consonant ក. I also found at least one case where three consonants seem to be able to appear in different orders: ក្ម្ស (kms-) vs. ក្ស្ម (ksm-).

(Update from the future: There is also an issue of determining syllable boundaries, which turns out to be a problem with some typos. Cleverer processing will make this more accurate.)

The Prototype

[edit]

I decided that my first task should be to build a prototype command line tool to parse out Khmer syllables and re-order them. This make it much easier to iterate on the re-ordering algorithm, debug, etc. It also puts off any problems that might crop up with Elasticsearch tokenizing while I figure out the re-ordering.

I did find some existing repositories on GitHub that do "Khmer re-ordering". However, the vast majority of them were copies of a library to re-order elements for display as a font, and the one that works on orthographic syllables assumes much cleaner data than what I'm seeing.

With a sample of 5,000 Khmer Wikipedia articles, 5,000 Wiktionary entries, and some guidelines on the theoretical order of syllable elements (see Some References, below), I set out to find and re-order the syllables of the text.

The Algorithm

[edit]

My current algorithm is as follows (parts of it are driven by the references I've found, and parts of it are data-driven, so it doesn't cover all possible cases, but it covers everything I've found so far, except for the syllable boundary problem):

  • Define useful character classes
    • consonants = ក ខ គ ឃ ង ច ឆ ជ ឈ ញ ដ ឋ ឌ ឍ ណ ត ថ ទ ធ ន ប ផ ព ភ ម យ រ ល វ ឝ ឞ ស ហ ឡ អ [U+1780–U+17A2]
      • ro = រ [U+179A]
    • independent vowels = ឣ ឤ ឥ ឦ ឧ ឨ ឩ ឪ ឫ ឬ ឭ ឮ ឯ ឰ ឱ ឲ ឳ [U+17A3–U+17B3]
      • The first two are deprecated and shouldn't occur after regularization
    • dependent vowels = ា ិ ី ឹ ឺ ុ ូ ួ ើ ឿ ៀ េ ែ ៃ ោ ៅ [U+17B6–U+17C5]
    • coeng = ្ [U+17D2]
    • diacritics = ំ ះ ៈ ៉ ៊ ់ ៌ ៍ ៎ ៏ ័ ៑ [U+17C6–U+17D1 U+17DD]
      • register shifters = [U+17C9 U+17CA]
      • robat = [U+17CC]
      • non-spacing diacritics = [U+17C6 U+17CB U+17CD–U+17D1 U+17DD]
      • spacing diacritics = [U+17C7 U+17C8]
    • zero-width elements = zero width space (ZWSP), zero width non-joiner (ZWNJ), zero-width joiner (ZWJ), soft-hyphen (SHY), invisible separator (InvSep) [U+200B–U+200D U+00AD U+2063]
  • Regularize the text
    • replace obsolete ligature ឨ (U+17A8) with ឧក (U+17A7 U+1780)
    • replace deprecated independent vowel ឣ (U+17A3) with អ (U+17A2)
    • replace deprecated independent vowel digraph ឤ (U+17A4) with អា (U+17A2 U+17B6)
    • replace ឲ (U+17B2) as a variant of ឱ (U+17B1)
    • replace deprecated trigraph ៘ (U+17D8) with ។ល។ (U+17D4 U+179B U+17D4)
    • delete non-visible inherent vowels (឴) (U+17B4) and (឵) (U+17B5) (used for transliteration)
    • replace obsolete ATTHACAN ៝ (U+17DD) with VIRIAM ៑ (U+17D1)
    • replace deprecated BATHAMASAT ៓ (U+17D3) with NIKAHIT ំ (U+17C6) as a likely error
  • Find syllables
    • A syllable is a consonant or independent vowel followed by any sequence of coeng(s)+consonant clusters, coeng(s)+independent vowel clusters, dependent vowels, diacritics, or zero-width elements.
      • coeng(s) are 1 or more coengs. There should be only one, but they are often invisible, and I've seen as many as three used at once.
      • KNOWN BUG: This works well for reasonably formatted text, but when there are certain typos—in particular a missing base consonant in the next syllable—this can gather up too many vowels. I haven't decided yet whether to try to detect the excess vowels or just let typos do silly things, as typos often do. (We want to catch duplicate vowels and split vowels, so we do need to allow multiple vowels in some cases.)
  • Re-order each syllable
    • remove zero-width elements
    • save the first character (consonant or independent vowel) as the base character
    • split the remainder of the syllable into chunks
      • coeng(s) + (consonant or independent vowel) + register shifter
      • coeng(s) + (consonant or independent vowel)
      • everything else splits into one-character chunks
    • collect the chunks, in their original order, into the following groups:
      • chunks starting with a coeng
        • deduplicate coengs within these chunks
      • dependent vowels
      • other register shifters
      • robats
      • non-spacing diacritics
      • spacing diacritics
      • [note that there shouldn't be any leftovers]
    • de-duplicate each group
      • if the same chunk occurs multiple times in a row, reduce it to one instance
    • repair split vowels
      • replace េ + ី (U+17C1 U+17B8) with ើ (U+17BE)
      • replace ី + េ (U+17B8 U+17C1) with ើ (U+17BE)
      • replace េ + ា (U+17C1 U+17B6) with ោ (U+17C4)
    • re-order subscript consonants (ro is always last)
      • if coeng + ro comes before coeng + consonant, swap them
    • join the elements: base character + other register shifters + robat + coeng chunks + dependent vowels + non-spacing diacritics + spacing diacritics
      • note that some register shifters could be in the coeng chunks

Early Results

[edit]

I've broken the results up into several categories for review by readers of Khmer. I've had one round of review so far, and things look pretty good. There are lots of re-ordered syllables that look the same or very nearly the same[*] as the originals.

[* I've slowly relaxed my criteria for "looking the same" as I've gained more experience with Khmer text. I installed a couple dozen Khmer fonts, and I've found that different ones are more or less forgiving of different kinds of non-canonical orderings. So, if things look the same or very nearly so in the more forgiving fonts, I consider them to be the same.]

In order of decreasing confidence, the groups are:

  • Syllables that were relatively simple and that looked the same when their elements were re-ordered into the canonical order.
  • Syllables with "invisible duplicates"—those with multiple copies of elements that in most fonts don't show up as duplicates.
  • Split vowels: syllables where the multipart vowels are entered in parts instead of as a single character.
  • Syllables with zero-width spaces, zero-width non-joiners, and zero-width joiners—these are supposed to control ligatures and shouldn't have any effect on meaning.
  • Syllables with swapped subscript consonants—these all have the subscript consonant ្រ ("ro") in them. Ro always comes third in a set of three consonants, but it is written to the left of the base character, so it is often typed second.
  • Syllables where the frequency of the non-canonical order is much higher than that of the canonical order. These all seem reasonable by looking at them, but the inverted frequency was worrisome, so I pulled them out for special consideration.
  • Syllables with duplicate subscript consonants. These duplicates are more often visible than not (using my font collection), so I wasn't as comfortable putting them in the "invisible duplicates" section.
  • Other visible duplicates—these have duplicate elements that show up in almost all of my fonts.

All of the above groups, after speaker review, are generally probably reasonably re-ordered.

The difficult groups include:

  • "Questionably Reordered Syllables"—these seem to follow the rules, but often looked really different before and after re-ordering, or rendered poorly in all or most fonts even after re-ordering.
  • "???"—these were ones that were so confusing I didn't know how to categorize them.

After speaker review, these mostly fell into two groups:

  • The majority of the "Questionably Reordered Syllables" and a minority of the "???" syllables were actually re-ordered okay.
  • The majority of the "???" syllables and a minority of the "Questionably Reordered Syllables"—the ones that looked the worst—have syllable boundary errors, caused by typos in the original text.

(There's one example left in the "???" category, which has both ត ("ta") and ដ ("da") as subscript consonants. They look the same as subscripts, but they are technically different letters, and in this case they both always show up. Maybe in some font somewhere they overlay each other and you can only see one of them. Not sure—but if I escape with only one really unclear example unresolved, that'll be pretty good!)

For the syllables with boundary errors, after the correct syllable there are additional unattached dependent vowels. These seem to be missing their base consonant. The best approach is probably to split these extra vowels off into a separate syllable. It's a little complicated because "split vowels" can have two vowel elements, and both visible and invisible duplicates can have the same element repeated.

Alternatively, we could say that typos are typos and they mess things up, and whatever happens, happens. I’d prefer to fix things when I can, but I may have more limitations in the final implementation in Elasticsearch than I have in the prototype.

Syllable Stats

[edit]

From the 5K Wikipedia article sample:

  • 4,165,057 total syllables
  • 7,172 changed syllables (0.17% of total)
  • 28 syllable boundary errors (0.00067% of total, 0.39% of changed)
    • 19 distinct syllables

From the 5K Wiktionary item sample:

  • 178,392 total syllables
  • 260 changed syllables (0.15% of total)
  • 1 syllable boundary error (0.00056% of total, 0.38% of changed)
    • 1 distinct syllable, obviously

These results are surprisingly consistent, compared to some of the stats I've seen when doing language analyzer analysis. I think that is because this is based on how people generally type words, rather than on the differences in style and content between Wikipedia and Wiktionary.

It's overall good that less than 0.2% of syllables need to be re-ordered and very good that less than 0.4% of syllables that we try to re-order (and approximately 0% of all syllables) have boundary errors.

Quick Review of Khmer Wikipedia Queries

[edit]

I pulled 90 days'–worth of queries from Khmer Wikipedia, giving 27,747 queries total. I did a quick analysis of the queries since I had the data in front of me.

Slightly less than half of all queries contain any Khmer characters. Slightly more than half were primarily Latin characters, with the usual preponderance of porn-related search terms (primarily xxx or some variant of the xnxx web site name).

  • 27,747 queries
  • 13,624 with some Khmer characters (49.1%)
  • 13,904 Mostly Latin (50.1%)
    • 5,447 xnx (xnxx, xnx, xxnx, etc.) (19.6%)
    • 1,855 xxx (6.7%)

A handful of queries were mostly numbers, symbols, or emojis:

  • 176 mostly numerals
  • 62 all punctuation
  • 12 mostly emojis

And there was the usual small collection of queries in other scripts:

  • 52 Thai
  • 28 Chinese
  • 23 Cyrillic
  • 8 Korean
  • 3 Hebrew
  • 3 Greek
  • 2 Katakana
  • 2 Hiragana
  • 1 Lao
  • 1 Devanagari
  • 1 Bengali
Syllable Stats
[edit]

I also processed the queries using my command line prototype...

From the 27K Wikipedia query sample:

  • 60,800 total syllables
  • 800 changed syllables (1.3% of total)
  • 3 syllable boundary errors (0.0049% of total, 0.38% of changed)
    • 3 distinct syllables

There are roughly 7 or 8 times the rate (1.3%) of syllables being re-ordered as in the Wikipedia and Wiktionary article text, and 8 or 9 times the rate of syllable boundary errors—though that is still only 3 total.

Other observations
[edit]
  • There is the usual junk queries (especially repeated characters, like ឥឥឥឥឥឥឥឥឥឥឥឥឥឥឥឥឥឥឥ or ,,,,,,, zzzzzzzz), and at least one of the syllable boundary errors look like a junk query (as opposed to a typo).
  • The re-ordered syllables in the query text are generally similar to the ones in the article text, and the majority of them were ones I'd seen before (597/800, or 74.6%)—which makes me think that certain errors are reasonably common, but there is a long, long tail.
    • There were probably more unattached coeng characters in the queries, though.

Enter Elasticsearch

[edit]

Khmer-language wikis use Elasticsearch's ICU tokenizer. Some documentation for it says that it detects syllables, but other documentation says it also has a dictionary to detect longer words.

Testing shows that it does have a dictionary. That's probably good for text with properly ordered Khmer elements, but might be bad for text with improperly ordered elements—the dictionary may not recognize all the variants, even if they look the same on-screen.

The easiest approach to implement would be to re-order elements after the tokenizer has tokenized them. However, improperly ordered elements could change the tokenization.

Command Line Re-Ordering Analysis Analysis

[edit]

The first step is to understand the current ICU-based baseline analysis, since it is somewhat Khmer-aware.

Next, I think that pre-tokenization syllable re-ordering will be better but more difficult to implement, so I'm planning to test both pre-tokenization and post-tokenization re-ordering and compare them to the baseline and to each other.

Current Baseline Khmer Analysis

[edit]

The baseline Khmer analysis chain consists of ICU tokenization and ICU normalization. From the ICU normalization we see the usual things, like cʰay normalizing to chay, ß to ss, µ to μ, and stripping of bi-directional markers.

The ICU tokenizer apparently uses a Khmer dictionary for known words and a Khmer syllable detector for unknown words. As a result, it generates plenty of multi-syllable tokens, like អាថ៌កំបាំង ("mystery").

There are some distinct tokens that are normalized the same by the current baseline analysis.

  • The Khmer dictionary matching seems to ignore zero-width joiners, zero-width non-joiners, soft-hyphens, invisible separators, and inherent vowels.
  • ICU normalization strips out zero-width joiners, zero-width non-joiners, soft-hyphens, invisible separators, and inherent vowels.
    • Invisible separators (also called invisible commas) are intended to separate elements like mathematical indexes that are written without separation (e.g. ij in xij). Fun fact: "invisible times" and "invisible plus" also exist.
    • Inherent vowels are invisible Khmer vowels that match the vowel inherent in an unmodified character as an aid to transliteration.

Soft hyphens are something I hadn't considered, and the invisible separator is something I hadn't even heard of before! Both are relatively rare—as are inherent vowels—but because they are normally invisible, a few slip by.

I added soft-hyphens and the invisible separator to my command line tool, updated the algorithm above, and added some soft-hyphen examples to the examples page. Like other invisible zero-width elements, they seem to creep in by accident sometimes—what with being invisible and all—or be intended to control typographic ligatures.

Post-Tokenization Re-Ordering

[edit]

Re-ordering syllables after tokenization is probably easier to implement than pre-tokenization re-ordering, because all the necessary bookkeeping to index the token back into the original string has been done during tokenization. A token filter can do anything it wants to the token without changing that bookkeeping.

Re-ordering tokens doesn't change the final number of tokens. In the 5K Wikipedia article sample, there are 3,085,383 tokens with or without re-ordering. The 5K Wiktionary sample has 127,092 tokens.

The number of types (unique tokens) decreases with re-ordering, as expected, as some re-orderings merge tokens. In the Wikipedia sample, the number decreased from 101,859 to 101,075 unique (post-analysis) types.

The decrease in tokens is all due to collisions (i.e., tokens merging). In the Wikipedia data, 775 pre-analysis types (0.716% of pre-analysis types) / 19,506 tokens (0.632% of tokens) were added to 716 groups (0.703% of post-analysis types), affecting a total of 1,501 pre-analysis types (1.387% of pre-analysis types) in those groups.

In the Wiktionary data, the impact was smaller: 41 pre-analysis types (0.218% of pre-analysis types) / 1,150 tokens (0.905% of tokens) were added to 41 groups (0.220% of post-analysis types), affecting a total of 82 pre-analysis types (0.437% of pre-analysis types) in those groups.

That is, overall less than 1% of tokens merged with other tokens as a result of re-ordering.

Interesting side note: Looking at the tokens that are re-ordered, it is apparent that the ICU Tokenizer also has trouble with syllable boundaries caused by missing-consonant typos.

Pre-Tokenization Re-Ordering

[edit]

Re-ordering syllables before tokenization is probably harder to implement than post-tokenization re-ordering, because there is a lot of necessary bookkeeping to index the token back into the original string.

Since I used my command line tool to process the text before sending it to the analyzer for this test, a lot of that bookkeeping information is lost, and we can't easily map the final re-ordered tokens back to their original strings. Instead, we have to look at "lost" and "found" tokens and changes in token counts that point to tokenization differences.

In general, "lost" tokens are those that are present in the baseline sample but not the re-ordered sample, and "found" tokens are vice versa.

With re-ordering, there are 112,970 fewer tokens (-3.661%, out of 3,085,383) in the Wikipedia data. It's not that 3.7% of tokens changed, but rather that the changes allowed longer tokens to be recognized by the ICU Tokenizer's Khmer dictionary. The effect on the Wiktionary data is even larger, with 13,642 fewer tokens (-10.734% of 127,092).

The overall impact of the re-ordering is quite large. However, because the original strings are not tracked, we can't directly see the new collisions (merged tokens).

There are a small number of apparent splits (less than 0.1%) in both the Wikipedia and Wiktionary data caused by incidental "clean up" from the command line tool, which removes many of the invisible characters (zero-width joiners, zero-width non-joiners, soft-hyphens, invisible separators, and inherent vowels).

Looking at token count changes gives a better idea of the impact. For the Wikipedia data:

  • 10,417 pre-analysis types (9.623% of pre-analysis types) gained 124,310 tokens (4.029% of tokens) across 10,417 groups (10.227% of post-analysis types).
    • That is, about 4% of tokens were merged with other tokens, and the mergers affected about 10% of the unique tokens. These seem to be either re-ordered tokens that merged with other, existing tokens, or longer words recognized by the ICU Tokenizer's Khmer dictionary after re-ordering.
  • 6,240 pre-analysis types (5.764% of pre-analysis types) lost 215,730 tokens (6.992% of tokens) across 6,240 groups (6.126% of post-analysis types).
    • That is, about 7% of tokens disappeared from the about 6% of groups they were in, either because the old form disappeared after re-ordering, or because the token became part of a longer token.

Some examples:

  • The token (ក + [InvSep]) no longer exists because the invisible separators were removed by the command line tool.
  • The token ឆាំ្ម1884 (ឆ + ា + ំ + ្ + ម) no longer exists because it has been replaced with ឆ្មាំ1884 (ឆ + ្ + ម + ា + ំ). Depending on font, these can look identical (or not).
  • The count for the token ព្រៃ(ព + ្ + រ + ៃ), meaning "wild", decreased from 1444 to 1240. The token count for កន្ទួត(ក + ន + ្ + ទ + ួ + ត), meaning "gooseberry", decreased from 27 to 26.
    • The removal of a zero-width space between one of each allowed for the creation of the new token កន្ទួតព្រៃ, "wild gooseberry". Other instances of ព្រៃ ("wild") similarly ended up in other, longer tokens.
  • The split vowel version of ប៉ុណ្ណេាះ (ប + ៉ + ុ + ណ + ្ + ណ + េ + ា + ះ) ("only that much") gets incorrectly split into two tokens (ប៉ុ and ណ្ណេាះ), while the repaired version, ប៉ុណ្ណោះ (ប + ៉ + ុ + ណ + ្ + ណ + + ះ) is treated as a single token.

Another source of lost tokens is numerals, which can be tokenized as attached to the words around them when invisible characters are removed, for example, ពី200ពាន់ ("from 200 thousand").

Interesting side note A: Both Arabic numerals (0123456789) and Khmer numerals (០១២៣៤៥៦៧៨៩) are treated the same, so it makes sense to map one to the other for indexing purposes.

Interesting side note B: Google Chrome does some sort of normalization funny business when searching within a page. សំស្ក្រឹត and សំស្ក្រឹតៈ are treated equivalently. (So, this is a chance to plug one of my favorite Chrome extensions: Chrome Regex Search—which searches for exactly what you type (in addition to matching regexes), which matters when you are working with anything other than flat ASCII.)

No additional insights came from comparing pre-tokenization re-ordering and post-tokenization re-ordering directly. I didn't expect anything, but i had to check.

Next Steps

[edit]
  • ✔︎ Get a sense of the scope of the typo-caused incorrect syllable boundary problem. The complexity of fixing it and the scope of the problem will help determine whether it's worth it to try to handle these cases.
    • DONE: The syllable boundary error rate is less than 0.001% of all syllables, so it isn't a huge concern. I will address it if it is easy to do so, but I won't worry about it too much, since these are caused by typos—which often lead to strange results—and we can fix 250x as many currently unfindable syllables.
  • ✔︎ Gather query data and check the prevalence of syllable re-ordering in queries. (I expect it will be a bit higher, as queries generally are less carefully/formally written.)
    • DONE: See Quick Review of Khmer Wikipedia Queries above.
  • ✔︎ Test the effects of re-ordering on tokenization and matching of re-ordered words. I'll look at re-ordering using the command line prototype both before tokenization and after, and see how big a difference it makes on the results. (My guess is that pre-tokenization re-ordering will be much better, but it's complex enough that it's worth it to see how big a difference it makes.)
    • DONE: See Command Line Re-Ordering Analysis Analysis above.
  • ✔︎ Create a custom Khmer analysis chain.
    • Create a character filter plugin to re-order Khmer syllables (and do all necessary bookkeeping—sigh) and test adding it to the Khmer analysis chain.
    • Test adding a character filter to map Khmer numerals (០១២៣៤៥៦៧៨៩) to Arabic numerals (0123456789) to the Khmer analysis chain.
    • DONE: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/659369

Some References

[edit]

These are some references that have proven useful to me in trying to figure out Khmer syllables.