User:TJones (WMF)/Notes/Khmer Reordering/Examples
Below are examples of Khmer syllables I found in a sample of 5,000 Khmer Wikipedia articles, and that I have automatically re-ordered. They are divided into groups that have similar re-ordering. I'm looking for feedback on what is right and what is wrong and advice on how to fix the things that are wrong.
The groups are sorted by how much help I need understanding them. The ones that are the most confusing to me are listed first.
These are only a (diverse) sample of all the syllables I found and re-ordered. Many more examples are on the Khmer Reordering/Examples/More sub-page.
The columns are:
- rewritten, the re-ordered version of the syllable, expanded out so all the elements are visible.
- original, the syllable as found on Khmer Wikipedia.
- context, a selection of text containing the original syllable.
- The original syllable is highlighted in red. Finding and highlighting the original syllable was done automatically, so there may be errors.
- The entire context is a link to Khmer Wikipedia, which should bring up a link to the original article containing the text. Of course, there may be no result because the original article has changed since I took the sample.
???
[edit]These syllables don't actually have a lot in common other than they are confusing to me and I don't know what to make of them. Perhaps these are not actually single syllables and I have found incorrect syllable boundaries, or they are typing mistakes in the original text, or something else is going on. Any ideas on how to treat these correctly would be appreciated!
Update: After speaker review, I've split this table into three. The first has the one that is still confusing (it has both แ and แ as subscript consonantsโthey look the same as subscripts). The second table has the ones that are split into syllables incorrectly because of typos, so I know I need to work on those. The last table are the ones that look funny to me, but are probably reasonably re-ordered.
rewritten | original | context |
---|---|---|
แแแแแแถ ( แ + แแ + แแ + แถ ) | แแแแถแแ ( แ + แแ + แถ + แแ ) | ...แแถแแแผแแแแแแแแแแถแแแแถแแแแแโแแแแแแแแแถแแฎ... |
rewritten | original | context |
---|---|---|
แแปแถแ ( แ + แป + แถ + แ ) | แแปแแถแ ( แ + แป + แ + แถ + แ ) | ...แแ แก แแแแปแโแฑแแโ แแปแแถแแโแ แทแแแโแฑแแโแแแแ... |
แแธแถแ ( แ + แธ + แถ + แ ) | แแธแแถ ( แ + แธ + แ + แถ ) | ...แ + แแแแถแ โแโ แแแแธแแถแแแแแถ แแแ แแบ แแ... |
แแแถแ ( แ + แ + แถ + แ ) | แแแแถ ( แ + แ + แ + แถ ) | ...แแ แถแแแแ แแแธแ แขแถแแแแถแแแแธแ แกแถแแแ... |
แ แแแแแแธ ( แ + แแ + แ + แแ + แธ ) | แ แแแแธแแ ( แ + แแ + แ + แธ + แแ ) | ...แแฝแแแถแแฑแแแแนแแแถ แ แแแแธแแแ แแบแแถแแถแแแถแ แแ... |
แแแแ ( แ + แแ + แ ) | แโโแแแ ( แ + [ZWSP] + [ZWSP] + แแ + แ ) | ...แแแถแแปโแฏแแแถแโแแปแแโโแแแแขแแแโแแแโแขแถแ โแ... |
แแแแ ( แ + แแ + แ ) | แโแแแ ( แ + [ZWSP] + แ + แแ ) | ...แโแแแโแแพแแแแธโแแนแโแแแโแแฝแโแ... |
Questionably Reordered Syllables
[edit]These seem to be in the correct order according to the rules I have found, but they look different in all or most fonts.
These usually include แ, or แแ (though the first few include แแ, แแ, and แแ). My best guess is that I have found incorrect syllable boundaries, but I don't know what the right thing to do is.
There are additional samples like these on the Khmer Reordering/Examples/More page.
Update: After speaker review, I've split the table into two. The first has syllable boundary errors, as in the ??? section above, which I know I need to work on. The second has the ones where the sub-consonant is after the vowel, and even though it renders differently for me, it is probably reasonable to re-order them.
Visible Duplicates
[edit]These multiple vowels and other diacritics always show up in all the fonts I have tried. My understanding is that each syllable should have only one dependent vowel. These have multiple dependent vowels (and one has duplicated แ). I don't think they are mistakes because the duplicates are easy to see when typing. Maybe they look correct using a font or operating system I don't have.
There are additional samples like these on the Khmer Reordering/Examples/More page.
Update: After speaker review, these are probably reasonably re-ordered.
Duplicate Subscript Consonants
[edit]Depending on the font, these duplicates are sometimes visible, sometimes not. So, I think they are rewritten correctly, but I want to make sure.
Update: After speaker review, these are probably reasonably re-ordered.
Original Is More Common
[edit]These look the same or very similar when rewritten, but the rewritten form is much less common (in my sample), which makes me worry. Some of these have hundreds more instances of the "original" form than the "rewritten" form. Others appear 3 or 4 times, but only as the "original" form. This makes me worry that there is something wrong with the way I'm re-ordering them, though I think they are correct.
There are additional samples like these on the Khmer Reordering/Examples/More page.
Update: After speaker review, these are probably reasonably re-ordered.
Consonants Swaps
[edit]These all have แแ before another subscript consonant. As far as I can tell, แแ should always be the third consonant if there are three consonants. In some fonts, the original form doesn't render properly, so I think these are correct.
There are additional samples like these on the Khmer Reordering/Examples/More page.
Update: After speaker review, these are probably reasonably re-ordered.
rewritten | original | context |
---|---|---|
แแแแแ ( แ + แแ + แแ ) | แแแแแ ( แ + แแ + แแ ) | ...แแแแแแแแฝแแแแแแแแแแแแ แแพแแแแธแแทแแแแถแแ... |
แแแแแแ ( แ + แแ + แแ + แ ) | แแแแแแ ( แ + แแ + แ + แแ ) | ...แแแแแฝแแถแ แแทแแแแแแแแแแ แแแแแแแแ แฑแแแแ... |
แแแแแแถ ( แ + แแ + แแ + แถ ) | แแแแแแถ ( แ + แแ + แแ + แถ ) | ...แแ แแถแแ - แแแ แแแแแแแแถ- แแแ แแนแแแแปแ -... |
แแแแแแแ ( แ + แแ + แแ + แ + แ ) | แแแแแแแ ( แ + แแ + แแ + แ + แ ) | ...แแแถแแแแแแแแแแแแแแแแแแแแแแถแแแแแถแแแ... |
แแแแแแถ ( แ + แแ + แแ + แถ ) | แแแแถแแ ( แ + แแ + แถ + แแ ) | ...แขแแแแแแแฏแแแถแขแแแแแแแแถแแแแแ แถแ แแแแแ แฑแแแข... |
แแแ แแแถ ( แ + แแ + แแ + แถ ) | แแแแถแแ ( แ + แแ + แถ + แแ ) | ...แแแถแแฅแแแแทแแแแแแแแแแถแแ แแแถแแ... |
แแแ แแแถแ ( แ + แแ + แแ + แถ + แ ) | แแแแแ แถแ ( แ + แแ + แแ + แถ + แ ) | ...แแผแแแถแแแแแแแแฝแแ แทแแแแแ แถแ แแแแนแแแแแถแแแทแแ... |
แแแ แแแ ( แ + แแ + แแ + แ ) | แแแแแ แ ( แ + แแ + แแ + แ ) | ...แแแแทแแ แแแแแแถ แแแแแแ แแ แฌ แแแขแธโ แฃ แแ... |
แแแแแแ ( แ + แแ + แแ + แ ) | แแแแแแ ( แ + แแ + แแ + แ ) | ...แแปแแแธแแแแแแแถแ แแแแแแแแ แแ แแแแถแแแแโ แแถ... |
แแแแแแถ ( แ + แแ + แแ + แถ ) | แแแแแแถ ( แ + แแ + แแ + แถ ) | ...แแแแแถแแแแแแถแแแแแทแแแแแแถแ แพแ แแถแแแแแแแแถแแ... |
แแแแแแ ( แ + แแ + แแ + แ ) | แแแแแแ ( แ + แแ + แ + แแ ) | ...แแทแแแแแแแแแแแแแแแแแแแแแแแถแแ แแถแแแแแแป... |
แแแแแแธ ( แ + แแ + แแ + แธ ) | แแแแธแแ ( แ + แแ + แธ + แแ ) | ...แแแ แแฝแแแถแแฝแแแแแแแแแธแแแแแแแแแแปแ - แแแ... |
แแแแแแ ( แ + แแ + แแ + แ ) | แแแแแแ ( แ + แแ + แ + แแ ) | ...แแแแแถแแถแแแแแทแแขแ แทแแแแแแแแแแแแแแแแแแทแแปแ... |
แแแแแแ ( แ + แแ + แแ + แ ) | แแแแแแ ( แ + แแ + แแ + แ ) | ...แแแแแบแแแแพแแแทแแขแ แทแแแแแแแแแ แแทแ แแ แแทแแแแ... |
แแแแแ ( แ + แแ + แแ ) | แแแแแ ( แ + แแ + แแ ) | ...โแแแแแโแแแแปแโแแ แทแแแแแแแปแแธโ แแแแถแแถแโแแถ... |
แแแแแแถ ( แ + แแ + แแ + แถ ) | แแแแถแแ ( แ + แแ + แถ + แแ ) | ...แแฝแแแแ แผแแแแแแแแถแแแแแถแแแแถแแแแธ แแ... |
แแแแแแธ ( แ + แแ + แแ + แธ ) | แแแแธแแ ( แ + แแ + แธ + แแ ) | ...แแแแแแถแแแแปแแแแแแฅแแแแธแแแแธแแแแแแแฝแแแผ แแ... |
แแแแแแ ( แ + แแ + แแ + แ ) | แแแแแแ ( แ + แแ + แ + แแ ) | ...1.แแแแแแแ... |
แแแแแแถ ( แ + แแ + แแ + แถ ) | แแแแถแแ ( แ + แแ + แถ + แแ ) | ...แแแ แแถแแแแแแแแปแแแแแถแแแแ แแถแแผ แแแแ แถแแแ ... |
แแแแแ ( แ + แแ + แแ ) | แแแแแ ( แ + แแ + แแ ) | ...แแแแแนแแแแทแแ แแแ แแแแแแแ แแแแถแแ... |
แแแแแแถ ( แ + แแ + แแ + แถ ) | แแแแถแแ ( แ + แแ + แถ + แแ ) | ...แแแแถแแแแถแแแ แถแแแ... |
แแแแแแถ ( แ + แแ + แแ + แถ ) | แแแแแแถ ( แ + แแ + แแ + แถ ) | ...แแแ แแฝแ โแแแแแถโแแถแแแแแแถโแขแถแแทแแแแโแแแแถแ แ... |
แแแแแแธ ( แ + แแ + แแ + แธ ) | แแแแธแแ ( แ + แแ + แธ + แแ ) | ...แแแแถแแแแแแผแแแแแธแแแแแแถแแแแแแผแแ แแแ... |
แแแแแแถ ( แ + แแ + แแ + แถ ) | แแแแถแแ ( แ + แแ + แถ + แแ ) | ...แแแแแ แแแแแธแ แแถแแแแถแแแแแแนแแแนแแแถแแพแแ... |
แแแแแแถ ( แ + แแ + แแ + แถ ) | แแแแแแถ ( แ + แแ + แแ + แถ ) | ...แ แแแแแ แฏแแแถแ แแถแแแแแแถ แแแแแธ แแแแฝแแแแแถ... |
แแแแแแธ ( แ + แแ + แแ + แธ ) | แแแแธแแ ( แ + แแ + แธ + แแ ) | ...แแถแแแแแแแปแแ แแทแ แแแแธแแแแ แแแแปแแแแแปแแ แแแฝ... |
แแแแแแธ ( แ + แแ + แแ + แธ ) | แแแแแแธ ( แ + แแ + แแ + แธ ) | ...แแ,แแถแแฝแแแถแแแแถแแแแแแธแ แ แพแแแทแแฅแแถแแแแแถ... |
แแแแแแธ ( แ + แแ + แแ + แธ ) | แแแแแแธ ( แ + แแ + แแ + แธ ) | ...แแแโแแโแแ โแแแแแ แแแแแแธแโแ แแแแ... |
แ แแแแ ( แ + แแ + แแ ) | แ แแแแ ( แ + แแ + แแ ) | ...แแธ13แแบแ แแแถแ แแ แกแ แ แแแแแแแแแปแแแแแถแแแแแแฝ... |
แ แแแแแถแ ( แ + แแ + แแ + แถ + แ ) | แ แแแแแถแ ( แ + แแ + แแ + แถ + แ ) | ...แ แแแธแแถแ, frein แ แแแแแถแแ, cafรฉ =แแถแ แแแ... |
แ แแแแแท ( แ + แแ + แแ + แท ) | แ แแแทแแ ( แ + แแ + แท + แแ ) | ...แแแถแแแแถแแแแแแปแแขแถแ แแแทแแแ แขแถแแแธ แแทแ แขแถแแ... |
แ แแแแแท ( แ + แแ + แแ + แท ) | แ แแแแแท ( แ + แแ + แแ + แท ) | ...แ แแแผแแ แแแแกแแ แขแถแ แแแแแทแแแถแแแแแผแ แแแธแแแถ... |
แ แแแแแแท ( แ + แแ + แแ + แ + แท ) | แ แแแแแแท ( แ + แแ + แแ + แ + แท ) | ...แแแแแแแปแแแแแแแแขแถแ แแแแแแทแ แแแแถแแแแแแแถแแ... |
แ แแแแแ ( แ + แแ + แแ + แ ) | แ แแแแแ ( แ + แแ + แ + แแ ) | ...แ แแแ แแแฝแ แแแทแ แแ แ แแแแแแแ แแแผแ แแแแแแ แขแถ... |
Zero-Width Spaces & (Non-)Joiners
[edit]These have U+200B (zero-width space [ZWSP]), U+200C (zero-width non-joiner, [ZWNJ]), or U+200D (zero-width joiner [ZWJ]) in them, which I believe is intended to change the rendering (but not the meaning) of diacritics or other elements. The rewritten form here isn't necessarily better, but I think it is the form that should be indexed for search.
Update: After speaker review, these are probably reasonably re-ordered. (They may be typos or they may be intended to control ligatures, but either way, the zero-width elements should not affect meaning, so they should be ignoredโespecially in the cases where they don't change the meaning.)
Soft Hyphens
[edit]NEW! The ICU tokenizer for Khmer ignores soft-hyphens (U+00AD), so we should to. These all seem reasonable.
Split Vowels
[edit]These have แ + แถ or แ + แธ (or แธ + แ) instead of แ and แพ. Since they look the same, I assume that the single vowel form is correct. In some fonts แธ + แ does not render properly, so I think swapping them is correct.
There are additional samples like these on the Khmer Reordering/Examples/More page.
Update: After speaker review, these are probably reasonably re-ordered.
Invisible Duplicates
[edit]In many fonts I looked at, these multiple vowels, subscript consonants, or other multiple diacritics render only once, so I take these to be mistakes that should be de-duplicated.
There are additional samples like these on the Khmer Reordering/Examples/More page.
Update: After speaker review, these are probably reasonably re-ordered.
Reordered Syllables
[edit]These seem to be reasonably reordered. These are the ones I am most confident in because they always look the same, or the original renders incorrectly in certain fonts. This is the largest group, but I hope these are easy to review because they are mostly correct!
There are a lot of additional samples like these on the Khmer Reordering/Examples/More page. Part 1, Part 2, Part 3.
Update: After speaker review, these are probably reasonably re-ordered.