Thanks so much, @Eltimbalino!
> But I'm sure I'm better than nothing.
Your help is much, much better than nothing—and much appreciated!
???
The general pattern I’m getting for a lot of these is that if there is a second vowel (other than the ones can be a split vowel, like េ and ា) we should consider it a different syllable. Sound right?
Alternatively, we could say that typos are typos and they mess things up, and whatever happens, happens. (I’d prefer to fix things when I can, but I may have more limitations in the final implementation than I have in this prototype.)
Cases like ញុាំ ( ញ + ុ + ា + ំ ) / ញុំាំ ( ញ + ុ + ំ + ា + ំ ) and ឆ្មាំ ( ឆ + ្ម + ា + ំ ) / ឆាំ្ម ( ឆ + ា + ំ + ្ម ) render just differently enough for me not to be sure. I’ll be a little more forgiving about the ones that are very close.
I need to think about this section more when I have more time—definitely on Monday.
Questionably Reordered Syllables
It sounds like the ones that are “vowel + sub-consonant” I should take as correct to re-order. I’ve was unsure about them because the they don’t render the same (for me) in the two different orders, unlike some others.
> Basically, if it fails to render, then it is incorrect and probably a typo.
The problem I’m having is that rendering seems to be very font-specific, and even application-specific; I’m working on a Mac and TextEdit sometimes renders the same fonts differently than Chrome!
The rest that don’t ever render correctly I’ll move up into the ??? section for more thinking.
Duplicate Supplementary Consonants
> but it doesn't render, I'm going to deduce that it is always wrong because what is the point of typing a character that is never seen?
I agree! But the problem, again, is that different fonts render differently. So I’ll take it that if I have a font doesn’t render them both, it’s okay to deduplicate.
(As a side note, this one—ស្ត្ដា ( ស + ្ត + ្ដ + ា ) / ស្តា្ដ ( ស + ្ត + ា + ្ដ )—from the ??? section is listed there not here because the sub-consonants are ត and ដ!)
Visible Duplicates / Consonants Swaps / Invisible Duplicates / Reordered Syllables
Good news! Woo hoo!
Zero-Width (Non-)Joiners
My info (Unicode spec (PDF), page 382) says they are used to control ligatures in Muul/Muol/Mool–type fonts, and to keep muusikatoan or triisap from being subscripts (which also varies by font). My plan is to just ignore them.
Split Vowels
So it sounds like merging these is a good thing. If someone were using a different keyboard they might type េ + ា and not realize they were still two separate characters because they look like ោ.
---
> Even if mediawiki were to run a script that corrected everything to be in an approved sequence, that would be only half of the battle. That script would also need to be run on any search phrase before the normal processes took over.
Ahh! You’ve hit on the crux of the problem—and there is a plan! I don’t actually plan to correct the text in the articles. That would be a never-ending task, since people would always be adding new content that could have differently ordered text. (It may be possible to have something like a spell-checker that corrects text as people type, but that’s far outside my area of expertise and there may be rare cases where you wouldn’t want to make those corrections.)
Instead, the plan is to re-order the text on the way into the search index. Both article text and search queries get the same treatment, so everything would match! (We do the same kind of thing for English, for example, just much less complicated—we lowercase and strip diacritics before putting things in the index, so Einstein matches ÉÎÑSTËÌŃ.
---
> Matt, the creator of kheng.info
I will definitely ping him and see if I can get him into the conversation. Thanks!
---
Whew! On Monday I’ll reorganize some of the samples based on our conversation and think harder about some of the ??? examples.
Any additional replies based on what I’ve tried to understand here would be great, too!
Thanks so much. This is definitely helpful, and I feel more confident that we are going in the right direction!