Thanks, Brion!
Even a nice carefully constructed language has irregularities—especially to a computer! Language is always messy.
I think the list of exceptions came from Wikipedia or Wikibooks, and I don't think the fact that they could be inflected was taken into account (typical English-speaker thinking on my part, at least—we have completely different pronoun forms, for no particularly good reason). The current stemming exceptions are just left unaltered. The goal was to keep ĉio from losing its -o; but ĉion should definitely be treated similarly. If it's easy to say which are can be inflected and which can't in the list, that would be great, otherwise I can try to work it out.
I've been thinking about the stemming options for demokrat-. Keeping in mind that the goal is not necessarily to get a correct stem, but rather a unique stem, maybe it doesn't matter. (Though in this case it picked up Demokrito, too—but stemming names is always a gamble.) It seems like it could get very complex to deal with in the general case. Productive prefixes give related forms like maldemokratia and pseŭdodemokratia—listing them all (either all the related forms or all the acceptable prefixes) would be annoying and prone to problems. On the other hand, while blocking ĉio from having the -o stripped off makes sense, it looks like other words end in -ĉio that are not related, so allowing any arbitrary prefix is ugly, too. Any thoughts on dealing with that? Maybe one way will seem obviously best if you can come up with any other potential problem cases.
Do you have any insight into how often the h-system and x-system forms are used in written text and in searching? If lots of people can't type ĝ and so search for gh or gx, it's probably not something we should ignore. A potential problem is the treatment of foreign words—though it doesn't matter if ghost, though, and laugh are internally represented as ĝost, thouĝ and lauĝ as long as they aren't ambiguous and thus collide with other words. I can try that out and see what impact it has on the words in my sample.
Help with the missing diacritical forms would be great, whether a pull request or a list here or elsewhere.
The to do list:
- make sure all the exceptions have proper diacritics
- find the exceptions that can be inflected, like ĉio and handle them properly (add to a general list of unbreakable stems, or explicitly map forms to stems)
- remove h-system and x-system words from the exception list
- test the impact of automatic h-system and x-system conversion on stemming collisions; if it's small enough, just do it
- decide how to handle ambiguous stems like demokrat- (accept defective stems with some errors, do something clever to handle prefixes, or something else TBD)
Thanks for all the help!