Sorry, the stemming list includes some compounds, which are divided into parts and will be searchable by any of the parts, though exact matches are best. So, the compound cases are fine, assuming the tokenization (breaking into words) is reasonable. Because there's a parser involved, context can change the way characters are treated, which adds to the complexity.
- [르귄 / 르 / 귄, a compound, with 르 and 귄 tagged as proper nouns.
- 빙 / 리빙—in isolation, 리빙 comes out as a single token. There are three instances of 리빙 in my Wikipedia corpus, and two of them are treated correctly. However, in "태양의 아이들 (2011, 웅진리빙하우스) ISBN 9788901136059", it gets indexed as a compound. Probably still a parsing error.
- 사라 / 사라코너—yep, I see it. But for some reason the name 사라코너 is also being treated as a compound [사라코너 • 사라 • 코너].
- 리아디 / 아디—again, 리아디 is treated as a compound, and the part 아디 is indexed under the whole
- 우러 is interpreted as 우르/VV(Verb)+어/E(Verbal endings), so it gets grouped with other instances of 우르.
- 비제이펜—interpreted as a compound, all proper nouns: "비/NNP(Proper Noun)+제이/NNP(Proper Noun)+펜/NNP(Proper Noun)", and so grouped under each of the parts.
- 손휴—again, proper nouns.
This brings up the possibility that we should not index compounds by their parts. The default setting throws away the original compound and only keeps the parts. I thought keeping the original would increase precision when you know exactly what you are looking for. Not keeping the parts would get rid of some of these errors, but also make it harder to match when you have part of a compound. For example, right now, many-part compounds can match a shorter compound that is part of it. So a four-part compound, ABCD, can match the three-part compound, ABC, because A, B, and C are all indexed separately.
Based on the general review of Tokenization and Compounds, though, I think we are okay, with more correct tokenizations than errors.
Thanks again, revi, for all the help! Any more comments on anything would be welcome!