User:TJones (WMF)/Notes/Kuromoji Analyzer Analysis
June-July 2017 — See TJones_(WMF)/Notes for other projects. See also T166731. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.
Intro
[edit]The Kuromoji Japanese language analyzer has lots of configuration options, and can be unpacked for custom configuration, and it is supported by Elastic. Seems like the right place to start.
Baseline (CJK) vs Kuromoji
[edit]The instance of the analyzer results in many fewer tokens on a 5K jawiki article sample:
- 5,515,225 (CJK)
- 2,379,072 (Kuromoji)
And many fewer token pre-/post-analysis types:
- 310,339 / 305,770 (CJK)
- 139,282 / 127,983 (Kuromoji)
- (there are fewer post-analysis types because some merge, like Apple and apple)
Many non-Japanese, non-Latin characters are not handled well:
- Arabic, Armenian, Bengali, Devanagari, Georgian, Hangul, Hebrew, IPA, Mongolian, Myanmar, Thaana, Thai, and Tibetan are removed.
- Latin words are split on apostrophes, periods, and colons; word breaks are added between numbers and letters (4G → 4 + g), and numbers are split on commas and periods. Typical European accented Latin characters (áéíóú, àèìòù, äëïöüÿ, âêîôû, ñãõ, ç, ø, å) are handled fine (but not folded).
- Cyrillic words are split on our old friend, the combining acute accent.
- Greek is treated very oddly, with some characters removed, and words sometimes split into individual letters, sometimes not. I can't figure out the pattern.
Lots of Japanese tokens (Hiragana, Katakana, and Ideographic characters) also change, but that's because the baseline CJK tokenizer works on bigrams rather than actually trying to segment words.
Fullwidth numbers starting with 1 are split character-by-character (1933 → 1 + 9 + 3 + 3 rather than 1933). Mixed fullwidth and halfwidth numbers are inconsistent, depending not only on where the fullwidth and halfwidth forms are, but also which ones they are. Leading fullwidth 1 seems to like to split off.
- 1933 → 1933
- 1933 → 1 + 933
- 1933 → 1 + 933
- 1933 → 1 + 933
- 2933 → 2933
- 2933 → 2933
- 2933 → 2933
- 2933 → 2933
That seems odd, and while not all of these mixed patterns occur in my sample, some do.
Kuromoji vs Kuromoji Unpacked
[edit]I unpacked the Kuromoji analyzer according to the docs on the Elastic website. I disabled our usual automatic upgrade of lowercase filter to icu_normalizer filter to focus on the effect of unpacking.
On a 1K jawiki article sample, the number of tokens was very similar:
- 520,971 (Kuromoji)
- 525,325 (unpacked)
Also, the number of pre- and post- analysis types are similar:
- 59,350 / 55,448 (Kuromoji)
- 59,643 / 55,716 (unpacked)
Some non-Japanese, non-Latin characters are treated differently:
- Arabic, Armenian, Bengali, Devanagari, Georgian, Hangul, Hebrew, Mongolian, Myanmar, Thaana, Thai, and Tibetan are now preserved!
- IPA characters are preserved, but words are split on them: dʒuglini → d + ʒ + uglini.
Others, the same:
- Latin words are split on apostrophes, periods, and colons; word breaks are added between numbers and letters (4G → 4 + g), and numbers are split on commas and periods. Typical European accented Latin characters (áéíóú, àèìòù, äëïöüÿ, âêîôû, ñãõ, ç, ø, å) are handled fine (but not folded).
- Cyrillic words are split on our old friend, the combining acute accent.
- Greek and full-width numbers are treated oddly, as before.
Overall, it's a big improvement for non-Japanese, non-Latin character handling.
Lowercase vs ICU Normalizer
[edit]I re-enabled the lowercase-to-icu_normalizer upgrade. The differences in the 1K sample were very slight and expected:
- ², ⑥, and ⑦ were converted to 2, 6, and 7.
- Greek final-sigma ς became regular sigma σ.
- German ß became ss.
- Pre-composed Roman numerals (Ⅲ ← that's a single character) were decomposed (III ← that's three i's).
Those are all great, so we'll leave that on in the unpacked version for now.
Fullwidth Numbers
[edit]I did some experimentation, and the problem with the fullwidth numbers is coming from the tokenizer. I've added a custom character filter to convert the fullwidth numbers to halfwidth numbers before the tokenizer, which solves the weird inconsistencies.
It does have one semi-undesirable side effect: months, like 4月 (4th month = "April"), are split into two tokens, 4 + 月, if a halfwidth number is used. While I think indexing 4月 or 4月 as a unit it better, it only happens for the fullwidth version, so while this is slightly worse, it is also more consistent, in that 4月 and 4月 will be indexed the same.
Bits and Bobs
[edit]I tested a number of other options available with Kuromoji. These are the ones that didn't pan out.
Kuromoji Tokenizer Modes
[edit]The Kuromoji tokenizer has several modes and a couple of expert parameters (see documentation). I didn't want to dig into all of it, but the "search" mode seemed interesting.
The "normal" mode returns compounds as single terms. So, 関西国際空港 ("Kansai International Airport") is indexed as just 関西国際空港.
In "search" mode, 関西国際空港 is indexed as four terms, the full compound (関西国際空港) and the three component parts (関西, 国際, 空港—"Kansai", "International", "Airport").
It turns out that "search" is the default, even though "normal" is listed first and sounds like it might be the normal mode of operation.
There is also an "extended" mode that breaks up unknown tokens into unigrams. (For comparison, the CJK analyzer breaks up everything into bigrams.)
The "search" mode seems better, so we'll stick with that.
Part of Speech Token Filter
[edit]I disabled the kuromoji_part_of_speech token filter, and it had no effect on a 1k sample of jawiki articles. According to the docs, it filters based on part of speech tags, but apparently none are configured, so it does nothing. Might as well leave it disabled if it isn't doing anything.
Iteration Mark Expansion
[edit]There's an option to expand iteration marks (e.g., 々), which indicate that the previous character should be repeated. However, sometimes the versions of the word with and without the iteration marks have different meanings. More importantly, expanding the iteration mark can change the tokenization—usually resulting in a word being split into pieces. The iteration mark seems much less ambiguous, so I think we should leave it alone.
Stop Words
[edit]A small number of types (~125) but a huge number of tokens (278,572 stop words vs 525,320 non-stop words—34.6%) are filtered as stop words. A quick survey of the top stop words all make sense.
Prolonged Sound Mark "Stemmer"
[edit]The kuromoji_stemmer token filter isn't really a stemmer (the kuromoji_baseform token filter does some stemming for verbs and adjectives), it just strips the Prolonged Sound Mark from the ends of words. A quick test disabling it shows that's exactly what it does.
This seems to be a mark used in loanwords to indicate long vowel sounds. I'm not sure why you'd want to remove it at the end of a word, but it often doesn't make any difference in Google translate, so it seems to be semi-optional.
The removal is on by default, so I'll leave it that way.
Japanese Numeral Conversion
[edit]The kuromoji_number normalizes Japanese numerals (〇, 一, etc.) to Arabic numerals (0, 1, etc.).
However it is wildly aggressive and ignores spaces, commas, periods, leading zeros, dashes, slashes, colons, number signs, and many more. So, 1, 2, 3. : 456 -7 #8 is tokenized as 12345678 and 一〇〇.一九 #七 九:〇 as 10019790. It also tokenizes 0.001 as just 1.
Unfortunately, the rules for parsing properly formatted Japanese numerals can't be implemented as simple digit-by-digit substitution, so a simple char filter can't solve this problem.
Fortunately, not using it is no worse than the current situation.
Groups for Review
[edit]Below are some groupings for review by fluent speakers of Japanese. These are tokens that are indexed together, so searching for one should find the others. The format is <normalized_form> - [<count> <token>].... The <normalized_form> is the internal representation of all the other tokens. It's sometimes meaningful, and sometimes not, depending on the analyzer and the tokens being considered, so take it as a hint to the meaning of the group, but not definitive. The <token> is a token found by the language analyzer (more or less a "word" but not exactly—it could be a grammatical particle, etc.) and <count> is the number of times it was found in the sample. While accuracy is important, frequency of errors also matters.
Groups with no common prefix/suffix
[edit]These are the groups that don't have a common prefix or suffix across all tokens in the group. That's not necessarily a bad thing (good, better, best is a fine group in English, for example)—but it's worth looking at them just to be sure. I've filtered out the half-width/full-width variants that were obvious to me.
- く - [47 か][14 きゃ][220 く][8 け][351 っ]
- くる - [2299 き][38 く][2 くっ][5 くら][10 くり][733 くる][10 くれ][13 くろ][187 こ][14 こい]
- す - [36 さ][4 し][81 しゃ][199 す]
- たい - [68 た][1015 たい][67 たかっ][5 たかつ][3 たから][12 たき][143 たく][4 たけれ][4 たし][3 てぇ][20 とう]
- ぬ - [4 ざり][142 ざる][1 ざれ][4927 ず][5 ずん][407 ぬ][84 ね]
- り - [512 り][188 る]
- る - [4 よ][1 りゃ][166 る][13 るる][1 るれ][38 れ][2 ろ]
Things are a bit complicated by the fact that "る" is in two different groups above. A number of tokens are normalized in multiple ways, which appears to be context-dependent.
Examples of where these tokens come from in the text are available on a separate page.
Largest Groups
[edit]These are the groups with the largest number of unique tokens. Again, these aren't necessarily wrong, but it is good to review them. They are actually pretty small groups compared to other language analyzers. The first two are duplicates from above.
- たい - [68 た][1015 たい][67 たかっ][5 たかつ][3 たから][12 たき][143 たく][4 たけれ][4 たし][3 てぇ][20 とう]
- くる - [2299 き][38 く][2 くっ][5 くら][10 くり][733 くる][10 くれ][13 くろ][187 こ][14 こい]
These are not duplicates:
- てる - [115 て][5 てっ][1 てよ][7 てら][3 てり][1 てりゃ][282 てる][1 てれ][11 てん]
- よい - [32 よ][217 よい][22 よかっ][2 よから][1 よかれ][19 よき][138 よく][6 よけれ][134 よし]
- 悪い - [33 悪][250 悪い][1 悪う][27 悪かっ][1 悪から][1 悪き][143 悪く][1 悪けれ][5 悪し]
- 良い - [75 良][383 良い][47 良かっ][2 良かれ][38 良き][265 良く][1 良けりゃ][2 良けれ][16 良し]
Examples of where these tokens come from in the text are available on a separate page.
Random Groups
[edit]Below are 50 random groups. I filtered out groups that consisted of just a deleted Prolonged Sound Mark (ー), since the "stemmer" (see above) is supposed to do that.
Examples of where these tokens come from in the text are available on a separate page.
- かき消す - [1 かき消し][2 かき消す][1 かき消そ]
- きつい - [6 きつ][7 きつい][1 きつう][1 きつき][6 きつく]
- こうじる - [41 こうじ][3 こうじろ]
- こじれる - [6 こじれ][1 こじれる]
- つける - [547 つけ][11 つけよ][179 つける][3 つけれ][9 つけろ][2 つけん]
- てんじる - [3 てんじ][1 てんじん]
- のく - [1 のい][12 のき][3 のく][2 のこ]
- ぶつ - [2 ぶた][2 ぶち][2 ぶちゃ][10 ぶっ][80 ぶつ][1 ぶて]
- まぎる - [1 まぎ][1 まぎら][1 まぎる][1 まぎれ][1 まぎん]
- もてる - [12 もて][3 もてる]
- 丸っこい - [2 丸っこい][1 丸っこく]
- 乗り換える - [24 乗り換え][19 乗り換える]
- 乗る - [16 乗][260 乗っ][8 乗ら][67 乗り][92 乗る][2 乗れ][1 乗ろ]
- 仕上げる - [23 仕上げ][7 仕上げる]
- 任じる - [111 任じ][4 任じる]
- 任す - [75 任さ][1 任し]
- 信じる - [179 信じ][5 信じよ][29 信じる][1 信じろ]
- 働かせる - [9 働かせ][3 働かせる]
- 助かる - [9 助かっ][4 助から][10 助かる][1 助かれ]
- 取りやめる - [29 取りやめ][1 取りやめる]
- 取り戻す - [3 取り戻さ][85 取り戻し][50 取り戻す][2 取り戻せ][12 取り戻そ]
- 唸る - [2 唸ら][1 唸り][3 唸る][1 唸れ]
- 太い - [81 太][25 太い][32 太く]
- 差し入れる - [1 差し入れ][2 差し入れる]
- 弱まる - [7 弱まっ][7 弱まり][6 弱まる]
- 従える - [24 従え][2 従えよ][5 従える]
- 思いとどまる - [3 思いとどまら][1 思いとどまり][8 思いとどまる]
- 承る - [164 承][1 承り]
- 押さえ込む - [1 押さえ込ま][2 押さえ込み][6 押さえ込む][1 押さえ込ん]
- 振る舞う - [6 振る舞い][17 振る舞う][4 振る舞っ]
- 携わる - [98 携わっ][3 携わら][23 携わり][48 携わる][1 携われ]
- 摂る - [32 摂][4 摂っ][3 摂ら][4 摂り][10 摂る][1 摂れ]
- 暴く - [4 暴い][12 暴か][3 暴き][5 暴く][1 暴こ]
- 癒える - [11 癒え][1 癒える]
- 策する - [1 策し][1 策する]
- 脱ぐ - [10 脱い][2 脱が][4 脱ぎ][4 脱ぐ]
- 致す - [1 致し][1 致す]
- 虐げる - [12 虐げ][3 虐げる]
- 裁く - [1 裁い][6 裁か][1 裁き][10 裁く]
- 要す - [3 要さ][42 要し][2 要す]
- 見て取れる - [2 見て取れ][5 見て取れる]
- 解く - [28 解い][37 解か][41 解き][35 解く][5 解こ]
- 言い表す - [2 言い表さ][2 言い表し][1 言い表す]
- 試す - [6 試さ][15 試し][15 試す][3 試そ]
- 謀る - [5 謀っ][1 謀ら][2 謀り][2 謀る]
- 譲り渡す - [1 譲り渡さ][3 譲り渡し]
- 護る - [3 護っ][8 護り][8 護る][1 護れ][1 護ろ]
- 起こす - [32 起こさ][400 起こし][155 起こす][2 起こせ][9 起こそ]
- 遠い - [113 遠][77 遠い][2 遠かっ][7 遠き][58 遠く][1 遠し]
- 闘う - [7 闘い][42 闘う][1 闘え][1 闘お][14 闘っ][4 闘わ]
Longest Tokens
[edit]Below are the tokens that are 25 characters or more.
The longest tokens in Latin characters are all reasonable—the first one is the English/basic Latin alphabet, while the next two are transposed versions of the alphabet (I smell a cipher). There are a few long German words, some chemical names, some English words run together as part of a URL, and a really long string that looks to be a transliteration of Sanskrit translated to Tibetan and back. (The corresponding English Wikipedia article breaks it up into multiple words.)
- abcdefghijklmnopqrstuvwxyz
- zabcdefghijklmnopqrstuvwxy
- hijklmnopqrstuvwxyzabcdefg
- luftwaffenausbildungskommando
- polizeidienstauszeichnung
- staedtischermusikvereinduesseldorf
- chlorobenzalmalononitrile
- dimethylmethylideneammonium
- dinitrosopentamethylenetetramine
- glycerylphosphorylcholine
- hydroxydihydrochelirubine
- hydroxydihydrosanguinarine
- hydroxyphenylacetaldehyde
- methylenedioxypyrovalerone
- diggingupbutchandsundance
- mahāvairocanābhisaṃbodhivikurvitādhiṣṭhānavaipulyasūtrendrarāja
The long Thai tokens look to be noun phrases in Thai, written without spaces in the usual Thai way. (The Japanese language analyzer isn't really supposed to know what to do with those.)
- ที่ทําการปกครองอําเภอเทิง
- ศูนย์เทคโนโลยีสารสนเทศและการสื่อสาร
The longest Japanese tokens can be broken up into two groups. The first group are in katakana. These are long tokens that are indexed both as one long string and as smaller parts (see "Kuromoji Tokenizer Modes" above). The breakdown of the tokens is provided beneath each long token. Notice that some of the sub-tokens are still pretty long ("クリテリウムイベント", "レーシングホールオブフェイムステークス", "カロッツェリアサテライトクルージングシステム").
- ジャパンカップサイクルロードレースクリテリウムイベント
- ジャパン - カップ - サイクル - ロードレース - クリテリウムイベント
- ナショナルミュージアムオブレーシングホールオブフェイムステークス
- ナショナル - ミュージアム - オブ - レーシングホールオブフェイムステークス
- パイオニアカロッツェリアサテライトクルージングシステム
- パイオニア - カロッツェリアサテライトクルージングシステム
- パシフィックゴルフグループインターナショナルホールディングス
- パシフィック - ゴルフ - グループ - インターナショナル - ホールディングス
The second group of Japanese tokens are all hiragana, and a lot of them start with "ょ". When submitted to the analyzer, these all come back as single tokens with no alternate breakdown. Based on Google translate, I think most or all of these are errors (I wouldn't be shocked if a few turned out to be the Japanese equivalent of antidisestablishmentarianism and supercalifragilisticexpialidocious).
- ざいさんぎょうだいじんのしょぶんにかかるしんさきじゅんとう
- ゃくかちょうみたてほんぞうふでつむしこえのとりどり
- ゅういっかいせんばつちゅうとうがっこうやきゅうたいかい
- ゅうとうがっこうゆうしょうやきゅうたいかいしこくたいかい
- ょうがいをりゆうとするさべつのかいしょうにかんするほうりつ
- ょうさつじんこういをおこなっただんたいのきせいにかんするほうりつ
- ょうせいほうじんこくりつびょういんきこうかながわびょういん
- ょうせいほうじんこくりつびょういんきこうもりおかびょういん
- ょうせんとうきょくによってらちされたひがいしゃとうのしえんにかんするほうりつ
- ょくかんれんさんぎょうろうどうくみあいそうれんごう
- ょけんおよびぎじゅつじょうのちしきのこうりゅうをよういにするためのにほんこくせいふと
- ょせいかんりょうとかぞくのよんひゃくろくじゅうごにち
One way to deal with these very long tokens is to change the tokenizer mode to "extended", which breaks up unknown words into unigrams (see "Kuromoji Tokenizer Modes" above). This would improve recall, but at the expense of precision.
For now, I say we let the pendulum swing in the other direction (away from indexing bigrams), but keep this potential problem in the back of our minds.
I still need to test the general tokenization of the language analyzer. If that's generally very good, I'll stick with this suggestion. If it's not great, we can reconsider the unigrams for unknown tokens.
Kuromoji Tokenization Analysis
[edit]One of the big benefits of a Japanese-specific language analyzer is better tokenization. Written Japanese generally doesn't use spaces, so the default CJK analyzer breaks everything up into bigrams (i.e., overlapping sets of two characters). Trying to find actual words in the text is best, if you can do a decent job.
A rough parallel in English would be to break English text up by some unit of meter (apologies to any poets, I'm gonna just wing it). So "the president of the united states" might be broken up into "the presi", "president", "dent of", "of the", "the unit", "united", and "ed states". You can't apply stop words (i.e., ignoring "the" and "of") and the matches you do get are not guaranteed to be what you intended. So, "..independent of the presiding consul's concerns, the unit belonging to Ed States, United Airlines president ..." matches all the pieces, relatively close together, but isn't at all about the president of the US.
Tools and Corpora
[edit]Using the SIGHAN framework I used to test Chinese segmentation, I set out to analyze the Japanese segmentation using Kuromoji.
I was able to extract tokenization information from the much more heavily annotated KNBC corpus. There are 4,186 sentences in the corpus. I had to drop a handful of them because my extracted tokenization did not match the original sentence after ""de-tokenizing" it. Some were missing, some mismatched. My final corpus had 4,178 sentences, so I don't think there is any major bias to the dropped sentences.
From the list of Longest Tokens above, we know that Kuromoji's tokenization is not 100% perfect—at least a small number of errors are expected.
I disabled stop words and tested the extended, search, and normal tokenizer modes (see Kuromoji Tokenizer Modes above). The extended mode is expected to have problems since unknown strings are split into individual characters. The search mode also has problems because it includes multiple tokens for the same string (strings recognized as compounds are indexed both as one longer string, and as constituent parts).
Punctuation Problems
[edit]I also had to deal with a somewhat unexpected problem of punctuation. The Kuromoji analyzer, rightly, drops punctuation and non-word symbols it isn't going to index. My script that extracts the tokenization fills in the dropped bits from the original string. That's mostly okay, but sometimes it causes problems when there are multiple punctuation and other symbols in a row. For example, when tokenizing "「そうだ 、京都 、行こう。」", Kuromoji returned the tokens そう, だ, 京都, 行こ, and う. My script filled in 「, 、, 、, and 。」. The problem is that the KNBC annotations split 。」 into separate tokens: 。 and 」. Disagreement here isn't terribly important, since we aren't indexing those tokens.
I wrote a script to identify correspondences in tokenization, and used it to identify where I should manually munge tokenization differences for punctuation and other symbols (e.g., "−−>" vs "− − >", "…。" vs "… 。", etc.). I also normalized a small number of fullwidth spaces.
Recall and Precision Results
[edit]It turns out that the differences are pretty small, and recall is roughly 82-83% and precision 72-76%, depending on the tokenizer:
Recall | Precision | F1 | |
extended | 81.8% | 72.0% | 76.6% |
search | 82.8% | 75.1% | 78.7% |
normal | 82.4% | 75.4% | 78.8% |
norm munged | 83.3% | 76.1% | 79.5% |
These are okay results, but not great.
However, there are some systematic differences between the KNBC tokenization and the Kuromoji tokenization.
Common Tokenization Discrepancies
[edit]Below is a list of the most common alternations (10 or more occurrences) found between the KNBC and Kuromoji tokenizations. I've also provided Google translations for the ones with 20 or more occurrences. (I've added a bullet, •, between tokens because I have trouble seeing the spaces in fullwidth text, and I suspect others who don't read Japanese may as well.)
The Google translations are far from definitive and comments from speakers of Japanese would be helpful, but the translations do hint that the most common alternations are not really content words. I've bolded the ones that seem to have some content. The rest account for 3,010 out of 9,214 alternations (32.6%, including those with <10 occurrences, which are not shown here).
Alternations | Tokenization | Google Translation | ||
Freq | KNBC | Kuromoji | KNBC | Kuromoji |
374 | して | し • て | do it | Then. The |
256 | と • いう | という | When. Say | It is called |
236 | ました | まし • た | Was | Better. It was |
131 | した | し • た | did | Then. It was |
131 | である | で • ある | Is | so. is there |
119 | いた | い • た | Was there | Yes. It was |
103 | れて | れ • て | Have been | Re The |
101 | なって | なっ • て | Become | Become The |
93 | のだ | の • だ | It was | of. It is |
89 | のです | の • です | It is | of. is |
88 | ように | よう • に | like | Looks like. Into |
80 | と • か | とか | When. Or | And |
80 | と • して | として | When. do it. | As |
76 | だった | だっ • た | was | So. It was |
71 | だろう | だろ • う | right | Right. Cormorant |
71 | なかった | なかっ • た | There was not | Not. It was |
69 | いて | い • て | Stomach | Yes. The |
63 | 行って | 行っ • て | go | Go. The |
60 | ような | よう • な | like | Looks like. What |
56 | 見て | 見 • て | look | You see. The |
55 | なった | なっ • た | became | Become It was |
55 | んです | ん • です | It is | Hmm. is |
52 | 思って | 思っ • て | I thought to | I thought. The |
50 | 行った | 行っ • た | went | Go. It was |
47 | れた | れ • た | Was done | Re It was |
47 | 使って | 使っ • て | Use | Use. The |
45 | でした | でし • た | was | It is. It was |
43 | 清水 • 寺 | 清水寺 | Shimizu. temple | Kiyomizudera |
41 | きた | き • た | Came | き It was |
41 | しまった | しまっ • た | Oops | Oops. It was |
40 | 的に | 的 • に | Specifically | Target. Into |
39 | でしょう | でしょ • う | Oh, yeah. | right. Cormorant |
39 | なくて | なく • て | I do not need it. | Not. The |
37 | あった | あっ • た | there were | Ah. It was |
36 | あって | あっ • て | There | Ah. The |
34 | 的な | 的 • な | Sophisticated | Target. What |
33 | でも | で • も | But | so. Also |
32 | いって | いっ • て | Go | I say. The |
32 | に • とって | にとって | To Handle | for |
32 | 好きな | 好き • な | Favorite | Like. What |
31 | いえば | いえ • ば | Speaking | House. The |
31 | に • ついて | について | To about | about |
31 | 持って | 持っ • て | Wait | Have. The |
30 | ので | の • で | Because | of. so |
30 | やって | やっ • て | do it | Do it. The |
29 | んだ | ん • だ | I | Hmm. It is |
29 | 出て | 出 • て | Came out | Out The |
28 | のだろう | の • だろ • う | Would be | of. Right. Cormorant |
27 | 食べて | 食べ • て | eat | eat. The |
25 | 入って | 入っ • て | go in | Enter. The |
24 | きて | き • て | come | き The |
24 | したり | し • たり | Or | Then. Or |
24 | 考えて | 考え • て | think | Thoughts. The |
22 | わけで | わけ • で | For that | Why so |
21 | に • よって | によって | To Accordingly | By |
21 | 思った | 思っ • た | thought | I thought. It was |
20 | のでしょう | の • でしょ • う | I guess | of. right. Cormorant |
20 | みて | み • て | look | Only. The |
20 | 住んで | 住ん • で | Live | Live. so |
19 | お • 寺 | お寺 | ||
19 | であった | で • あっ • た | ||
19 | 言って | 言っ • て | ||
18 | いけない | いけ • ない | ||
18 | そうです | そう • です | ||
18 | 修学 • 旅行 | 修学旅行 | ||
18 | 夏 • 休み | 夏休み | ||
18 | 来て | 来 • て | ||
18 | 様々な | 様々 • な | ||
18 | 確かに | 確か • に | ||
18 | 買って | 買っ • て | ||
17 | であり | で • あり | ||
17 | と • いった | といった | ||
17 | みた | み • た | ||
17 | 書いて | 書い • て | ||
16 | 他の | 他 • の | ||
16 | 知って | 知っ • て | ||
15 | これ • から | これから | ||
15 | そこ • で | そこで | ||
15 | なければ | なけれ • ば | ||
15 | ひと • つ | ひとつ | ||
15 | られた | られ • た | ||
15 | 一 • つ | 一つ | ||
15 | 有名な | 有名 • な | ||
15 | 来た | 来 • た | ||
15 | 買った | 買っ • た | ||
14 | お • 茶 | お茶 | ||
14 | のである | の • で • ある | ||
14 | られて | られ • て | ||
14 | んじゃ | ん • じゃ | ||
14 | 歩いて | 歩い • て | ||
14 | 非常に | 非常 • に | ||
13 | いった | いっ • た | ||
13 | って • いう | っていう | ||
13 | ついて | つい • て | ||
13 | もの • の | ものの | ||
13 | んだろう | ん • だろ • う | ||
13 | 聞いて | 聞い • て | ||
13 | 見た | 見 • た | ||
13 | 逆に | 逆 • に | ||
12 | お • 金 | お金 | ||
12 | それ • で | それで | ||
12 | できた | でき • た | ||
12 | できて | でき • て | ||
12 | ようです | よう • です | ||
12 | 何度 | 何 • 度 | ||
12 | 好きだ | 好き • だ | ||
12 | 河原 • 町 | 河原町 | ||
11 | しよう | しよ • う | ||
11 | わけです | わけ • です | ||
11 | 三 • 条 | 三条 | ||
11 | 入れて | 入れ • て | ||
11 | 目の前 | 目 • の • 前 | ||
11 | 聞いた | 聞い • た | ||
10 | いつでも | いつ • でも | ||
10 | お • 気に入り | お気に入り | ||
10 | かけて | かけ • て | ||
10 | このような | この • よう • な | ||
10 | して • る | し • てる | ||
10 | せて | せ • て | ||
10 | それ • でも | それでも | ||
10 | たかった | たかっ • た | ||
10 | に • 対して | に対して | ||
10 | みたいな | みたい • な | ||
10 | よかった | よかっ • た | ||
10 | 一 • 度 | 一度 | ||
10 | 今では | 今 • で • は | ||
10 | 作った | 作っ • た | ||
10 | 四 • 条 | 四条 | ||
10 | 変わって | 変わっ • て | ||
10 | 始めて | 始め • て | ||
10 | 百人一首 | 百 • 人 • 一 • 首 | ||
10 | 簡単に | 簡単 • に | ||
10 | 置いて | 置い • て |
Below I have split out the tokens that participate in alternations, to help identify regular patterns across alternations. I've included those with >=50 occurrences.
On the Kuromoji side, the top 9 are single characters, and 8 of them are identified by English Wiktionary as being particles (very briefly: particles are typically small words that provide additional grammatical information). To get a sense of the scope here, this would be like deciding whether "have been" or "look up" should be tokenized as one word or two. Consistency is probably more important than choosing either option.
It seems that Kuromoji is more aggressive about separating particles, and these 8 account for 6,656 of the 17,510 Kuromoji tokens (38.0%) that appear in alternations.
Most Commonly Alternating Tokens | |||||
Freq | KNBC | … | Freq | Kuromoji | |
468 | と | 2286 | て | Request maker sentence-final particle. | |
466 | して | 1650 | た | interrogative personal pronoun | |
271 | いう | 567 | し | Conjunctive particle | |
238 | ました | 526 | で | Particle meaning at/ or with | |
168 | に | 494 | な | Several particle meanings | |
133 | した | 463 | に | Several particle meanings | |
131 | である | 398 | の | case marking particle | |
119 | いた | 296 | う | ? | |
107 | れて | 272 | だ | nominal predicate particle | |
104 | か | 256 | という | ||
102 | なって | 240 | まし | ||
95 | お | 235 | です | ||
93 | のだ | 211 | い | ||
89 | のです | 203 | よう | ||
88 | ように | 184 | ある | ||
83 | る | 174 | ば | ||
76 | だった | 168 | なっ | ||
71 | だろう | 166 | れ | ||
71 | なかった | 138 | ん | ||
69 | いて | 127 | 行っ | ||
67 | 寺 | 116 | だろ | ||
65 | 行って | 114 | だっ | ||
60 | ような | 109 | あっ | ||
57 | んです | 100 | たり | ||
56 | なった | 84 | 的 | ||
56 | 見て | 82 | とか | ||
53 | 思って | 81 | たら | ||
51 | 行った | 80 | として | ||
80 | 思っ | ||||
78 | き | ||||
75 | なかっ | ||||
72 | も | ||||
72 | 見 | ||||
68 | 好き | ||||
67 | てる | ||||
65 | でしょ | ||||
60 | いっ | ||||
59 | そう | ||||
58 | でし | ||||
57 | ー | ||||
57 | 使っ | ||||
54 | と | ||||
53 | しまっ |
Tokenization Analysis Summary
[edit]My sense is that a significant portion of the disagreements between KNCB and Kuromoji are based on aggressiveness in separating particular parts of speech. There are surely a fair number of errors in the Kuromoji tokenization, but I'm not so worried that I'd want to stop and not proceed to set up the test index in labs.
Further Review
[edit]Below is some additional analysis, done as the result of issues brought up by speaker review or elsewhere. In particular, check out the discussion on Phab with whym, starting here.
Some 1- and 2-Character Tokens
[edit]In light of the discussion with whym on Phab and the concern that 1- and 2-character tokens are often highly ambiguous and can be grammatical suffixes, I've taken all of the 1- and 2-character tokens in the Groups with no common prefix/suffix above and run some additional analysis. In the tables below we have:
- token: the 1- or 2-character token from Groups with no common prefix/suffix above
- char_freq: the number of times the token string occurs in my 10,000-article corpus.
- omitted: the number of times Kuromoji omitted the string and did not index it.
- omit%: omitted/char_freq as a percentage. Values below 95% are bolded.
The remaining columns come in triples, which are:
- freq: the number of times the token was normalized in a particular way
- %: freq/char_freq as a percentage. Values above 1% are bolded.
- norm: the normalized version of the token.
The first table is the single-character tokens, which are generally much more common in the corpus. Many of these are indexed only vanishingly rarely, with 98% or more of the instances in the corpus being omitted from the index. Those with significant rates of indexing are relatively uncommon, occurring hundred to less than ten thousand times in the corpus, rather than one to two hundred thousand times.
token | char_freq | omitted | omit% | .. | freq | % | norm | .. | freq | % | norm | .. | freq | % | norm |
か | 78465 | 78418 | 99.940% | 47 | 0.060% | く | |||||||||
き | 35541 | 33237 | 93.517% | 5 | 0.014% | きる | 2299 | 6.469% | くる | ||||||
く | 37657 | 37326 | 99.121% | 220 | 0.584% | く | 73 | 0.194% | くい | 38 | 0.101% | くる | |||
け | 29808 | 29594 | 99.282% | 8 | 0.027% | く | 206 | 0.691% | け | ||||||
こ | 59029 | 58317 | 98.794% | 187 | 0.317% | くる | 494 | 0.837% | こ | 31 | 0.053% | こい | |||
さ | 71158 | 71122 | 99.949% | 36 | 0.051% | す | |||||||||
し | 169553 | 169549 | 99.998% | 4 | 0.002% | す | |||||||||
す | 62089 | 61890 | 99.679% | 199 | 0.321% | す | |||||||||
ず | 9163 | 4236 | 46.229% | 4927 | 53.771% | ぬ | |||||||||
た | 205766 | 205698 | 99.967% | 68 | 0.033% | たい | |||||||||
っ | 79632 | 79281 | 99.559% | 351 | 0.441% | く | |||||||||
ぬ | 905 | 498 | 55.028% | 407 | 44.972% | ぬ | |||||||||
ね | 2891 | 2392 | 82.740% | 84 | 2.906% | ぬ | 402 | 13.905% | ね | 13 | 0.450% | ねる | |||
よ | 43157 | 42158 | 97.685% | 963 | 2.231% | よ | 32 | 0.074% | よい | 4 | 0.009% | る | |||
り | 71178 | 70666 | 99.281% | 512 | 0.719% | り | |||||||||
る | 218636 | 218282 | 99.838% | 188 | 0.086% | り | 166 | 0.076% | る | ||||||
れ | 120917 | 120879 | 99.969% | 38 | 0.031% | る | |||||||||
ろ | 6659 | 6423 | 96.456% | 2 | 0.030% | る | 234 | 3.514% | ろ |
The second table is the two-character tokens. I didn't bold higher % values since almost everything would be bolded. There are many fewer occurrences of these strings in the corpus overall, with some occurring fewer than 10 times and none more than 1500 times (compared to hundreds of thousands of occurrences above).
token | char_freq | omitted | omit% | .. | freq | % | norm | .. | freq | % | norm | .. | freq | % | norm | .. | freq | % | norm | .. | freq | % | norm |
きゃ | 68 | 54 | 79.41% | 14 | 20.59% | く | |||||||||||||||||
くっ | 113 | 106 | 93.81% | 5 | 4.42% | くう | 2 | 1.77% | くる | ||||||||||||||
くら | 816 | 756 | 92.65% | 53 | 6.50% | くら | 2 | 0.25% | くらい | 5 | 0.61% | くる | |||||||||||
くり | 678 | 589 | 86.87% | 79 | 11.65% | くり | 10 | 1.47% | くる | ||||||||||||||
くる | 976 | 243 | 24.90% | 733 | 75.10% | くる | |||||||||||||||||
くれ | 544 | 249 | 45.77% | 10 | 1.84% | くる | 285 | 52.39% | くれる | ||||||||||||||
くろ | 183 | 119 | 65.03% | 13 | 7.10% | くる | 51 | 27.87% | くろい | ||||||||||||||
こい | 162 | 78 | 48.15% | 14 | 8.64% | くる | 56 | 34.57% | こい | 2 | 1.23% | こう | 5 | 3.09% | こく | 7 | 4.32% | こぐ | |||||
ざり | 23 | 19 | 82.61% | 4 | 17.39% | ぬ | |||||||||||||||||
ざる | 160 | 14 | 8.75% | 4 | 2.50% | ざる | 142 | 88.75% | ぬ | ||||||||||||||
ざれ | 5 | 4 | 80.00% | 1 | 20.00% | ぬ | |||||||||||||||||
しゃ | 488 | 407 | 83.40% | 81 | 16.60% | す | |||||||||||||||||
ずん | 14 | 9 | 64.29% | 5 | 35.71% | ぬ | |||||||||||||||||
たい | 1497 | 480 | 32.06% | 1015 | 67.80% | たい | 2 | 0.13% | たく | ||||||||||||||
たき | 129 | 85 | 65.89% | 12 | 9.30% | たい | 21 | 16.28% | たき | 11 | 8.53% | たく | |||||||||||
たく | 522 | 339 | 64.94% | 143 | 27.39% | たい | 40 | 7.66% | たく | ||||||||||||||
たし | 1103 | 1092 | 99.00% | 4 | 0.36% | たい | 7 | 0.63% | たす | ||||||||||||||
てぇ | 3 | 0 | 0.00% | 3 | 100.00% | たい | |||||||||||||||||
とう | 828 | 653 | 78.86% | 20 | 2.42% | たい | 155 | 18.72% | とう | ||||||||||||||
りゃ | 16 | 15 | 93.75% | 1 | 6.25% | る | |||||||||||||||||
るる | 47 | 11 | 23.40% | 13 | 27.66% | る | 23 | 48.94% | るる | ||||||||||||||
るれ | 4 | 3 | 75.00% | 1 | 25.00% | る |
Overall, how these 1- and 2-character tokens are indexed is still a concern, but the numbers lean towards it not being a gigantic problem.
Non-Indexed Characters
[edit]I’ve noticed that the analyzer drops a lot of characters and just doesn’t index them. (This isn’t a disaster—we have “text” field with the analyzed text, but also the “plain” field, which is generally unchanged, so exact matches are always possible.)
As an example of text being dropped, I analyzed this sentence fragment. The characters in [square brackets] are not indexed. Running the text through Google translate, there don’t seem to be any egregious errors—lots of function words (or at least things that get translated to function words) are getting omitted.
- グレート [・]アトラクター [が] 数億光年 [に] 渡る宇宙 [の] 領域内 [にある] 銀河 [とそれが] 属する銀河団 [の] 運動 [に] 及ぼす影響 [の] 観測 [から] 推定 [されたものである。]
To Do
[edit]- ✓ Get some native/fluent speaker review of the groupings above (Done! Thanks whym!)
- ✓ Test tokenization independently (Done—see above)
- ✓ Figure out what to do about BM25 (Done—as with Chinese, we'll enable it in the labs version and if it is well received, we'll go enable it in production)
- ✗ Enable BM25 for Japanese in prod if the labs review goes well.
- It didn't go well...
- ✗ Enable BM25 for Japanese in prod if the labs review goes well.
- ✓ Set up one or more of the configurations in labs (Done: http://ja-wp-kuromoji-relforge.wmflabs.org/w/index.php?search= )
- ✓ Post request for feedback to the Village Pump (Done: got feedback—check it out!)
- ✗ Do the deployment + reindexing dance! ♩♫♩
Abandon Ship!
[edit]Unfortunately, the user/speaker review from the Village Pump didn't go well. There were some problems with scoring and configuration in Labs, but even with that settled the results were often not as good, and often had lots of extra extraneous results. (Extra results probably would have been okay if better results ended up at the top of the list, but that didn't happen.)
It's possible that better scoring and weighting would give better results, but there's no simple, obvious fix to try, and careful tuning would require significant time and significant help from a fluent speaker. Since we weren't specifically trying to fix a problem with Japanese, just offering a potential improvement, it's okay to abandon this change.
We can come back to Kuromoji or another analyzer in the future if it offers better accuracy, or if we think it would fix a problem for the Japanese language wikis.