Jump to content

User:TJones (WMF)/Notes/Nori Analyzer Analysis

From mediawiki.org

August/September 2018 โ€” See TJones_(WMF)/Notes for other projects. See also T178925 and T206874. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background

[edit]

A new Korean analyzer, called Nori, [1] [2] supported by Elastic, is part of Elasticsearch 6.4 / Lucene 7.4. Even if we don't upgrade to ES 6.4 as our first version of ES 6, it seems like it is worth it to wait a little longer for an ES-supported analyzer.

My goal is to install ES 6.4 on my local machine and test the analyzer there. If Nori is up to the job, then deployment can wait until we are at ES 6.4+.

Data

[edit]

As usual, I'm doing my analysis on 10,000 random Korean Wikipedia articles and 10,000 random Korean Wiktionary entries, with most markup removed, and lines deduplicated to reduce unnaturally frequent wiki-specific phrases, like the Korean equivalents of "references", "see also", "noun", etc.

Deep Background

[edit]

I went looking to see if anything new and interesting had happened in the world of Korean morphological analysis since I made my list of analyzers to look into in October 2017.

MeCab was still high on the list, and there are a couple of ES plugin wrappers around it on Github.

Then I found more discussion of an ES analyzer called Seunjeon, used by Amazon and available on BitBucket. Seunjeon is based on MeCab, so it looked really promising.

I found a blog post on the Elastic blog comparing Seunjeon and two other analyzers, Arirang, and Open-korean-text. I missed it last year because it was published 2 days after I wrote the first draft of my list. Ugh.

The author of the blog post, Kiju Kim, is a Korean-speaking engineer at Elastic, and he did a great job looking at the speed and memory usage of the three analyzers. He also did an analysis of the tokens found by each in a short span of text.

Since that blog post was so helpful, I looked at other blog posts he's written, and discovered a nice series on the basics of working with Korean, Japanese, and Chinese text.

But then there was his newest postโ€”from this August!โ€”announcing a new Korean analyzer in Lucene 7.4 and Elastic 6.4! It's called Nori.

An Elastic-supported analyzer, if it is linguistically adequateโ€”is definitely the way to go. Supporting our own analyzers, like we do for Esperanto, Slovak, and Serbian, is good, but still a maintenance burden, and relying on third-party analyzers is great, except when they don't update to the right version of Elasticsearch quite as fast as we'd like them to.

A Brief Note on Hangeul

[edit]

If you aren't familiar with the Korean writing system, Hangeul,โ€  you can obviously read a lot more about it on Wikipedia. Very briefly, Hangeul characters are generally "syllabic blocks", that have an internal structure made up of individual consonant and vowel symbols, called jamo.

โ€  The name of the script can be transliterated in different systems as "Hangul" (a simplification of the more careful transliteration "Han'gลญl") or "Hangeul" in English. Which form is the most correct? "ํ•œ๊ธ€", obviously.

The name of the writing system, Hangeul, is written as ํ•œ๊ธ€... han + geul. But ํ•œ (han) is made up of ใ…Ž + ใ… + ใ„ด (h + a + n) and ๊ธ€ (geul) is made up of ใ„ฑ + ใ…ก + ใ„น (g + eu + l). There are different ways of arranging the jamo into a syllabic block, but they are generally a mix of left-to-right and top-to-bottom, depending on the number of jamo elements and their size and shape.

Knowing this makes it a little easier to follow some of the transformations that happen during stemming. When adding the present tense suffix โ€”แ†ซ๋‹ค/-nda, for example, the แ†ซ/n can join the final syllable of the word it is added to. Similarly, when it is removed, the final syllable left behind can lose the แ†ซ/n. When stripping the suffix from ๊ฐ„๋‹ค/ganda, we are left with ๊ฐ€/ga, which may not look like the original word. If you look closely, though, the first syllable ๊ฐ„/gan is ๊ฐ€/ga sitting on top of แ†ซ/n.

The แ†ซ/n in โ€”แ†ซ๋‹ค/-nda can also replace a final ใ„น/l, so the stem of ์•ˆ๋‹ค/anda (note that ใ…‡ here indicates there is no initial consonant and ใ… is a) is ์•Œ/al. Even if you can't remember the sounds of the individual jamo, knowing the structure of the syllabic blocks make it easier to see the relationship between ์•ˆ and ์•Œโ€”only the final consonant of the syllable changed.

In addition to Hangeul, Korean writing also uses Hanja, which are Chinese characters that have been borrowed into Korean.

Status Quo: The CJK Analyzer

[edit]

Korean-language wikis currently use the CJK analyzer, which does some normalization (like converting ๏ผฆ๏ฝ•๏ฝŒ๏ฝŒ๏ฝ—๏ฝ‰๏ฝ„๏ฝ”๏ฝˆ characters to halfwidth characters), but its most obvious feature is the way tokenization is done. Strings of CJK characters are broken up into overlapping bigrams. If we did that in English, the word bigram would be indexed as bi, ig, gr, ra, and am. Korean ์œ„ํ‚ค๋ฐฑ๊ณผ ("Wikipedia") is tokenized as ์œ„ํ‚ค, ํ‚ค๋ฐฑ, and ๋ฐฑ๊ณผ. Spaceless word boundaries (i.e., without a space or some punctuation) are ignored. It's not great, but it mostly works. Better systems are definitely better, but also a lot more complex. (Here's hoping the Nori Korean analyzer is better!)

CJK Solo Analysis

[edit]

When I investigated the Kuromoji analyzer for Japanese last summer, I neglected to do an analysis of the CJK analyzer on its own. (I keep discovering new and weird ways for analyses to go wrong, so my checklist of things to look for keeps growing.)

The most obvious things I see in this solo analysis are:

  • There are very few tokens that get normalized to be the same as other tokens. Essentially, all CJK bigrams are unique.
  • Full width forms like "๏ผ‘๏ผ’๏ผ", "๏ผฉ๏ผญ๏ผฆ", and "๏ผš" get normalized to their halfwidth (i.e., "normal" for English) forms, "120", "IMF", and ":".
  • Most longer tokens, especially in the 10 to 20 characters length range, are Latinโ€”including English and German words, domain names, underscored_phrases, or long file namesโ€”with a bit of Thai and some long numbers with commas in them thrown in. There are also the occasional multi-byte characters that Elastic can't handle as-is, so "๐ฐœ๐ฐ‡๐ฐš" gets converted at a 12:1 ratio into "\uD803\uDC1C\uD803\uDC07\uD803\uDC1A", and thus shows up internally as a 36-character token.
  • Above 25 characters, the longer tokens in the Wikipedia corpus are actually mostly Korean script!
    • The most common case is Korean words or phrases separated by a middle dot or interpunct (ยท, U+00B7), which is used like a comma or in-line bullet in lists: "๊ต๋ฆฌยท๋ถ€์‘๊ตยท์ˆ˜์›๋ถ€์‚ฌยท์ด์กฐ์ฐธ์˜ยท๋ณ‘์กฐ์ฐธํŒยท๋„์Šน์ง€ยท๋Œ€์‚ฌ๊ฐ„ยท๋Œ€์‚ฌํ—Œ".
    • It looks like any change in script from Hangeul to Latin or numbers will block the bigramming, so I see tokens like "๋Š”1988๋…„11์›”24์ผ๋ถ€ํ„ฐ1999๋…„8์›”8์ผ๊นŒ์ง€๋ฏธ๊ตญktma" and "์•ˆ๋…•์€ํ•˜์ฒ ๋„999๊ทน์žฅํŒ2.1981๋…„8์›”8์ผ.์ผ๋ณธ๊ฐœ๋ด‰์ž‘1999๋…„์žฌ๋”๋น™videoํŒ".
    • The presence of any non-CJK letter seems to block tokenization and bigramming. I replaced the middle dot with a Latin, Cyrillic, Armenian, Devanagari, Hebrew, or Arabic character and got the same result. Unusual space characters (six-per-em space, figure space, no-break space) did not break the tokenization.
    • A general solution would be complex and detailed, but a 95% solution would be to replace middle dots with spaces. We should do that if we don't adopt the Nori analyzer.
  • The Wiktionary data doesn't show any middle dots used for lists, and the vast majority of longer tokens are IPA phonetic transcriptions of phrases, which are often pleasantly detailed, so they have lots of diacritics that up the character count. For example, "sอˆษ›ฬษกษฏnbaฬ ษญtอˆaฬ ksอˆษ›ฬษกษฏnbaฬ ษญtอˆaฬ kสฐaฬ daฬ "โ€”which is 26 letters + 12 diacritics. Some are just really long, like "kum.beล‹.i.do.bal.bษจ.mjษ™n.kโ€™um.tสฐษจl.han.da"โ€”which uses periods to mark syllable boundaries.

A New Contender Emerges: The Nori Analyzer

[edit]

The Nori analyzer consists of several parts:

  • A Korean tokenizer: based on the MeCab dictionary. It can also optionally break up compounds into parts (with an option to keep or discard the original compound; discarding is the default), and it can make use of a user dictionary of additional nouns. If the tokenization works well, it should give much more accurate search results than the CJK bigrams!
  • A part of speech filter: Unsurprisingly, proper tokenization can be a little easier if you take into account parts of speech. In English, for example, the blank in "the ____ is ..." is going to be a noun phrase, which can help you figure out how to parse it. In "the building is ...", building is a noun, and so maybe we don't want to strip off the final -ing because it is not a verbal ending, as it would be in "She is building a fort." Anyway, since we have the parts of speech, we can filter out affixes, particles, and other low-information elements.
  • A reading form filter: This converts Hanja (Chinese characters) into their equivalent Hangeul. The Hangeul is more ambiguous, but may occur instead of the Hanja in some contexts. For common Hangeul equivalents, this can conflate a lot of Hanja.
  • A lowercasing filter: For the stray words and letters that show up in Latin, Cyrillic, Greek, or Armenian scripts.

Nori Solo Analysis

[edit]

The most common cause for input tokens to be stemmed the same seems to be the Hanja-to-Hangeul (Chinese-to-Korean) normalization. Since the tokens often have no characters in common, my automatic detection of potential problem stems goes crazy and almost every stemming group is a "potential problem" (i.e., there is no common beginning or ending substring across all terms). I will randomly sample some Hanja-to-Hangeul groups for native speaker review.

Some tokens that stem together are indeed inflections, but because of the syllabic blocks, it's hard to see that they are related. It's possible to decompose them (using NFD normalization). Doing so reveals our friend -แ†ซ๋‹ค/-nda as a likely Korean suffix, but since most stemming groups are Hanja/Hangeul groups, it didn't do too much for narrowing the range of potential problem suffixes.

Other tokens that stem together come from compounds. The default behavior for Nori is to break a compound into parts, discard the original and keep the parts. So, ์œ„ํ‚ค๋ฐฑ๊ณผ ("Wikipedia") gets divided into ์œ„ํ‚ค (transliteration of wiki) and ๋ฐฑ๊ณผ (an analog of the "encyclo" part of encyclopedia, meaning "all subjects").

Generally, I'm in favor of indexing the original compound (for increased precision) and the individual parts (for increased recall), and letting the scoring sort it out.

Feeding examples to the Nori stemmer on the command line also makes it clear that the context of a token affects its stemming and status as a compound (probably mediated by the part of speech tagging). For example, when I tokenize the string "๊ธฐ๋‹ค๋ฆฌ. ๊ธฐ๋‹ค๋ฆผ."โ€”both are forms of ๊ธฐ๋‹ค๋ฆฌ๋‹ค, meaning "to wait for"โ€”I get back two instances of the stem "๊ธฐ๋‹ค๋ฆฌ". With just a space between themโ€”as "๊ธฐ๋‹ค๋ฆฌ ๊ธฐ๋‹ค๋ฆผ"โ€”the stems are ๊ธฐ๋‹ค๋ฆฌ and ๋‹ค๋ฆฌ, with initial ๊ธฐ- apparently removed from the second token. (Though my very poorly-educated guess is that the tokenizer may sometimes ignore spaces and in this case is interpreting -๊ธฐ as a suffix, since it has several suffixed forms. Tokenizing as "๊ธฐ๋‹ค๋ฆฌ๊ธฐ๋‹ค๋ฆผ" gives the ๊ธฐ๋‹ค๋ฆฌ and ๋‹ค๋ฆฌ stems, too.)

The compound processing results in some input tokens generating multiple output tokens. Out of 128,352 pre-analysis tokens, 126,342 generated only one output token. 1,001 generated 2, 9 generated 3, 2 generated 4, and 1 generated 0! I definitely need to see where that empty token came from, and double check on those potential three- and four-part compounds.

Frequency of number of tokens generated and examples for 2+

output token count freq. examples
0 1
1 126342
2 1001 ๋ฐ”๋ด๋ท”๋ฅดํ…œ๋ฒ ๋ฅดํฌ์ฃผ โ†’ ๋ฐ”๋ด๋ท”๋ฅดํ…œ๋ฒ ๋ฅดํฌ - ์ฃผ
3 9 ๊ฑด์ง„ โ†’ ๊ฒƒ - ์ด - ์ง€ ๋ฐ”์Šค์ฝ”๋‹ค๊ฐ€๋งˆ โ†’ ๋ฐ”์Šค์ฝ” - ๋‹ค - ๊ฐ€๋งˆ
4 2 ์–‘์žฌ์‹œ๋ฏผ์˜์ˆฒ์—ญ โ†’ ์–‘์žฌ - ์‹œ๋ฏผ - ์ˆฒ - ์—ญ ํ•™๋™ยท์ฆ์‹ฌ์‚ฌ์ž…๊ตฌ์—ญ โ†’ ํ•™๋™ - ์ฆ์‹ฌ์‚ฌ - ์ž…๊ตฌ - ์—ญ

Oddly, the empty output token maps back to an empty input token (with length zero). It is triggered by the presence of the four characters "๊ทธ๋ ˆ์ด๋งจ" (part of the name of a manga character ๋”” ๊ทธ๋ ˆ์ด๋งจ). The four characters in the name are parsed as an input token, followed by a zero-length token. It's weird. There's only one in my 10K Wikipedia sample, and none in the Wiktionary sample, but there could potentially by dozens or even hundreds in the full Korean Wikipedia, so an empty-token filter seems to be called for.

Some other things I see in this solo analysis are:

  • The longest non-Korean tokens are similar, though there are no domain names, words_with_underscores, numbers with commas, or long IPA phonetic transcriptions.
  • Nori doesn't have the middle dot problem CJK does, but the longest Korean tokens still look similar: "ใ†๋„๋กœใ†์ง€๋ฐ˜ใ†์ˆ˜์ž์›ใ†๊ฑด์„คํ™˜๊ฒฝใ†๊ฑด์ถ•ใ†ํ™”์žฌ์„ค๋น„์—ฐ๊ตฌ". Some people use an obsolete character called "arae-a" (ใ†, U+318D) in place of a middle dot. Visually they are very similarโ€” ใ† vs ยท โ€”though YMMV, depending on your fonts.
    • Nori doesn't have the non-CJK letter problem the CJK analyzer does, so it breaks on numbers, Latin, Cyrillic, Greek, Armenian, Hebrew, or Arabic characters. It also breaks on Japanese Hiragana and Katakana characters. Chinese characters can be the same as the Hanja based on them, so they seem to be tokenized according to the internal workings of the Nori tokenizer, which often splits them off as separate tokens.
    • Note: The character arae-a (ใ†, U+318D) mentioned above is technically the "HANGUL LETTER ARAEA", and there is another form, the "HANGUL JUNGSEONG ARAEA" (แ†ž, U+119E), which looks pretty much identical and is also, probably incorrectly, used in lists ("์ƒˆ๋กœ์šด ์ƒ๊ฐแ†ž ๋ฐฐ๋ คํ•˜๋Š” ๋งˆ์Œแ†ž ์ปค๊ฐ€๋Š” ๊ฟˆ"). Its more proper use is as a "jungseong" character, which is the medial character of a syllabic block. If there is no precomposed Unicode character for a given syllabic block, you can specify the block as initial/medial/final parts (choseong/jungseong/jongseong) and if your fonts and operating system are up to the challenge, you get nice-looking syllabic blocks as a result. Below is an example of the same characters in Arial Unicode (which shows them individually) and in Noto Sans CJK (which is a newer font and knows how to do the right thing). There are even fewer of these used incorrectly, and some used correctly for historical syllabic blocks, so the right solution is probably to find them when used as bullet points, and replace them with a different character.

CJK vs Nori (monolithic)

[edit]

Since the tokenization of Korean is wildly different between CJK and Nori, comparing the two is also about how they treat everything elseโ€”normalization, treatment of non-Korean CJK text, treatment of non-CJK text, etc.

The token count differences for the Wikipedia corpus are huge: 3,698,656 for CJK and 2,382,379 for Nori. For a string of, say, 8 Korean characters without spaces, CJK will give 7 bigrams, while Nori will likely give only 2 to 4 tokens (the median length of Nori tokens is 3, and the mode is 2). The Wiktionary data shows less of a difference (105,389 tokens for CJK and 103,702 tokens for Nori), but Nori splits up certain non-Korean tokens that CJK doesn't, and there many more non-Korean tokens in Wiktionary that both treat the same.

Other differences of note (โœ“=Good for Nori; โ€”=Neutral for Nori; โœ—=Bad for Nori):

  • โœ“ Nori splits up mixed-Korean/non-CJK and mixed-script sequences. The most common are ones like date elements like "3์›”" ("third month/March") and "1990๋…„" ("the year 1990"), and measurements like "0.875ํ†ค์ด" ("0.875 tons"). With Nori we can still match phrases, but CJK would have trouble matching "1990" to "1990๋…„".
    • โœ“ Another common case is "์˜" which seems to mean "of", stuck to the end of a non-Korean word, like "allegory์˜" or "ะฐะฝะณะตะป์˜". Mixed-script CJK bigrams (like "ใซ้†ค", in which the first character is Japanese and the second is Chinese) are split up, too.
    • โœ— Nori kindly offers the option to get some of internal details, and it seems to divide non-CJK characters into type "SL(Foreign language)" and "SY(Other symbol)" (among others). It seems to generally break between characters from different character sets, though which Unicode blocks count as "symbols" is weird. Most Latin or Latin-based characters are "foreign", but IPA extensions are "symbols".
      • โœ— The "Greek Extended" block is also treated as "symbols" rather than as Greek, so "ฮตแผฐฮผฮฏ" gets split up into "ฮต" + "แผฐ" + "ฮผฮฏ", because แผฐ ("GREEK SMALL LETTER IOTA WITH PSILI" in the "Greek Extended" block) is treated as a symbol.
      • โœ— Our friend "sอˆษ›ฬษกษฏnbaฬ ษญtอˆaฬ ksอˆษ›ฬษกษฏnbaฬ ษญtอˆaฬ kสฐaฬ daฬ " gets tokenized by Nori as "s" + "อˆษ›ฬษกษฏ" + "nba" + "ฬ ษญ" + "t" + "อˆ" + "a" + "ฬ " + "ks" + "อˆษ›ฬษกษฏ" + "nba" + "ฬ ษญ" + "t" + "อˆ" + "a" + "ฬ " + "k" + "สฐ" + "a" + "ฬ " + "da" + "ฬ ".
    • โ€” Numbers are a different category for Nori, so "UPC600" gets split into "UPC" and "600".
    • โœ— Nori splits on combining characters (also treated as "symbols"), so that "ะ‘ะฐฬ€ะปั‚ะธั‡ะบะพฬ„" gets split into four tokens: ะ‘ะฐ + ฬ€ + ะปั‚ะธั‡ะบะพ + ฬ„. This happens a lot in Cyrillic, where combining accents are used to show stress.
    • โœ— Nori splits on apostrophes and some other apostrophe-like characters, including the curly apostrophe (โ€™) and the Hebrew ืณ (U+05F3, "HEBREW PUNCTUATION GERESH").
    • โœ— A lot of the combining and modifying characters are indexed on their own by Nori, resulting in low-quality and relatively high-volume tokens: ฬ€ โ€ข ฬ โ€ข ฬ‚ โ€ข ฬƒ โ€ข ฬ„ โ€ข ฬ โ€ข ฬชสฒ โ€ข ฬชห  โ€ข สท โ€ข สป โ€ข สผ โ€ข สพ โ€ข สฟ โ€ข ห€ ... etc.
  • โœ— CJK leaves soft hyphens in place, Nori splits tokens on them. Neither is ideal. Stripping them before tokenization might be better.
  • โœ“ Nori strips bidirectional markers (used when switching between left-to-right and right-to-left scripts), CJK leaves them in place, which is wrong.
  • โœ— CJK leaves zero-width non-joiner characters in place, while Nori splits on them, both of which are usually wrong, since the usual purpose of the character is to prevent ligatures. Stripping them seems to be the right thing to do.
  • โ€” Nori also splits on periods (example.com and 3.14159), colons (commons:category:vienna), underscores (service_file_system_driver), commas (in numbers), where CJK does not. These are generally good, though it's not great for acronyms.
  • โœ“ CJK eats encircled numbers (โ‘ โ‘กโ‘ข), "dingbat" circled numbers (โž€โžโž‚), parenthesized numbers (โ‘ดโ‘ตโ‘ถ), fractions (ยผ โ…“ โ…œ ยฝ โ…” ยพ), superscript numbers (ยนยฒยณ), and subscript numbers (โ‚โ‚‚โ‚ƒ); Nori keeps them. Normalizing them is probably best.
  • โœ“ It's a minor thing, but CJK eats some characters, like "๐„ž", "๐ŸŽ„", and some private use area characters, while Nori keeps them. They both eat other characters, like "โ™ก".
  • โ€” Strings of Japanese Hiragana or Katakana are kept whole. This is probably good for titles and other short phrases, but not good for extended strings of Japanese that would get tokenized as one long token.
  • โœ— Oddly, the string "ํŠœํ† ๋ฆฌ์–ผ" gets tokenized with an extra space at the end, as "ํŠœํ† ๋ฆฌ์–ผ ". It's the only token like that I've found. It seems to be stemmed correctly from the forms in the text ("ํŠœํ† ๋ฆฌ์–ผ์—" and "ํŠœํ† ๋ฆฌ์–ผ์„"), it just has an extra space. Weird.

Nori (monolithic) vs Nori (unpacked)

[edit]

I unpacked Nori according to the Elasticsearch 6.4 documentation and the results were identical to the monolithic Nori analyzer. So that's good. Now we can test other variations, like changing the compound processing and introducing ICU normalization and custom character filters to address some of the problems Nori has.

Nori: Enabling "Mixed" Compounds

[edit]

Nori has three options for dealing with compounds: break the compound into pieces and discard the original compound (the default), leave the compound as is, or index both the compound and its sub-parts, which they call "mixed". I prefer the "mixed" option if the compound splitting is good, as I noted above, because it allows for more precise matches on the whole compound, but also reasonable matches on parts of the compound.

We'll need native speaker review to judge the quality of the compound splitting, but we can still get a sense of the size of the impact on our 10K corpora.

For the Wikipedia corpus, the "default" Nori config generated 2,382,379 unique tokens, while "mixed" Nori generates 2,659,656 tokens. The extra ~277K (11.6%) tokens should be the original compounds that were split into sub-parts and discarded in the "default" config. The Wiktionary corpus gave a similar though smaller increase: 103,702 vs 111,331 (~7.6K / 7.4%).

New collisions are very rareโ€”on the order of 0.1% or less. This makes sense, because a new collision in this case means that a compound "AB" that had previously been indexed only as "A" and "B" is indexed as "AB", but there are already existing tokens indexed as "AB".

For the Wikipedia corpus, 45 pre-analysis types (0.035% of pre-analysis types) / 3644 tokens (0.153% of tokens) were added to 45 groups (0.041% of post-analysis types), affecting a total of 121 pre-analysis types (0.094% of pre-analysis types) in those groups.

For the Wiktionary corpus: 12 pre-analysis types (0.05% of pre-analysis types) / 67 tokens (0.065% of tokens) were added to 12 groups (0.058% of post-analysis types), affecting a total of 28 pre-analysis types (0.116% of pre-analysis types) in those groups.

So, the impact on increased ambiguity (new collisions) is very low, but the number of compounds indexed, which increases precision when searching for those compounds, is high!

Barring negative speaker feedback, indexing in "mixed" mode seems to be the way to go.

Other notes:

  • I did find one token out of the entire 10K Wikipedia corpus that has a middle dot in it, "ํ•™๋™ยท์ฆ์‹ฌ์‚ฌ์ž…๊ตฌ์—ญ", where the middle dot seems to playing the role of a hyphen in the name of a subway station with two names. The English title uses an en dash ("Hakdongโ€“Jeungsimsa Station") but the opening sentence uses the original middot ("HakdongยทJeungsimsa Station")! A very brief search did not turn up any other instances of titles with middle dots that are tokenized as one long token.
    • Note that this kind of thing may be less of a problem for Nori when it does happen because the bigger token is broken down into smaller tokens for both "default" Nori and "mixed" Nori.

Nori: Enabling ICU Normalization

[edit]

I compared the Nori "mixed compounds" config against the same config, but with ICU normalization enabled instead of simple lowercasing.

Since this involves normalizing strings after tokenization, the number of tokens found in each corpus remains unchanged (2,659,656 for Wikipedia; 111,331 for Wiktionary).

The impact is again very small, on the order of 0.1% or less.

For the Wikipedia corpus: 44 pre-analysis types (0.028% of pre-analysis types) / 396 tokens (0.015% of tokens) were added to 32 groups (0.024% of post-analysis types), affecting a total of 90 pre-analysis types (0.058% of pre-analysis types) in those groups.

For the Wiktionary corpus: 13 pre-analysis types (0.049% of pre-analysis types) / 887 tokens (0.797% of tokens) were added to 10 groups (0.043% of post-analysis types), affecting a total of 25 pre-analysis types (0.093% of pre-analysis types) in those groups.

New collisions are mostly other versions of letters and numbers (superscript, subscript, fullwidth, encircled, parenthesized), along with ligatures "๏ฌ"/"fi", precomposed Roman numerals, German รŸ -> ss, Greek ฯ‚ -> ฯƒ. The only Korean collisions are the letters ใ…” and ใ† being converted to their jungseong counterparts.

Other changes include the usual ICU normalizations.

One (familiar) problem is that dotted capital I (ฤฐ) is converted to lowercase i with an extra dot (iฬ‡), as in ฤฐtalya -> iฬ‡talya. (Though indexing already lowercase iฬ‡talya gives three tokens, since there is no precomposed character for iฬ‡, which is a regular i and a combining dot. This can be fixed with a character filter converting Turkish ฤฐ to I early on.

The primary effect on Korean text is to convert individual Unicode "letters" into the corresponding choseong/jungseong/jongseong. So, "ใ…ใ…‡ใ…Žใ„ดใ…Œ" is converted to "แ„‘แ„‹แ„’แ„‚แ„", which may look the same, unless you have clever fonts that can try to correctly format the choseong/jungseong/jongseong as syllabic blocks.

Early ICU Normalization

[edit]

The lowercase filter part of the Nori analysis chain happens last. Early on I thought that was a bit odd, so after unpacking Nori, I moved the ICU Normalizer (which replaced the lowercase filter) to be first among the token filters. It didn't make any difference for Wikipedia or Wiktionary, with the default or mixed compound processing.

There is an ICU Normalization character filter (which applies before tokenization) which could have a positive impact on the tokenization of non-Korean text.

I didn't run a full analysis, because after testing some examples ("ฮตแผฐฮผฮฏ", "ะ‘ะฐฬ€ะปั‚ะธั‡ะบะพฬ„", "sอˆษ›ฬษกษฏnbaฬ ษญtอˆaฬ ksอˆษ›ฬษกษฏnbaฬ ษญtอˆaฬ kสฐaฬ daฬ ", "kum.beล‹.i.do.bal.bษจ.mjษ™n.kโ€™um.tสฐษจl.han.da"), it didn't actually do anything useful. There were minor changes, like "tสฐษจl" becoming th + ษจ + l instead of t + hษจ + l.

More aggressive ICU folding, rather than mere normalization, would probably convert "tสฐษจl" to "thil", but ICU folding is only available as a token filter, after tokenization.

Since most of the "weird" problems are for non-Korean text, they aren't show-stoppers if we can't fix them.

Nori + Custom Filters

[edit]

I added some custom character and token filters to the unpacked, mixed-compound, ICU-normalizing Nori config:

  • A mapping character filter to:
    • convert middle dot (ยท, U+00B7), and letter arae-a (ใ†, U+318D) to spaces
    • convert dotted-I (ฤฐ) to I
    • remove soft hyphens and zero-width non-joiner
  • A pattern_replace character filter to strip combining diacritic characters from U+0300 to U+0331.
  • An minimum length token filter to remove empty strings

I'm not sure what to do about the apostrophes and apostrophe-like characters, so I've left that alone for now.

Some stats:

  • The net effect of the filters on tokenization in the Wikipedia corpus was small: 2,659,656 tokens before, 2,659,650 tokens after; presumably the effects of merging tokens that were broken up by diacritics was offset by splitting up tokens joined by dots. The Wiktionary corpus had a bigger net decrease in tokens, from 111,331 to 106,283; pronunciations are still getting split by character type, but no longer also on every diacritic.
  • The impact on the Wikipedia corpus was small, with on the order of 0.1% of tokens or less affected.
    • New collisions: 33 pre-analysis types (0.021% of pre-analysis types) / 37 tokens (0.001% of tokens) were added to 32 groups (0.024% of post-analysis types), affecting a total of 77 pre-analysis types (0.05% of pre-analysis types) in those groups.
    • New splits: 13 pre-analysis types (0.008% of pre-analysis types) / 30 tokens (0.001% of tokens) were lost from 13 groups (0.01% of post-analysis types), affecting a total of 216 pre-analysis types (0.139% of pre-analysis types) in those groups.
  • The impact on the Wiktionary corpus was a bit larger, with up to almost 2% of tokens being affected.
    • New collisions: 202 pre-analysis types (0.754% of pre-analysis types) / 1940 tokens (1.743% of tokens) were added to 181 groups (0.77% of post-analysis types), affecting a total of 405 pre-analysis types (1.512% of pre-analysis types) in those groups.

Observations:

  • The one token that might have been okay with a middle dot (ํ•™๋™ยท์ฆ์‹ฌ์‚ฌ์ž…๊ตฌ์—ญ) does not get tokenized as one token anymore. (With the mixed compound config, its parts were getting indexed before, and still are.)
  • There are some momentarily confusing results, such as Cyrillic "ะ•ะณะพ" is no longer in the same group as "ะตะณะพ"โ€”because it's actually part of "ะ•ะณะพฬั€ะพะฒ" (which had been tokenized as "ะตะณะพ + ฬ + ั€ะพะฒ", but is now kept together as as "ะตะณะพั€ะพะฒ".
  • Lots of good results like "MยญBยญC" (with soft hyphens) indexed with "MBC" and "ะ’ะธฬะบั‚ะพั€" with "ะ’ะธะบั‚ะพั€".
  • Wiktionary has lots of additional collisions, with phonetic transcription bits grouping with plain text.

Overall, this seems like a reasonable improvement, though the exact list of combining characters to ignore is unclear.

Speaker Review

[edit]

There's a lot going on with the Korean analyzer beyond stemming, which has often been the focus of my analyses. Tokenization and compound processing are also important. There are also the Hanja-to-Hangeul transformation.

Tokenization and Compounds

[edit]

Below are ten random example sentences pulled from Korean Wikipedia, and seven example sentences with specific phrases that get processed as compounds and are split into three or more tokens.

Speaker Notes: Please review the 17 examples below for proper tokenization, which is the process of breaking up text into words or other units. There can be some disagreement about the exact way to break up a particular text, so it doesn't have to be perfect, just reasonable. Some words are identified as "compounds", and are also broken up into smaller pieces. For example, ์–‘์žฌ์‹œ๋ฏผ์˜์ˆฒ์—ญ is broken up into ์–‘์žฌ, ์‹œ๋ฏผ, ์ˆฒ, and ์—ญ. For the purposes of search, searching for any of these five tokens would match a document that contains the full form, ์–‘์žฌ์‹œ๋ฏผ์˜์ˆฒ์—ญ.

In the examples below, each token is in [brackets]. When multiple tokens come from the same phrase, they are bracketed together, like this: [์–‘์žฌ์‹œ๋ฏผ์˜์ˆฒ์—ญ โ€ข ์–‘์žฌ โ€ข ์‹œ๋ฏผ โ€ข ์ˆฒ โ€ข ์—ญ].

We are generally only worried about really bad tokenization and compound processing. As in example in English, "football" could be broken up into "foot" and "ball"โ€”i.e., [football โ€ข foot โ€ข ball]โ€”or it could just occur as one word, [football]. Those are both acceptable. Something like [football โ€ข foo โ€ข tball] would be bad.

Note: some words or endings may be missing from the tokenization. Nori also removes words/characters/jamo that it determines are in the categories verbal endings, interjections, ending particles, general adverbs, conjunctive adverbs, determiners, prefixes, adjective suffixes, noun suffixes, verb suffixes, and various kinds of punctuation. Words have also been stemmedโ€”that is, reduced to their base formsโ€”which may introduce some additional errors or unexpected words.

10 random sentences

[edit]

Reviewers: you can also look at some review and discussion that's already happened, to see if you agree or disagree or have anything to add.

input ๊น€๋Œ€์ค‘ ๋Œ€ํ†ต๋ น์€ 2003๋…„๊นŒ์ง€ ํ•™๊ธ‰๋‹น ํ•™์ƒ์ˆ˜๋ฅผ 35๋ช… ์ดํ•˜๋กœ ๊ฐ์ถ•ํ•œ๋‹ค๋Š”๋‚ด์šฉ์˜ '7.20 ๊ต์œก์—ฌ๊ฑด ๊ฐœ์„ ๊ณ„ํš' ์„ ๋ฐœํ‘œํ–ˆ๋‹ค.
tokens [๊น€๋Œ€์ค‘] โ€” [๋Œ€ํ†ต๋ น] โ€” [2003] โ€” [๋…„] โ€” [ํ•™๊ธ‰] โ€” [ํ•™์ƒ] โ€” [์ˆ˜] โ€” [35] โ€” [๋ช…] โ€” [์ดํ•˜] โ€” [๊ฐ์ถ•] โ€” [๋‚ด์šฉ] โ€” [7] โ€” [20] โ€” [๊ต์œก] โ€” [์—ฌ๊ฑด] โ€” [๊ฐœ์„ ] โ€” [๊ณ„ํš] โ€” [๋ฐœํ‘œ]
input ๋ชจ๋“  ๋ชจ๋ธ์€ MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, ํ–ฅ์ƒ๋œ ์ธํ…” ์Šคํ”ผ๋“œ์Šคํ… ๊ธฐ์ˆ (EIST), EM64T(Extended Memory 64 Technology), XD ๋น„ํŠธ, ๊ฐ€์ƒํ™” ๊ธฐ์ˆ , ์Šค๋งˆํŠธ ์บ์‹œ, ์ธํ…” ํ„ฐ๋ณด ๋ถ€์ŠคํŠธ ์ง€์›
tokens [๋ชจ๋ธ] โ€” [mmx] โ€” [sse] โ€” [sse] โ€” [2] โ€” [sse] โ€” [3] โ€” [ssse] โ€” [3] โ€” [sse] โ€” [4] โ€” [1] โ€” [sse] โ€” [4] โ€” [2] โ€” [ํ–ฅ์ƒ] โ€” [์ธํ…”] โ€” [์Šคํ”ผ๋“œ์Šคํ… โ€ข ์Šคํ”ผ๋“œ โ€ข ์Šคํ…] โ€” [๊ธฐ์ˆ ] โ€” [eist] โ€” [em] โ€” [64] โ€” [t] โ€” [extended] โ€” [memory] โ€” [64] โ€” [technology] โ€” [xd] โ€” [๋น„ํŠธ] โ€” [๊ฐ€์ƒ] โ€” [๊ธฐ์ˆ ] โ€” [์Šค๋งˆํŠธ] โ€” [์บ์‹œ] โ€” [์ธํ…”] โ€” [ํ„ฐ๋ณด] โ€” [๋ถ€์ŠคํŠธ] โ€” [์ง€์›]
input ๋‹ค ์ž๋ผ๋ฉด ๋ชธ๊ธธ์ด๋Š” 61 cm, ๋ชธ๋ฌด๊ฒŒ๋Š” 1.4~2.7 kg ์ •๋„๊ฐ€ ๋œ๋‹ค.
tokens [์ž๋ผ] โ€” [๋ชธ๊ธธ์ด โ€ข ๋ชธ โ€ข ๊ธธ์ด] โ€” [61] โ€” [cm] โ€” [๋ชธ๋ฌด๊ฒŒ โ€ข ๋ชธ โ€ข ๋ฌด๊ฒŒ] โ€” [1] โ€” [4] โ€” [2] โ€” [7] โ€” [kg] โ€” [์ •๋„] โ€” [๋œ๋‹ค โ€ข ๋˜]
input 7์›” 14์ผ์—๋Š” ํƒœํ•ญ์‚ฐ์— ์žˆ๋˜ ์กฐ์„ ์ฒญ๋…„์—ฐํ•ฉํšŒ ์†Œ์† ๋ณ‘์‚ฌ๋“ค์ด ํ•˜๋ถ์„ฑ์— ๋„์ฐฉํ•˜์ž, ๋‹น์ผ ํ•˜๋ถ์„ฑ ์„ญํ˜„์—์„œ ๊น€๋‘๋ด‰, ๋ฐ•ํšจ์‚ผ ๋“ฑ๊ณผ ํ•จ๊ป˜ ์กฐ์„ ์˜์šฉ๊ตฐ์„ ๋ฐœ์กฑ์‹œํ‚ค๊ณ  ์ด์‚ฌ๋ น๊ด€์— ์ทจ์ž„ํ–ˆ๋‹ค.
tokens [7] โ€” [์›”] โ€” [14] โ€” [์ผ] โ€” [ํƒœํ•ญ] โ€” [์‚ฐ] โ€” [์žˆ] โ€” [์กฐ์„ ] โ€” [์ฒญ๋…„] โ€” [์—ฐํ•ฉํšŒ โ€ข ์—ฐํ•ฉ โ€ข ํšŒ] โ€” [์†Œ์†] โ€” [๋ณ‘์‚ฌ] โ€” [ํ•˜๋ถ์„ฑ โ€ข ํ•˜๋ถ โ€ข ์„ฑ] โ€” [๋„์ฐฉ] โ€” [๋‹น์ผ] โ€” [ํ•˜๋ถ์„ฑ โ€ข ํ•˜๋ถ โ€ข ์„ฑ] โ€” [์„ญ] โ€” [ํ˜„] โ€” [๊น€๋‘๋ด‰] โ€” [๋ฐ•] โ€” [ํšจ] โ€” [์‚ผ] โ€” [๋“ฑ] โ€” [์กฐ์„ ] โ€” [์šฉ๊ตฐ] โ€” [๋ฐœ์กฑ] โ€” [์ด์‚ฌ๋ น๊ด€ โ€ข ์ด โ€ข ์‚ฌ๋ น โ€ข ๊ด€] โ€” [์ทจ์ž„]
input ์—ฐํ•ฉ๊ฐ๋ฆฌ๊ตํšŒ์˜ ์กฐ์ง์€ ๋ฏธ๊ตญ ์ด์™ธ์—๋„ ์บ๋‚˜๋‹ค์™€ ์œ ๋Ÿฝ, ์•„ํ”„๋ฆฌ์นด์™€ ํ•„๋ฆฌํ•€์˜ ๊ตํšŒ๋“ค์„ ํฌํ•จํ•œ๋‹ค.
tokens [์—ฐํ•ฉ] โ€” [๊ฐ๋ฆฌ] โ€” [๊ตํšŒ] โ€” [์กฐ์ง] โ€” [๋ฏธ๊ตญ] โ€” [์ด์™ธ] โ€” [์บ๋‚˜๋‹ค] โ€” [์œ ๋Ÿฝ] โ€” [์•„ํ”„๋ฆฌ์นด] โ€” [ํ•„๋ฆฌํ•€] โ€” [๊ตํšŒ] โ€” [ํฌํ•จ]
input 2006๋…„ ์ค‘ํ™”์ธ๋ฏผ๊ณตํ™”๊ตญ์—์„œ๋Š” ๋‹จ๋ฐฑ์งˆ์˜ ํ•จ๋Ÿ‰์„ ์†์—ฌ์„œ, ๋ฏธ๊ตญ์œผ๋กœ ์ˆ˜์ถœํ•  ๊ฐ€์ถ• ์‚ฌ๋ฃŒ์˜ ์›๋ฃŒ์ธ ๋ฐ€๊ธ€๋ฃจํ… ๋“ฑ ์กฐ๋‹จ๋ฐฑ ํ•จ๋Ÿ‰์ด ๋†’์€ ์‚ฌ๋ฃŒ ์›๋ฃŒ์˜ ๋‹จ๋ฐฑ์งˆ์–‘์„ ๊ณผ์žฅํ•˜์—ฌ ๋ถ€ํ’€๋ฆฌ๋Š” ๋ฐ ์ด์šฉํ•˜์˜€๋‹ค.
tokens [2006] โ€” [๋…„] โ€” [์ค‘ํ™”] โ€” [์ธ๋ฏผ๊ณตํ™”๊ตญ โ€ข ์ธ๋ฏผ โ€ข ๊ณตํ™”๊ตญ] โ€” [๋‹จ๋ฐฑ์งˆ โ€ข ๋‹จ๋ฐฑ โ€ข ์งˆ] โ€” [ํ•จ๋Ÿ‰] โ€” [์†์—ฌ์„œ โ€ข ์†์ด] โ€” [๋ฏธ๊ตญ] โ€” [์ˆ˜์ถœ] โ€” [๊ฐ€์ถ•] โ€” [์‚ฌ๋ฃŒ] โ€” [์›๋ฃŒ] โ€” [์ธ โ€ข ์ด] โ€” [๋ฐ€] โ€” [๊ธ€๋ฃจํ…] โ€” [๋“ฑ] โ€” [์กฐ๋‹จ] โ€” [๋ฐฑ] โ€” [ํ•จ๋Ÿ‰] โ€” [๋†’] โ€” [์‚ฌ๋ฃŒ] โ€” [์›๋ฃŒ] โ€” [๋‹จ๋ฐฑ์งˆ โ€ข ๋‹จ๋ฐฑ โ€ข ์งˆ] โ€” [์–‘] โ€” [๊ณผ์žฅ] โ€” [๋ถ€ํ’€๋ฆฌ] โ€” [๋ฐ] โ€” [์ด์šฉ]
input ์ผ๋ณธ ์š”๋ฆฌ๋Š” ์‡ผ๊ตฐ ์น˜ํ•˜ ๋™์•ˆ์— ์—˜๋ฆฌํŠธ์ฃผ์˜๋ฅผ ์—†์• ๋ ค ํ–ˆ๋˜ ์ค‘์„ธ ์‹œ๋Œ€๊ฐ€ ์ถœํ˜„ํ•˜๋ฉฐ ๋ณ€ํ™”ํ•˜์˜€๋‹ค.
tokens [์ผ๋ณธ] โ€” [์š”๋ฆฌ] โ€” [์‡ผ๊ตฐ] โ€” [์น˜ํ•˜] โ€” [๋™์•ˆ] โ€” [์—˜๋ฆฌํŠธ์ฃผ์˜ โ€ข ์—˜๋ฆฌํŠธ โ€ข ์ฃผ์˜] โ€” [์—†์• ] โ€” [ํ–ˆ โ€ข ํ•˜] โ€” [์ค‘์„ธ] โ€” [์‹œ๋Œ€] โ€” [์ถœํ˜„] โ€” [๋ณ€ํ™”]
input ใ€Ž์‚ฐ๋ฆ‰๋„๊ฐ์˜๊ถคใ€ ๋“ฑ ๋ฌธํ—Œ์— ์˜ํ•˜๋ฉด ์„ธ์ข… ์˜๋ฆ‰(่‹ฑ้™ต), ๋ช…์ข… ๊ฐ•๋ฆ‰(ๅบท้™ต), ์ธ์กฐ ์žฅ๋ฆ‰(้•ท้™ต), ํšจ์ข… ์˜๋ฆ‰(ๅฏง้™ต)์˜ ์ •์ž๊ฐ์ด ํŒ”์ž‘์ง€๋ถ•์ด์—ˆ์œผ๋‚˜, ํ›„๋Œ€์— ๋ชจ๋‘ ๋งž๋ฐฐ์ง€๋ถ•์œผ๋กœ ๊ต์ฒด๋˜์–ด ํ˜„์žฌ๋Š” ์ˆญ๋ฆ‰์˜ ์ •์ž๊ฐ๋งŒ ํŒ”์ž‘์ง€๋ถ•์œผ๋กœ ๋‚จ์•„ ์žˆ๋‹ค.
tokens [์‚ฐ๋ฆ‰๋„๊ฐ โ€ข ์‚ฐ๋ฆ‰ โ€ข ๋„๊ฐ] โ€” [๊ถค] โ€” [๋“ฑ] โ€” [๋ฌธํ—Œ] โ€” [์˜ํ•˜] โ€” [์„ธ์ข…] โ€” [์˜๋ฆ‰] โ€” [์˜๋ฆ‰] โ€” [๋ช…์ข…] โ€” [๊ฐ•๋ฆ‰] โ€” [๊ฐ•๋ฆ‰] โ€” [์ธ์กฐ] โ€” [์žฅ๋ฆ‰] โ€” [์žฅ๋ฆ‰] โ€” [ํšจ์ข…] โ€” [์˜๋ฆ‰] โ€” [ๅฏง] โ€” [๋ฆ‰] โ€” [์ •์ž๊ฐ โ€ข ์ •์ž โ€ข ๊ฐ] โ€” [ํŒ”์ž‘์ง€๋ถ• โ€ข ํŒ”์ž‘ โ€ข ์ง€๋ถ•] โ€” [์ด] โ€” [ํ›„๋Œ€] โ€” [๋งž๋ฐฐ์ง€๋ถ• โ€ข ๋งž๋ฐฐ โ€ข ์ง€๋ถ•] โ€” [๊ต์ฒด] โ€” [ํ˜„์žฌ] โ€” [์ˆญ๋ฆ‰] โ€” [์ •์ž๊ฐ โ€ข ์ •์ž โ€ข ๊ฐ] โ€” [ํŒ”์ž‘์ง€๋ถ• โ€ข ํŒ”์ž‘ โ€ข ์ง€๋ถ•] โ€” [๋‚จ] โ€” [์žˆ]
input 1934๋…„ ํŒŒ์šธ ํฐ ํžŒ๋ด๋ถ€๋ฅดํฌ ๋Œ€ํ†ต๋ น์ด ์‚ฌ๋งํ•œ ํ›„ ํžˆํ‹€๋Ÿฌ๋Š” ์ˆ˜์ƒ๊ณผ ๋Œ€ํ†ต๋ น์ง์„ ๊ฒธ๋ฌดํ•ด์„œ ๊ตญ๋ฐฉ๊ตญ ์ตœ๊ณ  ์ง€ํœ˜๊ถŒ์„ ์†์— ๋„ฃ๊ฒŒ ๋˜์—ˆ๋‹ค.
tokens [1934] โ€” [๋…„] โ€” [ํŒŒ์šธ] โ€” [ํฐ] โ€” [ํžŒ๋ด๋ถ€๋ฅดํฌ] โ€” [๋Œ€ํ†ต๋ น] โ€” [์‚ฌ๋ง] โ€” [ํ›„] โ€” [ํžˆํ‹€๋Ÿฌ] โ€” [์ˆ˜์ƒ] โ€” [๋Œ€ํ†ต๋ น] โ€” [์ง] โ€” [๊ฒธ๋ฌด] โ€” [๊ตญ๋ฐฉ] โ€” [๊ตญ] โ€” [์ตœ๊ณ ] โ€” [์ง€ํœ˜] โ€” [์†] โ€” [๋„ฃ] โ€” [๋˜]
input ๋ถ€์‚ฐ์ง€๋ฐฉ๋ฒ•์›์™€ ์„œ์šธํ˜•์‚ฌ์ง€๋ฐฉ๋ฒ•์› ๋“ฑ์—์„œ ๋ถ€์žฅํŒ์‚ฌ๋ฅผ ํ•˜๋‹ค๊ฐ€ ๋ถ€์‚ฐ์ง€๋ฐฉ๋ฒ•์›, ์ œ์ฃผ์ง€๋ฐฉ๋ฒ•์›, ์ถ˜์ฒœ์ง€๋ฐฉ๋ฒ•์›, ๊ด‘์ฃผ๊ณ ๋“ฑ๋ฒ•์›์—์„œ ๋ฒ•์›์žฅ์„ ์—ญ์ž„ํ•˜์˜€์œผ๋ฉฐ ์ดํ›„ ๊ณต์ง์—์„œ ๋ฌผ๋Ÿฌ๋‚˜ ๋ณ€ํ˜ธ์‚ฌ ํ™œ๋™์„ ํ–ˆ๋‹ค.
tokens [๋ถ€์‚ฐ] โ€” [์ง€๋ฐฉ] โ€” [๋ฒ•์›] โ€” [์„œ์šธ] โ€” [ํ˜•์‚ฌ] โ€” [์ง€๋ฐฉ] โ€” [๋ฒ•์›] โ€” [๋“ฑ] โ€” [๋ถ€์žฅ] โ€” [ํŒ์‚ฌ] โ€” [ํ•˜] โ€” [๋ถ€์‚ฐ] โ€” [์ง€๋ฐฉ] โ€” [๋ฒ•์›] โ€” [์ œ์ฃผ] โ€” [์ง€๋ฐฉ] โ€” [๋ฒ•์›] โ€” [์ถ˜์ฒœ] โ€” [์ง€๋ฐฉ] โ€” [๋ฒ•์›] โ€” [๊ด‘์ฃผ] โ€” [๊ณ ๋“ฑ] โ€” [๋ฒ•์›] โ€” [๋ฒ•์›์žฅ โ€ข ๋ฒ•์› โ€ข ์žฅ] โ€” [์—ญ์ž„] โ€” [์ดํ›„] โ€” [๊ณต์ง] โ€” [๋ฌผ๋Ÿฌ๋‚˜ โ€ข ๋ฌผ๋Ÿฌ๋‚˜] โ€” [๋ณ€ํ˜ธ์‚ฌ โ€ข ๋ณ€ํ˜ธ โ€ข ์‚ฌ] โ€” [ํ™œ๋™] โ€” [ํ–ˆ โ€ข ํ•˜]

7 sentences with many-part compounds

[edit]

Reviewers: you can also look at some review and discussion that's already happened, to see if you agree or disagree or have anything to add.

input ์–‘์žฌ์—ญ - ์–‘์žฌ์‹œ๋ฏผ์˜์ˆฒ์—ญ - ์–‘์žฌ ๋‚˜๋“ค๋ชฉ (์ œ๋ถ€์—ฌ๊ฐ์œผ๋กœ ์ด๊ด€)
tokens [์–‘์žฌ์—ญ โ€ข ์–‘์žฌ โ€ข ์—ญ] โ€” [์–‘์žฌ์‹œ๋ฏผ์˜์ˆฒ์—ญ โ€ข ์–‘์žฌ โ€ข ์‹œ๋ฏผ โ€ข ์ˆฒ โ€ข ์—ญ] โ€” [์–‘์žฌ] โ€” [๋‚˜๋“ค๋ชฉ โ€ข ๋‚˜๋“ค โ€ข ๋ชฉ] โ€” [์ œ๋ถ€์—ฌ๊ฐ โ€ข ์ œ๋ถ€ โ€ข ์—ฌ๊ฐ] โ€” [์ด๊ด€]
input ์ œ41๊ถŒ ใ€Š๋น„ํ‹€์Šค๋ฅผ ์œ„๊ธฐ์—์„œ ๊ฑด์ง„ ๋…ธ๋ž€ ์ž ์ˆ˜ํ•จใ€‹
tokens [41] โ€” [๊ถŒ] โ€” [๋น„ํ‹€์Šค] โ€” [์œ„๊ธฐ] โ€” [๊ฑด์ง„ โ€ข ๊ฒƒ โ€ข ์ด โ€ข ์ง€] โ€” [๋…ธ๋ž€ โ€ข ๋…ธ๋ž—] โ€” [์ž ์ˆ˜ํ•จ โ€ข ์ž ์ˆ˜ โ€ข ํ•จ]
input 17๋ฒˆํŠธ๋ž™ <์ข‹์€๋‚ > ๋ธŒ๋ผ์šด์•„์ด๋“œ๊ฑธ์Šค ๋ฒ„์ „์„ ํŽธ๊ณก
tokens [17] โ€” [๋ฒˆ] โ€” [ํŠธ๋ž™] โ€” [์ข‹] โ€” [๋‚ ] โ€” [๋ธŒ๋ผ์šด์•„์ด๋“œ๊ฑธ์Šค โ€ข ๋ธŒ๋ผ์šด โ€ข ์•„์ด๋“œ โ€ข ๊ฑธ์Šค] โ€” [๋ฒ„์ „] โ€” [ํŽธ๊ณก]
input ์‚ฌํƒ•์ˆ˜์ˆ˜๋Š” ์›๋ž˜ ์—ด๋Œ€ ๋‚จ์•„์‹œ์•„์™€ ๋™๋‚จ์•„์‹œ์•„์—์„œ ์ „ํ•ด์ ธ์™”๋‹ค.
tokens [์‚ฌํƒ•์ˆ˜์ˆ˜ โ€ข ์‚ฌํƒ• โ€ข ์ˆ˜์ˆ˜] โ€” [์—ด] โ€” [๋Œ€] โ€” [๋‚จ์•„์‹œ์•„ โ€ข ๋‚จ โ€ข ์•„์‹œ์•„] โ€” [๋™๋‚จ์•„์‹œ์•„ โ€ข ๋™๋‚จ โ€ข ์•„์‹œ์•„] โ€” [์ „ํ•ด์ ธ์™” โ€ข ์ „ํ•˜ โ€ข ์ง€ โ€ข ์˜ค]
input ๋ฏธ๊ตญ๊ตฐ์ด ์ฒ˜์Œ์œผ๋กœ ๋ผ์ธ๊ฐ•์„ ๋„ํ•˜ํ•œ๋‹ค
tokens [๋ฏธ๊ตญ] โ€” [๊ตฐ] โ€” [์ฒ˜์Œ] โ€” [๋ผ์ธ๊ฐ• โ€ข ๋ผ์ธ โ€ข ๊ฐ•] โ€” [๋„ํ•˜]
input ์ „ ๊ตฌ๊ฐ„ ์•ผ๋งˆ๊ตฌ์น˜ํ˜„์— ์†Œ์žฌ.
tokens [๊ตฌ๊ฐ„] โ€” [์•ผ๋งˆ๊ตฌ์น˜ํ˜„ โ€ข ์•ผ๋งˆ๊ตฌ์น˜ โ€ข ํ˜„] โ€” [์†Œ์žฌ]
input ๋‹น์‹ ์ด ๋ญ”๋ฐ ์—ฌ๊ธฐ์„œ ํฐ์†Œ๋ฆฌ๋ฅผ ์น˜๋Š”๊ฑฐ์•ผ.
tokens [๋‹น์‹ ] โ€” [๋ญ”๋ฐ โ€ข ๋ญ โ€ข ์ด] โ€” [์—ฌ๊ธฐ] โ€” [ํฐ์†Œ๋ฆฌ โ€ข ํฐ โ€ข ์†Œ๋ฆฌ] โ€” [์น˜] โ€” [๊ฑฐ โ€ข ๊ฒƒ] โ€” [์•ผ โ€ข ์ด]

Hanja-to-Hangul

[edit]

One of the unique features of the Nori analyzer is that it converts Hanja (Chinese characters borrowed into Korean) to Hangeul (the syllabic Korean script) to make them easier to search for. We want to make sure the conversion seems reasonable.

Speaker Notes: Below are 55 random examples of Chinese tokens, which are presumably Hanja, being grouped together with Korean tokens. Searching for either would find the other. Are these groupings reasonable? (Note that the last 14 examples have more than one Chinese token.

  • [ๆฃฎ] [์‚ผ]
  • [่ฏ] [๋ จ]
  • [ๆ ธ] [ํ•ต]
  • [ๆŸณ] [๋ฅ˜]
  • [็•ฅ] [๋žต]
  • [ไบ”็ตƒ] [์˜คํ˜„]
  • [ไบ”้“] [์˜ค๋„]
  • [ไบคๅญ] [๊ต์ž]
  • [ๅˆ†ๆดพ] [๋ถ„ํŒŒ]
  • [ๅˆ†้…] [๋ถ„๋ฐฐ]
  • [ๅฏ่ƒฝ] [๊ฐ€๋Šฅ]
  • [ๅ–ฎๅ…‰] [๋‹จ๊ด‘]
  • [ๅฅ‡ๅฝข] [๊ธฐํ˜•]
  • [ๅฅ‰ๆˆด] [๋ด‰๋Œ€]
  • [ๅฉฆๅฎถ] [๋ถ€๊ฐ€]
  • [ๅชฝๅชฝ] [๋งˆ๋งˆ]
  • [ๅฑฑไธญ] [์‚ฐ์ค‘]
  • [ๅนณๅœฐ] [ํ‰์ง€]
  • [ๅฝขๆ…‹] [ํ˜•ํƒœ]
  • [ๅฟƒ่ก“] [์‹ฌ์ˆ ]
  • [ๅฟซๆ„Ÿ] [์พŒ๊ฐ]
  • [ๆ”ฟ่ฎŠ] [์ •๋ณ€]
  • [ๆ™‚็”จ] [์‹œ์šฉ]
  • [ๆญฆ้™ต] [๋ฌด๋ฆ‰]
  • [ๆบชๆน–] [๊ณ„ํ˜ธ]
  • [็จๅญค] [๋…๊ณ ]
  • [็พไปฃ] [ํ˜„๋Œ€]
  • [็จ…ๅ‹™] [์„ธ๋ฌด]
  • [็ด€ๅ‚ณ] [๊ธฐ์ „]
  • [่ฐๆ˜Ž] [์ด๋ช…]
  • [่ฅฟๆฑŸ] [์„œ๊ฐ•]
  • [่งฃๆ”พ] [ํ•ด๋ฐฉ]
  • [่ฎ€ๅˆธ] [๋…๊ถŒ]
  • [่ตคๆ ธ] [์ ํ•ต]
  • [้‡Ž็”Ÿ] [์•ผ์ƒ]
  • [้Ž”่Œƒ] [์šฉ๋ฒ”]
  • [้™ฝๅˆป] [์–‘๊ฐ]
  • [้›ฒ่‡บ] [์šด๋Œ€]
  • [้Ÿ“ๆ—ฅ] [ํ•œ์ผ]
  • [้ฌช็ˆญ] [ํˆฌ์Ÿ]
  • [้ปƒ้พ] [ํ™ฉ์ข…]
  • [ไบบๅคฉ] [ไปๅท] [์ธ์ฒœ]
  • [ๅ…จๅœ‹] [ๆˆฐๅœ‹] [์ „๊ตญ]
  • [ๅญคๅฑฑ] [้ซ˜ๅฑฑ] [๊ณ ์‚ฐ]
  • [ๅฎถๅฃ] [ๆžถๆง‹] [๊ฐ€๊ตฌ]
  • [ๅฐ‡็›ธ] [้•ทไธŠ] [์žฅ์ƒ]
  • [ๅฐๅธซ] [็ด ็ ‚] [์†Œ์‚ฌ]
  • [ๆญฃๅผ] [็จ‹ๅผ] [์ •์‹]
  • [้ฃ›้ณฅ] [้ผป็ฅ–] [๋น„์กฐ]
  • [ๅ‹‡] [ๅบธ] [่Œธ] [่ธŠ] [์šฉ]
  • [ๅ‡ๆƒณ] [ๅ‡่ฑก] [ๅ˜‰็ฅฅ] [๊ฐ€์ƒ]
  • [ๅ…ƒๅฎš] [ๅ…ƒๆญฃ] [้ ๅพ] [์›์ •]
  • [ๅคงๅฏถ] [ๅคง่ผ”] [๋Œ€๋ณด] [๋Œ€๋ณธ]
  • [ๅˆบ็นก] [ๅญ—ๆ•ธ] [็ดซ็ถฌ] [่‡ชไฟฎ] [์ž์ˆ˜]
  • [ไปฃๅ„Ÿ] [ๅคงๅ•†] [ๅคง็›ธ] [ๅคง่ณž] [ๅฐ่ฑก] [้šŠๅ•†] [๋Œ€์ƒ]

Stemming

[edit]

Speaker Notes: Below are 50 random samples of "stemming groups", which are words grouped together by trying to reduce them to their base form. In English, this groups words like hope, hopes, hoped, and hoping. These would be indicated as "hope: [hope] [hoped] [hopes] [hoping]".

Another example, from below: "๋น ์ ธ๋‚˜์˜ค: [๋น ์ ธ๋‚˜์˜จ] [๋น ์ ธ๋‚˜์™”]" means that searching for either of "๋น ์ ธ๋‚˜์˜จ" or "๋น ์ ธ๋‚˜์™”" will find the other. Both are stored internally as the stemmed form "๋น ์ ธ๋‚˜์˜ค". The stemmed form is usually close to the most basic form of a word, but does not need to be correct. The important question is whether it is good that searching for one form in [brackets] will find the others.

Note that some lists may include compounds, which can be broken up into parts. So, you might see something like "ball: [ball] [football] [baseball] [basketball]" because "football" could be stored internally as "football", "foot", and "ball"; "baseball" as "baseball", "base", and "ball"; etc.

  • ๊ฐ€๋ฅด๋‹ค: [๊ฐ€๋ฅด๋‹ค] [๊ฐ€๋ฅด๋‹คํ˜ธ]
  • ๊ฐˆ๋ผ์„œ: [๊ฐˆ๋ผ์„œ] [๊ฐˆ๋ผ์„ฐ]
  • ๊ท„: [๊ท„] [๋ฅด๊ท„]
  • ๋Œ์–ด๋‹น๊ธฐ: [๋Œ์–ด๋‹น๊ฒจ์„œ] [๋Œ์–ด๋‹น๊ฒจ์ ธ] [๋Œ์–ด๋‹น๊ธฐ] [๋Œ์–ด๋‹น๊ธด๋‹ค]
  • ๋ˆˆ๋ถ€์‹œ: [๋ˆˆ๋ถ€์‹œ] [๋ˆˆ๋ถ€์‹ ]
  • ๋‹ค์Šค๋ฆฌ: [๋‹ค์Šค๋ ค] [๋‹ค์Šค๋ ธ] [๋‹ค์Šค๋ฆฌ] [๋‹ค์Šค๋ฆฐ] [๋‹ค์Šค๋ฆฐ๋‹ค] [๋‹ค์Šค๋ฆด] [๋‹ค์Šค๋ฆผ]
  • ๋‹ฌ๋ฆฌ: [๋‹ฌ๋ ค] [๋‹ฌ๋ ค๋‚˜๊ฐ„๋‹ค] [๋‹ฌ๋ ค๋ผ] [๋‹ฌ๋ ค์„œ] [๋‹ฌ๋ ค์•ผ] [๋‹ฌ๋ ค์ ธ] [๋‹ฌ๋ ธ] [๋‹ฌ๋ฆฌ] [๋‹ฌ๋ฆฐ] [๋‹ฌ๋ฆฐ๋‹ค] [๋‹ฌ๋ฆด]
  • ๋ค๋ฒผ๋“ค: [๋ค๋ฒผ๋“œ] [๋ค๋ฒผ๋“ค]
  • ๋…ํ•˜: [๋…ํ•˜] [๋…ํ•œ]
  • ๋’คํ”๋“ค: [๋’คํ”๋“œ] [๋’คํ”๋“ ] [๋’คํ”๋“ค]
  • ๋“ค๋œจ: [๋“ค๋– ] [๋“ค๋œจ] [๋“ค๋œธ]
  • ๋ง: [๋ง] [๋ฐ”์ด๋ง]
  • ๋งค๋‹ฌ๋ฆฌ: [๋งค๋‹ฌ๋ ค] [๋งค๋‹ฌ๋ ค์„œ] [๋งค๋‹ฌ๋ ธ] [๋งค๋‹ฌ๋ฆฌ] [๋งค๋‹ฌ๋ฆฐ] [๋งค๋‹ฌ๋ฆด]
  • ๋งค์‚ฌ์ถ”์„ธ์ธ : [๋งค์‚ฌ์ถ”์„ธ์ธ ] [๋งค์‚ฌ์ถ”์„ธ์ธ ์ฃผ]
  • ๋ฉ‹์ง€: [๋ฉ‹์ ธ] [๋ฉ‹์กŒ] [๋ฉ‹์ง€] [๋ฉ‹์ง„]
  • ๋ชธ๋ถ€๋ฆผ์น˜: [๋ชธ๋ถ€๋ฆผ์ณค] [๋ชธ๋ถ€๋ฆผ์น˜]
  • ๋ฌด๋ฅ: [๋ฌด๋”์šด] [๋ฌด๋ฅ]
  • ๋ฌด๋ฅด๋งŒ์Šคํฌ: [๋ฌด๋ฅด๋งŒ์Šคํฌ] [๋ฌด๋ฅด๋งŒ์Šคํฌ์ฃผ]
  • ๋ฐ”๋ด๋ท”๋ฅดํ…œ๋ฒ ๋ฅดํฌ: [๋ฐ”๋ด๋ท”๋ฅดํ…œ๋ฒ ๋ฅดํฌ] [๋ฐ”๋ด๋ท”๋ฅดํ…œ๋ฒ ๋ฅดํฌ์ฃผ]
  • ๋ถ€๋Ÿฌ๋œจ๋ฆฌ: [๋ถ€๋Ÿฌ๋œจ๋ ธ] [๋ถ€๋Ÿฌ๋œจ๋ฆฌ]
  • ๋ถˆ๋Ÿฌ์ผ์œผํ‚ค: [๋ถˆ๋Ÿฌ์ผ์œผ์ผฐ] [๋ถˆ๋Ÿฌ์ผ์œผํ‚ค] [๋ถˆ๋Ÿฌ์ผ์œผํ‚จ] [๋ถˆ๋Ÿฌ์ผ์œผํ‚จ๋‹ค] [๋ถˆ๋Ÿฌ์ผ์œผํ‚ฌ]
  • ๋น™: [๋ฆฌ๋น™] [๋น™]
  • ๋น ๋œจ๋ฆฌ: [๋น ๋œจ๋ ค] [๋น ๋œจ๋ ธ] [๋น ๋œจ๋ฆฌ] [๋น ๋œจ๋ฆด]
  • ๋น ์ ธ๋‚˜์˜ค: [๋น ์ ธ๋‚˜์˜จ] [๋น ์ ธ๋‚˜์™”]
  • ๋ป—์น˜: [๋ป—์ณ] [๋ป—์ณ์„œ] [๋ป—์น˜] [๋ป—์นœ]
  • ์‚ฌ๋ผ: [์‚ฌ๋ผ] [์‚ฌ๋ผ์ฝ”๋„ˆ]
  • ์‚ฌ์šฐ์Šค๋‹ค์ฝ”ํƒ€: [์‚ฌ์šฐ์Šค๋‹ค์ฝ”ํƒ€] [์‚ฌ์šฐ์Šค๋‹ค์ฝ”ํƒ€์ฃผ]
  • ์‚ฌ์šฐ์Šค์บ๋กค๋ผ์ด๋‚˜: [์‚ฌ์šฐ์Šค์บ๋กค๋ผ์ด๋‚˜] [์‚ฌ์šฐ์Šค์บ๋กค๋ผ์ด๋‚˜์ฃผ]
  • ์Š๋ ˆ์Šค๋น„ํžˆํ™€์Šˆํƒ€์ธ: [์Š๋ ˆ์Šค๋น„ํžˆํ™€์Šˆํƒ€์ธ] [์Š๋ ˆ์Šค๋น„ํžˆํ™€์Šˆํƒ€์ธ์ฃผ]
  • ์‹นํŠธ: [์‹นํ„ฐ] [์‹นํŠธ] [์‹นํŠผ] [์‹นํŠผ๋‹ค๊ณ ]
  • ์•„๋””: [๋ฆฌ์•„๋””] [์•„๋””]
  • ์•„ํ‚คํƒ€: [์•„ํ‚คํƒ€] [์•„ํ‚คํƒ€ํ˜„]
  • ์• ์“ฐ: [์• ์จ] [์• ์จ๋„] [์• ์ผ] [์• ์“ฐ] [์• ์“ด๋‹ค]
  • ์•ผ๋‹จ์น˜: [์•ผ๋‹จ์น˜] [์•ผ๋‹จ์นœ๋‹ค]
  • ์—ด๋ฆฌ: [์—ด๋ ค] [์—ด๋ ค๋ผ] [์—ด๋ ค์•ผ] [์—ด๋ ค์ ธ] [์—ด๋ ธ] [์—ด๋ ธ์œผ๋ฉฐ] [์—ด๋ฆฌ] [์—ด๋ฆฐ] [์—ด๋ฆฐ๋‹ค] [์—ด๋ฆฐ๋‹ค๋Š”] [์—ด๋ฆด]
  • ์˜ค๋ž˜๋˜: [์˜ค๋ž˜๋œ] [์˜ค๋ž˜๋จ]
  • ์šฐ๋ฅด: [์šฐ๋Ÿฌ] [์šฐ๋ฅด]
  • ์›จ์Šคํ„ด์˜ค์ŠคํŠธ๋ ˆ์ผ๋ฆฌ์•„: [์›จ์Šคํ„ด์˜ค์ŠคํŠธ๋ ˆ์ผ๋ฆฌ์•„] [์›จ์Šคํ„ด์˜ค์ŠคํŠธ๋ ˆ์ผ๋ฆฌ์•„์ฃผ]
  • ์œ„์•ˆ์žฅ: [์œ„์•ˆ์žฅ] [์œ„์•ˆ์žฅ๊ฐ•]
  • ์œ ํ”„๋ผํ…Œ์Šค: [์œ ํ”„๋ผํ…Œ์Šค] [์œ ํ”„๋ผํ…Œ์Šค๊ฐ•]
  • ์ž˜์ธ ๋ถ€๋ฅดํฌ: [์ž˜์ธ ๋ถ€๋ฅดํฌ] [์ž˜์ธ ๋ถ€๋ฅดํฌ์ฃผ]
  • ์ž ๊ธฐ: [์ž ๊ฒจ] [์ž ๊ฒจ์„œ] [์ž ๊ฒผ] [์ž ๊ธฐ] [์ž ๊ธด] [์ž ๊ธด๋‹ค]
  • ์ง€๋‚ด: [์ง€๋‚ด] [์ง€๋‚ธ] [์ง€๋‚ธ๋‹ค] [์ง€๋‚ผ] [์ง€๋ƒ„] [์ง€๋ƒˆ] [์ง€๋ƒˆ์œผ๋‚˜] [์ง€๋ƒˆ์œผ๋ฉฐ] [์ง€์–ด๋‚ด]
  • ์ซ“๊ธฐ: [์ซ“๊ฒจ] [์ซ“๊ฒจ๊ฐ„] [์ซ“๊ฒผ] [์ซ“๊ธฐ]
  • ์ถ”ํ•˜: [์ถ”ํ•˜] [์ถ”ํ•œ]
  • ํ…Œ๋„ค์‹œ: [ํ…Œ๋„ค์‹œ] [ํ…Œ๋„ค์‹œ์ฃผ]
  • ํŽœ: [๋น„์ œ์ดํŽœ] [ํŽœ]
  • ํ›„๋ ค์น˜: [ํ›„๋ ค์ณ] [ํ›„๋ ค์ณค] [ํ›„๋ ค์น˜]
  • ํ›„์ฟ ์‹œ๋งˆ: [ํ›„์ฟ ์‹œ๋งˆ] [ํ›„์ฟ ์‹œ๋งˆํ˜„]
  • ํœด: [์†ํœด] [ํœด]

Large Groups

[edit]

Very large groups of tokens that are grouped together are sometimes a sign of something going wrong. Sometimes it just means there are a lot of common related forms or a lot of ambiguity. If there are a relatively small number of really bad stemsโ€”as might happen with a statistical modelโ€”then we can specifically filter the worst ones (like we do for Polish), or add other filters, say, based on part-of-speech tags.

Speaker Notes: Below are some of the largest "stemming groups", which are words grouped together by trying to reduce them to their base form. In English, this groups words like hope, hopes, hoped, and hoping. These would be indicated as "hope: [hope] [hoped] [hopes] [hoping]".

Note that some lists may include compounds, which can be broken up into parts. So, you might see something like "ball: [ball] [football] [baseball] [basketball]" because "football" could be stored internally as "football", "foot", and "ball"; "baseball" as "baseball", "base", and "ball"; etc.

If it is too difficult to understand why some tokens are grouped with others without context, I can try to provide context for these tokens by tracking the specific sentences they came from.

I've also included some notes from my own investigations for the first two. I'm only listing the top three from Wikipedia until we get a sense of what's going on.

[Note that these large groups are not necessarily indicative of the general overall performance of the Nori analyzer.]

  • ์ง€: [ไน‹] [ๅœฐ] [ๅฟ—] [ๆ‘ฏ] [ๆ™บ] [ๆฑ ] [็Ÿฅ] [่‡ณ] [่Šท] [๊ฐ€๊นŒ์›Œ์ ธ] [๊ฐ€๊นŒ์›Œ์กŒ] [๊ฐ€๊นŒ์›Œ์ง„] [๊ฐ€๊นŒ์›Œ์ง„๋‹ค] [๊ฐ€๊นŒ์›Œ์งˆ] [๊ฐ€๋ ค์ ธ] [๊ฐ€๋ ค์กŒ] [๊ฐ€๋ ค์ง„] [๊ฐ€๋ ค์ง„๋‹ค] [๊ฐ€๋ ค์ง] [๊ฐ€๋ฅด์ณ์ง„] [๊ฐ€๋ฒผ์›Œ์กŒ] [๊ฐ€๋ฒผ์›Œ์ง„] [๊ฐ€ํ•ด์ ธ์•ผ] [๊ฐ€ํ•ด์กŒ] [๊ฐ€ํ•ด์ง„] [๊ฐ€ํ•ด์งˆ] [๊ฐˆ๋ผ์ ธ์„œ] [๊ฐˆ๋ผ์กŒ] [๊ฐˆ๋ผ์ง„๋‹ค] [๊ฐˆ๋ผ์ง] [๊ฐ์ถฐ์งˆ] [๊ฐ•ํ•ด์ ธ] [๊ฐ•ํ•ด์ ธ์„œ] [๊ฐ•ํ•ด์กŒ] [๊ฐ•ํ•ด์ง„๋‹ค] [๊ฐ•ํ•ด์งˆ] [๊ฐ–์ถฐ์ ธ] [๊ฐ–์ถฐ์ง„] [๊ฑด์ง„] [๊ฑธ๋ ค์กŒ] [๊ฑธ์ณ์ ธ] [๊ฑธ์ณ์กŒ] [๊ฑธ์ณ์ง„] [๊ฒน์ณ์ ธ] [๊ณ ์ณ์กŒ] [๊ณฑํ•ด์ง„] [๊ตฌ์›Œ์ง„] [๊ตฌํ•ด์ง„๋‹ค] [๊ทธ๋ ค์ ธ] [๊ทธ๋ ค์กŒ] [๊ทธ๋ ค์กŒ์œผ๋ฉฐ] [๊ทธ๋ ค์ง„] [๊ทธ๋ ค์ง„๋‹ค] [๊ทธ๋ ค์ง„๋‹ค๊ณ ] [๊ทธ๋ ค์งˆ] [๊ทธ๋ฆฌ์›Œ์งˆ] [๊ธธ๋“ค์—ฌ์ง„] [๊ธธ๋Ÿฌ์กŒ] [๊ธธ๋Ÿฌ์กŒ์œผ๋ฉฐ] [๊ธธ๋Ÿฌ์ง„๋‹ค] [๊บผ๋ ค์กŒ] [๊บผ์ ธ] [๊บผ์ง] [๊พธ๋ฉฐ์ ธ] [๊พธ๋ฉฐ์กŒ] [๊พธ๋ฉฐ์ง„] [๋Œ์–ด๋‹น๊ฒจ์ ธ] [๋ผ์›Œ์ ธ] [๋‚˜๋ˆ ์ ธ] [๋‚˜๋น ์ ธ] [๋‚˜๋น ์ ธ์„œ] [๋‚˜๋น ์กŒ] [๋‚˜๋น ์ง„] [๋‚˜๋น ์ง„๋‹ค] [๋‚˜๋น ์งˆ] [๋‚˜์ง„] [๋‚จ๊ฒจ์ ธ] [๋‚จ๊ฒจ์ง„] [๋‚จ๊ฒจ์งˆ] [๋‚ด๋˜์ ธ์ ธ] [๋‚ด๋ ค์ง„] [๋‚ด๋ ค์ง] [๋„˜๊ฒจ์กŒ] [๋„˜๊ฒจ์ง„] [๋„˜๊ฒจ์ง„๋‹ค] [๋†“์—ฌ์ ธ] [๋†“์—ฌ์กŒ] [๋†“์—ฌ์ง„] [๋Š๊ปด์กŒ] [๋Š๊ปด์ง„] [๋Š๊ปด์ง„๋‹ค] [๋Š๊ปด์งˆ] [๋Š๋ ค์กŒ] [๋Š๋ ค์ง„๋‹ค] [๋Š๋ ค์งˆ] [๋Šฆ์ถฐ์กŒ] [๋Šฆ์ถฐ์ง„] [๋‹ค๋ค„์ ธ] [๋‹ค๋ค„์ ธ์•ผ] [๋‹ค๋ค„์กŒ] [๋‹ค๋ค„์ง„] [๋‹ค๋ค„์ง„๋‹ค] [๋‹ฌ๊ถˆ์ง„] [๋‹ฌ๋ผ์ ธ์•ผ] [๋‹ฌ๋ผ์กŒ] [๋‹ฌ๋ผ์ง„๋‹ค] [๋‹ฌ๋ผ์ง„๋‹ค๋Š”] [๋‹ฌ๋ ค์ ธ] [๋‹ด๊ฒจ์ ธ] [๋‹ด๊ฒจ์ง„] [๋”๋Ÿฝํ˜€์ง„] [๋”๋ ตํ˜€์ ธ] [๋”์›Œ์งˆ] [๋”ํ•ด์ ธ] [๋˜์ ธ์ ธ] [๋˜์ ธ์ง„] [๋ง๋ถ™์—ฌ์ง„] [๋ฎ์—ฌ์ ธ] [๋ฐ์›Œ์ง„] [๋Œ๋ ค์กŒ] [๋˜๋Œ๋ ค์กŒ] [๋‘๊บผ์›Œ์ ธ] [๋‘๊บผ์›Œ์ง„๋‹ค] [๋‘˜๋Ÿฌ์ ธ] [๋“œ๋ฆฌ์›Œ์ง„] [๋“ค์—ฌ์ง„] [๋”ฐ๋ผ์ง„] [๋œจ๊ฑฐ์›Œ์ง„] [๋œจ๊ฑฐ์›Œ์งˆ] [๋œธํ•ด์กŒ] [๋กœ์›Œ์งˆ] [๋ง๋ ค์ ธ] [๋งž์ถฐ์ ธ] [๋งž์ถฐ์กŒ] [๋งž์ถฐ์ง„] [๋งž์ถฐ์ง„๋‹ค] [๋งก๊ฒจ์ ธ] [๋งก๊ฒจ์กŒ] [๋งค๊ฒจ์ง„] [๋งค๊ฒจ์ง„๋‹ค] [๋งค๊ฒจ์งˆ] [๋ฉˆ์ถฐ์ง„] [๋ฉ”์›Œ์ ธ] [๋ชจ์…”์ ธ] [๋ชจ์…”์กŒ] [๋ชจ์…”์ง„] [๋ชจ์…”์ง„๋‹ค] [๋ชจ์•„์กŒ] [๋ชจ์•„์ง„] [๋ฌด๊ฑฐ์›Œ์กŒ] [๋ฌด๊ฑฐ์›Œ์ง„] [๋ญ‰์ณ์ง„] [๋ฏธ๋ค„์ ธ] [๋ฏธ๋ค„์กŒ] [๋ฐ”์ณ์ง„] [๋ฐ›์•„๋“ค์—ฌ์ ธ] [๋ฐ›์•„๋“ค์—ฌ์กŒ] [๋ฐ›์•„๋“ค์—ฌ์ง„] [๋ฐ›์•„๋“ค์—ฌ์ง„๋‹ค] [๋ฐ›์ณ์ง„] [๋ฐœ๋ผ์ ธ] [๋ฐํ˜€์ ธ] [๋ฐํ˜€์กŒ] [๋ฐํ˜€์กŒ์œผ๋‚˜] [๋ฐํ˜€์กŒ์œผ๋ฉฐ] [๋ฐํ˜€์กŒ์œผ๋ฏ€๋กœ] [๋ฐํ˜€์ง„] [๋ฐํ˜€์ง„๋‹ค] [๋ฐํ˜€์งˆ] [๋ฐํ˜€์ง] [๋ฒ„๋ ค์ ธ] [๋ฒ„๋ ค์กŒ] [๋ฒ„๋ ค์ง„] [๋ฒ„๋ ค์ง„๋‹ค] [๋ฒŒ๋ ค์ง„] [๋ฒŒ์—ฌ์กŒ] [๋ฒ—๊ฒจ์กŒ] [๋ฒ—๊ฒจ์ง„๋‹ค] [๋ณด์—ฌ์กŒ] [๋ณด์—ฌ์ง„๋‹ค] [๋ด‰ํ•ด์ ธ] [๋ด‰ํ•ด์กŒ] [๋ด‰ํ•ด์กŒ์œผ๋‚˜] [๋ด‰ํ•ด์ง„] [๋ถ€๋“œ๋Ÿฌ์›Œ์กŒ] [๋ถ€๋“œ๋Ÿฌ์›Œ์ง„] [๋ถˆ๋ ค์ ธ] [๋ถˆํƒœ์›Œ์ ธ] [๋ถˆํƒœ์›Œ์กŒ] [๋ถ™์—ฌ์ ธ] [๋ถ™์—ฌ์กŒ] [๋ถ™์—ฌ์ง„๋‹ค] [๋ถ™์—ฌ์งˆ] [๋น„์›Œ์กŒ] [๋น„์ถฐ์กŒ] [๋นจ๋ผ์ ธ์„œ] [๋นจ๋ผ์กŒ] [๋นจ๋ผ์ง„] [๋นจ๋ผ์ง„๋‹ค] [๋ฟŒ๋ ค์กŒ] [๋ฟŒ๋ ค์ง„] [๋ฟŒ๋ ค์ง„๋‹ค] [์‚ฐ์ง€] [์‚ด๋ ค์ง„] [์ƒˆ๊ฒจ์ ธ] [์ƒˆ๊ฒจ์กŒ] [์ƒˆ๊ฒจ์ง„] [์ƒˆ๊ฒจ์งˆ] [์ƒˆ๋กœ์›Œ์ง„] [์„ธ์›Œ์ ธ] [์„ธ์›Œ์ ธ์•ผ] [์„ธ์›Œ์กŒ] [์„ธ์›Œ์กŒ์œผ๋ฉฐ] [์„ธ์›Œ์ง„] [์„ธ์›Œ์ง„๋‹ค] [์„ธ์›Œ์งˆ] [์„ธ์›Œ์ง] [์ˆจ๊ฒจ์ ธ] [์ˆจ๊ฒจ์ ธ์˜จ] [์ˆจ๊ฒจ์ง„] [์‰ฌ์›Œ์กŒ] [์‰ฌ์›Œ์ง„๋‹ค] [์Šค๋Ÿฌ์›Œ์ง„] [์‹œ๋„๋Ÿฌ์›Œ์กŒ] [์‹ฌํ•ด์ ธ] [์‹ฌํ•ด์ ธ์„œ] [์‹ฌํ•ด์กŒ] [์‹ฌํ•ด์ง„] [์‹ฌํ•ด์ง„๋‹ค๋Š”] [์‹ฌํ•ด์งˆ] [์Œ“์—ฌ์ ธ] [์Œ“์—ฌ์กŒ] [์Œ“์—ฌ์ง„] [์จ์ ธ] [์จ์กŒ] [์จ์ง„] [์“ฐ์—ฌ์ ธ] [์“ฐ์—ฌ์ ธ์•ผ] [์“ฐ์—ฌ์กŒ] [์“ฐ์—ฌ์ง„] [์“ฐ์—ฌ์ง„๋‹ค] [์“ฐ์—ฌ์ง„๋‹ค๋ฉด] [์”Œ์—ฌ์กŒ] [์”Œ์—ฌ์ง„] [์•ˆ์ง€] [์•Œ๋ ค์ ธ] [์•Œ๋ ค์ ธ์„œ] [์•Œ๋ ค์ ธ์•ผ] [์•Œ๋ ค์กŒ] [์•Œ๋ ค์กŒ์—ˆ] [์•Œ๋ ค์กŒ์œผ๋‚˜] [์•Œ๋ ค์กŒ์œผ๋ฉฐ] [์•Œ๋ ค์ง„] [์•Œ๋ ค์ง„๋‹ค] [์•Œ๋ ค์งˆ] [์•Œ๋ ค์ง] [์•ž๋‹น๊ฒจ์ง„] [์•ฝํ•ด์ ธ] [์•ฝํ•ด์กŒ] [์•ฝํ•ด์ง„] [์•ฝํ•ด์ง„๋‹ค] [์–ด๋‘์›Œ์ง„] [์–ด๋‘์›Œ์งˆ] [์–ด๋ ค์›Œ์กŒ] [์–ด๋ ค์›Œ์ง„] [์–ด๋ ค์›Œ์ง„๋‹ค] [์–ด๋ ค์›Œ์งˆ] [์–นํ˜€์ ธ] [์–นํ˜€์ง„] [์—ฌ๊ฒจ์ ธ] [์—ฌ๊ฒจ์กŒ] [์—ฌ๊ฒจ์กŒ์—ˆ] [์—ฌ๊ฒจ์กŒ์œผ๋‚˜] [์—ฌ๊ฒจ์ง„] [์—ฌ๊ฒจ์ง„๋‹ค] [์—ฌ๊ฒจ์ง„๋‹ค๋Š”] [์—ฌ์ ธ] [์—ฌ์ ธ์„œ] [์—ฌ์ ธ์•ผ] [์—ฌ์กŒ] [์—ฌ์ง„] [์—ฌ์ง„๋‹ค] [์—ฌ์ง] [์—ฐ์ง€] [์—ด๋ ค์ ธ] [์˜ˆ๋ป์งˆ] [์˜ฌ๋ ค์ ธ] [์˜ฌ๋ ค์กŒ] [์˜ฌ๋ ค์ง„] [์˜ฎ๊ฒจ์ ธ] [์˜ฎ๊ฒจ์ ธ์„œ] [์˜ฎ๊ฒจ์กŒ] [์˜ฎ๊ฒจ์ง„] [์˜ฎ๊ฒจ์งˆ] [์˜ฎ๊ฒจ์ง] [์›Œ์ง„] [์ด๋ค„์กŒ] [์ฝํ˜€์ง„] [์ฝํ˜€์ง„๋‹ค] [์ž…ํ˜€์ ธ] [์žŠํ˜€์ ธ] [์žŠํ˜€์กŒ] [์žŠํ˜€์กŒ์œผ๋ฉฐ] [์žŠํ˜€์ง„] [์žŠํ˜€์ง„๋‹ค] [์ž˜๋ ค์กŒ] [์ €์งˆ๋Ÿฌ์กŒ์œผ๋ฉฐ] [์ ํ˜€์ ธ] [์ „ํ•ด์ ธ] [์ „ํ•ด์ ธ์˜จ๋‹ค] [์ „ํ•ด์ ธ์™”] [์ „ํ•ด์กŒ] [์ „ํ•ด์ง„] [์ „ํ•ด์ง„๋‹ค] [์ „ํ•ด์งˆ] [์ „ํ•ด์ง] [์ •ํ•ด์ ธ] [์ •ํ•ด์ ธ์•ผ] [์ •ํ•ด์กŒ] [์ •ํ•ด์ง„] [์ •ํ•ด์ง„๋‹ค] [์ œ์ด์ง€] [์ ธ] [์ ธ๊ฐ] [์ ธ๊ฐ”] [์ ธ๋‚˜์™€] [์ ธ๋„] [์ ธ๋ผ] [์ ธ๋ฒ„๋ฆฐ] [์ ธ๋ณธ] [์ ธ์„œ] [์ ธ์•ผ] [์ ธ์ค€] [์กŒ] [์กŒ์–ด๋„] [์กŒ์—ˆ] [์กŒ์œผ๋‚˜] [์กŒ์œผ๋ฉฐ] [์กŒ์„] [์ขํ˜€์ง„] [์ง€] [์ง€๊ธฐ] [์ง€๋ฉด] [์ง€์›Œ์ ธ] [์ง€์›Œ์กŒ] [์ง€์›Œ์งˆ] [์ง€์งˆ] [์ง€์ผœ์กŒ] [์ง€์ผœ์ง„๋‹ค] [์ง„] [์ง„๋‹ค] [์ง„๋‹ค๊ณ ] [์ง„๋‹ค๋Š”] [์ง„๋‹ค๋ฉด] [์ง„๋‹จ] [์งˆ] [์งˆ๊นŒ] [์งˆ์ˆ˜๋ก] [์ง] [์งœ์—ฌ์ ธ] [์งœ์—ฌ์กŒ] [์งœ์—ฌ์ง„] [์ฐข๊ฒจ์ง„] [์ฐจ๊ฐ€์›Œ์ง„๋‹ค] [์ฑ„์›Œ์ ธ] [์ฑ„์›Œ์กŒ] [์ฑ„์›Œ์ง„] [์ฑ„์›Œ์ง„๋‹ค] [์ฒ˜ํ•ด์ง„๋‹ค] [์ทจํ•ด์กŒ] [์ทจํ•ด์ง„] [์น˜๋Ÿฌ์ ธ์„œ] [์น˜๋Ÿฌ์กŒ] [์น˜๋Ÿฌ์กŒ์œผ๋ฉฐ] [์น˜๋Ÿฌ์ง„] [์น˜๋Ÿฌ์ง„๋‹ค] [์น˜๋ค„์กŒ] [์น˜๋ค„์ง„] [์นœํ•ด์ ธ] [์นœํ•ด์กŒ] [์นœํ•ด์ง„๋‹ค] [์น ํ•ด์ ธ] [์นญํ•ด์กŒ] [์ปค์ ธ] [์ปค์ ธ์„œ] [์ปค์กŒ] [์ปค์ง„] [์ปค์ง„๋‹ค๋Š”] [์ปค์งˆ] [์ปค์งˆ์ˆ˜๋ก] [์ปค์ง] [ํƒœ์›Œ์ ธ] [ํƒœ์›Œ์กŒ] [ํŠ•๊ฒจ์ ธ] [ํŒŒ์—ฌ์ ธ] [ํŽธํ•ด์ง„๋‹ค] [ํŽผ์ณ์ ธ] [ํ•ฉ์ณ์ €] [ํ•ฉ์ณ์ ธ] [ํ•ฉ์ณ์ ธ์„œ] [ํ•ฉ์ณ์ ธ์•ผ] [ํ•ฉ์ณ์กŒ] [ํ•ฉ์ณ์ง„] [ํ•ด์ ธ] [ํ•ด์ ธ๊ฐˆ] [ํ•ด์ ธ๊ฐ”] [ํ•ด์ ธ์„œ] [ํ•ด์ ธ์•ผ] [ํ•ด์กŒ] [ํ•ด์กŒ์œผ๋‚˜] [ํ•ด์กŒ์œผ๋ฉฐ] [ํ•ด์ง„] [ํ•ด์ง„๋‹ค] [ํ•ด์ง„๋‹ค๊ณ ] [ํ•ด์ง„๋‹ค๋Š”] [ํ•ด์งˆ] [ํ•ด์ง] [ํ–‰ํ•ด์ ธ] [ํ–‰ํ•ด์กŒ์œผ๋ฉฐ] [ํ–‰ํ•ด์ง„] [ํ–‰ํ•ด์ง„๋‹ค] [ํ–‰ํ•ด์งˆ] [ํ–‰ํ•ด์ง] [ํฉ๋ฟŒ๋ ค์ ธ]
    • ์ง€/ji has 6 etymologies and 6 meanings on English Wiktionary, so there's bound to be some ambiguity and some errors. Some of the Hanja that are converted, like "ๆ™บ", are listed in Wiktionary as just ์ง€]/ji, while others, like "็Ÿฅ", have multiple Hangeul versions (in this case, ์•Œ/ai or ์ง€/ji), and it looks like Nori picked this one. (Turns out it is an eumhun, which lists a meaning and a pronunciation, so ์ง€/ji is the pronunciation and ์•Œ/ai is the meaning. It's confusing if you aren't familiar with it, but it does make sense.) In several other cases, especially where the token ends with -์ง„, the part of speech tagger is marking ์ง€ as an auxiliary verb, which is maybe another category of parts of speech we should filter.
  • ์ด: [ไผŠ] [ๅฝ] [็•ฐ] [็ฆป] [๊ฐ ] [๊ฑฐ๋‚˜] [๊ฑฐ๋“ ์š”] [๊ฑด] [๊ฑด๊ฐ€] [๊ฑด๋ฐ] [๊ฑด์ง€] [๊ฑด์ง„] [๊ฑธ๊นŒ] [๊ฑธ๊นŒ์š”] [๊ฒ๋‹ˆ๋‹ค] [๊ฒŒ] [๊ฒ์ง€] [๊ฒ ] [๊ฒจ] [๊ณ ] [๊ณค] [๊ตฌ] [๊ตฌ๋‚˜] [๊ตฌ๋งˆ] [๊ตฐ] [๊ตฐ๋ฐ] [๊ทธ๋ž˜์„œ์ธ์ง€] [๊ธฐ] [๊ธด] [๊ธด๊ณ ] [๊ธด๋ฐ] [๊นŒ] [๊นŒ์ง„] [๊บผ] [๊ผฌ] [๋‚˜๋ผ] [๋‚จ์ธ๋ฐ] [๋ƒ] [๋ƒ๊ณ ] [๋ƒ๋Š”] [๋ƒ๋ฉฐ] [๋ƒ๋ฉด] [๋„ค] [๋‡จ] [๋ˆ„๊ตฐ๊ฐ€] [๋ˆ„๊ตฐ๋ฐ] [๋ˆ„๊ตฐ์ง€] [๋‹ˆ] [๋‹ˆ๊นŒ] [๋‹ˆ๋‹ค] [๋‹ค] [๋‹ค๊ณ ] [๋‹ค๋ƒ] [๋‹ค๋Š”] [๋‹ค๋‹ˆ] [๋‹ค๋ผ๊ณ ] [๋‹ค๋ž€] [๋‹ค๋งŒ] [๋‹จ๊ฐ€] [๋‹จ๋ฐ] [๋‹ต] [๋‹ต๋‹ˆ๋‹ค] [๋Œ€ํ•ด์„œ] [๋”๋ผ] [๋”๋ผ๋„] [๋˜๊ฐ€] [๋ฐ] [๋ด] [๋ด์ง€] [๋„๋ก] [๋ˆ] [๋ผ] [๋“œ๋‹ˆ] [๋“œ๋ผ] [๋“ ] [๋“ ์ง€] [๋””] [๋””์š”] [๋ผ] [๋ผ๊ณ ] [๋ผ๊ณค] [๋ผ๊ธฐ] [๋ผ๋‚˜] [๋ผ๋„ค] [๋ผ๋‡จ] [๋ผ๋Š”] [๋ผ๋Š”๋ฐ] [๋ผ๋‹ˆ] [๋ผ๋„] [๋ผ๋กœ] [๋ผ๋ฉฐ] [๋ผ๋ฉด] [๋ผ๋ฉด์„œ] [๋ผ์„œ] [๋ผ์•ผ] [๋ผ์˜ค] [๋ผ์š”] [๋ผ์šฐ] [๋ผ์ง€๋งŒ] [๋ฝ] [๋ž€] [๋ž€๋‹ค] [๋ž„] [๋žŒ] [๋ž๋‹ˆ๋‹ค] [๋ž˜] [๋ž˜๋‚˜] [๋ž˜๋„] [๋žœ] [๋Ÿฌ] [๋Ÿฌ๋‹ˆ] [๋Ÿฐ] [๋ ค] [๋ จ] [๋กœ] [๋กœ๊ตฐ] [๋กœ๋‹ค] [๋ก ๊ฐ€] [๋ก ์ง€] [๋ฅœ] [๋ฆฌ] [๋งˆ] [๋จธ] [๋จผ] [๋ฉฐ] [๋ฉด] [๋ฉด์„œ] [๋ฉด์€] [๋ชจ๋ฆฌ] [๋ชฌ๋ฐ] [๋ชฌ์ง€] [๋ฌด์–ด] [๋ฌด์–ธ๊ฐ€] [๋ฌธ์ง€] [๋ญ”๊ฐ€] [๋ญ˜๊นŒ] [๋ฏ€๋กœ] [๋ฐ˜๋ฐ] [๋ถ€ํ„ด๊ฐ€] [์„œ] [์„ ๊ฐ€] [์„ ์ง€] [์„ธ] [์„ธ์š”] [์„ผ๊ฐ€] [์„ผํ„ฐ] [์…”] [์…จ] [์†Œ] [์‡ผ] [์Šˆ] [์‹ ] [์‹ ๊ฐ€] [์‹ ์ง€] [์‹ญ๋‹ˆ๊นŒ] [์จ์„œ] [์•ผ] [์–˜๊น๋‹ˆ๋‹ค] [์–ด๋”˜๊ฐ€] [์–ด์งธ์„œ์ธ์ง€] [์–ธ๊ณ ] [์–ธ์  ๊ฐ„] [์—์„ ์ง€] [์—์š”] [์—”์ง€] [์—ฌ] [์—ฌ๋„] [์—ฌ์„œ] [์—ฌ์„œ๋ผ๋„] [์—ฌ์„ ] [์—ฌ์•ผ] [์—ด] [์˜€] [์˜€์—ˆ] [์˜€์œผ๋‚˜] [์˜€์œผ๋‹ˆ] [์˜€์œผ๋ฆฌ๋ผ] [์˜€์œผ๋ฉฐ] [์˜€์œผ๋ฏ€๋กœ] [์˜€์„] [์˜€์„์ง€๋ผ๋„] [์˜€์Œ์—๋„] [์˜€์Œ์„] [์˜€์Œ์ด] [์˜ˆ] [์˜ˆ์š”] [์˜Œ์ง€] [์˜จ๊ฐ€] [์˜จ๋ฐ] [์˜ฌ] [์™ ] [์š”] [์š”๋ฆฐ๋ฐ] [์šฐ] [์›์ด์—ˆ๋‹ค๋Š”] [์œ„ํ•ด์„œ] [์œ ] [์˜ํ•ด์„œ] [์ด] [์ด๊ณ ] [์ดํƒ๋ฆผ] [์ธ] [์ธ๊ฐ€] [์ธ๊ฐ€๋ผ๋Š”] [์ธ๊ธฐ] [์ธ๋‹ค] [์ธ๋ฐ] [์ธ๋ฐ๋‹ค] [์ธ๋ฐ์š”] [์ธ๋“ค] [์ธ๋“ฏ] [์ธ๋””] [์ธ์ฆ‰] [์ธ์ง€] [์ธ์ง„] [์ผ] [์ผ๊นŒ] [์ผ๊นŒ์š”] [์ผ๋ฆฌ] [์ผ์ˆ˜๋ก] [์ผ์ง€] [์ผ์ง€๋ผ] [์ž„] [์ž…] [์ž…๋‹ˆ๊นŒ] [์ž…๋‹ˆ๋‹ค] [์žŠ์–ด๋ฒ„๋ฆฐ๋‹ค] [์ž‘์ธ] [์ž”] [์ž–] [์ž–์•„] [์ €์ธ] [์ „ํ™˜] [์ •๋ฐ˜๋Œ€] [์ œ] [์ œ์กฐ์—…์ฒด์ธ] [์  ] [์ฃ ] [์ฅฌ] [์ง€] [์ง€๋งŒ] [์งผ] [์ฐจ์ธ] [์ฐฌ๊ฐ€] [ํ‚ค๋ก ] [ํ‚ค์ง€] [ํ…Œ] [ํ…Œ๋‹ˆ] [ํ…] [ํ…๋ฐ] [ํ‹ด๋””] [ํ”„๋ก ] [ํ•œ๊ฑด] [ํ• ์ง€] [ํ•จ์ธ] [ํ•จ์ธ๋ฐ] [ํ•ฉ๋‹ˆ๋‹ค] [ํ•ด์„œ] [ํ•ด์„œ์ธ์ง€] [ํ•ด์ค„ํ…Œ] [ํ˜•์‚ฐ] [ํ›„์—]
    • ์ด/i has 11 etymologies and 16 meaningsโ€”one of which has 37 sup-parts?!?โ€”on English Wiktionary, so there's bound to be lots of ambiguity and some errors. Only 4 of 5 Hanja are in English Wiktionary, but all have ์ด/i as their Hangeul counterpart. For the rest, some are hard to track downโ€”without any other context, the tokens shown here, like "๋‹ต", don't generate ์ด when analyzed.
    • Other examples: "๋ฐ˜๋ฐ" is analyzed as ๋ฐ˜๋ฐ โ€ข ๋ฐ” โ€ข ์ด, where ์ด is marked as a "positive designator". "์˜€์„์ง€๋ผ๋„" is analyzed as a series of "verbal endings" with ์ด as a "positive designator" in the middle. All the verbal endings are dropped, which is kind of weird.
  • ํ•˜: [ไธ‹] [ๅค] [ๆฒณ] [๊ฑฐ๋“ ] [๊ฒ ๋‹ค] [๊ฒ ๋‹ค๊ณ ] [๊ฒ ๋‹ค๋Š”] [๊ณ ๋งˆ์›Œํ–ˆ] [๊ธฐ๋ปํ• ] [๊ธฐ๋ปํ–ˆ] [๊บผ๋ คํ•œ] [๋”๋‹ˆ] [๋‘๋ ค์›Œํ–ˆ] [๋”ฐ๋ผํ•œ] [๋ž˜๋ผ] [๋ ธ์œผ๋‚˜] [๋ฏธ์›Œํ• ] [๋ถ€๋Ÿฌ์›Œํ•œ๋‹ค] [์Šค๋Ÿฌ์›Œํ•œ] [์Šค๋Ÿฌ์›Œํ•œ๋‹ค] [์Šค๋Ÿฌ์›Œํ–ˆ] [์Šฌํผํ•œ๋‹ค] [์‹œ๊ณ ] [์‹œ๋„ค] [์‹œ๋Š”] [์‹œ๋˜] [์•„ํŒŒํ–ˆ] [์•ผ] [์•ผ๊ฒ ๋‹ค] [์—ด] [์ค˜์•ผ] [์ง€์ผœ์•ผ] [์น˜] [์น ] [์นด] [์ผ€ํ•œ] [ํ‚ค์ง€] [ํ•˜] [ํ•œ] [ํ•œ๊ฑด] [ํ•œ๊ฑธ] [ํ•œ๋‹ค] [ํ•œ๋‹ค๊ฑฐ๋‚˜] [ํ•œ๋‹ค๊ณ ] [ํ•œ๋‹ค๋Š”] [ํ•œ๋‹ค๋Š”๋ฐ] [ํ•œ๋‹ค๋˜๊ฐ€] [ํ•œ๋‹ค๋˜์ง€] [ํ•œ๋‹ค๋“ ์ง€] [ํ•œ๋‹ค๋ฉฐ] [ํ•œ๋‹ค๋ฉด] [ํ•œ๋‹ค๋ฉด์„œ] [ํ•œ๋‹ค์ง€๋งŒ] [ํ•œ๋ฐ] [ํ•œ๋ฐ๋‹ค๊ฐ€] [ํ•œ๋“ค] [ํ•œ์ง€] [ํ•œ์ง€๋ผ] [ํ• ] [ํ• ๊นŒ] [ํ• ๊นŒ์š”] [ํ• ๋ผ] [ํ• ๋ ค๊ณ ] [ํ• ๋ ค๋ฉด] [ํ• ์ˆ˜๋ก] [ํ• ์ง€] [ํ• ์ง€๋ผ๋„] [ํ•จ] [ํ•จ์ธ] [ํ•ฉ๋‹ˆ๋‹ค] [ํ•ฉ๋‹ˆ๋‹ค๋งŒ] [ํ•ฉ์‹œ๋‹ค] [ํ•ด] [ํ•ด๊ฐ€] [ํ•ด๋Œ„๋‹ค] [ํ•ด๋„] [ํ•ด๋ผ] [ํ•ด๋ฒ„๋ ธ] [ํ•ด์„œ] [ํ•ด์„œ์ธ์ง€] [ํ•ด์„ ] [ํ•ด์•ผ] [ํ•ด์•ผ๊ฒ ๋„ค] [ํ•ด์˜ด] [ํ•ด์™€] [ํ•ด์™”] [ํ•ด์™”์—ˆ] [ํ•ด์™”์œผ๋ฉฐ] [ํ•ด์š”] [ํ•ด์ค€๋‹ค] [ํ•ด์ค€๋‹ค๊ณ ] [ํ•ด์ค€๋‹ค๋ฉด] [ํ•ด์ค„๋ž˜] [ํ•ด์คŒ] [ํ•ด์ค˜] [ํ•ดํ•œ๋‹ค] [ํ•ดํ–ˆ] [ํ•ดํ–ˆ์—ˆ] [ํ–ˆ] [ํ–ˆ์–ด๋„] [ํ–ˆ์—ˆ] [ํ–ˆ์—ˆ์œผ๋‚˜] [ํ–ˆ์—ˆ์œผ๋ฉฐ] [ํ–ˆ์œผ๋‚˜] [ํ–ˆ์œผ๋‹ˆ] [ํ–ˆ์œผ๋ฉฐ] [ํ–ˆ์œผ๋ฏ€๋กœ] [ํ—€] [ํ—ˆ] [ํ—ค]

Additional Part-of-Speech Filters

[edit]

Updateโ€”October 2018

A big takeaway from the speaker review (see talk page) is that while there aren't a lot of errors, a lot of the errors that are there come from the "positive designator" (VCP) part-of-speech category, and to a lesser degree the "auxiliary verb or adjective" (VX) category. Filtering these parts of speech might improve the analysis results and keep otherwise unrelated forms from matching. (In English, this would be something like searching for have worked and matching have walked, have percolated, and have transmogrified because they all use the auxiliary verb have.)

There's also a "negative designator" (VCN) part-of-speech category that seems parallel to "positive designator", so it is also a candidate for filtering.

Impact of POS Filtering

[edit]

It's easy enough to see the size of the impact of enabling any additional POS filtersโ€”without necessarily being able to say if it is positive or negativeโ€”by running a before-and-after comparison.

After unpacking the filter and making sure nothing changed as a result, I enabled an additional filter for each of the three categories separately so I can attribute to each their relative impact.

The table below summarizes the number of tokens filtered for each part-of-speech tag, and all three together.

VCP: Positive designator
old tokens new tokens delta pct
Wikipedia 2,659,650 2,582,967 -76,683 -2.883%
Wiktionary 106,283 105,489 -794 -0.747%
VNC: Negative designator
old tokens new tokens delta pct
Wikipedia 2,659,650 2,657,152 -2,498 -0.094%
Wiktionary 106,283 106,250 -33 -0.031%
VX: Auxiliary Verb or Adjective
old tokens new tokens delta pct
Wikipedia 2,659,650 2,612,579 -47,071 -1.770%
Wiktionary 106,283 105,560 -723 -0.680%
VCP+VNC+VX
old tokens new tokens delta pct
Wikipedia 2,659,650 2,533,398 -126,252 -4.747%
Wiktionary 106,283 104,733 -1,550 -1.458%

VCP (-2.883%) and VX (-1.770%) have a large effect on Wikipedia, and a smaller, but still significant effect on Wiktionary. VNC, the "negative designator" seems to be much more rare across the board.

  • VCP/"positive designator": The most obvious effect of filtering VCP was to break up the "์ด" group, which was reduced from 262 distinct tokens to only 12 in the Wikipedia data, and from 59 to 5 in the Wiktionary data. These larger groups are often the source of false positive results, so this is good. A small number of other token types (<100) had their overall frequency decrease. Details below.
    • Wikipedia
      • Splits: 250 pre-analysis types (0.161% of pre-analysis types) / 7,822 tokens (0.294% of tokens) were lost from 1 groupsโ€”the "์ด" groupโ€”(0.001% of post-analysis types), affecting a total of 262 pre-analysis types (0.169% of pre-analysis types) in those groups.
      • Token count decreases: 92 pre-analysis types (0.059% of pre-analysis types) lost 66,458 tokens (2.499% of tokens) across 86 groups (0.064% of post-analysis types).
    • Wiktionary
      • Splits: 55 pre-analysis types (0.202% of pre-analysis types) / 314 tokens (0.295% of tokens) were lost from 2 groups (0.008% of post-analysis types), affecting a total of 61 pre-analysis types (0.224% of pre-analysis types) in those groups.
      • Token count decreases: 20 pre-analysis types (0.073% of pre-analysis types) lost 384 tokens (0.361% of tokens) across 20 groups (0.084% of post-analysis types).
  • VCN/"negative designator": Filtering VCN had very little effect, with 5 or fewer types affected and no breaking up of any stemming groups in the Wikipedia data. Details below:
    • Wikipedia
      • Token count decreases: 5 pre-analysis types (0.003% of pre-analysis types) lost 32 tokens (0.001% of tokens) across 4 groups (0.003% of post-analysis types).
    • Wiktionary
      • Splits: 1 pre-analysis types (0.004% of pre-analysis types) / 1 tokens (0.001% of tokens) were lost from 1 groups (0.004% of post-analysis types), affecting a total of 2 pre-analysis types (0.007% of pre-analysis types) in those groups.
      • Token count decreases: 1 pre-analysis types (0.004% of pre-analysis types) lost 1 tokens (0.001% of tokens) across 1 groups (0.004% of post-analysis types).
  • VX/"Auxiliary Verb or Adjective": Filtering VX had a similar size impact to filtering VPC, though the effect was spread out across a larger number of stemming groups. The most obvious effect was breaking up the "์ง€" group, which is the largest group in the Wikipedia data. The group was reduced from 424 distinct tokens to only 25 in the Wikipedia data, and from 24 to 8 in the Wiktionary data. Other large groups in the Wikipedia data were also reduced in size: the "์ฃผ" group when from 180 unique tokens to 136; the "ํ•˜" group went from 110 to 71; and the "์˜ค" group went from 89 to 39. The overall distribution of large groups changed radically for the better.
    • Wikipedia
      • Splits: 716 pre-analysis types (0.462% of pre-analysis types) / 5,733 tokens (0.216% of tokens) were lost from 33 groups (0.024% of post-analysis types), affecting a total of 1,250 pre-analysis types (0.806% of pre-analysis types) in those groups.
      • Token count decreases: 318 pre-analysis types (0.205% of pre-analysis types) lost 40,098 tokens (1.508% of tokens) across 197 groups (0.145% of post-analysis types).
    • Wiktionary
      • Splits: 37 pre-analysis types (0.136% of pre-analysis types) / 53 tokens (0.05% of tokens) were lost from 9 groups (0.038% of post-analysis types), affecting a total of 102 pre-analysis types (0.374% of pre-analysis types) in those groups.
      • Token count decreases: 56 pre-analysis types (0.205% of pre-analysis types) lost 443 tokens (0.417% of tokens) across 41 groups (0.172% of post-analysis types).
  • VCP+VNC+VX: Since tokens are only assigned one part of speech, the effects of filtering multiple parts of speech are cumulative in the obvious way.
    • It's worth noting that the five largest group sizes for Wikipedia dropped from (424, 262, 180, 110, 89) to (136, 71, 44, 39, 32), and for Wiktionary from (59, 25, 24, 21, 16) to (21, 19, 13, 13, 12), which is a great decrease in the size of the larger, error-prone stemming groups.
    • The "high-frequency" (>500) tokens filtered in the Wikipedia data, and which all seem reasonable, are:
      • ๋ผ๋Š”, which is normally a verbal ending, but is sometimes interpreted as "์ด/VCP(Positive designator)+๋ผ๋Š”/E(Verbal endings)", meaning that it got lumped into the "์ด" group.
      • ์•„๋‹ˆ, which seems to mean "no/not", and is at least sometimes treated as a negative designator.
      • ๋ชปํ•˜, which seems to mean "cannot", and is at least sometimes treated as an auxiliary verb.

Overall, this looks like a good additional filter to implement.

Next Steps

[edit]
  • โœ”๏ธŽ Still To Do (since speaker review went well) (ON HOLDโ€”Oct 2018โ€”waiting for Elasticsearch 6.4.2 versions of all the rest of our plugins for configuration and integration testing):
    • Set up config (T206874โ€”DONE)
      • Determine whether we need to change the config for plain field and the completion suggester.
      • Implement the configs in AnalysisConfigBuilder and add tests.
        • Command line version of the Elasticsearch config is here.
    • Re-index (T216738โ€”DONE)
      • Figure out how re-indexing Korean with a very different analyzer interacts with LTR
      • Re-index Korean-language wikis
  • CJK follow up (ON HOLDโ€”Oct 2018โ€”longer term to-do):
    • test CJK with Japanese mixed-script tokens, including middle dots
    • look for easily-fixed anomalies with other long Japanese tokens
    • Look at soft hyphens and zero-width non-joiners
    • possibly unpack CJK for Japanese and add fixes
    • possibly consolidate fixes for CJK across Japanese and Korean; may need to test and include Chinese, even though we don't use CJK for Chinese.
  • โœ”๏ธŽ Get speaker review of the samples and examples above. (DONE see talk page)
    • โœ˜ If speaker review is unclear:
      • Consider setting up an instance with Nori on RelForge for people to test
    • โœ˜ If speaker review is generally negative:
      • Unpack CJK for Korean and add middle-dot-to-space conversion, strip soft hyphens and zero-width non-joiners.
    • Speaker review highlights:
      • Generally things are positive; there are definitely errors and sub-optimal behavior, but nothing that seems horrible, given the overall complexity of the task.
      • The parser categories of VCP/"positive designator" and VX/"auxiliary verb or adjective" look like they should also be filtered by the part of speech filter. I'll have to test and see what impact filtering them might haveโ€”it's always possible they are doing something useful in another area we didn't dig into.
      • Not surprisingly, compounds can be difficult, and some longer words are tokenized fine at the top level, but then are broken up into smaller bits inappropriately.
      • I don't know how to read Korean Hanja entries in Wiktionary. But I did eventually learn what "eumhun" means, though it is not well-documented on Wiktionary.
  • โœ”๏ธŽ Open upstream tickets (DONE)
    • CJK bugs: (DONE)
      • mixed-script tokens are treated as one long token
      • leaves soft hyphens in place
      • leaves bidi markers in place
      • leaves zero-width non-joiner in place
      • eats encircled numbers (โ‘ โ‘กโ‘ข), "dingbat" circled numbers (โž€โžโž‚), parenthesized numbers (โ‘ดโ‘ตโ‘ถ), fractions (ยผ โ…“ โ…œ ยฝ โ…” ยพ), superscript numbers (ยนยฒยณ), and subscript numbers (โ‚โ‚‚โ‚ƒ)
        • I was invited to edit the docs to explain the normalization issues, so I'll try to do that eventually
    • Nori bugs: (DONE: Elasticsearch & Lucene ticket 1, ticket 2)
      • arae-a used as middle dot creates one long token (done)
      • empty token after "๊ทธ๋ ˆ์ด๋งจ" (done)
      • "ํŠœํ† ๋ฆฌ์–ผ" gets tokenized with an extra space at the end (done)
      • splits on soft hyphens and zero-width non-joiners (denied, since these can be fixed by a char filter)
      • tokens split on different "types", including IPA extensions, Extended Greek, and diacritics, and apostrophes. (separate ticket)