I don't see much feedback from Japanese wikis so I'd like to give one. The new search works pretty well for me on Japanese Wikipedia and Wiktionary. I especially like the section title highlighting and the improved word count in each search result. For example, against this query the old search gives a search result with a line saying "6 kb (24 words)" which is unreasonable, while the new search gives "6 kb (1,836 words)" which is reasonable.
Topic on Talk:Search/Old/status
Yay! Thanks! The old search used spaces for word count (I believe) but the Cirrus delegates to the text analyzer which has some knowledge about Japanese.
Questions:
- There is an Elasticsearch plugin that is supposed to make Japanese analysis better. Would you be willing to try it out if I expose it in jawiki in beta and tell me if it is better/worse/the same?
- I'd like to start enabling cirrus as the default on more wikipedias. We're almost everywhere but wikipedias. Anyway, would you be willing to talk about it on jawiki's village pump? I'd love to do it with community support rather then force it on folks.
+1 on Nik: whym, it would be wonderful if you could help with that. :)
I deployed the plugin in beta this afternoon and loaded a few pages. You can try it and compare: http://ja.wikipedia.beta.wmflabs.org/w/index.php?title=%E7%89%B9%E5%88%A5%3A%E6%A4%9C%E7%B4%A2&profile=default&search=%E4%B8%89&fulltext=Search
NEverett, I'd like to try both. Do you know what exactly is the difference between the kuromoji plugin and the one you currently use? Is the current one inherited from lsearchd? This will also allow the community to know how the difference will be (and maybe how they can help debug).
The version on the beta.wmflabs.org looks not bad, but it is hard to say whether it is "better" unless we test these analyzers against the same document set. In general, I believe the difference of those Japanese analysis engines will be very subtle when looking at the search result quality, as long as they use the same or a similar morphological dictionary.
The Kuromoji plugin looks to be an effort to integrate this which claims support for lemmatization and readings for kanji. I'm playing with the default setup for it and I don't see any kanji normalization, but it does a much better job with word segmentation then the one that is deployed on jawiki now. The one deployed on jawiki now is Lucene's StandardAnalyzer which implements unicode word segmentation. I haven't dove into that deeply enough to explain it, but some examples.
- 日本国 becomes
- 日本 and 国 in kuromoji
- 日 and 本 and 国 in standard
- にっぽんこく becomes
- にっぽん and こい in kuromoji
- に and っ and ぽ and ん and こ and い in standard
From that it looks like kuromoji should be better but standard is saved by executing the search for all the characters as a phrase search which makes everything line up _reasonably_ well. It won't perform as well, but that should be ok too.
And it looks like my fancy highlighter chokes on kuromoji, which isn't cool. Look here. There are results without any highlighted anything which isn't good.
With regards to lsearchd: I'm not sure what it uses. It doesn't have the api that lets me see how text is analyzed so I have to guess from reading the code and there is a lot of it.
Do you want to continue working on kuromoji plugin until it is ready regarding highlighting? Or do you want to officialize the current beta feature as it is? I support your observations in that kuromoji's segmentation is more linguistically meaningful, which could improve search. However, failing to highlight is a major issue, and so far I personally cannot see how much search/snippet would be improved by kuromoji.
Is it easy for you to import all pages from jawiki to the test instance, or create another test instance using the same reduced document set, but processed by StandardAnalyzer? I'd be interested in testing various queries to check differences in search results, looking at what are retrieved and what are not by which.
My I'm bad at replying to these. Sorry. So, yeah, my plan right now is to go ahead with the standard analyzer and do more work on getting the Japanese one better later. Sorry for the late reply.