Topic on Extension talk:CirrusSearch

Stopwords configuration

2 comments • 14:39, 9 May 2023 1 year ago

2

212.224.228.93 (talkcontribs)

Hi there,

Can I find a list of the words considered stopwords in my language? How can I add a word to the list?

Thanks!

Reply 09:35, 3 May 2023 1 year ago

TJones (WMF) (talkcontribs)

> Can I find a list of the words considered stopwords in my language?

It would be easier to answer if you mentioned the language, since there is a lot going on with stopwords!

For many languages, we use the stopword filters built into Elasticsearch, which are based on data from Lucene. Elastics has a list with links to the Lucene code base. Note that there are both Portuguese and "Brazilian"; we use Portuguese. We don't use CJK except for Japanese (and that may change eventually). Oddly the CJK stopword list is all English; we use the actual English list, which is only slightly different from the CJK list.

For some historical reason, certain language analyzers in Lucene are kept separate from the rest. I think it's because they were originally developed outside Lucene. They include:

Kuromoji (Japanese, which we don't use, yet)—stopwords
Ukrainian Morfologik—stopwords... however, for technical reasons, we maintain our own copy—currently they are the same
Nori (Korean)—which doesn't use stopwords per se, but rather filters part-of-speech tags put on words by the parser. We have a custom list.
SmartCN (Chinese)—it has a stopword list, but it is only punctuation (for technical reasons)
Stempel (Polish)—stopwords

We have some custom stopword lists in CirrusSearch:

For Moroccan Arabic (ary) and Egyptian Arabic (arz) but not Standard Arabic (ar), we add a fair number of additional stop words.
For Romanian, we add additional variants for some words because the Lucene list is so old that it uses the incorrect letters (ş & ţ) because the correct letters (ș & ț) were not available on computers back then (to be fair, they weren't reliably available until almost 2010).
The Mirandese stopword list was provided by a community member, inspired by the Portuguese stopword list.
The Polish list is the same as the Stempel list above, except we add "o.o" to go with "o.o."—by the time we get to stopwords, no tokens have final periods, so "o.o." doesn't filter anything.

We have smaller lists of additional stopwoprds that are embedded in the code.

For Armenian, we add two spelling variants.
For Chinese/SmartCN we have our own punctuation list, which is just a comma (again for technical reasons)
We have additional stop word filters for Irish and Polish, but they aren't for proper stopwords, they are just tools for filtering bits and bobs that come up during analysis. (The SmartCN filter is like that, too, I guess.)

> How can I add a word to the list?

So, it depends on the language and where the stopword list comes from, whose list you want to update, and how long you want to wait to see results.

For quicker results for on-wiki search, we can make changes to CirrusSearch. You can tell me the language and the word(s) and I can take care of it, you can open a ticket on Phabricator and add the tag "Discovery-Search" if you want to track progress, or if you are a Mediawiki programmer, you could submit a patch to the codebase and the Search Team be happy to review it.

If you want to help a wider audience, you could open a ticket or a pull request upstream. Elastic is our immediate source of stopwords for most of these, but they are just wrappers around Lucene, so if they pay attention to a ticket, they'd just open a ticket in Lucene, so you can skip that step and open the ticket or pull request with Lucene. If it's accepted, it will eventually trickle down to Elastic again—though not directly to CirrusSearch, because we can't upgrade Elasticsearch anymore because of licensing changes. We haven't worked out our longer-term plan yet, but there is a decent chance we will end up on an Elasticsearch fork or other Lucene-based search engine and see the benefit eventually.

For most of the core Lucene stopword lists, there's another source mentioned in the code. The most common sources are Jaques Savoy and Snowball, though there are others. You can try to contact Lucene's upstream source and get them to update their list of stopwords, too, which might reach a wider audience, and might eventually trickle down to Lucene (they did update their Snowball-based stemmers and stopword lists 3 years ago—I think it's ad hoc, but they do update from time to time.)

And now the question you didn't ask, but you must be thinking if you read this far...

> Why is it so complicated!

At least, I ask myself this now and then. Lucene tries to be the central repository for lots of open source language analysis because they want to make it available to their users, but they don't have everything. We make modifications and customizations in CirrusSearch in response to things we find in our data, or that community members bring to our attention. We try to push things upstream, but it can take a long time, and it's work when there are other things to do.

Reply Edited 14:39, 9 May 2023 1 year ago

Reply to "Stopwords configuration"