Hi there,
Can I find a list of the words considered stopwords in my language? How can I add a word to the list?
Thanks!
Hi there,
Can I find a list of the words considered stopwords in my language? How can I add a word to the list?
Thanks!
> Can I find a list of the words considered stopwords in my language?
It would be easier to answer if you mentioned the language, since there is a lot going on with stopwords!
For many languages, we use the stopword filters built into Elasticsearch, which are based on data from Lucene. Elastics has a list with links to the Lucene code base. Note that there are both Portuguese and "Brazilian"; we use Portuguese. We don't use CJK except for Japanese (and that may change eventually). Oddly the CJK stopword list is all English; we use the actual English list, which is only slightly different from the CJK list.
For some historical reason, certain language analyzers in Lucene are kept separate from the rest. I think it's because they were originally developed outside Lucene. They include:
We have some custom stopword lists in CirrusSearch:
We have smaller lists of additional stopwoprds that are embedded in the code.
> How can I add a word to the list?
So, it depends on the language and where the stopword list comes from, whose list you want to update, and how long you want to wait to see results.
For quicker results for on-wiki search, we can make changes to CirrusSearch. You can tell me the language and the word(s) and I can take care of it, you can open a ticket on Phabricator and add the tag "Discovery-Search" if you want to track progress, or if you are a Mediawiki programmer, you could submit a patch to the codebase and the Search Team be happy to review it.
If you want to help a wider audience, you could open a ticket or a pull request upstream. Elastic is our immediate source of stopwords for most of these, but they are just wrappers around Lucene, so if they pay attention to a ticket, they'd just open a ticket in Lucene, so you can skip that step and open the ticket or pull request with Lucene. If it's accepted, it will eventually trickle down to Elastic again—though not directly to CirrusSearch, because we can't upgrade Elasticsearch anymore because of licensing changes. We haven't worked out our longer-term plan yet, but there is a decent chance we will end up on an Elasticsearch fork or other Lucene-based search engine and see the benefit eventually.
For most of the core Lucene stopword lists, there's another source mentioned in the code. The most common sources are Jaques Savoy and Snowball, though there are others. You can try to contact Lucene's upstream source and get them to update their list of stopwords, too, which might reach a wider audience, and might eventually trickle down to Lucene (they did update their Snowball-based stemmers and stopword lists 3 years ago—I think it's ad hoc, but they do update from time to time.)
And now the question you didn't ask, but you must be thinking if you read this far...
> Why is it so complicated!
At least, I ask myself this now and then. Lucene tries to be the central repository for lots of open source language analysis because they want to make it available to their users, but they don't have everything. We make modifications and customizations in CirrusSearch in response to things we find in our data, or that community members bring to our attention. We try to push things upstream, but it can take a long time, and it's work when there are other things to do.