Jump to content

Topic on Extension talk:CirrusSearch

Possible to exclude "to", "and", "or" etc. from search weights?

6
T0lk (talkcontribs)

A sentence-length search on my wiki often returns results where the only highlighted words are matches to the short/frequent words like "and", "or", "to", etc. which are not useful as highlighted words. See this picture for example: http://i.imgur.com/f9IWEc0.png

Can short words be excluded? If I exclude them manually I get a much better result: http://i.imgur.com/bLm6nhQ.png

DCausse (WMF) (talkcontribs)

Do you use the wikimedia experimental highlighter?

If not I'd suggest you to give it a try, you'll need to install it as a plugin on every node of your elasticsearch cluster.

See this page on github for more information on how to install it.

Once installed (requires a cluster restart) you can activate it on mediawiki side by setting the following config options:

$wgCirrusSearchUseExperimentalHighlighter = true;

$wgCirrusSearchOptimizeIndexForExperimentalHighlighter = true;

T0lk (talkcontribs)

I followed the steps and installed the Experimental Highlighter. Nothing seems to have changed. Aside from not getting any errors, is there any way to make sure I installed it correctly/it's actually working?

T0lk (talkcontribs)

After some testing I confirmed I always had experimental highlighter plugin installed, and can retrieve results using the highlight syntax when running searches from the command line. So, it just seems like setting those two options to true did not make a change in the search results for that specific query perhaps. It would be useful to know what type of search term would yield different results when those two options are set to true for further testing.

DCausse (WMF) (talkcontribs)

You can double check that the highlighter is in use by dumping the CirrusSearch query. You can ask Mediawiki to do so by adding the &cirrusDumpQuery URI param to the search URL. It will return a JSON page where you can have a look at the elasticsearch query sent by Cirrus.

Under the section "highlight" you'll find the list of fields and type should be set to experimental. If it's not the case then it's probable that $wgCirrusSearchUseExperimentalHighlighter is not evaluated properly.

If you see type: experimental then you are using the highlighter and unfortunately it's not smart enough to handle your example and the official answer to your question would be no.


You can read the following if you are comfortable with PHP and willing to hack your mediawiki installation.

This highlighter supports a bunch of config options but unfortunately these options are not configurable via MediaWiki config vars.

But if you'd like to hack something everything is in the php file includes/Search/ResultsType.php and more precisely the class FullTextResultsType. You can either try to tweak some scoring values such as boost_before or implement a very ugly hack to only highlight on the field which excludes stop words:

Simply add

                       if( $name === 'text' ) { continue; }

Inside the loop of the method private function addMatchedFields( $fields ) {

It should look like:

        /**
         * @param array[] $fields
         * @return array[]
         */
        private function addMatchedFields( $fields ) {
                foreach ( array_keys( $fields ) as $name ) {
                        if( $name === 'text' ) { continue; } // ugly hack: force highlighting on field with stopwords excluded
                        $fields[$name]['matched_fields'] =  array( $name, "$name.plain" );
                }
                return $fields;
        }
T0lk (talkcontribs)

That's awesome help, thank you. Unfortunately I see "type": "fvh". I'm not quite sure why my config is ignoring ExperimentalHighlighter settings. I will spend some time trying to figure that out. Thanks for your help!

Reply to "Possible to exclude "to", "and", "or" etc. from search weights?"