Extension:CirrusSearch/Scoring

From mediawiki.org

This page aims to provide some insights on the scoring functions and techniques used by CirrusSearch to rank search results.

Basics[edit]

Cirrus follows a very basic concept used by many search engines, a document score combines two types of sub-scores:

  1. A score that computes the similarity of the query with the document
  2. Scores that depend only on the document metadata (e.g recency, number of incoming links, language...)

Combining these different scores is a delicate balance, by ignoring type 2 we may rank very high bad quality and unpopular pages but by over weighting we could bring pages that are totally irrelevant to the user query in the top 10.

Query Architecture[edit]

Cirrus Query Architecture
Cirrus Query Architecture

The whole purpose of CirrusSearch is to parse the user query into an ElasticSearch Query using the functionalities available in the ElasticSearch Query DSL.

ElasticSearch Query[edit]

This is the query we send to the cluster in order to retrieve ranked results. The full query is rather large even for a single word query (e.g. single word query). The query can be dissociated into several components which serves different purposes. Note that the small number labels (1 or 2) on the diagram indicate whether this component participates in a type 1 or type 2 sub-score.

Retrieval[edit]

The purpose of this step is to retrieve documents in the index that match the user query. There are 2 different way to retrieve documents :

  • the full-text queries that computes a score for each document.
  • the filters that do not compute any score.

The fulltext query in cirrus is currently composed of a specific query on the title and redirects. The query must contain all the words title or redirect in the same order to match. Its impact on the score is very high. This query part is important to make sure that if the user types a query that matches perfectly a title/redirect its likely to be in the top search results.

The second part of the full text query can use the QueryString. It uses a default AND operator between words meaning that all the words in the query must appear in the document.

An alternative to QueryString is also available in Cirrus. It uses Common Terms Query as a base component and is useful for long search queries like questions. It works by separating words into two groups, the frequent words (common) and non-common words allowing to set various criteria.

In the end a document must match either the NearMatch or QueryString/CommonTermsQuery (if the query matches to the NearMatch it's very likely that it will match the second part). The score computed during the retrieval step will determine the order in which the documents will be rescored. This score is extremely important as it is the main participant in the type 1 score.

Filters are nearly the same whith the exception that they do not participate in the score. A very common filter in Cirrus is the namespace filter but you can activate other filters by using a special syntax.

It is essential for the retrieval step to be extremely efficient.

Rescoring[edit]

The rescoring step allows cirrus to rescore the top-N[1] documents returned by the retrieval phase. By working on a limited sub-set of documents it is possible to do "costly" operations that could not be done in the retrieval step.

  • Phrase rescore: when there are more than two words (generally a phrase) in the query this rescore function tries to rank higher documents that have the same phrase. This function is very costly so it is applied only on the top-512 docs by default while other methods are applied to the top-8196 docs.
  • Incoming links: applies a boost factor depending on the number of incoming to the page.
  • Recency: applies an exponential decay on document timestamp, this is useful to rank high recent pages and used by default on wikinews.
  • Language: on wikis that supports multiple languages a page can be boosted according to the user language or the default wiki language.
  • Templates: we can configure a list of templates with an associated boost factor, this is in use on wikis where a template is used to flag and rank the quality of a page (Featured Article/List, Good Article, Stub...).
  • Custom: some extensions can include specific metadata to the default cirrus document model (e.g. number of labels, sitelinks and others for wikidata), using a custom rescore profile wiki admins can configure how these various criteria are combined to alter document scores.
  • MLR: CirrusSearch also supports w:Learning_to_rank using the ltr plugin. LTR allows to use machine learning and a set of features (example of LTR features). All the tools required to build a MLR are not provided out of the box by CirrusSearch (see wikitech:Search/MLR_Pipeline for the documentation on how MLR is applied on WMF wikis).

Display[edit]

Fallback methods[edit]

Query rewritting[edit]

Interwiki[edit]

Analysis[edit]

Fields[edit]

Cross-wiki[edit]

The wiki blocks are ordered by recall [2] . Large wikis are likely to be ordered first frequently[3]. Concerning some wikis there might be a small variation such as being filtered by title, or relying of boosted templates.

References[edit]