Jump to content

Extension:CirrusSearch/Query Construction

From mediawiki.org

This page describes how the user query is manipulated to be reconstructed as a structured Elasticsearch query.

this document describes CirrusSearch internals and may rapidly become out of date as it describes details of the current code base.

Overview

[edit]

CirrusSearch interacts with MediaWiki core by extending SearchEngine. This class exposes 3 main ways to query the index and find pages (called SearchEngine entry points in CirrusSearch):

When the query string and its associated metadata[1] enter Cirrus it undergoes various transformation steps:

  1. Parsing
  2. Profile selection
  3. Elasticsearch query building
  4. Elasticsearch responses transformation
  5. Fallback methods evaluation
  6. CrossProject searches

Parsing

[edit]

Parsing is responsible for extracting features[2] from the user query string. Note that while parsing is particularly important for full text search queries it is also present for other search entry points, for instance the namespace prefix extraction is present in all searches and can be considered a parsing step.

Parsing produces a SearchQuery instance that contains all the information known about the query and its context.

  • the search engine entry point
  • all its metadata (size, offset, ...)
  • contextual filters (e.g. the prefix option provided by Extension:InputBox)
  • the parsed query (AST)

The SearchQuery is immutable.

Profile selection

[edit]

Profile is the process responsible for deciding what are the best profiles to use for a given SearchQuery. This component is currently under discussion.

Elasticsearch query building

[edit]

This is the process of building the Elasticsearch search request body.

As of this writing the intent is to switch the logic of building the elasticsearch query into a set of transformations whose input is the immutable SearchQuery and whose output is a part of the elasticsearch search request body. The current technique uses a mutable context that all building components can modify.

Retrieval query

[edit]

Meant to extract all the documents that match the user query. This Elasticsearch query is split into two parts.

Scoring part

[edit]

Elements of the query that affect scoring. Changing something here should not change the set of hits found by the retrieval query. This section of the query must only affect the initial ranking of the results. The scoring part of a query is controlled by a FullTextQueryBuilder[3] currently only supported by the full text SearchEngine entry point.

Filtering part

[edit]

Elements of the query that do not affect ranking. Changing something here does not affect ranking but changes the set of hits found by the retrieval query. Filtering is also controlled by FullTextQueryBuilder but will change similarly to have a SearchQuery as input.

full text search keywords can interect with the filters by implementing FilterQueryFeature.

Rescore query

[edit]

Fine-tuning of the ranking. Depending on the need, multiple rescore queries can be combined, their scores can also be combined. Some searches may prefer to combine the score from the scoring part of the retrieval query with some rescore components.

full text search keywords can interact with the rescore functions by implementing BoostFunctionFeature.

Fetch phase configuration

[edit]

This is the part of the search request that instructs Elasticsearch what data to extract for every hit we display to the user. This phase of the query building process is not yet fully designed and the current way of doing things is not optimal. A ResultsType is chosen early in the process and is responsible for selecting the fields to extract and the fields to highlight.

It is tricky as it is directly connected to the way we display the search hits in Special:Search. Some extension may want to extract and display specific data that it stored using a custom mapping and a custom ContentHandler. Some keywords may want to tell the user that they matched a particular part of the document. Some extension may want to completely transform the data and aspect displayed using hooks like Manual:Hooks/ShowSearchHit.

In general, ShowSearchHit is currently used in a dual capacity: as a hook for some extension to incrementally tweak search results (i.e. add some widget or formatting), and as means to completely override the result display, like Wikibase is doing (both with and without CirrusSearch enabled). The challenge here is that some scenarios - like Wikibase without CirrusSearch - may call for complete display override without actually involving custom result type, thus the only way to implement such customization now is the hook.

Currently:

Drawbacks are:

  • None of these techniques can be strongly coupled but they are highly interdependent[4]
  • ResultsType is not driven by profile and it's unclear when it should be constructed. Cirrus decides the ResultsType before anything else, but some FullTextQueryBuilder may override it
  • Manual:Hooks/ShowSearchHit being a hook gives no guarantee that it'll be executed in the right order (not have its values overridden) nor that it has all the required context to know what to do.
  • Keywords are unable to cleanly add new highlighting hints[5]

Elasticsearch responses transformation

[edit]

The process of reading the Elasticsearch response and returning a:

The process responsible for doing this transformation is through CirrusSearch ResultsType.

wikidata's wbsearchentities uses a special ResultsType implementation to create TermSearchResult arrays instead of the SearchEngine types.

Fallback methods evaluation

[edit]

Fallback methods are only used (for now) in full text searches. It's a process that spans the entirety of the query construction up to the results evaluation. It is meant to repair a query that may not produce desirable results (e.g., at least 3 results to display).

Phrase suggester

[edit]

Attach an Elasticsearch suggest request to the main search query and display the suggestion if a title is not highlighted. May rewrite the entire result set using the suggestion as the query if the initial result set did not produce any results. It is supposed to detect typos and fix them.

TextCat language detection

[edit]

This process runs language detection on the user query and runs a second search on the corresponding wiki in the detected language. The results are appended to the first ones.

Cross-project searches

[edit]

This process runs SearchQuery for every sister project of the same language. The search request is attached to the main one using the msearch feature.

Notes and references

[edit]
  1. namespaces, size, offset...
  2. See Help:CirrusSearch for example features in the full text search query
  3. will be deprecated soon to take benefit of the SearchQuery class
  4. task T190130
  5. task T195881