Jump to content

Extension:CirrusSearch/Schema

From mediawiki.org

CirrusSearch uses Elasticsearch as the underlying search engine. The schema used by CirrusSearch is defined through Elasticsearch index settings and mappings. Both the settings and mappings can be requested from any wiki running CirrusSearch to retrieve the current configuration. Attempts are made to keep the documentation here up to date, but the api responses contain the source of truth.


Analysis Chains Used

[edit]

CirrusSearch defines a variety of analysis chains that are used throughout the schema to allow search text fields in different way. These are exposed as sub properties when querying elasticsearch. For example the near_match analysis of the title field is typically exposed as title.near_match. There are no strict guarantees about the sub-property naming, but convention is for the property to share the name of the analyzer.

The results of using an analysis chain can be checked with the elasticsearch analyze API. This can be queried on the cloudelastic servers or by importing the settings provided by the cirrus-settings-dump api call into a local elasticsearch instance.

keyword

[edit]

Strict matching of the property text with the queried text. The text is not split into words, the whole text must match from beginning to end. The property text is truncated to 5000 characters, nothing after the first 5000 characters is taken into consideration when matching.

lowercase_keyword

[edit]

Identical to keyword, but with icu normalization and folding applied.

near_match

[edit]

Identical to keyword, but with additional flattening of various space-like tokens to spaces. This is used to power the "Go" functionality of CirrusSearch.

near_match_asciifolding

[edit]

Identical to lowercase_keyword, but with additional flattening of various space-like tokens to spaces.

plain

[edit]

Applied to textual content to represent the words in a method very close to the original words. Minimal transformations are applied. This only represents words, various special characters (quotes, commas, etc.) are removed in the tokenization step.

prefix

[edit]

Generates all possible prefixes of a keyword. ICU normalization is applied along with flattening of various space-like tokens to spaces. Any matching against a prefix must start from the very first character of the field.

prefix_asciifolding

[edit]

Similar to prefix, but with icu folding applied as well.

trigram

[edit]

Generates trigrams, or three character sequences, of the textual content. This is primarily used to accelerate regex search. For example the string "example text" will yield the tokens: "exa", "xam", "amp", "mpl", "ple", "le ", "e t", " te", "tex", "ext"

text

[edit]

Standard analyzer for text content. This is similar to the plain analyzer but with more aggressive normalization applied to the content. These normalizations may include stop word filtering, stemming, and other language specific handling.

short_text

[edit]

Similar to the text analyzer, but specialized for short text strings such as headings and titles.

source_text_plain

[edit]

Analyzer primarily used against wikitext to provide word level queries. Uses only icu normalization along with some special rules to help separate words seen in wikitext.

suggest

[edit]

Shingled analzer used to power search suggestions (aka did you mean). Shingles are similar to trigrams, but operate on the word level instead of the character level. This analyzer is configured to emit 1, 2 and 3-grams. For example the string "cats with hats" will emit the tokens: "cats", "cats with", "cats with hats", "with", "with hats", "hats"

token_count

[edit]

Reports the number of tokens in a field, rather than the textual content.

Native Document Properties

[edit]

These properties are calculated in the CirrusSearch extension and provided to elasticsearch when sending updates.

version

[edit]

The revision id that was indexed

wiki

[edit]

The dbname of the wiki this document belongs to

namespace

[edit]

The integer namespace the document is in

namespace_text

[edit]

The textual representation of the namespace the document is in. This is in the wiki's content language

title

[edit]

The title of the page this document represents. The title uses the text format, where spaces in the title are preserved.

timestamp

[edit]

The timestamp the most recently indexed revision of this page was created at. Timestamps are in the format YYYY-MM-DDTHH:MM:SSZ

create_timestamp

[edit]

The timestamp the first revision of this page was created at. Timestamps are in the format YYYY-MM-DDTHH:MM:SSZ.

category

[edit]

A list of categories the page belongs to. The categories use the text format, where spaces in the title are preserved.

[edit]

A list of external url's this page links to.

[edit]

A list of wiki pages that are linked from this page. The wiki pages are in dbkey format, where spaces are replaced with underscores.

template

[edit]

A list of templates that are used in this page, as reported by the MediaWiki wikitext parser. The template names are in text format, where spaces in the title are preserved.

text

[edit]

The textual content of the page. This is roughly constructed by running the wikitext through the parser to generate html, removing non-text content, and stripping all html. Content removed from this field such as tables, captions, and hatnotes, are moved to the auxiliary_text field.

source_text

[edit]

The source wikitext of the page.

text_bytes

[edit]

The size of the content as reported by the associated mediawiki Content implementation. For wikitext this is the number of bytes in the wikitext.

content_model

[edit]

String representing the name of the content model for this page.

wikibase_item

[edit]

String containing the wikidata Q-item this page is associated with.

coordinates

[edit]

List of coordinates associated with this page. Each coordinate has the following structure:

Properties of each coordinate:

  • coord - elasticsearch geo_point. Represented as object with two properties: lat/lon. Both contain a floating point number in the domain (-180, 180)
  • country - country code
  • dim - dimension. Integer radius, in meters, of the item being referenced
  • globe - The globe the coordinates are on. Typically "earth".
  • name - Name of the item referenced. Often null
  • primary - Boolean representing if this is the primary coordinate for the article. Only one coordinate can be primary.
  • region - Sub-region of country this coordinate is within. For example if country code is US region will be a two letter US State code.
  • type - ???. Same value as gt_type field of GeoData table in mysql

language

[edit]

The language code this page is in

heading

[edit]

List of headings on this page

opening_text

[edit]

Text content of the page prior to the first heading. The content is also available in the text property.

auxiliary_text

[edit]

List of strings removed from the text property. The content that is moved from the text property to this one is controlled by the WikiTextStructure::$auxiliaryElementSelectors property in mediawiki core.

display_title

[edit]

Contains the display title of the page if it differs from the regular page title in ways other than casing. If the display title is prefixed with the translated namespace of the page in the pages language the namespace name is stripped.

file_bits

[edit]

Contains the integer bit depth of the media represented by this page

file_height

[edit]

Contains the integer height of the media represented by this page

file_media_type

[edit]

Contains the media type of the media represented by this page.

file_mime

[edit]

Contains the mime type of the media represented by this page.

file_resolution

[edit]

Contains an integer representation of the resolution of the media represented by this page. This is calculated as floor(square_root(file_width * file_height)).

file_size

[edit]

Contains the size of the media represented by this page in bytes.

file_text

[edit]

Contains the text content of the media represented by this page for mime types that mediawiki knows how to extract the content of, such as PDF and DJVU. Length of text indexed is limited by $wgCirrusSearchMaxFileTextLength which is unlimited by default and 50kB on WMF wikis.

file_width

[edit]

Contains the width of the media represented by this page in pixels

[edit]

Contains an integer representing the number of pages on the same wiki that link to this page.

redirect

[edit]

List of redirects on the same wiki that redirect to this page. Each redirect is represented by an object with two properties: The integer namespace in the namespace property, and the title of the redirect in

Properties only populated on commonswiki

[edit]

local_sites_with_dupe

[edit]

Only found on commonswiki in the File namespace. Contains list of wiki dbnames that have an uploaded file with the exact same name as this file.

Properties only populated on wikis running wikibase repo

[edit]

descriptions.*

[edit]

label_count

[edit]

labels.*

[edit]

lemma

[edit]

lexeme_forms

[edit]

id

[edit]

representation

[edit]

lexeme_language

[edit]

lexical_category

[edit]
[edit]

statement_count

[edit]

statement_keywords

[edit]

External Document Properties

[edit]

These properties are calculated external to CirrusSearch and populated within the production search clusters

popularity_score

[edit]

A floating point number representing the percentage of page views to this wiki that requests this page. This is only available for content pages.

weighted_tags

[edit]

Contains classification predictions about the page from various sources, including ORES models and link recommendations. While the name says articletopic, this will be renamed to something semantically appropriate, perhaps predicted_classes or even classifications, in the future.

Predictions are provided in the source documents in an array with per-model prefixes and a suffixed integer in [0,1000] representing the confidence. The analysis chain interprets this value as the term frequency. For legacy reasons unprefixed predictions (without a /) belong to the ORES articletopic model. For example:

   [
       "STEM.Computing|780",
       "drafttopic/STEM.STEM*|988",
       "link_recommend/exists|1",
   ]

copy_to Document Properties

[edit]

These properties are not provided directly by CirrusSearch, rather the elasticsearch mapping is instructed to create these fields by copying content from other fields.

all

[edit]

Contains all text content copied to a single field. This consolidation into a single field is an optimization, semantically it shouldn't be important. The general idea is to use as a first-pass filter that removes most irrelevant results, leaving the individual field queries to only effect scoring.

all_near_match

[edit]

Contains both titles and redirects in a single field for filtering with the near_match analyzer.

suggest

[edit]

The suggest field is populated by the copy_to section of the title and redirect fields. The suggest field uses shingles (word ngrams) which provides phrase matching in a way that doesn't have to be restricted to the rescore window for performance reasons.

labels_all

[edit]

Only generated on wikis containing wikibase repo. Contains a copy of all labels in all languages.