Extension:CirrusSearch/Schema

CirrusSearch uses Elasticsearch as the underlying search engine. The schema used by CirrusSearch is defined through Elasticsearch index settings and mappings. Both the settings and mappings can be requested from any wiki running CirrusSearch to retrieve the current configuration. Attempts are made to keep the documentation here up to date, but the api responses contain the source of truth.

Analysis Chains Used

CirrusSearch defines a variety of analysis chains that are used throughout the schema to allow search text fields in different way. These are exposed as sub properties when querying elasticsearch. For example the near_match analysis of the title field is typically exposed as title.near_match. There are no strict guarantees about the sub-property naming, but convention is for the property to share the name of the analyzer.

The results of using an analysis chain can be checked with the elasticsearch analyze API. This can be queried on the cloudelastic servers or by importing the settings provided by the cirrus-settings-dump api call into a local elasticsearch instance.

keyword

Strict matching of the property text with the queried text. The text is not split into words, the whole text must match from beginning to end. The property text is truncated to 5000 characters, nothing after the first 5000 characters is taken into consideration when matching.

lowercase_keyword

Identical to keyword, but with icu normalization and folding applied.

near_match

Identical to keyword, but with additional flattening of various space-like tokens to spaces. This is used to power the "Go" functionality of CirrusSearch.

near_match_asciifolding

Identical to lowercase_keyword, but with additional flattening of various space-like tokens to spaces.

plain

Applied to textual content to represent the words in a method very close to the original words. Minimal transformations are applied. This only represents words, various special characters (quotes, commas, etc.) are removed in the tokenization step.

prefix

Generates all possible prefixes of a keyword. ICU normalization is applied along with flattening of various space-like tokens to spaces. Any matching against a prefix must start from the very first character of the field.

prefix_asciifolding

Similar to prefix, but with icu folding applied as well.

trigram

Generates trigrams, or three character sequences, of the textual content. This is primarily used to accelerate regex search. For example the string "example text" will yield the tokens: "exa", "xam", "amp", "mpl", "ple", "le ", "e t", " te", "tex", "ext"

text

Standard analyzer for text content. This is similar to the plain analyzer but with more aggressive normalization applied to the content. These normalizations may include stop word filtering, stemming, and other language specific handling.

short_text

Similar to the text analyzer, but specialized for short text strings such as headings and titles.

source_text_plain

Analyzer primarily used against wikitext to provide word level queries. Uses only icu normalization along with some special rules to help separate words seen in wikitext.

suggest

Shingled analzer used to power search suggestions (aka did you mean). Shingles are similar to trigrams, but operate on the word level instead of the character level. This analyzer is configured to emit 1, 2 and 3-grams. For example the string "cats with hats" will emit the tokens: "cats", "cats with", "cats with hats", "with", "with hats", "hats"

token_count

Reports the number of tokens in a field, rather than the textual content.

Native Document Properties

These properties are calculated in the CirrusSearch extension and provided to elasticsearch when sending updates.

version

The revision id that was indexed

wiki

The dbname of the wiki this document belongs to

namespace

The integer namespace the document is in

namespace_text

The textual representation of the namespace the document is in. This is in the wiki's content language

title

The title of the page this document represents. The title uses the text format, where spaces in the title are preserved.

timestamp

The timestamp the most recently indexed revision of this page was created at. Timestamps are in the format YYYY-MM-DDTHH:MM:SSZ

create_timestamp

The timestamp the first revision of this page was created at. Timestamps are in the format YYYY-MM-DDTHH:MM:SSZ.

external_link

A list of external url's this page links to.

outgoing_link

A list of wiki pages that are linked from this page. The wiki pages are in dbkey format, where spaces are replaced with underscores.

template

A list of templates that are used in this page, as reported by the MediaWiki wikitext parser. The template names are in text format, where spaces in the title are preserved.

text

The textual content of the page. This is roughly constructed by running the wikitext through the parser to generate html, removing non-text content, and stripping all html. Content removed from this field such as tables, captions, and hatnotes, are moved to the auxiliary_text field.

source_text

The source wikitext of the page.

text_bytes

The size of the content as reported by the associated mediawiki Content implementation. For wikitext this is the number of bytes in the wikitext.

content_model

String representing the name of the content model for this page.

wikibase_item

String containing the wikidata Q-item this page is associated with.

coordinates

List of coordinates associated with this page. Each coordinate has the following structure:

Properties of each coordinate:

coord - elasticsearch geo_point. Represented as object with two properties: lat/lon. Both contain a floating point number in the domain (-180, 180)
country - country code
dim - dimension. Integer radius, in meters, of the item being referenced
globe - The globe the coordinates are on. Typically "earth".
name - Name of the item referenced. Often null
primary - Boolean representing if this is the primary coordinate for the article. Only one coordinate can be primary.
region - Sub-region of country this coordinate is within. For example if country code is US region will be a two letter US State code.
type - ???. Same value as gt_type field of GeoData table in mysql

language

The language code this page is in

heading

List of headings on this page

opening_text

Text content of the page prior to the first heading. The content is also available in the text property.

auxiliary_text

List of strings removed from the text property. The content that is moved from the text property to this one is controlled by the WikiTextStructure::$auxiliaryElementSelectors property in mediawiki core.

display_title

Contains the display title of the page if it differs from the regular page title in ways other than casing. If the display title is prefixed with the translated namespace of the page in the pages language the namespace name is stripped.

file_bits

Contains the integer bit depth of the media represented by this page

file_height

Contains the integer height of the media represented by this page

file_media_type

Contains the media type of the media represented by this page.

file_mime

Contains the mime type of the media represented by this page.

file_resolution

Contains an integer representation of the resolution of the media represented by this page. This is calculated as floor(square_root(file_width * file_height)).

file_size

Contains the size of the media represented by this page in bytes.

file_text

Contains the text content of the media represented by this page for mime types that mediawiki knows how to extract the content of, such as PDF and DJVU. Length of text indexed is limited by $wgCirrusSearchMaxFileTextLength which is unlimited by default and 50kB on WMF wikis.

file_width

Contains the width of the media represented by this page in pixels

incoming_links

Contains an integer representing the number of pages on the same wiki that link to this page.

redirect

List of redirects on the same wiki that redirect to this page. Each redirect is represented by an object with two properties: The integer namespace in the namespace property, and the title of the redirect in

Properties only populated on commonswiki

local_sites_with_dupe

Only found on commonswiki in the File namespace. Contains list of wiki dbnames that have an uploaded file with the exact same name as this file.

Properties only populated on wikis running wikibase repo

descriptions.*

label_count

labels.*

lemma

lexeme_forms

id

representation

lexeme_language

lexical_category

sitelink_count

statement_count

statement_keywords

External Document Properties

These properties are calculated external to CirrusSearch and populated within the production search clusters

popularity_score

A floating point number representing the percentage of page views to this wiki that requests this page. This is only available for content pages.

weighted_tags

Contains classification predictions about the page from various sources, including ORES models and link recommendations. While the name says articletopic, this will be renamed to something semantically appropriate, perhaps predicted_classes or even classifications, in the future.

Predictions are provided in the source documents in an array with per-model prefixes and a suffixed integer in [0,1000] representing the confidence. The analysis chain interprets this value as the term frequency. For legacy reasons unprefixed predictions (without a /) belong to the ORES articletopic model. For example:

   [
       "STEM.Computing|780",
       "drafttopic/STEM.STEM*|988",
       "link_recommend/exists|1",
   ]

copy_to Document Properties

These properties are not provided directly by CirrusSearch, rather the elasticsearch mapping is instructed to create these fields by copying content from other fields.

all

Contains all text content copied to a single field. This consolidation into a single field is an optimization, semantically it shouldn't be important. The general idea is to use as a first-pass filter that removes most irrelevant results, leaving the individual field queries to only effect scoring.

all_near_match

Contains both titles and redirects in a single field for filtering with the near_match analyzer.

suggest

The suggest field is populated by the copy_to section of the title and redirect fields. The suggest field uses shingles (word ngrams) which provides phrase matching in a way that doesn't have to be restricted to the rescore window for performance reasons.

labels_all

Only generated on wikis containing wikibase repo. Contains a copy of all labels in all languages.