Manual:Architectural modules/Parser

Module
Parser
Responsibilities	Different: parsing of wikitext and several other tasks
Implementation	The main Parser.php file as well as 14 other separated files which implement supporting functionalities for parser

Responsibilities

Parser module has different responsibilities in MediaWiki. Besides the actual parsing of wikitext it is used for several other tasks. The role of Parser module is comprised of the following functional areas, that will be further presented in more details:

Parse wikitext into HTML
- Apply options for parsing
- Deliver output object to be used for rendering
Cache the output of parsing
Provide parser functions
Provide tag hooks
Extract and replace sections when they are edited

Parse wikitext into HTML

The content of the articles is stored in the database in wikitext. When an article has to be shown in the browser, this wikitext needs to be converted into proper HTML. Performing this task is the core functionality of Parser module. The entry point to do the parsing is Parser::parse(), that sequentially executes a row of operations transforming at each step a piece of wikitext code (tables, lists, headers etc.) into HTML. The execution steps of parse() are shown in the table Parsing Steps.

Parsing Steps
startParse()	Sets the title, options and outputType
internalParse()
preprocessToDom()	Preprocesses wikitext and returns the document tree
$frame->expand()	Expands templates Expands variables and parser functions Inserts unique prefixes for nowiki and puts values in mStripState array under nowiki Inserts unique prefixes for sections and puts values in mStripState array under general Removes HTML comments
Sanitizer:: removeHTMLtags()	Cleans up HTML, removes dangerous tags and attributes, removes HTML comments
doTableStuff()	Renders wikitext for tables
preg_replace()	Inserts <hr /> tag for thematic break (start of sections)
doDoubleUnderscore()	Removes valid double-underscore items, like __NOTOC__, and puts them into array `$Parser->mDoubleUnderscores`.
doHeadings()	Renders section headers, i.e. "==" are replaced with <h2> tags
replaceInternalLinks()	Puts placeholders for internal links in [[ ]] and stores them in `$Parser->mLinkHolders` Renders section links Removes categories and puts them in `$mOutput->mCategories`
doAllQuotes()	Replaces single quotes with HTML markup (<i>, <b>, etc)
replaceExternalLinks()	Renders external links
doMagicLinks()	Replaces special strings like "ISBN xxx" and "RFC xxx" with magic external links
formatHeadings()	Auto numbers headings if that options is enabled Adds an [edit] link to sections for users who have enabled the option and can edit the page Adds a table of contents on the top for users who have enabled the option Auto-anchors headings
exit internalParse()
$this->mStripState->unstripGeneral()	Inserts back general stripped items from StripState
preg_replace()	Cleans up special characters
doBlockLevels()	Renders lists from lines starting with ':', '*', '#', etc. Renders new lines and paragraphs
replaceLinkHolders()	Replaces link placeholders with actual links from $Parser->mLinkHolders
$this->getConverterLanguage()->convert()	The text is language converted (when applicable)
mStripState->unstripNoWiki()	Nowikitext is inserted back from StripState array
replaceTransparentTags()	Replaces transparent tags with values which are provided by the callback functions in $Parser->mTransparentTagHooks. Transparent tag hooks are like regular XML-style tag hooks, but they operate on HTML instead of wikitext.
$this->mStripState->unstripGeneral()	Inserts back general stripped items from StripState
Sanitizer::normalizeCharReferences()	Ensures that any entities and character references are legal for XML and XHTML specifically
MWTidy::tidy($text)	If HTML tidy is enabled, MWTidy::tidy is called to do the tidying
Limit report	If limit report is enabled, produces limit report
mOutput->setText($text)	Sets the parsed text to ParserOutput
return $this->mOutput	Returns the ParserOutput object with HTML text of the wiki page


ParserOptions

The parsing of wikitext is based on the applied ParserOptions. ParserOptions are initialized from the RequestContext where the Language and User are the two important parameters. User preferences, such as thumbs size, numbering of headings or language are set to the ParserOptions. Furthermore, other options get their values based on the settings for global variables of MediaWiki. These are, for example, mAllowExternalImages or mMaxTemplateDepth (maximum recursion depth for templates within templates). An example of ParserOptions object with values can be found on the figure ParserOptions. As a result, the HTML output for the same article (and the same revision) can vary depending on the used parser options.


ParserOutput

An important point to mention about the parsing process is the incorporation of extension hooks. Hooks allow custom code to be executed when some defined event occurs. Code in the parse() already includes some defined events and runs hooks when they occur. For example, there is an event ParserBeforeInternalParse, on which hooks will be ran before proceeding with internal parse. Additional functions can be registered on these events or new events can be created as everywhere in MediaWiki.

The result of parsing is stored as the ParserOutput object. The main attribute here is mText, that holds the whole HTML representation of the article content. Besides that ParserOutput holds separately categories, links, images, sections, templates and other "parts" of the article as variables. Moreover, it holds information relevant for caching – cache time, expiry and revision. An example of ParserOutput object can be found on the figure ParserOutput. After the ParserOutput is produced, the values of its attributes are set to the OutputPage object.

Cache the output of parsing

In order to optimize the performance of MediaWiki and not to parse every time the wikitext into HTML, the ParserOutput can be cached. There are 2 PHP files in the parser directory which relate to mechanism of caching: ParserCache.php and CacheTime.php. When ParserOutput is being created it is done so using some ParserOptions. There are more than 30 options and they can be found in the ParserOptions class. Example of such options would be:

$mDateFormat – specifies date format
$mInterfaceMessage – specifies which language to call for plural and grammar
$mEnableLimitReport – specifies whether to enable limit report in an HTML comment on output
$mNumberHeadings – specifies whether headings should be automatically numbered
$mThumbSize – specifies thumb size preferred by user

All of these options are taken into account when producing the ParserOutput and some of them are critical for creation of the cache.

After the wikitext is parsed and ParserOutput is created (in PoolWorkArticleView::doWork()), the output will be cached if cache expiry > 0 and we are dealing with the latest revision of the article. For that ParserCache::singleton()->save() function is called. The key to store the ParserOutput in cache is generated from the pageid and used ParserOptions. The options are hashed using ParserOptions::optionsHash() in form of '!value' or '!*'. '!' means the beginning of a new option. '*' is placed when no value for this option is found. In the end the key for the cached ParserOutput will look this way:

key = buildings_en:pcache:idhash:31-0!*!0!!en!2

where

buildings_en – name of the database
pcache – constant for parser cache
31 – page id
0 – render key
!*!0!!en!2 – different options, 'en' for example stands for user language

After the key is created, the output is saved with $this->mMemc->set(). mMemc is the given back-end storage mechanism (memcached client or a BagOStuff derivative). It is set during the construction of ParserCache by passing $parserMemc, which gets its value from global variable $wgParserCacheType.

When a current revision of the page is requested and the request is not a redirect, client and file cache will be checked. If nor client, nor file cache are available and the variable $useParserCache is set to true, parser cache will be tried. The cache retrieval consists of 2 steps: getting optionsKey from cache and getting the actual ParserOutput cache using parserOutputKey. optionsKey is a CacheTime object that holds in particular $mUsedOptions, $mCacheTime and $mCacheExpiry. The key for getting the optionsKey is generated from pageid:

key=buildings_en:pcache:idoptions:31

where pageid=31.

If optionsKey can be found in cache and it is not expired, then we can get the parserOutputKey. For that ParserCache::getParserOutputKey() will be called. The $usedOptions coming from $optionsKey->mUsedOptions will be hashed to become the part of the parserOutputKey as described before.

Now when the parserOutputKey is available, it will be tried to retrieve the value for it from cache. If the cache is found, it is not expired and no different revision was requested, then the cached ParserOutput object will be returned and served further to the user. Otherwise false will be returned and the page will be parsed again.

As described before, ParserOptions are important for saving the cache. If a requested page was saved with different options than those required for retrieval, it will have to be parsed again. Following the example above, a user who has Spanish as his preferred language would get the following parserOutputKey for exactly the same page:

buildings_en:pcache:idhash:30-0!*!0!!es!*!*

As it has es standing for user language, the page has to be parsed again in order to provide correct representation. The Spanish user, for example, would then see 'Editar' instead of 'Edit' for editing sections of the page.

Provide parser functions

Parser function is any magic word, that takes parameters and returns calculated value based on these parameters. The Parser module implements the required functions and provides interface for using them. An example of a built in parser function would be PLURAL.

This page has been accessed {{PLURAL:110|once|110 times}}

would become

This page has been accessed 110 times.

The parser functions can be accessed not only during the parsing of the articles written in wikitext, but at any point when the outcome of this function is needed. MediaWiki developers can implement additional parser functions by creating an extension. This way users will have more options for applying magic words with parameters suited to their specific needs.

Provide tag hooks

Markup of MediaWiki includes tags, that allow to delimit some text and process it in a special way depending on the meaning of the tag. There are 4 core tags in MediaWiki (nowiki, pre, gallery and html), that the parser can process automatically. That is the implementation of processing is built in the parser. The users, however, might want to extend the wiki markup and they can do so by introducing new tags. This can be done by implementing a tag extension and integrating it with parser. A tag extension would consist of a function, that will be called during parsing and that will render the tagged text into HTML.

Extract and replace sections when they are edited

MediaWiki provides the possibility to edit specific sections instead of editing the whole article. The extraction of the section and its replacement after it has been changed is done by the Parser module as well. The entry point for these operations would be Parser::getSection() and Parser::replaceSection().

Implementation Information

Files related to the Parser module are placed in the parser directory, that can be found in includes --> parser. The directory consists of the main Parser.php file as well as 14 other separated files which implement supporting functionalities for parser.

Parser.php

Contains the core functionality of the Parser module.
The main entry point is parse() that does the parsing of wikitext.
Other entry points:
- preSaveTransform()
- preprocess()
- cleanSig()
- getSection()
- replaceSection()
- getPreloadText()

CacheTime.php

Sets the time of caching.
Sets the time (in number of seconds) when the cache should expire and checks if it is expired.
Contains variable $mUsedOptions with ParserOptions that were used to produce the ParserOutput.

CoreParserFunctions.php

Parser functions provided by MediaWiki core. These functions are registered for the Parser in Parser::firstCallInit().
Parser function is any magic word that takes one or more parameters. They are sometimes prefixed with hash to distinguish them from templates.
Example: {{PAGEID: page name}} – Returns the page identifier of the specified page.

CoreTagHooks.php

Tag hooks provided by MediaWiki core. These functions are registered for the Parser in Parser::firstCallInit().
Core tags are:
- <pre> – works as normal html <pre> tag and the text inside is not considered as wikitext (ignored by parser).
- <nowiki> – text inside is not considered as wikitext (ignored by parser).
- <gallery> – images are displayed in rows and columns.
- <html> – only if $wgRawHtml is enabled, then the text inside is treated as raw HTML.

DateFormatter.php

Date formatter recognizes dates in plain text and formats them according to user preferences.

LinkHolderArray.php

Temporarily holds links of the wiki page. There are 2 types of links: $internals and $interwikis. At some point during parsing all internal and interwiki links from the wikitext are cut out and placed into this array with key-value pairs. Instead of cut out links the keys for the links are inserted in the wikitext. At the right time real links correctly formated in HTML are placed back into the wiki page.

Parser_DiffTest.php (removed in 1.35)

Fake parser that outputs the difference of two different parsers.

ParserCache.php

Handles caching of the ParserOutput:
- generates cache key
- saves in cache
- retrieves from cache

ParserOptions.php

Holds options used to create ParserOutput.
Generates hashed options that will be used as a part of the key to store ParserOutput in cache.

ParserOutput.php

Represents the output of the parsing.
The main variable is $mText which contains HTML text that will be rendered in browser.
Also has such variables as
- $mLanguageLinks – List of the full text of language links, in the order they appear
- $mCategories – Map of category names to sort keys
- $mTitleText – Title text of the chosen language variant
- ...

Preprocessor.php

Interfaces for preprocessors (Preprocessor, PPFrame, PPNode).
Implementation classes can be set in configuration for Parser.

Preprocessor_DOM.php (removed in 1.35)

Preprocessor using PHP's dom extension.
Contains several classes: Preprocessor_DOM (implements Preprocessor), PPDStack, PPDStackElement, PPDPart, PPFrame_DOM, PPTemplateFrame_DOM, PPCustomFrame_DOM, PPNode.
Main functionalities:
- Preprocesses wikitext and returns a document tree.
- Creates a PPFrame DOM object and calls its expand() method to do the structure of the wiki page (creates new lines for sections and lists) and expands templates and parser functions.

Preprocessor_Hash.php

Preprocessor using PHP arrays

StripState.php

Holder for stripped items when parsing wikitext. 2 types of striped items: nowiki and general.
Holds these items as key-value pairs while parsing is done. These items are removed from wikitext and the keys are placed instead. When all the parsing is done the items are inserted back into the text.
This is done in order to leave nowikitext as it is while doing normal parsing of wikitext.

Tidy.php

HTML validation and correction.
Parsing/Replacing Tidy

Architectural modules of MediaWiki