Preprocessor ABNF
MediaWiki preprocessor syntax in augmented Backus–Naur Form (ABNF) (RFC 5234).
Ideal rules
[edit]; START = start of string
; END = end of string
; LINE-START = start of line
; LINE-END = end of line
;
; The string starts with LINE-START. An LF input produces the tokens
; LINE-END LF LINE-START, and the string ends with LINE-END.
;
; The starting symbol of the grammar is wikitext-L1.
xml-char = %x9 / %xA / %xD / %x20-D7FF / %xE000-FFFD / %x10000-10FFFF
sptab = SP / HTAB
; everything except ">" (%x3E)
attr-char = %x9 / %xA / %xD / %x20-3D / %x3F-D7FF / %xE000-FFFD / %x10000-10FFFF
literal = *xml-char
title = wikitext-L3
part-name = wikitext-L3
part-value = wikitext-L3
part = ( part-name "=" part-value ) / ( part-value )
parts = [ title *( "|" part ) ]
tplarg = "{{{" parts "}}}"
template = "{{" parts "}}"
link = "[[" wikitext-L3 "]]"
comment = "<!--" literal "-->"
unclosed-comment = "<!--" literal END
; the + in the line-eating-comment rule was absent between MW 1.12 and MW 1.22
line-eating-comment = LF LINE-START *SP +( comment *SP ) LINE-END
attr = *attr-char
nowiki-element = "<nowiki" attr ( "/>" / ( ">" literal ( "</nowiki>" / END ) ) )
; ...and similar rules added by XML-style extensions.
xmlish-element = nowiki-element / ... extensions ...
heading = LINE-START heading-inner [ *sptab comment ] *sptab LINE-END
heading-inner = "=" wikitext-L3 "=" /
"==" wikitext-L3 "==" /
"===" wikitext-L3 "===" /
"====" wikitext-L3 "====" /
"=====" wikitext-L3 "=====" /
"======" wikitext-L3 "======"
; wikitext-L1 is a simple proxy to wikitext-L2, except in inclusion mode, where it
; has a role in <onlyinclude> syntax (see below)
wikitext-L1 = wikitext-L2 / *wikitext-L1
wikitext-L2 = heading / wikitext-L3 / *wikitext-L2
wikitext-L3 = literal / template / tplarg / link / comment /
line-eating-comment / unclosed-comment / xmlish-element /
*wikitext-L3
In inclusion mode, these rules are added:
noinclude-element = "<noinclude" attr ( "/>" / ( ">" literal ( "</noinclude>" / END ) ) )
inclusion-ignored-tag = "<includeonly>" / "</includeonly>"
closed-onlyinclude-item = ignored-text "<onlyinclude>" wikitext-L2 "</onlyinclude>"
unclosed-onlyinclude-item = ignored-text "<onlyinclude>" wikitext-L2
ignored-text = literal
onlyinclude-sequence = *closed-onlyinclude-item *unclosed-onlyinclude-item
xmlish-element =/ noinclude-element
wikitext-L1 =/ onlyinclude-sequence
wikitext-L3 =/ inclusion-ignored-tag / onlyinclude-sequence
In non-inclusion mode, these rules are added:
includeonly-element = "<includeonly" attr ( "/>" / ( ">" literal ( "</includeonly>" / END ) ) )
noninclusion-ignored-tag = "<noinclude>" / "</noinclude>" / "<onlyinclude>" / "</onlyinclude>"
xmlish-element =/ includeonly-element
wikitext-L3 =/ noninclusion-ignored-tag
Ideal precedence
[edit]- Angle bracket constructs:
onlyinclude-sequence
,xmlish-element
,comment
,unclosed-comment
,line-eating-comment
,inclusion-ignored-tags
,noninclusion-ignored-tags
- Bracketed syntax:
tplarg
,template
,link
heading
literal
In ambiguity between angle-bracket constructs, the first-opened structure takes precedence. For example:
<nowiki><!--</nowiki>-->
The nowiki-element
wins.
In ambiguity between template
, tplarg
and link
, the structure with the rightmost opening takes precedence. For example:
[[ {{ ]] }}
The template
wins because it was opened after the link
.
tplarg
takes precedence over template
where braces alone are involved. But it is neither higher nor lower in precedence than link
. Sequences of matching braces are thus interpreted as follows:
- 4: {{{{·}}}} → {·{{{·}}}·}
- 5: {{{{{·}}}}} → {{·{{{·}}}·}}
- 6: {{{{{{·}}}}}} → {{{·{{{·}}}·}}}
- 7: {{{{{{{·}}}}}}} → {·{{{·{{{·}}}·}}}·}
Practicalities
[edit]The main implementation challenge is avoiding infinite backtracking when disambiguating between competing bracketed constructs: template
, tplarg
, link
and heading
. The xmlish elements (including comments) don't suffer this problem because an unclosed xmlish element runs to the end, forcing a literal interpretation of the contents.
For example:
{{ [[ x | y | ...long string... }}
The square brackets are unclosed, and so the pipe characters should be interpreted as separating the parts
of a template
. But we don't know if the link is valid until the cursor reaches the end of the long string. This has traditionally been dealt with by adding a number of "broken" rules with the same precedence as the unbroken rules.
Since forever:
broken-tplarg = "{{{" parts-L2 broken-template = "{{" parts-L2 broken-link = "[[" wikitext-L2
Since MW 1.12:
broken-heading = LINE-START 1*6"=" wikitext-L3 LINE-END
Where parts-L2
is like parts
except that it allows headings inside it:
part-L2 = ( part-name-L2 "=" part-value-L2 ) / ( part-value-L2 ) part-name-L2 = wikitext-L2 part-value-L2 = wikitext-L2 parts-L2 = [ part-L2 1*( "|" part-L2 ) ]
These "broken" rules, when matched, produce output similar to a literal start followed by ordinary wikitext. The difference is that they compete on the same precedence level as the unbroken rules. So the previous example is parsed as a broken-template
containing a broken-link
containing a long string and a literal "}}".
Based on the ideal rules, we would expect the literal
interpretation of "}}" to have a lower precedence than its interpretation as the end of a template
. But with the "broken" rules, the broken-link
takes precedence over the template
, being the rightmost-opened structure.
Broken rules always run to the end of the input string, because the only other way to terminate a broken rule is to turn it into an unbroken rule by closing it.
Because a heading
or a broken-heading
can appear in a part-L2
, there is now ambiguity between the equals sign of the name/value separator, and the equals sign for the heading. We resolve it in the following way:
- For level 1 headings (i.e. one equals sign on each side), the
part
takes precedence. - For level 2-6 headings, the heading takes precedence.
If the part-L2
later becomes a part
because the template
or tplarg
is closed, we could now have an errant heading
in wikitext-L3
, where it's not allowed. The heading
can easily be disabled, but the name/value separator can't easily be recovered. To represent the syntactic effect of this, we introduce another rule:
disabled-heading = heading wikitext-L3 =/ disabled-heading
The disambiguation of disabled-heading
with part
works in the same way as the disambiguation of heading
with part-L2
, described above.
Note that even with the changes described in this section, the grammar outlined here has ambiguities and precedence issues and does not correspond to the implementation of the PHP Preprocessor. This spec shouldn't be relied on an authoritative machine-readable reference, but as a useful guide for a human to understand the intended precedence and semantics of the preprocessor.
Possible improvements
[edit]If an efficient algorithm could be found for disambiguating the ideal rules, without introducing "broken" rules, that would be great. It would be a b/c break, but probably beneficial. Backwards compatibility was broken anyway by introducing broken-heading
(the "newsome" bug on m:MNPP).
Line-eating comments could very easily be made to match at the start of the string. Currently they don't since there is no LF
at the start of the string, just a LINE-START
.
The "rightmost opening" rule for bracketed precedence is arbitrary, an artifact of implementation. Leftmost opening would probably be more intuitive.