Markup spec/BNF/Inline text
EBNF grammar project
"Inline text" is the guts of Wikitext formatting. It covers every situation where "normal" text is allowed, such as image captions, table data, and to an extent (yet to work out how to enforce this restriction...), headings and link text.
Code
[edit] <inline-text> ::= <inline-element> [<inline-text>]
<inline-element> ::=
| <category-link>
| <internal-link>
| <external-link>
| <magic-link>
| <image-inline> | <gallery-block> | <media-inline>
| <text-with-formatting>
<text-with-formatting> ::=
| <formatting>
| <inline-html>
| <noparseblock>
| <behaviour-switch>
| <open-guillemet> | <close-guillemet>
| <html-entity>
| <html-unsafe-symbol>
| <text>
| <random-character>
| (more missing?)...
<text> ::= { <harmless-character> }+
Detail:
- noparse-block: Markup spec/BNF/Noparse-block
- link: Links
- magic-link: Magic links
The parser should try the options in order, with text
matching if all else fails. In particular, <category-link> should be matched before <link> because a category is a special type of link, and we don't want the normal parsing to occur.
HTML entity
[edit]The parser recognises validly constructed HTML entities and leaves them alone.
<html-entity> ::= "&" <html-entity-name> ";"
| "&#" <decimal-number> ";"
| "&#x" <hex-number> ";"
<html-entity-name> ::= Sanitizer::$wgHtmlEntities (case sensitive)
(* "Aacute" | "aacute" | ... *)
Rendering
[edit]- The whole sequence is output literally.
- The current parser does very complicated things with escaping and de-escaping. So maybe there are places where something more complicated needs to happen.
HTML unsafe symbol
[edit]These "unsafe" symbols are turned into HTML entities if they haven't matched part of a valid HTML entity above. It's probably not too efficient having single-character level matching rules...perhaps should be combined with "text".
<html-unsafe-symbol> ::= <unescaped-ampersand> | <unespaced-less-than> | <unescaped-greater-than>
<unescaped-ampersand> ::= "&"
<unescaped-less-than> ::= "<"
<unescaped-greater-than> ::= ">"
Rendering
[edit]- <unescaped-ampersand> →
&
- <unescaped-less-than> →
<
- <unescaped-greater-than> →
>
Text
[edit]Harmless-characters mean characters that couldn't be anything else. I'm not sure how useful this is as a distinction, but perhaps it will help speed things up?
A "random character" is any character which hasn't matched anything else.
<harmless-characters> ::= /[A-Za-z0-9] etc
<random-character> ::= ? any character ... ?
Rendering
[edit]Both types are written literally.
This section from the "fundamental elements" section...time to mangle!
<character> ::= <whitespace-char> | <non-whitespace-char> | <html-entity>
<whitespace> ::= <whitespace-char> [<whitespace>] | EOF
<newlines> ::= <newline> [<newlines>]
<space-tabs> ::= <space-tab> [<space-tabs>]
<whitespace-char> ::= <space-tab> | <newline>
<space-tab> ::= <space> | TAB
<spaces> ::= <space> [<spaces>]
<space> ::= " "
<newline> ::= CR LF | LF CR | CR | LF
<BOL> ::= <newline> | BOF
<EOL> ::= <newline> | EOF
<non-whitespace-char> ::= <letter> | <decimal-digit> | <symbol>
<letter> ::= <ucase-letter> | <lcase-letter>
<ucase-letter> ::= "A" | "B" | ... | "Y" | "Z"
<lcase-letter> ::= "a" | "b" | ... | "y" | "z"
<symbol> ::= <html-unsafe-symbol> | <underscore> | "." | "," | ...
<underscore> ::= "_"
<decimal-number> ::= <decimal-digit> [<decimal-number>]
<decimal-digit> ::= "0" | "1" | ... | "8" | "9"
<hex-number> ::= <hex-digit> [<hex-number>]
<hex-digit> ::= <decimal-digit>
| "A" | "B" | "C" | "D" | "E" | "F"
| "a" | "b" | "c" | "d" | "e" | "f"
Formatting
[edit]Bold/italics is the biggest problem with switching to a consume-parse-render parser. It will not be possible to describe the current, extremely esoteric rules in simple (E)BNF. The best we can hope for is to store tokens representing the apostrophe clumps and do a second pass to make more sense of them. It would be very useful to define a second, unambiguous set of formatting syntax (most likely // and **), and encourage people to use those wherever apostrophes and bold/italics meet.
Some rules for parsing bold/italics as recognised by the current parser. These must be implemented (Brion said so). In increasing order of complexity:
- ''italics'', '''bold''', '''''bold-italics'''''
- italics, bold, bold-italics
- '''''bold-italics''' just italics'' normal
- bold-italics just italics normal
- Some text about l''''Arc de triomphe'''.
- Some text about l'Arc de triomphe.
- However: '''bold is l''''Arc de triomphe'''.
- However: bold is l'Arc de triomphe.
Optimistic view:
<formatting> ::= <bold-italic-toggle> | <bold-toggle> | <italic-toggle>
<bold-italic-toggle> ::= "'''''"
<bold-toggle> ::= "'''"
<italic-toggle> ::= "''"
Reality:
<formatting> ::= <apostrophe-jungle>
<apostrophe-jungle> ::= "''" { "'" }
Rendering
[edit]- Once the parser has decided which way the toggles go:
- bold-toggle-on -> <b>
- bold-toggle-off -> </b>
- italic-toggle-on -> <i>
- italic-toggle-off -> </i>
- bold-italic-toggle-on -> <b> <i>
- bold-italic-toggle-off-> </i> </b>
Determining the behaviour of apostrophes
[edit]The following describes the behaviour of repeated apostrophes. "Bold" means "toggle bold", rather than "turn bold on". "Bold, italics" means "Toggle bold and italics independently", rather than "turn bold and italics on" or "toggle bold and italics the same way".
- One (
'
): Always a single apostrophe.- e.g. (
hello ' blah
) → hello ' blah
- e.g. (
- Two (
''
): Always italics on or off- e.g. (
hello '' blah
) → hello blah
- e.g. (
- Three (
'''
):- Bold (default)
- e.g. (
hello ''' blah
) → hello blah
- e.g. (
- Apostrophe, italics
- If there is otherwise an odd number of both bold and italics
- If the preceding characters are <space><non-space> (and there are no earlier such sequences)
- e.g. (
hello l'''amour'' l'''ouest''' blah
) → hello l'amour louest blah
- e.g. (
- Else if the preceding characters are <non-space><non-space> (and there are no earlier such sequences)
- e.g. (
hello mon'''amour'' blah
) → hello mon'amour blah
- e.g. (
- Else (the preceding character is <space>) (and there are no earlier such sequences)
- e.g. (
hello '''amour'' '''blah '''blah
) → hello 'amour blah blah
- e.g. (
- If the preceding characters are <space><non-space> (and there are no earlier such sequences)
- If there is otherwise an odd number of both bold and italics
- Italics, apostrophe (never)
- Bold (default)
- Four (
''''
):- Bold, apostrophe (never)
- Apostrophe, bold (default, if either bold or italics ends up balanced)
- e.g. (
hello ''''amour''' now ''italics unbalanced, but that's ok
) → hello 'amour now italics unbalanced, but that's ok - e.g. (
hello ''''amour''' now, '''bold unbalanced, but that's ok
) → hello 'amour now, bold unbalanced, but that's ok
- e.g. (
- Apostrophe, apostrophe, italics
- If the default treatment leads to an odd number of bold and italics then this can meet condition 1 under the second case of three italics, above.
- e.g. (
hello ''''amour''' now '''''bold and italics unbalanced, so invoke this special case
) → hello ''amour now bold and italics unbalanced, so invoke this special case
- e.g. (
- If the default treatment leads to an odd number of bold and italics then this can meet condition 1 under the second case of three italics, above.
- Five (
'''''
):- Bold, italics; or italics, bold (default, the two cases are equivalent)
- e.g. (
hello ''''' blah
) → hello blah
- e.g. (
- Italics, apostrophe, italics (never)
- Bold, italics; or italics, bold (default, the two cases are equivalent)
- More than five:
- Apostrophes, bold+italics (default)
- e.g. (
hello '''''''''' blah
) → hello ''''' blah - e.g. (
hello '''bold '''''''''' blah
) → hello bold ''''' blah
- e.g. (
- Bold+italics, apostrophes (never)
- Apostrophes, bold+italics (default)
Inline HTML
[edit]The parser recognises and cleans a large number of HTML tags, as defined in Sanitizer.php.
A decision has to be made here on whether to attempt to parse these things as a matched set, or whether to leave that to a later pass.
A loose definition assuming they are treated individually:
<InlineHTML> ::= <InlineHTML-Open> | <InlineHTML-Close> | <InlineHTML-OpenClose> | <HTMLComment>
<InlineHTML-Open> ::= "<" <InlineHTMLtagname> [<extra-characters>] ">"
<InlineHTML-Close> ::= "</" <InlineHTMLtagname> [<extra-characters>] ">"
<InlineHTML-OpenClose> ::= "<" <InlineHTMLtagname> [<extra-characters>] "/>"
<extra-characters> ::= <word-boundary-char> {characters - ">"}
<word-boundary-char> ::= " " | "-" | ":" | " " | "\"" | "/" | "*" | "#" | "!" | "$" | "%" | ...
Remarks
[edit]- The range of "word-boundary-char" seems to be an artefact of the regular expression:
if( preg_match( '!^(/?)(\\w+)([^>]*?)(/{0,1}>)([^<]*)$!', $x, $regs ) ) {
The list of "tags that must be closed":
[edit]- block elements
- p, span, table, div,
- lists
- ol, ul, dl,
- paragraph formatting
- h1, h2, h3, h4, h5, h6, cite, center, blockquote, caption, pre,
- character formatting
- b, del, i, ins, u, font, big, small, sub, sup, code, em, s,
- strike, strong, tt, var, u
- Ruby
- rt, rb , rp, ruby,
Tags that can appear singly, and possibly paired
[edit]- br, hr, li, dt, dd
Tags that must not be paired
[edit]- br, hr
Tags that can be nested (source code is dubious on this)
[edit]- table, tr, td, th, div, blockquote, ol, ul, dl, font, big, small, sub, sup, span
Tags that can only appear inside a table
[edit]- td, th, tr,
Tags that make lists
[edit]- ul,ol,
And tags that can appear inside lists
[edit]- li
The significance of these groupings is shown as follows:
A <blockquote> B <span>C </blockquote> D </span> E
Here, blockquote and span are both "nesting" tags. When the close-blockquote tag is found inside the span block, it is escaped.
This doesn't work:
<span>Some text [[Image:foo.jpg|close </span>it.]]
But this does:
<b>Some text [[Image:foo.jpg|close </b>it.]]
Rendering
[edit]- Tags that have to be paired are forced closed according to some sort of logic.
- <extra-characters> are "sanitized", strip all but pre-approved attributes and styles on a whitelist.
- Tags are then written out literally:
<InlineHTMLTagname> " " <sanitized-attributes> >
etc. - HTML comments are completely discarded, with some whitespace massaging: (sanitizer.php)
- To avoid leaving blank lines, when a comment is both preceded and followed by a newline (ignoring spaces), trim leading and trailing spaces and one of the newlines.
Non-breaking spaces
[edit]This is pretty trivial and used basically to improve the appearance of punctuation in French, which always places a space before certain punctuation, and places spaces inside guillemets. Other languages use these characters, but without the spaces. Currently performed directly in the parse() method.
<nbsp-before> ::= [any character] <space> ("»" | "?" | ":" | ";" | "!" | "%")
<nbsp-after> ::= "«" <space>
Rendering
[edit]- In both cases, the space is converted to a
 
string.
Behaviour switches
[edit]Not to be confused with magic links. These seem to be able to be used virtually anywhere: a table of contents in an image caption even works. See Help:Magic words#Behaviour switches.
<behaviour-switch> ::= <behaviourswitch-toc> | <behaviourswitch-forcetoc> | <behaviourswitch-notoc> | <behaviourswitch-noeditsection> | <behaviourswitch-nogallery>
<behaviourswitch-toc> ::= mw("toc")
<behaviourswitch-forcetoc> ::= mw("forcetoc")
<behaviourswitch-notoc> ::= mw("notoc")
<behaviourswitch-noeditsection> ::= mw("noeditsection")
<behaviourswitch-nogallery> ::= mw("nogallery")
/* defaults, i->case insensitive, s->case sensitive */
mw("notoc") ::= "__TOC__"i
mw("forcetoc") ::= "__FORCETOC__"i
mw("notoc") ::= "__NOTOC__"i
mw("noeditsection") ::= "__NOEDITSECTION__"i
mw("nogallery") ::= "__NOGALLERY__"i
Notes:
- These are the "default" strings to be matched. They can be modified in
languages/messages/MessagesXx.php
where Xx is the language. - Each magicword can have more than one string associated with it.
- The magic words are by default case insensitive but this can be changed in the file.
- Plenty of other "magic words" exist, including "magic variables" (eg {{CURRENTMONTH}}) which will be handled by the preprocessor. However it looks like all sorts of other "magic words" exist and are processed in different places.
Semantics
[edit]- behaviourswitch-toc: a miniature contents page will be rendered and inserted at the first instance of this token.
- behaviourswitch-forcetoc: a contents box will be rendered even if the normal criteria (typically, 4 sections) have not been met. Irrelevant if magicword-toc is present.
- behaviourswitch-notoc: no miniature contents pages will be rendered. Only takes effect if neither magicword-toc nor magicword-forcetoc are present.
- behaviourswitch-noeditsection: no edit links are to be displayed for any sections.
- behaviourswitch-nogallery: unclear. According to the code (parser::stripNoGallery): if the string (not case-sensitive) occurs in the HTML, do not add TOC. Perhaps it only has an effect in certain namespaces.
Images, media, gallery
[edit]Links to images and media should be handled as normal links. It's inline images and media that are being dealt with here.
Originally from MetaWiki.
Images
[edit] ImageInline ::= "[[" , "Image:" , PageName, ".", ImageExtension, ( { <Pipe>, ImageOption, } ) "]]" ;
ImageName ::= PageName, ".", ImageExtension
ImageExtension ::= "jpg" | "jpeg" | "png" | "svg" | "gif" | "bmp" ;
ImageOption ::= ImageModeParameter | ImageSizeParameter | ImageAlignParameter
| ImageVAlignParameter | Caption
ImageModeParameter ::= ImageModeManualThumb | ImageModeThumb | ImageModeFrame | ImageModeFrameless
ImageModeManualThumb ::= mw("img_manualthumb");
ImageModeAutoThumb ::= mw("img_thumbnail");
ImageModeFrame ::= mw("img_frame");
ImageModeFrameless ::= mw("img_frameless");
/* Default settings: */
mw("img_manualthumb") ::= "thumbnail=", ImageName | "thumb=", ImageName
mw("img_thumbnail") ::= "thumbnail" | "thumb";
mw("img_frame") ::= "framed" | "enframed" | "frame";
mw("img_frameless") ::= "frameless";
ImageOtherParameter ::= ImageParamPage | ImageParamUpright | ImageParamBorder
ImageParamPage ::= mw("img_page")
ImageParamUpgright ::= mw("img_upright")
ImageParamBorder ::= mw("img_border")
/* Default settings: */
mw("img_page") ::= "page=$1" | "page $1" ??? (where is this used?)
mw("img_upright") ::= "upright" [, ["=",] PositiveInteger]
mw("img_border") ::= "border"
ImageSizeParameter ::= mw("img_width");
/* Default setting: */
mw("img_width") ::= PositiveNumber "px" ;
ImageAlignParameter ::= ImageAlignLeft | ImageAlign|Center | ImageAlignRight | ImageAlignNone
ImageAlignLeft ::= mw("img_left")
ImageAlignCenter ::= mw("img_center")
ImageAlignRight ::= mw("img_right")
ImageAlignNone ::= mw("img_none")
/* Default settings: */
mw("img_left") ::= "left"
mw("img_center") ::= "center" | "centre"
mw("img_right") ::= "right"
mw("img_none") ::= "none"
ImageValignParameter ::= ImageValignBaseline | ImageValignSub | ImageValignSuper | ImageValignTop
| ImageValignTextTop | ImageValignMiddle | ImageValignBottom | ImageValignTextBottom
ImageValignBaseline ::= mw("img_baseline")
ImageValignSub ::= mw("img_sub")
ImageValignSuper ::= mw("img_super")
ImageValignTop ::= mw("img_top")
ImageValignTextTop ::= mw("img_text_top")
ImageValignMiddle ::= mw("img_middle")
ImageValignBottom ::= mw("img_bottom")
ImageValignTextBottom ::= mw("img_text_bottom")
/* By default: */
mw("img_baseline") ::= "baseline"
mw("img_sub") ::= "sub"
mw("img_super") ::= "super" | "sup"
mw("img_top") ::= "top"
mw("img_text_top") ::= "text-top"
mw("img_middle") ::= "middle"
mw("img_bottom") ::= "bottom"
mw("img_text_bottom") ::= "text-bottom"
Caption ::= <inline-text>
Semantics
[edit]- Renders an image inline using the
<img>
tag. - It is not an error to specify multiple alignment parameters; the first specified is the one used.
- It is not an error to specify multiple captions; the last specified is the one used.
- The caption has no effect if ThumbImageParameter is not given.
Media
[edit] MediaInline ::= "[[" , "Media:" , PageName "." MediaExtension "]]" ;
MediaExtension = "ogg" | "wav" ;
Gallery
[edit] GalleryBlock ::= "<gallery>" [ NewLine ] GalleryImage { [ NewLine ] GalleryImage } [ NewLine ] "</gallery>" ;
GalleryImage ::= (to be defined: essentially foo.jpg[|caption] )
Remarks:
- The gallery block can technically be used in the middle of a sentence so is not a "special block". It doesn't render particularly nicely when you do that though.