Parsing/Notes/HTML5

This page records some notes and observations about the HTML5 spec and parsing algorithm as a quick / easy reference and will be filled out progressively.

Non-obvious terminology / notes

Content categories

The spec defines a bunch of content categories. Elements can belong to zero or more categories. The list below should give you a sense of what the categories represent.

Flow content - pretty much everything except a few elements
Metadata content - link, meta, ..
Heading content - h1 - h6, ..
Sectioning content - h1 - h6, section, ...
Embedding content - audio, video, embed, object, etc.
Interactive content - forms, buttons, and the like
Phrasing content - all phrasing content is flow content; heading & sectioning content cannot be phrasing content

Palpable content:
- elements in this category should provide at least one non-empty text node or audio/video.
- with this category, the spec effectively discourages empty elements
- we may not enforce this in MediaWiki but rely on linting tools to flag scenarios where this might be happening
Script-supporting elements: script, template
Media element: audio, video
Sectioning roots: blockquote, body, details, dialog, fieldset, figure, td

Observations

Elements that are Flow but not Phrasing: table, lists, headings, p, div, blockquote, section, figure, header, footer and other uncommon ones. Loosely speaking, this is the block node notion from HTML4.
Phrasing content is, loosely speaking, the inline node notion from HTML4.

Content model

Transparent content model: they inherit content models from their nearest non-transparent ancestor.
Nothing content model: no content can be present / nested in these elements

Paragraphs

Paragraphs in HTML5 is a structural concept, not a semantic / logical content.
Runs of phrasing content form paragraphs. In other words, p-tags can only contain phrasing content.
can be omitted if followed by a set of tags. I imagine this is just grandfathering in the html seen in the wild.
Not required to add p tags around runs of phrasing content that form paragraphs. But, better to add them for clarity and to avoid edge cases in rendering. We'll probably always add them in MediaWiki.

Composition Spec notes

For each element, build a map of context in which a node can show up and content model it expects. There is clearly a hierarchical relation here. The content model for a node determines the context in which children can show up. So, these constraints should line up properly.
This map can probably be used to come up with a set of composition rules / spec when document fragments need to be composed into a final document.
The HTML5 parsing algorithm specifies a fragment parsing mode that can handle this scenario, but we are then left to the whims of what the parsing algorithm does instead of specifying what we would like the behavior to be. For example, we might handle a-in-a differently than what the fragment parsing algorithm would do.

Composition constraints

One of the things to work out with the balanced templates RFC and the Wikitext 2.0 proposal is to figure out how to properly compose fragments to yield a well-formed spec-conformant document. Note that since we have well-formed DOM fragments, we don't need to worry about the parts of the HTML parsing algorithm that deal with unclosed or misnested tags. We only need to worry about the content model constraints.

Looking at the table below, the following is a summary of composition constraints (partial since it only covers a largish subset of elements):

Elements that only accept phrasing content: h1 - h6, p, pretty much all the text-content elements (span, i, b, em, strong, small, sup, sub, etc. -- see section 4.5 in the table below). We have two options here:
1. strip non-phrasing tags from the content: This seems the right approach for h1 - h6 tags
2. split the parent node to ensure constraints are satisfied: This seems the right approach for p and text-content elements
Custom exclusions / constraints: No a-tags inside a; no table-tags inside caption; No main inside nav, aside, ... ; etc.
- The best solution here is to strip the offending tags from the fragment. So, if you have an a-tag being used inside another a-tag, the a-tag is stripped out. An alternative is to convert the a-tag to text. But, in either case, the a-tag itself is removed. This has an impact on real use cases on wikipedias. [http://website.com Company with [[Website WikiPage]] here] seems to be found on wikis which leads to broken rendering for reads and headaches for Parsoid for editing and round-tripping. The solution proposed here is a better uniform solution.
Constraint on insertion context: li inside ol/ul, td/th inside tr, ...; etc. Some possibilities below. Option 3. seems like the best approach.
1. Suppress the fragment content entirely: Might work for some cases, but probably not a good idea.
2. Insert necessary required tags, i.e., insert a ul-tag or a table tag as necessary: Unclear that this is a good solution.
3. Strip just the offending tags, i.e. <td>x y z</td> is converted to x y z
Deviations from content-model and context constraint:
- It looks like the HTML5 parser does not enforce content model constraints in some cases. Try parsing <pre>a <ol><li>x</li></ol>y</pre>. The parser allows Flow content inside the pre tag which violates the (what the spec says) normative content model of a pre tag. Since wikitext overrides the <pre> tag as a native extension with wikitext semantics, we don't have to deal with this in MediaWiki since a HTML pre tag can never show up in wikitext.
- It lets the li tag be used outside a list. Try parsing <li>x</li>. The list item is allowed to exist outside a list. To be clear, the spec does say that context requirements are non-normative, so there is that.

So, overall it looks like we can come up with a fairly reasonable set of fragment composition rules based on common sense notions (derived from the HTML5 spec). Within the wikitext markup spec, we might even specify exceptions / minor variations from the spec if it aids reasoning and/or eliminates edge cases.

Quick reference table of HTML5 elements and their content model

Element	Content categories	Context (Where can this element be used?)	Content Model (What elements can be used in its DOM tree)
4.2 Document
html	None	document's doc element / wherever a fragment is allowed	head followed by body
head	None	First elt of html	1+ elts of metadata with 1 title and <1 base elt
title	Metadata	In head without other titles	Text that is not IEW.
base	Metadata	In head without other bases	Nothing
link	Metadata; If allowed in body flow & phrasing	metadata OR noscript OR phrasing	Nothing
meta	Metadata; flow & phrasing if itemprop is present	.. complicated ..	Nothing
style	Metadata	Metadata content	.. complicated ..
4.3 Sections
body	Sectioning root	second elt of html	Flow
article	Flow, Sectioning, Palpable	Flow	Flow
section	Flow, Sectioning, Palpable	Flow	Flow
nav	Flow, Sectioning, Palpable	Flow	Flow - {main}
aside	Flow, Sectioning, Palpable	Flow	Flow - {main}
h1 - h6	Flow, Sectioning, Palpable	Flow, child of hgroup	Phrasing
hgroup	Flow, Sectioning, Palpable	Flow, child of hgroup	zero or more h1..h6, template
header	Flow, Palpable	Flow	Flow - {header,footer,main}
footer	Flow, Palpable	Flow	Flow - {header,footer,main}
address	Flow, Palpable	Flow	Flow - {header,footer,main} - Heading - Sectioning
4.4 Grouping content
p	Flow, Palpable	Flow	Phrasing
hr	Flow	Flow	Nothing
pre	Flow, Palpable	Flow	Phrasing
blockquote	Flow, Sectioning root, Palpable	Flow	Flow
ol	Flow, Palpable if li present	Flow	>= 0 li and script-supporting
ul	Flow, Palpable if li present	Flow	>= 0 li and script-supporting
li	None	In ol, ul and <menu type='toolbar'>	Flow
dl	Flow	Flow	>= 0 groups of [dt+, dd+]
dt	None	Before dd or dt inside dl	Flow - {header,footer} - Sectioning - Heading
dd	None	After dd or dt inside dl	Flow
figure	Flow, Sectioning root, Palpable	Flow	Flow with optional figcaption before/after the flow content
figcaption	None	First/Last child of figure	Flow
main	Flow, Palpable	Flow	Flow
div	Flow, Palpable	Flow	Flow
4.5 Text-level
a	Flow, Phrasing, Palpable	Phrasing	Transparent, No interactive or a
em	Flow, Phrasing, Palpable	Phrasing	Phrasing
strong	Flow, Phrasing, Palpable	Phrasing	Phrasing
small	Flow, Phrasing, Palpable	Phrasing	Phrasing
s	Flow, Phrasing, Palpable	Phrasing	Phrasing
cite	Flow, Phrasing, Palpable	Phrasing	Phrasing
q	Flow, Phrasing, Palpable	Phrasing	Phrasing
dfn	Flow, Phrasing, Palpable	Phrasing	Phrasing
abbr	Flow, Phrasing, Palpable	Phrasing	Phrasing
ruby	Flow, Phrasing, Palpable	Phrasing	.. complicated ..
rt	None	child of ruby	Phrasing
rp	None	child of ruby immediate before/after rt	Text
data	Flow, Phrasing, Palpable	Phrasing	Phrasing
time	Flow, Phrasing, Palpable	Phrasing	Phrasing if datetime attr present, constrained text (see spec for details)
code	Flow, Phrasing, Palpable	Phrasing	Phrasing
var	Flow, Phrasing, Palpable	Phrasing	Phrasing
samp	Flow, Phrasing, Palpable	Phrasing	Phrasing
kbd	Flow, Phrasing, Palpable	Phrasing	Phrasing
sub	Flow, Phrasing, Palpable	Phrasing	Phrasing
sup	Flow, Phrasing, Palpable	Phrasing	Phrasing
i	Flow, Phrasing, Palpable	Phrasing	Phrasing
b	Flow, Phrasing, Palpable	Phrasing	Phrasing
u	Flow, Phrasing, Palpable	Phrasing	Phrasing
mark	Flow, Phrasing, Palpable	Phrasing	Phrasing
bdi	Flow, Phrasing, Palpable	Phrasing	Phrasing
bdo	Flow, Phrasing, Palpable	Phrasing	Phrasing
span	Flow, Phrasing, Palpable	Phrasing	Phrasing
br	Flow, Phrasing	Phrasing	Nothing
wbr	Flow, Phrasing	Phrasing	Nothing
4.7 Edits
ins	Flow, Phrasing, Palpable	Phrasing	Transparent
del	Flow, Phrasing	Phrasing	Transparent
4.8 Embedded
picture	Flow, Phrasing, Embedded	Embedded	0+ source tags followed by img optionally intermixed with script-supporting elements
source	None	child of picture, before img; child of a media elt before Flow or track elements	Nothing
img	Flow, Phrasing, Embedded, Form-associated, Interactive?, Palpable	Embedded	Nothing
iframe	Flow, Phrasing, Embedded, Interactive, Palpable	Embedded	Text with constraints (see spec for details)
embed	Flow, Phrasing, Embedded, Interactive, Palpable	Embedded	Nothing
object	Flow, Phrasing, Embedded, Interactive?, Palpable, Listed & submittable form-associated elt	Embedded	0+ param followed by transparent
param	None	child of object before Flow	Nothing
video	Flow, Phrasing, Embedded, Interactive? Palpable	Embedded	.. complicated ..
audio	Flow, Phrasing, Embedded, Interactive? Palpable?	Embedded	.. complicated ..
track	None	child of media element before Flow	Nothing
map	Flow, Phrasing, Palpable	Phrasing	Transparent
area	Flow, Phrasing	Phrasing, but within a map ancestor	Nothing
4.9 Tabular data
table	Flow, Palpable	Flow	caption?, colgroup, thead?, (tbody OR tr+), tfoot?, intermixed with optional script-supporting elements
caption	None	first element of table	Flow - {table}
colgroup	None	child of table, after caption, before thead, tbody, tr, tfoot	Nothing if span attr is present; 0+ col and template if not
col	None	child of colgroup without a span attr	Nothing
tbody	0+ `‎<tr>` and script-supporing elements
thead	None	child of table, after caption, colgroup, before tbody, tfoot, tr; No other thead allowed	0+ tr and script-supporing elements
tfoot	None	child of table, after caption, colgroup, thead, tbody, tr; No other tfoot allowed	0+ tr and script-supporing elements
tr	None	child of thead, tbody, tfoot; child of tr after caption, colgroup and thead, but only if there are no tbody
td	Sectioning root	child of tr	Flow
th	None	child of tr	Flow - {header, footer} - Sectioning - Heading
4.10 Form
.. skipped ..
4.11 Interaction
.. skipped ..
4.12 Scripting
script
template	..	..	content have no conformance requirements.