Parsing/Notes/HTML5
This page records some notes and observations about the HTML5 spec and parsing algorithm as a quick / easy reference and will be filled out progressively.
Non-obvious terminology / notes
[edit]Content categories
[edit]The spec defines a bunch of content categories. Elements can belong to zero or more categories. The list below should give you a sense of what the categories represent.
- Flow content - pretty much everything except a few elements
- Metadata content - link, meta, ..
- Heading content - h1 - h6, ..
- Sectioning content - h1 - h6, section, ...
- Embedding content - audio, video, embed, object, etc.
- Interactive content - forms, buttons, and the like
- Phrasing content - all phrasing content is flow content; heading & sectioning content cannot be phrasing content
- Palpable content:
- elements in this category should provide at least one non-empty text node or audio/video.
- with this category, the spec effectively discourages empty elements
- we may not enforce this in MediaWiki but rely on linting tools to flag scenarios where this might be happening
- Script-supporting elements: script, template
- Media element: audio, video
- Sectioning roots: blockquote, body, details, dialog, fieldset, figure, td
Observations
[edit]- Elements that are Flow but not Phrasing: table, lists, headings, p, div, blockquote, section, figure, header, footer and other uncommon ones. Loosely speaking, this is the block node notion from HTML4.
- Phrasing content is, loosely speaking, the inline node notion from HTML4.
Content model
[edit]- Transparent content model: they inherit content models from their nearest non-transparent ancestor.
- Nothing content model: no content can be present / nested in these elements
Paragraphs
[edit]- Paragraphs in HTML5 is a structural concept, not a semantic / logical content.
- Runs of phrasing content form paragraphs. In other words, p-tags can only contain phrasing content.
- </p> can be omitted if followed by a set of tags. I imagine this is just grandfathering in the html seen in the wild.
- Not required to add p tags around runs of phrasing content that form paragraphs. But, better to add them for clarity and to avoid edge cases in rendering. We'll probably always add them in MediaWiki.
Composition Spec notes
[edit]- For each element, build a map of context in which a node can show up and content model it expects. There is clearly a hierarchical relation here. The content model for a node determines the context in which children can show up. So, these constraints should line up properly.
- This map can probably be used to come up with a set of composition rules / spec when document fragments need to be composed into a final document.
- The HTML5 parsing algorithm specifies a fragment parsing mode that can handle this scenario, but we are then left to the whims of what the parsing algorithm does instead of specifying what we would like the behavior to be. For example, we might handle a-in-a differently than what the fragment parsing algorithm would do.
Composition constraints
[edit]One of the things to work out with the balanced templates RFC and the Wikitext 2.0 proposal is to figure out how to properly compose fragments to yield a well-formed spec-conformant document. Note that since we have well-formed DOM fragments, we don't need to worry about the parts of the HTML parsing algorithm that deal with unclosed or misnested tags. We only need to worry about the content model constraints.
Looking at the table below, the following is a summary of composition constraints (partial since it only covers a largish subset of elements):
- Elements that only accept phrasing content: h1 - h6, p, pretty much all the text-content elements (span, i, b, em, strong, small, sup, sub, etc. -- see section 4.5 in the table below). We have two options here:
- strip non-phrasing tags from the content: This seems the right approach for h1 - h6 tags
- split the parent node to ensure constraints are satisfied: This seems the right approach for p and text-content elements
- Custom exclusions / constraints: No a-tags inside a; no table-tags inside caption; No main inside nav, aside, ... ; etc.
- The best solution here is to strip the offending tags from the fragment. So, if you have an a-tag being used inside another a-tag, the a-tag is stripped out. An alternative is to convert the a-tag to text. But, in either case, the a-tag itself is removed. This has an impact on real use cases on wikipedias.
[http://website.com Company with [[Website WikiPage]] here]
seems to be found on wikis which leads to broken rendering for reads and headaches for Parsoid for editing and round-tripping. The solution proposed here is a better uniform solution.
- The best solution here is to strip the offending tags from the fragment. So, if you have an a-tag being used inside another a-tag, the a-tag is stripped out. An alternative is to convert the a-tag to text. But, in either case, the a-tag itself is removed. This has an impact on real use cases on wikipedias.
- Constraint on insertion context: li inside ol/ul, td/th inside tr, ...; etc. Some possibilities below. Option 3. seems like the best approach.
- Suppress the fragment content entirely: Might work for some cases, but probably not a good idea.
- Insert necessary required tags, i.e., insert a ul-tag or a table tag as necessary: Unclear that this is a good solution.
- Strip just the offending tags, i.e.
<td>x <i>y</i> z</td>
is converted tox <i>y</i> z
- Deviations from content-model and context constraint:
- It looks like the HTML5 parser does not enforce content model constraints in some cases. Try parsing
<pre>a <ol><li>x</li></ol>y</pre>
. The parser allows Flow content inside the pre tag which violates the (what the spec says) normative content model of a pre tag. Since wikitext overrides the <pre> tag as a native extension with wikitext semantics, we don't have to deal with this in MediaWiki since a HTML pre tag can never show up in wikitext. - It lets the li tag be used outside a list. Try parsing
<li>x</li>
. The list item is allowed to exist outside a list. To be clear, the spec does say that context requirements are non-normative, so there is that.
- It looks like the HTML5 parser does not enforce content model constraints in some cases. Try parsing
So, overall it looks like we can come up with a fairly reasonable set of fragment composition rules based on common sense notions (derived from the HTML5 spec). Within the wikitext markup spec, we might even specify exceptions / minor variations from the spec if it aids reasoning and/or eliminates edge cases.
Quick reference table of HTML5 elements and their content model
[edit]Element | Content categories | Context
(Where can this element be used?) |
Content Model
(What elements can be used in its DOM tree) |
---|---|---|---|
4.2 Document | |||
html | None | document's doc element / wherever a fragment is allowed | head followed by body |
head | None | First elt of html | 1+ elts of metadata with 1 title and <1 base elt |
title | Metadata | In head without other titles | Text that is not IEW. |
base | Metadata | In head without other bases | Nothing |
link | Metadata; If allowed in body flow & phrasing | metadata OR noscript OR phrasing | Nothing |
meta | Metadata; flow & phrasing if itemprop is present | .. complicated .. | Nothing |
style | Metadata | Metadata content | .. complicated .. |
4.3 Sections | |||
body | Sectioning root | second elt of html | Flow |
article | Flow, Sectioning, Palpable | Flow | Flow |
section | Flow, Sectioning, Palpable | Flow | Flow |
nav | Flow, Sectioning, Palpable | Flow | Flow - {main} |
aside | Flow, Sectioning, Palpable | Flow | Flow - {main} |
h1 - h6 | Flow, Sectioning, Palpable | Flow, child of hgroup | Phrasing |
hgroup | Flow, Sectioning, Palpable | Flow, child of hgroup | zero or more h1..h6, template |
header | Flow, Palpable | Flow | Flow - {header,footer,main} |
footer | Flow, Palpable | Flow | Flow - {header,footer,main} |
address | Flow, Palpable | Flow | Flow - {header,footer,main} - Heading - Sectioning |
4.4 Grouping content | |||
p | Flow, Palpable | Flow | Phrasing |
hr | Flow | Flow | Nothing |
pre | Flow, Palpable | Flow | Phrasing |
blockquote | Flow, Sectioning root, Palpable | Flow | Flow |
ol | Flow, Palpable if li present | Flow | >= 0 li and script-supporting |
ul | Flow, Palpable if li present | Flow | >= 0 li and script-supporting |
li | None | In ol, ul and <menu type='toolbar'> | Flow |
dl | Flow | Flow | >= 0 groups of [dt+, dd+] |
dt | None | Before dd or dt inside dl | Flow - {header,footer} - Sectioning - Heading |
dd | None | After dd or dt inside dl | Flow |
figure | Flow, Sectioning root, Palpable | Flow | Flow with optional figcaption before/after the flow content |
figcaption | None | First/Last child of figure | Flow |
main | Flow, Palpable | Flow | Flow |
div | Flow, Palpable | Flow | Flow |
4.5 Text-level | |||
a | Flow, Phrasing, Palpable | Phrasing | Transparent, No interactive or a |
em | Flow, Phrasing, Palpable | Phrasing | Phrasing |
strong | Flow, Phrasing, Palpable | Phrasing | Phrasing |
small | Flow, Phrasing, Palpable | Phrasing | Phrasing |
s | Flow, Phrasing, Palpable | Phrasing | Phrasing |
cite | Flow, Phrasing, Palpable | Phrasing | Phrasing |
q | Flow, Phrasing, Palpable | Phrasing | Phrasing |
dfn | Flow, Phrasing, Palpable | Phrasing | Phrasing |
abbr | Flow, Phrasing, Palpable | Phrasing | Phrasing |
ruby | Flow, Phrasing, Palpable | Phrasing | .. complicated .. |
rt | None | child of ruby | Phrasing |
rp | None | child of ruby immediate before/after rt | Text |
data | Flow, Phrasing, Palpable | Phrasing | Phrasing |
time | Flow, Phrasing, Palpable | Phrasing | Phrasing if datetime attr present, constrained text (see spec for details) |
code | Flow, Phrasing, Palpable | Phrasing | Phrasing |
var | Flow, Phrasing, Palpable | Phrasing | Phrasing |
samp | Flow, Phrasing, Palpable | Phrasing | Phrasing |
kbd | Flow, Phrasing, Palpable | Phrasing | Phrasing |
sub | Flow, Phrasing, Palpable | Phrasing | Phrasing |
sup | Flow, Phrasing, Palpable | Phrasing | Phrasing |
i | Flow, Phrasing, Palpable | Phrasing | Phrasing |
b | Flow, Phrasing, Palpable | Phrasing | Phrasing |
u | Flow, Phrasing, Palpable | Phrasing | Phrasing |
mark | Flow, Phrasing, Palpable | Phrasing | Phrasing |
bdi | Flow, Phrasing, Palpable | Phrasing | Phrasing |
bdo | Flow, Phrasing, Palpable | Phrasing | Phrasing |
span | Flow, Phrasing, Palpable | Phrasing | Phrasing |
br | Flow, Phrasing | Phrasing | Nothing |
wbr | Flow, Phrasing | Phrasing | Nothing |
4.7 Edits | |||
ins | Flow, Phrasing, Palpable | Phrasing | Transparent |
del | Flow, Phrasing | Phrasing | Transparent |
4.8 Embedded | |||
picture | Flow, Phrasing, Embedded | Embedded | 0+ source tags followed by img optionally intermixed with script-supporting elements |
source | None | child of picture, before img; child of a media elt before Flow or track elements | Nothing |
img | Flow, Phrasing, Embedded, Form-associated, Interactive?, Palpable | Embedded | Nothing |
iframe | Flow, Phrasing, Embedded, Interactive, Palpable | Embedded | Text with constraints (see spec for details) |
embed | Flow, Phrasing, Embedded, Interactive, Palpable | Embedded | Nothing |
object | Flow, Phrasing, Embedded, Interactive?, Palpable, Listed & submittable form-associated elt | Embedded | 0+ param followed by transparent |
param | None | child of object before Flow | Nothing |
video | Flow, Phrasing, Embedded, Interactive? Palpable | Embedded | .. complicated .. |
audio | Flow, Phrasing, Embedded, Interactive? Palpable? | Embedded | .. complicated .. |
track | None | child of media element before Flow | Nothing |
map | Flow, Phrasing, Palpable | Phrasing | Transparent |
area | Flow, Phrasing | Phrasing, but within a map ancestor | Nothing |
4.9 Tabular data | |||
table | Flow, Palpable | Flow | caption?, colgroup*, thead?, (tbody* OR tr+), tfoot?, intermixed with optional script-supporting elements |
caption | None | first element of table | Flow - {table} |
colgroup | None | child of table, after caption, before thead, tbody, tr, tfoot | Nothing if span attr is present; 0+ col and template if not |
col | None | child of colgroup without a span attr | Nothing |
tbody | 0+ â<tr> and script-supporing elements
| ||
thead | None | child of table, after caption, colgroup, before tbody, tfoot, tr; No other thead allowed | 0+ tr and script-supporing elements |
tfoot | None | child of table, after caption, colgroup, thead, tbody, tr; No other tfoot allowed | 0+ tr and script-supporing elements |
tr | None | child of thead, tbody, tfoot; child of tr after caption, colgroup and thead, but only if there are no tbody | |
td | Sectioning root | child of tr | Flow |
th | None | child of tr | Flow - {header, footer} - Sectioning - Heading |
4.10 Form | |||
.. skipped .. | |||
4.11 Interaction | |||
.. skipped .. | |||
4.12 Scripting | |||
script | |||
template | .. | .. | content have no conformance requirements. |