Parsing/Notes/HTML5 Compliance

Sources of non-compliance

There are at least 3 sources of non-compliance in MediaWiki output.

Use of obsolete tags from HTML4 (ex: big, font)
Use of obsolete tag attributes (ex: bgcolor)
Violation of content model constraints (ex: <span><div>..</div></span> or <small><ul><li>..</li></ul></small>) (See T200562)

This affects the output of the PHP parser as well as Parsoid.

Fixing non-compliance

If we want to fix 1, one option would be to use the linter extension to deprecate the use of these tags and fix the (wikitext/visual) editors to not provide these tag buttons in their interfaces (ex: phab:T40487). For example, both WikiEditor and VisualEditor emit <big> tags. An alternative option is to treat these special tags in wikitext and emit <span> tags with classes / inline styles in the sanitizer (ex: See T154067).

If we want to fix 2, similar options exist. Use linter extension to deprecate use of these attributes (T173944). And, where necessary, rewrite these attributes to equivalent HTML5 attributes in the sanitizer. See T68413, T42632 and Manual:$wgCleanupPresentationalAttributes for some related relevant discussion.

The situation with 3. is a bit more complicated. Tidy does a better job of compliance with (3) than Parsoid or the Tidy replacement RemexHTML. But, Tidy is HTML4 compliant and does too much, so emulating that is not the solution. The non-compliance in Parsoid, etc. exists because the HTML5 tree builder has to be more lenient in what it expects and so the serialize(parse(html5-string)) operation does not guarantee that content model constraints will be enforced. HTML5's tree builder algorithm used in parsing input strings is deliberately designed this way because of the vast source of non-compliant documents out there. So, we cannot rely on the tree builder to fix up content model constraints. If we wanted to ensure compliant output, we would have to either rely on a post-processor to fix up the output (more feasible) or never generate non-compliant in the first place (less feasible). With Parsoid, this post-processing pass is further complicated by the fact that this has to fix up DSR offsets as well as any other private round tripping information (much less serious going forward as we remove more and more of it) so that selective serialization continues to function properly.

Separately, Parsoid has non-standard uses of <link> tags that won't directly validate with a HTML5 validator. But, we should verify that Parsoid's uses are compliant with RDFa extensions to the html5 syntax.

Other HTML5 spec issues

Beyond this, we might want to consider other fixes to our output. For example, element ids generated in MediaWiki are HTML4 ids and have more constraints on them compared to HTML5. We could migrate to generating HTML5 ids instead but this is an involved task as well. T152540 has more details.

The sanitizer code in the core parser as well as Parsoid reflect HTML5 semantics only partially. At some point, they should be updated to adopt HTML5 semantics more fully (while accounting for html4 tags and attributes that are still in use as indicated earlier). T145002 is the task for that.

Related discussions elsewhere

enwiki VP (Policies) Rfc: RfC: Should deprecated/invalid/unsupported HTML tags be discouraged?
en VP (Technical) discussion: Time_to_knock_out_obsolete_HTML_tags
T68413 has some discussion about invalid attributes, dropping them, migrating them which is relevant.

Pros / cons of shooting for compliance

TO BE COMPLETED.

This section is to collect arguments for / against shooting for compliance.

Compliance is a binary state. However, given that, we can still discuss what parts of the HTML5 spec we want to comply with. For example, which subsets of 1 - 3 in the first section we want to shoot for. What are the pros / cons of it?

The WHATWG FAQ says "Validity (more often referred to as document conformance in the WHATWG) is a quality assurance tool to help authors avoid mistakes. We don't make things non-conforming (invalid) for the sake of it, we use conformance as a guide for developers to help them avoid bad practices or mistakes (like typos)."

As such, having conformance as a goal allows us to use standard output validation to check the quality of the MediaWiki code base.

Tim says: The HTML 5 specification further suggests that vendor neutral-extensions, such as non-standard tags, be supported by publishing an applicable standard which the document will conform with. It could be called HTML+MediaWiki. Legoktm started User:Legoktm/HTML+MediaWiki to bring our usage of <big> into compliance. Our usage of <figure-inline> also needs to be documented.

Sources of non-compliance

Fixing non-compliance

Other HTML5 spec issues

Related discussions elsewhere

Pros / cons of shooting for compliance

See also