Parsoid/Normalizations
While serializing (html2wt), Parsoid performs a number of normalizations.
Most can be found in DOMNormalizer.php
Normalizations
[edit]These are the normalizations that Parsoid performs,
- Tag minimization (<i>/<b> tags)
- Serialize invalid <a> tags to text
- Enforce single-line context (in headings and lists)
- Strip empty headings and style tags (only performed on new nodes)
- Tag minimization (<a> tags, when at least one is new)
- Whitespace at the start of paragraphs
- New links that end in spaces
- New table cells starting with escapable prefixes
Other normalizations that work around issues in Parsoid / VE+clients as a simpler solution for generating clean wikitext (at least for now)
- Force category links and behaviour switches to serialize before/after headings (only performed on new nodes)
- Strip <br> tags in headers (introduced by Parsoid in some paragraphs which when converted to headings in VE stick around)
- Strip trailing <nowiki/> from wikitext lines (this one will be unnecessary once Parsoid stops introducing these)
Examples
[edit]Tag minimization (<i>/<b> tags)
[edit]<b>X</b><b>Y</b>
// becomes
<b>XY</b>
and
<i>A</i><b><i>X</i></b><b><i>Y</i></b><i>Z</i>
// becomesÂ
<i>A<b>XY</b>Z</i>
Force category links and behaviour switches to serialize before/after headings
[edit]<h2>hello there<link href="Category:A1" rel="mw:PageProp/Category" /></h2>
// becomes
<h2>hello there</h2>
<link href="Category:A1" rel="mw:PageProp/Category" />
and
<h2><meta property="mw:PageProp/toc" /> ok</h2>
// becomes
<meta property="mw:PageProp/toc" />
<h2> ok</h2>
Serialize invalid <a> tags to text
[edit]<a rel="mw:WikiLink" href="[[foo]]">text</a>
// serializes to
text
and<v lang="html5"><a rel="mw:WikiLink" href="foo">*a foo</a>
// serializes to
*a [[foo]]</syntaxhighlight>
Enforce single-line context
[edit]<h2>testing
123</h2>
// becomes
<h2>testing 123</h2>
and
<ul><li>asd
sdf</li></ul>
// becomes
<ul><li>asd sdf</li></ul>
However, newlines in transclusion parameters are preserved.
<h2> hi <span about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"bogus","href":"./Template:Bogus"},"params":{"1":{"wt":"there\nyou"}},"i":0}}]}'>there</span><span about="#mwt1">
</span><span about="#mwt1">you</span> </h2>
// serializes to
== hi {{bogus|there
you}} ==
Strip empty headings and style tags
[edit]Normally,
<h2></h2>
<i></i><b></b>
// serializes to
==<nowiki/>==
''<nowiki/>'''''<nowiki/>'''
but with scrubbing it's all dropped.
Tag minimization (<a> tags)
[edit]<a href="Football">Foot</a><a href="Football">ball</a>
// becomes
<a href="Football">Football</a>
and
<a href="Football"><i>Foot</i></a><a href="Football"><b><i>ball</i></b></a>
// becomes
<a href="Football"><i>Foot<b>ball</b></i></a>
Move formatting from link text to the entire link (with some exceptions)
[edit]<a rel="mw:WikiLink" href="./Football"><u><i><b>Football</b></i></u></a>
// becomes
<u><i><b><a rel="mw:WikiLink" href="./Football">Football</a></b></i></u>
This enables a simplified wikilink format if the href and link text formatting match. Without the reordering [[Football|<u>'''''Football'''''</u>]]
would be emitted. With the reordering <u>'''''[[Football]]'''''</u>
will be emitted.
Exceptions:
- If the formatting tags have attributes like color, style, class since the reordering can change rendering in some cases. The A-tag's color style will override the outer style, i.e.
<i color='brown'>[[Foo]]</i>
doesn't render the same as[[Foo|<i color='brown'>Foo</i>]]
- If the link text is not identical to the href, the reordering is not done since the simplified link form is not enabled in this case.
Whitespace at the start of paragraphs
[edit]These nowikis are to prevent roundtripping as preformatted text.
<p> hi
ho</p>
// normally serializes to
<nowiki> </nowiki>hi
<nowiki> </nowiki>ho
// but with scrubbing becomes
hi
ho
New links that end in spaces
[edit]The nowiki here is to prevent link trails.
<p><a rel="mw:WikiLink" href="./Berlin" title="Berlin">Berlin </a>is the capital of Germany.</p>
// normally serializes to
[[Berlin ]]<nowiki/>is the capital of Germany.
// but with scrubbing becomes
[[Berlin]] is the capital of Germany.
New table cells starting with escapable prefixes
[edit]<table>
<tr><td>a</td></tr>
<tr><td>-</td></tr>
<tr><td>+</td></tr>
</table>
// normally serializes to
{|
|a
|-
|<nowiki>-</nowiki>
|-
|<nowiki>+</nowiki>
|}
// but with scrubbing becomes
{|
|a
|-
| -
|-
| +
|}