Jump to content

Parsing/Replacing Tidy

From mediawiki.org
(Redirected from Parsing/Removing Tidy)

Please see the FAQ for more focused discussion of why we are replacing html4-tidy with RemexHtml.

Tidy has had a large number of bugs filed against it and the current binary deployed on the Wikimedia cluster is based on HTML4 semantics. Additionally, since the effect of running Tidy on MW Parser main pass output is poorly specified, the Parsing team is working on instead using the HTML 5 parsing algorithm to clean up bad HTML in wikitext, similar to the way browsers deal with tag soup. This is the approach taken by Parsoid, and using the same approach in MediaWiki will help provide consistent output in the two parsers.

This effort was tracked in Phabricator in phab:T89331. This change did not happen during 2016 to allow communities several months to check their pages for errors.

After testing three different Tidy-replacement implementations (one in Java, and two in PHP), we have settled on a version based on RemexHTML, a PHP-only HTML5 parsing library.

See also the simplified instructions for editors.

What this means for editors

[edit]

Some templates currently rely on behaviour specific to Tidy which will not be retained. These templates will have to be updated. In initial testing (see below), we found issues such as:

  • Templates which generate plainly broken output (such as mismatched start and end tags), which RemexHTML cleans up in a different way to Tidy. These templates should be fixed.
  • Some templates generate unnecessary line breaks, and these line breaks cause MediaWiki's paragraph formatting (doBlockLevels) to output broken HTML. Tidy then cleans up the broken HTML differently from RemexHTML. Editors may have to work around this MediaWiki bug.
  • Tidy rearranges the whitespace in the HTML, mostly in a misguided attempt at pretty-printing. It assumes that these changes have no effect on the final layout, however, using the CSS white-space property can make them visible. This behaviour of Tidy's is usually harmless, but some templates rely on this behaviour, especially navboxes which use "white-space: nowrap" to prevent breaking in the middle of a list item. Some templates will need to be fixed.

To identify some of the pages that need fixing, we are working to add new categories to the Linter extension for scenarios 2. and 3. in the Things to fix section below. In addition, to assist editors in migrating wikitext to the new rules, we deployed an extension called ParserMigration. If you enable ParserMigration in your preferences (under "Editing > Developer tools > Enable parser migration tool"), a link called "Edit with migration tool" is added to the toolbox of all articles, which can show the current (Tidy) and expected (RemexHTML) output side-by-side, and can preview article text changes in the same side-by-side view. Given this, you can make the required changes to the wikitext and compare how the page renders with Tidy and with RemexHTML via the side-by-side preview.

We currently have no fixed schedule for turning off Tidy and replacing it with the RemexHTML based version. We will do this when we are satisfied that the impact on readers will be minor and tolerable. But, we would also not like to drag this on indefinitely. It would be ideal if the template fixes are prioritized by editors.

Things to fix

[edit]

Based on visual diff testing, here are 3 main categories of markup that need fixing:

  1. Self-closing tags: <b/> <div/> are no longer treated as empty tags in a HTML5 parser. Both the PHP parser as well as Parsoid right now have backward compatibility code to handle these. However, it is good to fix them up. Looks like the cleanup effort is going well so far. This dashboard tracks progress.
  2. Broken wikitext markup: Tidy and a HTML5 parser might fix up mismatched tags or missing end tags differently in some scenarios.
  3. Pages/templates relying on Tidy's whitespace mangling behavior: There are 3 kinds of whitespace mangling that Tidy does that a HTML5 parser does not do. This is because you can write CSS rules that are affected by whitespace. Those are exactly the pages that will be impacted by this.
    • Tidy migrates trailing whitespace out of inline tags like <span>, <b>, etc. to outside the tag
    • Tidy adds \n chars between a closing tag and an opening tag. For example, newlines between list items or table cells.
    • Tidy strips whitespace between a tag and its contents. Ex: whitespace inside a list item or a table cell.

It is possible to provide tool support for scenarios 1 and 2 above.

For 1. we have been adding maintenance categories to pages using self-closing tags and editors have been using that and other tools to fix them up.

For 2, there are the various Project CheckWiki tools. Besides that, the parsing team will deploy the Linter extension to help identify and fix some of these scenarios.

However, we don't know how to provide tooling to detect / list pages that rely on Tidy's whitespace mangling behavior. See below for more information about this.

Effect of whitespace changes

[edit]

The net effect of Tidy's whitespace mangling is that wikitext / HTML like this

<ul><li> a </li><li> b </li></ul>

will be transformed to

<ul>
<li>a</li>
<li>b</li>
</ul>

In most cases, this does not make a difference to rendering. However, there are pages and templates that will be affected by this. If a page sets the CSS white-space: nowrap property or displays a list with the CSS display:inline property, those pages will be affected because all the Tidy replacements will not mangle whitespace in this way.

See T155634 (affects a template), T74416#2384571 (affects a template that has been fixed). There are some other templates and pages that have been identified below. We expect navbox templates and similar templates might need fixing to not rely on whitespace mangling for correct rendering.

Detailed information about what to fix

[edit]
The FAQ page may contain more up to date information.

Here is a classification of diffs into different categories along with example titles, detailed description where useful, and proposed resolutions.

Self-closing tags like <b/> <div/>, etc.

[edit]

Tidy strips self-closing tags like <b/>, <div/> but a HTML5 parser treats them as <b>, <div>, etc. This dashboard tracks progress of editors fixing these.

Sample searches to find these on your wiki:

  • To find <b/> paste this into the search box: insource:/\<b\/\>/
  • To find <p/> paste this into the search box: insource:/\<p\/\>/
  • To find <td/> paste this into the search box: insource:/\<td\/\>/
  • To find <div/> paste this into the search box: insource:/\<div\/\>/ (Note that this will find only "empty" divs; structures such as <div xxx="yyy" /> are also wrong.)
  • To find <span/> paste this into the search box: insource:/\<span\/\>/ (Note that this will find only "empty" spans; structures such as <span xxx="yyy" /> are also wrong.)

For a long list of many more searches that have yielded results on the English Wikipedia, expand the box below. Note that regular expression searches do not always return complete results. See Phabricator bug T106685 for details.

Extended content

[Note: some of these searches find tags that are not self-closed tags, but are broken in some other way and should be fixed.]

  • insource:/\<\/*b *\/\>/
  • insource:/\<\/*big *\/\>/
  • insource:/\<\/*blockquote *\/\>/
  • insource:/\<\/br *>/
  • insource:/\<\s*\/*\s*b r\s*\|*\s*\/*\s*\>/
  • insource:/\<\s*\/*\s*br\s*[\\\?]+\s*\/*>/
  • insource:/\<\s*\/*\s*br\s*[\|\/\?\.]+\s*\/\>/
  • insource:/\<\s*\/\s*br[\s\|\/\?\.]*\>/
  • insource:/\<\s*\<\/*\s*br\s*\/*>/
  • insource:/\<\/*center *\/\>/
  • insource:/\<\/*del *\/\>/
  • insource:/\<\/*div *\/\>/
  • insource:/(\<div class=\"[a-zA-Z0-9_:; #%\-]+\" *)\/\>/
  • insource:/(\<div id=[\'\"][ça-zA-Z0-9\-_ ]+[\'\"] +style=\"[a-zA-Z0-9_:; #%\-]+[\'\"] *)\/\>/
  • insource:/(\<div id=[\'\"]*[ça-zA-Z0-9\-_ ]+[\'\"]* *)\/\>/
  • insource:/(\<div style=\"[a-zA-Z0-9_:; #%\-]+\" *)\/\>/
  • insource:/\<font *\/\>/
  • insource:/\<font color=\"[a-z ]+\"*\/\>/
  • insource:/\<font style=\"*[a-z ]+\"*\/\>/
  • insource:/\<\/*h1 *\/\>/
  • insource:/\<\/*h2 *\/\>/
  • insource:/\<\/*h3 *\/\>/
  • insource:/\<\/*h4 *\/\>/
  • insource:/\<\/*h5 *\/\>/
  • insource:/\<\/*i *\/\>/
  • insource:/\<\/*p *\/\>/
  • insource:/(\<p id=[\'\"][ça-zA-Z0-9\-_ ]+[\'\"] *)\/\>/
  • insource:/\<\/*s *\/\>/
  • insource:/\<\/*small *\/\>/
  • insource:/\<\/span *\/\>/
  • insource:/(\<span class\s*=\s*\"*[ça-zA-Z0-9\-_ ]+\"* *)\/\>/
  • insource:/(\<span id\s*=\s*\"*[ça-zA-Z0-9\-_ \(\)\–\.\,\:\&\'\"\;\/\%\!]+\"* *)\/\>/
  • insource:/\<span *\/\>/
  • insource:/\<span style=\"color\"*\/\>/
  • insource:/\<\/*strike *\/\>/
  • insource:/\<\/*sub *\/\>/
  • insource:/\<\/*sup *\/\>/
  • insource:/\<\/*td *(colspan=\d+)\/\>/
  • insource:/\<\/*td *\/\>/
  • insource:/(\<td style=\"[a-zA-Z0-9_:; #%\-=]+\" *)\/\>/
  • insource:/\<\/*th *\/\>/
  • insource:/\<\/*tr *\/\>/
  • insource:/\<\/*u *\/\>/
Example Titles Detailed description for the example titles Proposed resolution Linter category
enwiki:Horse

dewiki:Nachwachsender Rochstoff

Self-closing bold tags "<b/>" were used in the {{hands}} template as wikitext syntax modifiers, for example to prevent interpretation of punctuation at the start of the line. HTML 5 specifies that "<b/>" is to be treated the same as "<b>", but with a parse error emitted, whereas Tidy treats them as empty elements and removes them. So in RemexHTML, bold formatting ran on to the end of the article. The standard solution is <nowiki/>, but the author chose <b/> instead in order to reduce the post-expand include size. Further discussion: w:User talk:Wikid77#Empty_bold_tags

Same with the dewiki page that has a <b/> in the first paragraph of the article. It serves no purpose and can be safely deleted.

Update wikitext by either removing <b/> or replacing it with a <nowiki/> tag as needed. self-closed-tag
enwiki:Villafranchian,

ruwiki pages using Template:Автомобиль (Ex: ruwiki: Renault Espace)

A self-closing div tag in the middle of {{Neogene ELMA}} is interpreted as <div> instead of <div></div>, causing the remainder of the article to move inside the timeline box.

Same with the ruwiki template

Update wikitext by either removing the self-closing tag or converting it to a pair of tags that surrounds the intended content: <div>My '''Content'''</div> self-closed-tag
enwiki:2016 Malaysia Premier League and possibly other pages Use of <div id="Perlis v ATM"/> and many other divs like that causes content following that section to be swallowed into the <div> tags. self-closed-tag
Most of the itwikisource pages that are showing diffs Use of <span class="interwiki-info" id="el" title="(orig.)" style="display:none;" /> in {{IncludiIntestazione}} which is stripped by Tidy and not in HTML5depurate accounts for the large rendering diffs. self-closed-tag

The following tools can help detect and fix such issues:

Wikitext markup errors

[edit]

Ex: Unclosed tables; Nested tables in fosterable position; <small>…<small> instead of <small>…</small>, etc. These are fixed up differently by Tidy and HTML5Depurate. There is nothing to do in HTML5depurate. The obvious fix here is to fix up the affected templates and pages.

Example Titles Detailed description for the example titles Proposed resolution Lint category
eswiki:Bob Esponja There is an unclosed table that then runs into a new section with another table. Update wikitext misnested-tag
enwiki:2015-16 Odense Bulldogs season,

enwiki:2015-16 ABA League

Unclosed <small> tags in http://en.wikipedia.org/w/index.php?title=Template:2015%E2%80%9316%20Metal%20Ligaen%20table

Unclosed <small> tag in http://en.wikipedia.org/w/index.php?title=2015–16_ABA_League&action=edit&section=6

Update wikitext missing-end-tag
Pretty much all the various svwiki page diffs.

Ex:svwiki:Kugelstein

Template:klimatöversikt has a HTML table, and all the svwiki pages have markup of the form
{| border="1"
{{klimatöversikt
...
}}
|}
This markup is parsed as 2 separate tables in a HTML5 parser since the inner <table> is in fosterable position whereas Tidy fixes up the HTML differently and introduces nested tables.
? deletable table tag
itwiki:Juventus_Football_Club_1982-1983 and several other itwiki sports pages Template https://it.wikipedia.org/w/index.php?title=Template:Incontro_di_club has a nested table in fosterable position causing different fixups in Tidy and HTML5 parsers. fostered
ruwiki:Флатт,_Рэйчел (Rachael Flatt), ruwiki:Сабликова, Мартина (Martina Sáblíková) Navbox templates such as {{Чемпионы мира по фигурному катанию среди юниоров}} use {{nowrap begin}}/{{nowrap end}}, with items delimited with {{·w}}, which theoretically should break the nowrap span with a space outside the nowrap section. But doBlockLevels() inserts a misnested paragraph tag which starts inside the first nowrap span, and ends inside the last nowrap span. Tidy fixes this by splitting the spans, whereas depurate moves the whole paragraph inside the first nowrap span, causing the whole list to be nowrapped. remove div tag around nowrap begin/end pwrap-bug-workaround
enwiki:Wildcat (comics) An unclosed <i> in a heading causes the contents of every block starting from halfway through the TOC to be wrapped in a separate <i>, thanks to AFE reconstruction. missing-end-tag

Trailing whitespace migration from inline tags like <span>, <b>, etc.

[edit]

Tidy migrates trailing whitespace out of inline tags like <span>, <b>, etc. to outside the tag but this is broken Tidy behavior. A HTML5 parser will not do this.

Example Titles Detailed description for the example titles Proposed resolution
ruwiki:Миллер, Боде (Bode Miller) and lots of other ruwiki pages The template {{Обладатели Кубка мира по горнолыжному спорту в общем зачёте}} is displayed with a width of 7400 pixels due to it consisting of a series of adjacent spans with "white-space: nowrap". There is a space at the end of each span's contents, which tidy moves outside the span, allowing the browser to break the line. With Depurate, the space is not moved, so the line is not broken. Update wikitext See this example for how you might do it
itwikivoyage:Avezzano and possibly others The Tidy version has "</span></a></span></b> <span" and the HTML5 depurate version has "</span></a> </span></b><span". That accounts for the red diffs running down the page in the lower-right quarter of the upright diff. Update wikitext

display:inline list wrapping diffs because of inter-element white-space (possibly a concern in other rendering scenarios?)

[edit]

In the Tidy version, there seem to be \n chars between </li> and the next <li>. In the HTML5depurate version, in some cases, they are not present. This seems to cause rendering differences in wrapping of lists.

Example Titles Detailed description for the example titles Proposed resolution
ptwiki:José Serra and not sure if there are other pages that are affected similarly. The last line with portals is a list where every element is styled as 'display:inline'. The Tidy version has "</li>\n<li…". The HTML5depurate version has "</li><li…" This missing newline causes the list to render as a long list which causes the entire page to flow and render differently causing larger visual diffs. This could potentially be a concern in other scenarios.

Looks like http://pt.base.wikitextexp.wmflabs.org/w/index.php?title=Template:Portal3&action=edit is the template that generates this list. It doesn't looks like there should be a newline rendered between list items but Tidy is adding the newlines on its own causing rendering diffs.

This looks like this is a Tidy bug and requires fixing pages / templates that rely on this behavior. See Phab:T74416 for an example where an editor fixed a template on frwiki

Other notes (previously used by the parsing team)

[edit]

Visual diff testing

[edit]

We found that a common impact of RemexHTML was to cause changes to the HTML which either don't affect the visual layout at all, or cause only minor vertical whitespace changes. In the belief that minor vertical whitespace changes would be tolerable, we wrote an image differ called UprightDiff which is able to identify vertical motion within an image, and to discount such motion for the purposes of automated testing.

We exported a subset of about 64000 articles from various Wikimedia projects, and rendered them with Tidy and with RemexHTML, then used UprightDiff to analyse the result. Current results can be seen at http://mw-expt-tests.wmflabs.org/ .

See Uprightdiff numeric scoring for more details about how we assign a test scorech tested page.

P-tags wrapping newlines

[edit]

<p> tags wrapping newlines in the HTML5 depurate but stripped by Tidy cause minor whitespace margin diffs and seems to be a source of a lot of noise in visual diff output.

NOTE for editors: HTML5depurate is taking care of this automatically right now. Eventually we might remove this compatibility pass, but this won't be an issue in the initial Tidy removal rollout.

Example Titles Detailed description for the example titles Proposed resolution
frwiki:Kefteji and many others There are 3 <p>\n</p> tags before the image that is showing up in the visual diff image.
  • gerrit:288573 A new version of RemexHTML with this fix is out and a new visual diff test run is complete. Now, > 87% of pages render without any pixel diffs (compared to ~39% before).
  • gerrit:290616 Mark empty p-elements with mw-empty-elt + Add display:none CSS rule for this class (https://gerrit.wikimedia.org/r/#/c/290614/) + ensure that the new CSS rules are enabled when using HTML5depurate. With all these fixes, > 93% of pages render without any pixel diffs (compared to ~87% before).

See also

[edit]