Jump to content

User:GWicke/oldUserPage

From mediawiki.org

I am Gabriel Wicke, a software developer working for the Wikimedia Foundation. I am currently working on the Parsoid parser for MediaWiki in support of the Visual editor project.

Before joining the foundation in October 2011, I was a volunteer contributor with a very active period between 2003 and 2005. During this time, I designed and implemented the first version of the Squid cache layer, the MonoBook skin, added the capability to develop javascript and css on the wiki (user and common scripts/styles) and tweaked the parser to emit slightly less broken output. Without too much success though, which is why I then added html tidy as a post-processor to clean up the mess before output.

The new parser should fix a big part of that problem by design. The challenge now is to support the existing content of a few encyclopedias including esoteric template trickery, parser functions and extensions. And make it round-trip back to wikitext ;)

I am using my user page as a personal bookmark list and note area. If you find anything that is interesting to you as well, then all the better! If you would like to contact me, then you can catch me as gwicke in #mediawiki on irc.freenode.net (US west coast times). You can also send me a mail through a form or at gwicke@wikimedia.org.

General links:

Visual Editor

[edit]

Parser

[edit]

HTML 5 parsers

[edit]

HTML 5 parsing widely overlaps with the needs for a wiki parser, and covers many areas of informal grammar and fix-ups. The main differences are in the tokenizer part (different syntax), and in the handling of non-matching elements (ignore vs. show as plain text).

The plan is to convert the PEG parser into a tokenizer (which handles most wiki-specific issues) and use any HTML5-compliant parser as a backend that builds the (DOM) tree from 'token soup'. If we can get away with an unmodified HTML parser, then this will allow a reuse of specification and implementations for the back-end.

HTML5 parsers include (apart from those in browsers):

Tokenizer interfaces

[edit]

html5:

  • uses events module for dispatch to parser; supports streaming sources (EventEmitter)
        this.emit('token', tok);
        {type: 'Characters', data: c}
        {type: 'StartTag', name: 'li', 
            data: [{nodeName: 'attr1', nodeValue: 'attrvalue1'}]};

dom.js

  • direct call to current parser, by insertion mode
        insertToken(TEXT, s);
        insertToken(COMMENT, s);
        insertToken(TAG, tagname, [['an1', 'av']]);
        insertToken(ENDTAG, tagname);
        -> parser(..)

validator.nu htmlparser:

        TreeBuilder.startTag(tagName, [[key,value]], selfClosing);
        TreeBuilder.endTag(tagName);
        TreeBuilder.comment(commentstring);

Common emit function, passed into tokenizer constructor: emit(TYPE, "tagname", [[key, value]]). The list of attribute key-value pairs preserves order and duplicate attributes for round-tripping if possible. TYPE is one of (incomplete list) TAG, ENDTAG, TEXT, COMMENT, SELFCLOSINGTAG. Source positions would be an interesting addition to enable some degree of reconciliation.

Wiki-specific parser work

[edit]

Differences between Tidy and HTML5 parser behavior

[edit]

Generally HTML5 parsers only perform very limited correction of invalid nesting according to the content model. Content in locations where neither inline nor block-level content is legal (for example between 'table' and 'tr' tags) is generally adopted by elements further up in the tree according to an 'foster parent' algorithm. Block-level ('flow' in HTML5 lingo) content where only inline ('phrasing') is allowed is not corrected at all. Browsers manage to display these unspecified nestings with mostly acceptable results.

Tidy on the other hand tries harder to correct content-model violations, with sometimes surprising and not very localized effects if other invalid content (especially with missing end tags) precedes mis-nested content.

Examples:

  • Block elements in headings are moved after the heading, or joined up with unclosed block elements before the heading

Nuggets from the HTML 5 spec

[edit]

Formatting elements

[edit]
  • a, b, big, code, em, font, i, nobr, s, small, strike, strong, tt, and u. [4]
  • scope limited by applet elements, buttons, object elements, marquees, table cells, and table captions
  • formatting restored when entering other elements: Search for Reconstruct the active formatting elements in [5]
[edit]

Other UI stuff

[edit]

Test cases

[edit]

The parser is being written against the MediaWiki parser tests suite (currently a little more than 660 test cases). Next up will be round-trip testing and running through dumps. See also Parsoid/test cases

Fun with templates and parser functions

[edit]

Potential future server-side DOM processing stuff

[edit]

ECMAScript & co

[edit]

Code review and process

[edit]

Community

[edit]
Legalese
[edit]

Although I work for the Wikimedia Foundation, contributions under this account do not necessarily represent the actions or views of the Foundation unless expressly stated otherwise. For example, edits to articles or uploads of other media are done in my individual, personal capacity unless otherwise stated.