User:GWicke/oldUserPage

I am Gabriel Wicke, a software developer working for the Wikimedia Foundation. I am currently working on the Parsoid parser for MediaWiki in support of the Visual editor project.

Before joining the foundation in October 2011, I was a volunteer contributor with a very active period between 2003 and 2005. During this time, I designed and implemented the first version of the Squid cache layer, the MonoBook skin, added the capability to develop javascript and css on the wiki (user and common scripts/styles) and tweaked the parser to emit slightly less broken output. Without too much success though, which is why I then added html tidy as a post-processor to clean up the mess before output.

The new parser should fix a big part of that problem by design. The challenge now is to support the existing content of a few encyclopedias including esoteric template trickery, parser functions and extensions. And make it round-trip back to wikitext ;)

I am using my user page as a personal bookmark list and note area. If you find anything that is interesting to you as well, then all the better! If you would like to contact me, then you can catch me as gwicke in #mediawiki on irc.freenode.net (US west coast times). You can also send me a mail through a form or at gwickewikimedia.org.

General links:

Visual Editor

VE edits in EN recent changes

Parser

Parsoid, the project main page.
PEG.js documentation
WikiDom docs, talk and example document
Parser tests in bugzilla: have, need

HTML 5 parsers

HTML 5 parsing widely overlaps with the needs for a wiki parser, and covers many areas of informal grammar and fix-ups. The main differences are in the tokenizer part (different syntax), and in the handling of non-matching elements (ignore vs. show as plain text).

The plan is to convert the PEG parser into a tokenizer (which handles most wiki-specific issues) and use any HTML5-compliant parser as a backend that builds the (DOM) tree from 'token soup'. If we can get away with an unmodified HTML parser, then this will allow a reuse of specification and implementations for the back-end.

HTML5 parsers include (apart from those in browsers):

Java: http://about.validator.nu/htmlparser/, especially startTag in src/nu/validator/htmlparser/impl/TreeBuilder.java
- Based on Google Web Toolkit, can compile to Javascript (Debian build script) and C++ (C++ used in Gecko; Live Javascript version);
PHP, Python and Ruby ports: http://code.google.com/p/html5lib/
Dom.js, a cleaner JS parser + DOM implementation sponsored by Mozilla; only works on Spidermonkey (not on node.js) due to use of proxies, const and other advanced JS features: [1], [2]
Node html5 parser library: https://github.com/aredridel/html5, slower than the Mozilla one according to [3]

Tokenizer interfaces

html5:

uses events module for dispatch to parser; supports streaming sources (EventEmitter)

        this.emit('token', tok);
        {type: 'Characters', data: c}
        {type: 'StartTag', name: 'li', 
            data: [{nodeName: 'attr1', nodeValue: 'attrvalue1'}]};

dom.js

direct call to current parser, by insertion mode

        insertToken(TEXT, s);
        insertToken(COMMENT, s);
        insertToken(TAG, tagname, [['an1', 'av']]);
        insertToken(ENDTAG, tagname);
        -> parser(..)

validator.nu htmlparser:

        TreeBuilder.startTag(tagName, [[key,value]], selfClosing);
        TreeBuilder.endTag(tagName);
        TreeBuilder.comment(commentstring);

Common emit function, passed into tokenizer constructor: emit(TYPE, "tagname", [[key, value]]). The list of attribute key-value pairs preserves order and duplicate attributes for round-tripping if possible. TYPE is one of (incomplete list) TAG, ENDTAG, TEXT, COMMENT, SELFCLOSINGTAG. Source positions would be an interesting addition to enable some degree of reconciliation.

Wiki-specific parser work

List of alternative parsers, notes from Berlin Hackathon 2011
Markup spec/ANTLR/draft and Wikitext-l discussion
Kiwi grammar
Python parser by Mozilla.org
Sweble: in particular sweble-wikitext/swc-parser-lazy/src/main/java/org/sweble/wikitext/lazy/postprocessor and sweble-wikitext/swc-parser-lazy/src/main/autogen/org/sweble/wikitext/lazy/parser (grammar) (and damn those deep hierarchies!!)
Hook output handling in current parser: bugzilla:8997
LOLCode version of the MW parser

Differences between Tidy and HTML5 parser behavior

Generally HTML5 parsers only perform very limited correction of invalid nesting according to the content model. Content in locations where neither inline nor block-level content is legal (for example between 'table' and 'tr' tags) is generally adopted by elements further up in the tree according to an 'foster parent' algorithm. Block-level ('flow' in HTML5 lingo) content where only inline ('phrasing') is allowed is not corrected at all. Browsers manage to display these unspecified nestings with mostly acceptable results.

Tidy on the other hand tries harder to correct content-model violations, with sometimes surprising and not very localized effects if other invalid content (especially with missing end tags) precedes mis-nested content.

Examples:

Block elements in headings are moved after the heading, or joined up with unclosed block elements before the heading

Nuggets from the HTML 5 spec

Formatting elements

a, b, big, code, em, font, i, nobr, s, small, strike, strong, tt, and u. [4]
scope limited by applet elements, buttons, object elements, marquees, table cells, and table captions
formatting restored when entering other elements: Search for Reconstruct the active formatting elements in [5]

Editor-related bits from the HTML spec

UndoManager and DOM transaction interface WIP. Mozilla implementation, partial WebKit implementation which was deferred & removed from trunk
WhatWG Web-Apps standard WIP

Other UI stuff

CKEditor adds widget support - edit widgets for special content (think thumbnails with captions, templates etc). They were discussing HTML5+RDFa too, but in the end went with data attributes like this: <span class="time" data-widget="time" timestamp=1357220873242 utc=false seconds=false>14:47</span>. See some widget example code. They also added support for restricting possible HTML content models by context, and filters that enforce this for pasted HTML from external sources. CKEditor was just chosen as the editor for Drupal.
Visual diff in localwiki using a daisydiff Java diff service to produce diff annotations. Contact: Philip and Mike on #localwiki.
RDFa/JSON-LD editing: https://github.com/bergie/vie and create.js

Test cases

The parser is being written against the MediaWiki parser tests suite (currently a little more than 660 test cases). Next up will be round-trip testing and running through dumps. See also Parsoid/test cases

Fun with templates and parser functions

The Qif template - the historical beginnings of parser functions: using default parameters to implement conditionals.
Help:Extension:ParserFunctions
en:Template:Selfsubst - self-replicating subst template; example for subst functionality not readily moved to the editor.
Extension:TemplateProfiler, sample output by User:Zocky
en:Template:Get regnum()

Potential future server-side DOM processing stuff

Plans for the Lua scripting extension - currently produces text, essentially Lua-programmable parser functions
TAL: Spec, http://code.google.com/p/jstal/; Very similar: Genshi, AngularJS, Plates, emberjs, knockoutjs, Jade
Resource limits and sandboxing:
- HTML5 WebWorkers can be used to provide resource limits and sandboxing. Data can be passed in using zero-copy binary structure passing.
- Alternative nodejs runInNewContext: Sandboxed & time-limited JS execution using V8. Checking code in node likely good template for wrapping V8 from C/C++ with similar limits.
- non-eval JS expression evaluation: http://silentmatt.com/javascript-expression-evaluator/
jsdom supports the execution of onload inline javascript in attributes
namespace scoping similar to TAL limits possible interactions, which is a good thing for DOM subtree caching and general sanity; source of javascript functions can be retrieved using toString, which could potentially be used to implement scoping following the DOM tree by rewriting the source
https://github.com/aptana/Jaxer/tree/master/server: Gecko DOM with SpiderMonkey JS engine, server-side. Similar, but older: http://simile.mit.edu/wiki/Crowbar

ECMAScript & co

IcedCoffeeScript: Async extension for cofeescript
node-fibers: Lightweight userspace threads for node

Code review and process

Community

GSoC mentoring manual

Legalese

Although I work for the Wikimedia Foundation, contributions under this account do not necessarily represent the actions or views of the Foundation unless expressly stated otherwise. For example, edits to articles or uploads of other media are done in my individual, personal capacity unless otherwise stated.