User:GWicke/oldUserPage
I am Gabriel Wicke, a software developer working for the Wikimedia Foundation. I am currently working on the Parsoid parser for MediaWiki in support of the Visual editor project.
Before joining the foundation in October 2011, I was a volunteer contributor with a very active period between 2003 and 2005. During this time, I designed and implemented the first version of the Squid cache layer, the MonoBook skin, added the capability to develop javascript and css on the wiki (user and common scripts/styles) and tweaked the parser to emit slightly less broken output. Without too much success though, which is why I then added html tidy as a post-processor to clean up the mess before output.
The new parser should fix a big part of that problem by design. The challenge now is to support the existing content of a few encyclopedias including esoteric template trickery, parser functions and extensions. And make it round-trip back to wikitext ;)
I am using my user page as a personal bookmark list and note area. If you find anything that is interesting to you as well, then all the better! If you would like to contact me, then you can catch me as gwicke in #mediawiki on irc.freenode.net (US west coast times). You can also send me a mail through a form or at gwickewikimedia.org.
General links:
- Current activity in Gerrit
- Etherpad index
- Current engineering report, Parsoid status
- Bugs I am CCed on, high priority bugs , parser bugs, @20%
- JavaScript performance, MVL
- Subpages of this user page
- load on Parsoid cluster
Visual Editor
[edit]Parser
[edit]- Parsoid, the project main page.
- PEG.js documentation
- WikiDom docs, talk and example document
- Parser tests in bugzilla: have, need
HTML 5 parsers
[edit]HTML 5 parsing widely overlaps with the needs for a wiki parser, and covers many areas of informal grammar and fix-ups. The main differences are in the tokenizer part (different syntax), and in the handling of non-matching elements (ignore vs. show as plain text).
The plan is to convert the PEG parser into a tokenizer (which handles most wiki-specific issues) and use any HTML5-compliant parser as a backend that builds the (DOM) tree from 'token soup'. If we can get away with an unmodified HTML parser, then this will allow a reuse of specification and implementations for the back-end.
HTML5 parsers include (apart from those in browsers):
- Java: http://about.validator.nu/htmlparser/, especially startTag in src/nu/validator/htmlparser/impl/TreeBuilder.java
- Based on Google Web Toolkit, can compile to Javascript (Debian build script) and C++ (C++ used in Gecko; Live Javascript version);
- PHP, Python and Ruby ports: http://code.google.com/p/html5lib/
- Dom.js, a cleaner JS parser + DOM implementation sponsored by Mozilla; only works on Spidermonkey (not on node.js) due to use of proxies, const and other advanced JS features: [1], [2]
- Node html5 parser library: https://github.com/aredridel/html5, slower than the Mozilla one according to [3]
Tokenizer interfaces
[edit]- uses events module for dispatch to parser; supports streaming sources (EventEmitter)
this.emit('token', tok); {type: 'Characters', data: c} {type: 'StartTag', name: 'li', data: [{nodeName: 'attr1', nodeValue: 'attrvalue1'}]};
- direct call to current parser, by insertion mode
insertToken(TEXT, s); insertToken(COMMENT, s); insertToken(TAG, tagname, [['an1', 'av']]); insertToken(ENDTAG, tagname); -> parser(..)
validator.nu htmlparser:
TreeBuilder.startTag(tagName, [[key,value]], selfClosing); TreeBuilder.endTag(tagName); TreeBuilder.comment(commentstring);
Common emit function, passed into tokenizer constructor: emit(TYPE, "tagname", [[key, value]])
. The list of attribute key-value pairs preserves order and duplicate attributes for round-tripping if possible. TYPE is one of (incomplete list) TAG, ENDTAG, TEXT, COMMENT, SELFCLOSINGTAG. Source positions would be an interesting addition to enable some degree of reconciliation.
Wiki-specific parser work
[edit]- List of alternative parsers, notes from Berlin Hackathon 2011
- Markup spec/ANTLR/draft and Wikitext-l discussion
- Kiwi grammar
- Python parser by Mozilla.org
- Sweble: in particular sweble-wikitext/swc-parser-lazy/src/main/java/org/sweble/wikitext/lazy/postprocessor and sweble-wikitext/swc-parser-lazy/src/main/autogen/org/sweble/wikitext/lazy/parser (grammar) (and damn those deep hierarchies!!)
- Hook output handling in current parser: bugzilla:8997
- LOLCode version of the MW parser
Differences between Tidy and HTML5 parser behavior
[edit]Generally HTML5 parsers only perform very limited correction of invalid nesting according to the content model. Content in locations where neither inline nor block-level content is legal (for example between 'table' and 'tr' tags) is generally adopted by elements further up in the tree according to an 'foster parent' algorithm. Block-level ('flow' in HTML5 lingo) content where only inline ('phrasing') is allowed is not corrected at all. Browsers manage to display these unspecified nestings with mostly acceptable results.
Tidy on the other hand tries harder to correct content-model violations, with sometimes surprising and not very localized effects if other invalid content (especially with missing end tags) precedes mis-nested content.
Examples:
- Block elements in headings are moved after the heading, or joined up with unclosed block elements before the heading
Nuggets from the HTML 5 spec
[edit]Formatting elements
[edit]- a, b, big, code, em, font, i, nobr, s, small, strike, strong, tt, and u. [4]
- scope limited by applet elements, buttons, object elements, marquees, table cells, and table captions
- formatting restored when entering other elements: Search for Reconstruct the active formatting elements in [5]
Editor-related bits from the HTML spec
[edit]- UndoManager and DOM transaction interface WIP. Mozilla implementation, partial WebKit implementation which was deferred & removed from trunk
- WhatWG Web-Apps standard WIP
Other UI stuff
[edit]- CKEditor adds widget support - edit widgets for special content (think thumbnails with captions, templates etc). They were discussing HTML5+RDFa too, but in the end went with data attributes like this:
<span class="time" data-widget="time" timestamp=1357220873242 utc=false seconds=false>14:47</span>
. See some widget example code. They also added support for restricting possible HTML content models by context, and filters that enforce this for pasted HTML from external sources. CKEditor was just chosen as the editor for Drupal. - Visual diff in localwiki using a daisydiff Java diff service to produce diff annotations. Contact: Philip and Mike on #localwiki.
- RDFa/JSON-LD editing: https://github.com/bergie/vie and create.js
Test cases
[edit]The parser is being written against the MediaWiki parser tests suite (currently a little more than 660 test cases). Next up will be round-trip testing and running through dumps. See also Parsoid/test cases
Fun with templates and parser functions
[edit]- The Qif template - the historical beginnings of parser functions: using default parameters to implement conditionals.
- Help:Extension:ParserFunctions
- en:Template:Selfsubst - self-replicating subst template; example for subst functionality not readily moved to the editor.
- Extension:TemplateProfiler, sample output by User:Zocky
- en:Template:Get regnum()
Potential future server-side DOM processing stuff
[edit]- Plans for the Lua scripting extension - currently produces text, essentially Lua-programmable parser functions
- TAL: Spec, http://code.google.com/p/jstal/; Very similar: Genshi, AngularJS, Plates, emberjs, knockoutjs, Jade
- Resource limits and sandboxing:
- HTML5 WebWorkers can be used to provide resource limits and sandboxing. Data can be passed in using zero-copy binary structure passing.
- Alternative nodejs runInNewContext: Sandboxed & time-limited JS execution using V8. Checking code in node likely good template for wrapping V8 from C/C++ with similar limits.
- non-eval JS expression evaluation: http://silentmatt.com/javascript-expression-evaluator/
- jsdom supports the execution of onload inline javascript in attributes
- namespace scoping similar to TAL limits possible interactions, which is a good thing for DOM subtree caching and general sanity; source of javascript functions can be retrieved using toString, which could potentially be used to implement scoping following the DOM tree by rewriting the source
- https://github.com/aptana/Jaxer/tree/master/server: Gecko DOM with SpiderMonkey JS engine, server-side. Similar, but older: http://simile.mit.edu/wiki/Crowbar
ECMAScript & co
[edit]- IcedCoffeeScript: Async extension for cofeescript
- node-fibers: Lightweight userspace threads for node
Code review and process
[edit]Community
[edit]Legalese
[edit]Although I work for the Wikimedia Foundation, contributions under this account do not necessarily represent the actions or views of the Foundation unless expressly stated otherwise. For example, edits to articles or uploads of other media are done in my individual, personal capacity unless otherwise stated.