Parsoid/C++

Libraries

peg/leg is a plain C library which supports the same grammar features as PEG.js, and can include C++ code in actions.
HTML5 parser (we actually only need the tree builder) and DOM library candidates:

Webkit

HTML5 tree builder, reasonable token objects, JS runtime integration. Disadvantages: complex integration and build system

The parser code is at [1]. We'd need to write a wrapper similar to HTMLDocumentParser.

Related projects:

Phantomjs embedded headless webkit which supports loading web pages, running any JS that is embedded in them and exporting the resulting DOM. It performs full rendering and can save screenshots or PDFs of pages.
Webkitdriver is a headless port which cuts out the rendering portion (and replaces it with stubs).

Gecko

Related projects:

Crowbar is a Gecko wrapper without rendering

TODO:

We should investigate if we can rip out the required pieces out of Gecko and use it independently.

A simple stand-alone HTML5 parser written for the netsurf browser project. 18k lines of simple C. Example libxml DOM integration would be a good starting point for a minimalist solution, that still provides libxml features like xslt and xpath.

libxml is the native XML binding for PHP and has bindings to any language imaginable including Lua and V8/node, which makes it very attractive for a highly portable representation of the DOM.

Work on the libhubbub tree builder integration is done in the wmf-parsoid-libhubbub branch on GitHub.

Memory management

Token chunks are cached and shared between concurrent expansion threads, so a mechanism like refcounting would be needed. This documentation about refcounted pointers in WebKit is quite interesting: http://www.webkit.org/coding/RefPtr.html. Doing this per-chunk should help to amortize the overheads of thread-safe refcounting.

Thread architecture and PHP interfacing

We would like to parallelize the parser execution into at least separate threads for the tokenizer, token stream transforms and the tree builder. Token stream transforms can be parallelized further. These threads could be started on demand for each call to the parser, or kept around in a thread pool. PHP extensions can register a module setup function which is called when Apache starts up, and can allocate SAPI-global state that is preserved across requests [2][3] (Archived 2013-09-24 at the Wayback Machine). This could be used to set up a Parsoid thread pool per SAPI.

PHP generally supports synchronous callbacks into the interpreter (examples: luasandbox.c,[4],[5],[6]). Callbacks need to be performed in a single-threaded fashion. Asynchronous callbacks (with the main PHP thread running in parallel) are not supported as all internal state (memory allocation etc) assumes single-threaded execution. Simple asynchronous signaling can still be performed via the socket-like stream API or file descriptors.

Error handling in PHP is normally done via longjmp, which should be avoided since this would leave the parser in an undefined state. It might be possible to wrap all callbacks into try/catch and register error handlers so that the longjmp is avoided.

For the parser, template source retrieval would be a main application for parallel IO. The need for this can be reduced a lot by passing it the source of included templates (the list is available in the links table) retrieved using a batch lookup from memcached. Only cache misses would then trigger the sequential DB retrieval.

Bindings to other languages

The basic options are hand-written bindings for PHP vs. autogenerated SWIG bindings for several languages. Most of the interface will be about callbacks back into PHP, which is not directly supported by SWIG, so the value of SWIG bindings would be very limited in this case. A careful definition of a narrow API helps to keep the interface implementation manageable.

Interesting libraries

folly has some goodies like folly::fbstring and folly::dynamic for relatively natural JSON data handling
Luabind: quite elegant Lua bindings
boost::asio is a nice CPP event loop library, including running of callbacks on multiple cores
http://pugixml.org/: Memory efficient XML C++ library with a relatively nice interface (you can do things like node.parent().parent().attribute("id").value()). No XML namespace support though. 8 words per empty DOM node vs. 15 for libxml2.
http://utfcpp.sourceforge.net/: UTF8 iteration, append etc on std::string. Blog post on the approach. Alternative: Boost.Locale (ICU bindings) or ICU directly. The latte two are mostly multi-byte, which tends to be less efficient if you are mostly interested in processing ASCII-delimited UTF8.
Comparison of C++ XML parsers and DOM libraries