Parsoid/Parser Unification/Updates

From mediawiki.org

Project Updates[edit]

  • February 2024 (planned):
    • Parsoid read views enabled by default for talk pages on wikitech.
    • Parsoid read views enabled by default for main namespace and talk pages on officewiki.
  • November 2023:
    • Individual user opt-in to parsoid read views for articles was deployed as part of the ParserMigration extension.
  • July 2023:
  • Feb 2023:
  • Dec 2022:
  • Oct 2022:
  • Sept 2022:
  • July 2022:
    • Started work on adding i18n and l10n support in Parsoid.
    • (incomplete: to be updated)
  • 2022 (Jan - June):
    • Started evaluating Parsoid HTML against core HTML wrt size differences, and identifying what parts (tags, annotations, attributes) of Parsoid HTML to strip so that Parsoid-HTML read views don't have serious impacts on bandwidth and client rendering latencies. Acceptable divergences will be selected in conversations with the Performance team.
    • Started work on making Kartographer compatible with Parsoid.
    • Updated Graph to be compatible with Parsoid.
    • Updated MediaWiki core ParserTest runner to support running Parsoid tests in CI and development. This lets extensions (especially those that operate on wikitext) that target Parsoid to test their implementation via parser tests.
    • (incomplete: to be updated with changes in core repo and any other extensions)
  • 2021:
    • Make <translate> extension functional with Parsoid
    • A whole bunch of performance work to reduce wt2html transformation latencies
    • Switched a few wikis (testwiki, test2wiki, mediawikiwiki, officewiki, wikitech, group0) to use Parsoid-style HTML for media wikitext. Complete rollout to all wikis blocked on ironing out a number of other compatibility issues.
    • Experimental / prototyping work to switch to the Dodo DOM library from native PHP DOM -- project put on hold indefinitely after running into performance issues.
    • Lots of bug fixes and fixes to edge case incompatibilities between Parsoid and legacy parser.
    • (incomplete: to be updated with changes in core repo and any other extensions)
  • December 2020:
    • Finished addressing bulk of functionality differences between Parsoid and core version of Cite implementation
    • Fixed parsing differences between Parsoid and core in use of templates for table-cell attributes and the like
    • Started migrating core output for media wikitext to use Parsoid-style output to reduce output differences between Parsoid and core
    • Updated parsertests framework to enable extension implementations to be tested against Parsoid. Finishing this lets us move extension code out of Parsoid repository into the extension repositories and enables other extensions to start enabling support for Parsoid and run tests with Parsoid
    • Recalibrated our original plans to migrate all extensions to be Parsoid-compatible to a smaller subset. We will initially target only those extensions that implement tag hooks OR use parser hooks. Those that simply use public parser methods could continue to use the core parser for a while
  • November 2020:
    • Introduced uniform error handling for extensions with boilerplate code handled by Parsoid
    • Identified resource module related issues in Parsoid output that result in rendering differentces between Parsoid and core output
    • Made ImageMap extension Parsoid-compatible
  • October 2020:
    • Several technical debt fixes ending in using a single document per request with document fragments for nested pipeline parses
    • Ongoing fixes to Parsoid's Cite implementation
    • Identify CSS fixes to reduce rendering differences between Parsoid and core output
    • Ongoing syncing and consultation with the Platform Engineering Team to upgrade the ParserCache infrastructure to accomodate Parsoid use cases and Parsoid clients in the future
  • September 2020:
    • We continued outreach about Parsoid's Extension API and get feedback (Emails on wikitech-l and mediawiki-l lists).
    • We have been continuing to fix functionality gaps between Parsoid's Cite implementation and the default Cite implementation.
  • August 2020:
    • We have been publishing results of weekly visual diff runs comparing Parsoid rendering and core parser rendering here.
    • We filed a TechCom RFC for Parsoid's Extension API. We also presented a Tech Talk about this.
    • Fixed some performance bugs which greatly reduced out-of-memory errors seen in production.
  • July 2020:
    • We upgraded the visual diffing infrastructure and started initial test runs comparing Parsoid rendering and core rendering on a 25K sample of pages from a small set of wikis. We plan to run these tests every week and monitor progress as we fix rendering and functionality gaps. Results are accessible at http://parsoid-vs-core.wmflabs.org/.
    • We prepared Parsoid for MediaWiki 1.35 LTS release so that Parsoid and VisualEditor can be used out of the box.
    • Reduced impedance mismatches between Parsoid and core parser wrt use of concepts around HTML4 block / inline tag notions.
    • Work in Progress:
      • integrating parser test infrastructure between Parsoid and core.
      • Enabling extension tests to be run against Parsoid.
  • April - June 2020:
    • We started addressing functionality gaps in Parsoid (specially error handling in Cite extension) and rendering differences (output for media wikitext).
    • We fine tuned the Parsoid Extension API further in preparation for wider consultation in the coming months.
    • We Implemented a registration mechanisms for extensions to hook with Parsoid.
  • Jan - March 2020:
    • We addressed a bunch of technical debt incurred during the porting and integrated Parsoid closer with MediaWiki core.
    • Starting March 2020, Parsoid is deployed as part of the weekly MediaWiki train.
    • We also started drafting a Parsoid Extension API for extensions to hook directly into Parsoid.
  • 2019:
    • Around end January, we started porting Parsoid to PHP and by end of the year, successfully completed the project by deploying Parsoid/PHP to the Wikimedia cluster and serving all traffic from it.
    • This blog post provides a good overview of the porting project.
  • 2018:
    • Early experimentation, prototyping and preparation to port Parsoid from JavaScript to PHP. This was low-key background work.
  • 2015 - 2018:
    • HTML4 Tidy was replaced with HTML5 RemexHtml and while upgrading the MediaWiki infrastructure, it also eliminated one of the biggest source of rendering differences between Parsoid & the core paresr. This blog post is a good overview of the reasons for this replacement and the process of doing this.