Alternative parsers
This page is a compilation of links, descriptions, and status reports of the various alternative MediaWiki parsers—that is, programs and projects, other than MediaWiki itself, which are able or intended to translate MediaWiki's text markup syntax into something else. Some of these have quite narrow purposes, while others are possible contenders for replacing the somewhat labyrinthine code that currently drives MediaWiki itself.
Many of the things linked here are likely to be out of date and under-maintained, or even abandoned. But in the interest of not duplicating the same work over and over, it seemed sensible to collect together what was "out there". In addition, although so many alternative parsers exist, almost no unofficial parser powers any wiki site, except for wikitextparser which powers the OpenTTD wiki through TrueWiki.
Parsers that build an abstract syntax tree (AST) and provide access to it are listed under #Parsers providing an AST; parsers that don't build an AST but extract some information are listed under #Parsers extracting some information; the rest of the parsers are listed under #Other parsers.
Parsers providing an AST
[edit]Free software
[edit]Name and link | Principal author(s) | Language | Input | Output | Complete implementation | Can convert output back to markup | Comments / other info | License |
---|---|---|---|---|---|---|---|---|
Parsoid | Gabriel Wicke and the Parsoid / Visual editor team | PEG / PHP (formerly Node.js) | markup, XML dumps, test cases | tokens, HTML5 DOM with RDFa and round-trip data | Yes | Yes | Fully-featured round-tripping parser/runtime that powers the Visual editor on Wikipedia. Work ongoing to provide a HTML-only read / edit interface, and later to become the default parser for MediaWiki. See roadmap. | GPLv2+ |
DizzyLogic Wiki Parser | Dizzy Logic | C++ | XML dumps | Syntax tree in XML, plain text | No | No | Fast datamining-oriented parser for English Wikipedia. Capable of processing all of English Wikipedia into plain text and XML in 2-3 hours on a modern processor. Convenient graphical interface. Windows installer available (64-bit). | MIT license |
mwparserfromhell | The Earwig | Python | markup | AST | almost | Yes | A Python library to convert Wiki markup to a navigable string, which can be used to examine and manipulate templates. Written in pure Python, compatible with Python 2.7 and 3, and no dependencies. | MIT License |
wtf_wikipedia | Spencer Kelly | JavaScript | markup | JSON | almost | No | Supports recursive links & templates, parses infoboxes and links, resolves special templates, parses images and categories. runs server-side & browser. | MIT |
wikiapi | kanashimi | JavaScript | markup | JavaScript native object | almost | Yes | Parses sections, templates with parameters, links, images and categories, wiki-table to JS array or JS array to wiki-table, and many more. You may modify parts of the wikitext, then regenerate the page just using parsed.toString() . Runs on node.js and browser. |
MIT |
Sweble Wikitext Parser | Hannes Dohrn | Java | markup | AST, XML, HTML | almost | ? | Claims to be very thorough. There are three papers surrounding the Sweble Wikitext Parser. | Apache License 2.0 |
wikitextparser | 5j9 | Python | markup | AST | almost | Yes | Provides several accessor methods in an object tree to navigate to structural elements like sections, tables, links etc. Supports extracting table data as list of lists. Available via pip, supports Python 3. | GPLv3 |
mwlib | PediaPress.com | Python with C library | markup and other | parse tree, HTML, PDF, XML, OpenDocument | No | ? | Used by MediaWiki's "Print/export" feature, see Reading/Web/PDF Functionality. | BSD |
wb2pdf | Dirk Hünniger | Haskell | online article | LaTeX, PDF, Parse Tree, HTML, OpenDocument, EPUB | No | ? | Recursive Descent based on Monadic Parser Combinators. Allows for non context-free input, especially non well formatted HTML as often found on Wikipedia. | GPL |
XWiki Rendering Framework | XWiki dev team | Java | various WikiMarkups | Well formed sequence of events, HTML/XHTML, other WikiMarkups | No | No | XWiki can be used a full-fledged wiki supporting several WikiMarkups (including MediaWiki's markup). It also offers a standalone Rendering Engine that can be used as a Java library for parsing/rendering WikiMarkups. Cant output to mediawiki format as of 2016/03 though. | LGPL |
mediawiki-parser | Peter Potrowl, Erik Rose | Python | markup | XHTML, raw text, AST | No | No | GSoC-2011 project; the use of a PEG parser makes it easy to improve. Parser functions are not supported yet. | GPLv3 |
smc.mw | Marcus Brinkmann | Python | markup | AST, HTML | No | No | Stateful PEG parser based on Grako (Archived 2014-03-09 at the Wayback Machine), with a very clean separation of parsing stages, grammars and semantic transformations. | BSD |
Pandoc | John MacFarlane | Haskell | markup | many & AST | No | not identical | Can convert subset of mediawiki markup to ~35 different formats (5 of which are flavors of markdown). | GPLv2 |
MwParserFromScratch | CXuesong | C# | markup | AST | No | Yes | A portable .NET library that parses wikitext into Abstract Syntax Tree. For now it supports most of the common markup expressions except file links, double-underscored magic words, and tables. | Apache License |
gensim.segment_wiki | RaRe Technologies | Python | MediaWiki XML | JSON | No | No | Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python, segment_wiki - script for wikipedia parsing & extraction. | LGPLv2.1 |
mediawiki-parser | Ben Gamari | Haskell | markup or MediaWiki XML | AST | almost | No | mediawiki-parser served as the basis of the extraction pipeline of the NIST TREC Complex Answer Retrieval information retrieval track. It is a PEG parser capable of producing abstract syntax tree representing most of the Mediawiki syntax. | BSD-3-Clause |
parse_wiki_text | Fredrik Portström | Rust | markup | AST | No | No | Parse Wiki Text attempts to take all uncertainty out of parsing wiki text by converting it to another format that is easy to work with. The target format is Rust objects that can ergonomically be processed using iterators and match expressions. | modified MIT |
wikitextprocessor | Tatu Ylonen | Python | XML dumps | AST | ? | ? | Can expand templates and Lua macros. | MIT unless otherwise noted in individual files (see LICENSE) |
wikiparser-node | Bhsd | TypeScript | markup | AST, HTML | almost | Yes | Parsing, modifying, and linting wikitext. Runs in Node.js and browser (online playground). | GPLv3 |
Proprietary
[edit]Name and link | Principal author(s) | Language | Input | Output | Complete implementation | Can convert output back to markup | Comments / other info | License |
---|---|---|---|---|---|---|---|---|
WikiTaxi | Ralf Junker | Delphi / Pascal | MediaWiki markup, page or fragment | Node-tree, HTML, potentially others | almost | Hand-crafted parser with template expansion, parser functions (core and extended), tag extensions (<ref>, <source>), wiki text parsing. Used for the WikiTaxi offline reader. | No sources available |
Abandoned
[edit]Name and link | Principal author(s) | Language | Input | Output | Complete implementation | Can convert output back to markup | Comments / other info | License |
---|---|---|---|---|---|---|---|---|
DKPro JWPL parser | Torsten Zesch, Richard Eckart de Castilho, Oliver Ferschke, Elisabeth Niemann | Java | XML dump | API to access pages, outlinks, inlinks and more | No | "JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information contained in Wikipedia." "JWPL is for you: If you need structured access to Wikipedia in Java." Older parser not maintained any more - JWPL uses Sweble now. | LGPL | |
FlexBisonParse | Timwi | flex, bison and C | markup fragment | Custom XML | No | Intended as an eventual replacement to the parsing code inside MediaWiki itself. | ||
sanskrit-coders/wiki-tools | Vishvas Vasuki | Scala | Mediawiki text | Mediawiki text and Section tree | No | Only parses mediawiki sections - that's it. One can parse a wiki page with multiple sections, get a section tree, add, access and delete sections. | Creative commons | |
Perl Wikipedia Toolkit | Michal Jurosz | Perl | XML dump, SQL dump | Own parse tree, WikiMedia markup | No | Perl Wikipedia Toolkit developed for Computer-assisted Wikipedia translation. (Little functional) | ||
WikiOnCD (Archived 2006-01-15 at the Wayback Machine) | Andrew Rodland | Perl | SQL dump or markup | HTML, Parse tree (eventually?) | No | Started out as an offline wiki browser, but grew a parser when Wiki2static turned out to be too limiting. No web presence yet; code is in the SVN. | GPL | |
WikiPress Publisher[dead link] | Erwin Jurschitza | Delphi 7 | XML dump | DocBook XML, Digibib XML, HTML | No | Used for the German DVD, generates lists of bad markup. | No sources available | |
Saya.Parser.Wiki[dead link] | Nana Sakisaka | C++ | markup | AST | No | Pure C++11 parser implemented with Boost.Spirit.Qi. | Boost Software License 1.0 |
Parsers extracting some information
[edit]Name and link | Principal author(s) | Language | Input | Output | Complete implementation | Can convert output back to markup | Comments / other info | License |
---|---|---|---|---|---|---|---|---|
PHP-Wikipedia-Syntax-Parser | Don Wilson | PHP | markup | Associative array | No | Parses top-level sections, w:Wikipedia:Persondata, infoboxes, external links, categories, and interlanguage links. | GPL | |
Wiki-infobox-parser | Zhipeng Jiang | JavaScript | markup | JSON | No | A light Wikipedia Infobox Parser written in JavaScript. | MIT | |
wiktextract | Tatu Ylonen | Python | XML dumps | JSON | ? | Parses most of the English Wiktionary into a JSON. Can expand templates and Lua macros. You can run it locally, or directly grab the JSON output hosted at [1]. | MIT | |
ParseWiki | Gerges | PHP | wikitext | Associative array | Yes | A library that helps parse wikitext data | GPL-3.0 |
Other parsers
[edit]Name and link | Principal author(s) | Language | Input | Output | Complete implementation | Comments / other info | License |
---|---|---|---|---|---|---|---|
Mylyn WikiText | David Green | Java | Local files | HTML, DocBook, Eclipse Help, DITA, extensible | No | Integration with Ant and Eclipse runtime. | EPL |
wikipedia-js | kenshiro_o | Node.js | markup | HTML | No | A simple client that enables you to query Wikipedia articles in english. The results are formatted in basic HTML. You can retrieve either a summary of an article (i.e. before the table of contents) or a full article. | MIT |
WikiExtractor | Giuseppe Attardi, Antonio Fuschetto | Python | XML dumps | text | No | Simple and fast tool for extracting plain text from Wikipedia dumps. It performs template expansion and handles parser functions (core and extended). | GPL |
Mediawiki2HTML Machine | Johannes Buchner | PHP | markup | HTML | No | Project for parsing without the Mediawiki engine. | AGPL3 + any later version |
Java API (Bliki engine) | Axel Kramer | Java | markup fragment | HTML, PDF | almost | Java Wikipedia API - (supports ParserFunctions, Lua/Scribunto...). | EPLv1.0 or GPLv2.1+ |
WikiCloth | nricciar | Ruby | markup | HTML | No | Ruby implementation of the MediaWiki markup language, including a fair amount of the parser functions. | MIT |
YaCy | YaCy dev team | Java | XML dump | XML with Dublin Core Metadata | No | YaCy is a search engine and a MediaWiki parser is included as one of the import modules. MediaWiki xml dumps are first converted to Dublin Core XML as intermediate format and then inserted into the search index using the built-in Dublin Core importer. | GPL |
WiktionaryParser | dev team | python | markup | JSON | No | Wiktionary parser. As of October 2019, downloads the article on-the-fly and parses "etymologies, definitions, pronunciations, examples, audio links and related words". | MIT |
LuaWiki | Alexander Misel | Lua, PEG | markup | HTML | No | LuaWiki has a parser which supports most common syntaxes used in article namespace, however it defined a different grammar for templates. | GPLv3 |
wiktionary-dumps | excarnateSojourner | Python | XML dump | various | No | A collection of scripts for extracting various information specifically from database dumps of the English Wiktionary. Only one or two may be more broadly useful. Active as of 2023. | CC0 |
wikiparser-java | javalc6 | Java | XML dump | various | No | The library has been developed to parse and render English Wiktionary. In addition to English, several other languages are supported. | Apache License |
other wiktionary parsers | various | various | markup | various | No | See list at <stackoverflow.com/q/3364279> | various |
Abandoned
[edit]Name and link | Principal author(s) | Language | Input | Output | Complete implementation | Comments / other info | License |
---|---|---|---|---|---|---|---|
libmwparser | Saitmoh | C | XML dumps, markup | XML, XHTML, Expanded WikiText | almost | Primary an wikimedias offline reader with interwiki support. Libmwparser is a source independent library which supports most of MediaWiki syntax and some extensions like math or gallery. | GPL |
Wiky.php | Toni Lähdekorpi | PHP, Regular Expressions | markup | HTML | No | A tiny PHP library that uses only regular expressions to convert Wiki markup to HTML. | Apache License/GPL/LGPL/MPL/CC |
Wiky | Tanin Na Nakorn | Ruby | markup | HTML | No | A simple Ruby library to convert Wiki markup to HTML. | Apache License |
Wiky.js | Tanin Na Nakorn | JavaScript | markup | HTML | No | A simple JavaScript library to convert Wiki markup to HTML (limited subset). | Apache License |
txtwiki.js | Joao Sa | JavaScript | markup | Text | No | A JavaScript library to convert MediaWiki markup to plaintext. | MIT License |
mw2html | Connelly Barnes | Python | Wiki url | HTML | No | Minimal setup - gets the basic job of creating a static copy of the wiki done. | Public Domain |
PHP5 WP | Dan Goldsmith | PHP | markup | HTML | No | Parser With Plugin Framework To Add Additional Syntax. Configurable for alternative markup i.e. PMWIKI. | MPL 2.0 |
JAMWiki | Ryan | Java | JAMWiki front-end | HTML | No | Java Wiki engine that supports MediaWiki syntax. The roadmap also calls for XML import and export that will be compatible with Mediawiki. | LGPLv2 |
InstaView | Pilaf | JavaScript | markup fragment | HTML | No | Provides instant preview while editing a page (without reloading). | BSD |
InstaView | C. Scott Ananian | JavaScript | markup fragment | HTML | No | Port of Pilaf's code to node.js, volo, and the browser. | BSD |
Tero-dump | Tero Karvinen | ? | Local wiki installation, including MySQL, PHP, web server | HTML | No | Scripts for grabbing the whole wiki; does not include images. | |
Text_Wiki_Mediawiki | Multiple | PHP | markup | HTML, LaTeX, Plain text | No | Part of the Text_Wiki library. | LGPL |
TomeRaider export | Erik Zachte | Perl | XML dump | TomeRaider database | No | See en:Wikipedia:TomeRaider database for more details. | |
Waikiki | Magnus Manske | C++ | SQL dump (via SQLite) | HTML | No | Abandoned in favour of "flexbisonparse", but has been used inside some experimental "front ends". | |
Wikiwyg (Archived 2008-12-16 at the Wayback Machine) | Jim Higson | JavaScript | A live installation of MediaWiki | HTML (via XML) | No | More than just a parser; attempts to create a fully functional client-side interface. | |
wik2dict | Guaka | Python | SQL dump | DICT | No | ||
wiki2pdf | Stephan Walter | Python (and PHP) | markup fragment or set of online articles | LaTeX, PDF | No | Project is incomplete and dormant. | |
WikiPDF | Felipe Sanches | Python (and PHP) | One selected article | LaTeX based on templates, PDF | No | Mediawiki extension that uses Stephan Walter's wiki2pdf as backend. | |
Wikifilter | ? | C++ (VS) | XML dumps | HTML | No | A Windows program that uses Apache/IIS to serve the pages. Abandoned in 2006, before ParserFunctions were available. | |
Wikipedia Dump Reader | Benjamin Thyreau | Python | XML dumps | On screen | No | Cross platform viewer. | GPLv2/~BSD license |
Marker | Ryan Blue | Ruby | markup (subset) | HTML or formatted text | No | Marker is a Ruby implementation of a subset of the MediaWiki markup language, intended bring MediaWiki's markup language to non-wiki applications with multiple output formats. | GPL |
Kiwi | Thomas Luce, Karl Matthias, AboutUs.org | C, Ruby, PEG | markup | HTML | almost | Kiwi is a PEG-based C implementation with Ruby bindings and a command line parser. It is very fast and supports most of the MediaWiki syntax. | BSD |
A non-parser dumper
[edit]One of the common uses of alternative parsers is to dump wiki content into static form, such as HTML or PDF. Tim Starling has written a script which isn't a parser, but uses the MediaWiki internal code to dump an entire wiki to HTML, from the command-line. See Extension:DumpHTML. This has been used (years ago) to create the static dumps at https://dumps.wikimedia.org
There are also similar dumpers as part of the Kiwix project, for example mwoffliner, and you can query the RESTBase API to obtain HTML-format output with semantic information (such as tranclusions) included.
Related topics
[edit]- If you want to convert MediaWiki documents into some other format, the above tools are useful. If you want to convert HTML documents or other formats into MediaWiki documents, you may find Wikipedia: Wikipedia: Tools/Editing tools#Wikisyntax conversion utilities and Manual: importing external content more useful.
- One-pass parser
- MediaWiki lexer and MediaWiki flexer (not parsers as such, just grammar definitions; probably superseded by/within other projects below)
- en:Wikipedia:Text editor support includes various scripts and extensions for things like syntax highlighting for things like EMACS, Vim, and all sorts; some of these may include rudimentary parsing capabilities.
- Here are some proof of concept rules for a subset of the Mediawiki markup: these are written in a metalanguage that treats preformatted text as source text, and everything else as comment.
- Markup spec aims to produce a specification of MediaWiki's markup format.
- Help:Extension:ParserFunctions is the main parser extension for MediaWiki.
- mwparserfromhell and Parsoid's similar jsapi are useful tools for extraction and transformation tasks.
- If no library suits your needs, you still have the option of parsing the data dumps: see meta:Data_dumps and meta:Data_dumps/Other_tools.