Jump to content

WikiWho

From mediawiki.org

WikiWho is a service providing authorship attribution using a content persistence algorithm. It was first developed by the Karlsruhe Institute of Technology and GESIS – Leibniz Institute for the Social Sciences. In August 2021, it was moved to Wikimedia Cloud Services infrastructure and is now maintained and under further development by Community Tech and the Wiki Education Foundation.

Technical details

The core functionality of WikiWho involves parsing the complete set of all historical revisions of a wiki page in order to find out who wrote and/or removed and/or reinserted which exact text at token level at what revision. This means that for every token (such as a word), its individual addition, removal, and reintroduction history becomes available.

The original algorithm working behind the scenes is described in a WWW 2014 paper, along with an extensive evaluation resulting in 95% accuracy on fairly revision-rich articles. The current code version is available on GitHub.

In a nutshell, the approach divides each revision into hierarchically nested paragraph, sentence and token elements and tracks their appearance through the complete content graph it builds in this way over all revisions. It is implemented currently for wikitext, but can run on any kind of text in principle (although tokenization rules might have to be adapted).

WikiWho algorithm example
Example of how the token metadata is generated

In this way, it becomes possible to track – for each single token – all original additions, deletions, re-insertions and re-deletions and in which revision they took place. Which in turn allows to infer the editor, timestamp, etc. of those revisions. Also, individual tokens retain a unique ID, making it possible to distinguish two tokens with identical strings in different text positions.

Currently supported wikis

Tools powered by WikiWho