Jump to content

Incremental dumps/File format/XML output

From mediawiki.org

The XML output from incremental dumps should be exactly the same as the current XML dumps, with the following exceptions. Any exception not listed here is most likely a bug and should be reported.

The exceptions (from most serious to least):

  1. Revisions of a page are ordered by their id in history dumps. XML dumps don't actually have any order specified.
  2. The ‎<restrictions> tag is omitted.
    The page_restrictions field in the database is not used anymore, so the ‎<restrictions> tag doesn't provide accurate information about the restrictions of a page.
  3. The id attribute is missing for the ‎<text> tag in stub dumps.
    This is currently used in the dump infrastructure for creating pages dumps, but is not useful to users.
  4. Comments that are 255 bytes long and end in an invalid UTF-8 sequence are shortened.
    In the current dumps, the invalid sequence is replaced with U+FFFD REPLACEMENT CHARACTER. In the XML produced by incremental dumps, the invalid sequence is removed.
    This applies only to the last character of full-length comments. In other cases, incremental dumps use U+FFFD REPLACEMENT CHARACTER, just like current dumps.
  5. Anonymous IPv6 contributors whose address is not in full form (i.e. it contains ::) will be normalized to full form. This should be very rare, the addresses should almost always be in full form already.
  6. The minor tag is consistently written as <minor /> (with space).
    In current dumps, this is inconsistent: pages dumps use <minor />, while stub dumps use <minor/>.
    This could affect users who read the dumps using regular expressions or similar methods, it doesn't make any difference for those who use XML parsers.