Jump to content

Manual:Importing XML dumps

From mediawiki.org

This page describes methods to import XML dumps. XML Dumps contain the content of a wiki (wiki pages with all their revisions), without the site-related data. A XML dump does not create a full backup of the wiki database, the dump does not contain user accounts, images, edit logs, etc.

The Special:Export page of any MediaWiki site, including any Wikimedia site and Wikipedia, creates an XML file (content dump). See meta:Data dumps and Manual:DumpBackup.php . XML files are explained more on Help:Export.

How to import?

[edit]

There are several methods for importing these XML dumps.

Importing large XML dumps (such as English Wikipedia)

[edit]

Guide to Importing the English Wikipedia XML File

[edit]

Importing the English Wikipedia files can be a complex task, but with the right approach, it can be achieved reliably. This guide outlines the process using the importDump.php script located in the maintenance folder of your MediaWiki installation.

Prepare your environment
[edit]

Before starting the import process, ensure that your server meets the necessary requirements, including sufficient disk space and system resources.

Obtain the Wikipedia XML Dump
[edit]

The English Wikipedia XML dump can be downloaded from the Wikimedia Foundation's website. At the time of this writing, the file size is approximately 91GB.

Importing the XML Dump
[edit]

There are several methods to import the XML dump into your MediaWiki installation:

  • Direct Import: You can use the importDump.php script provided in the maintenance folder. This method is reliable but may be resource-intensive.
  • Individual File Import: Alternatively, you can download and import each multipart XML file individually. While this approach allows for more manageable imports, it requires additional effort.
Managing Import Errors
[edit]

Importing such a large file may result in errors, leading to interruptions in the process. To mitigate this, consider setting up cron jobs to import the XML dump in smaller batches over time. This approach ensures smoother processing and reduces the risk of failures.

Simple Cron Schedule Example

Notes:

  1. The Wiki files are assumed to be located in the main Web Server folder, typically /var/www/html/.
  2. The XML dump file is also assumed to be located in the same directory.
  3. This cron job is executed by the root user, and log files are created in the /root/wikilogs folder.
  4. Replace {your-server} with the URL of your server in the --server argument.
  5. This schedule runs every 10 minutes, allowing for different import sections to be executed in different hours.

Please make changes to point to your specific folders, or create any folders you wish to use.

*/15 4 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 5 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 6 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 7 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 8 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 9 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 10 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 11 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 12 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 13 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 14 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 15 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 16 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 17 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 18 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 19 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 20 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 21 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
*/15 22 * * * /root/wikilogs/10-cronjob.sh > ~/wikilogs/import-$(date +\%Y\%m\%d_\%H\%M\%S)-cron.log 2>&1
0 0 * * * /var/www/html/maintenance/run updateArticleCount.php --use-master --globals --memory-limit default --profiler json --update > ~/wikilogs/articleCount-`date +\%Y\%m\%d\%H\%M\%S`-cron.log 2>&1
30 1 * * * /var/www/html/maintenance/run update --quick > ~/wikilogs/update-`date +\%Y\%m\%d\%H\%M\%S`-cron.log 2>&1
30 2 * * * /var/www/html/maintenance/run rebuildrecentchanges.php --globals --memory-limit default --profiler json > ~/wikilogs/recentChanges-`date +\%Y\%m\%d\%H\%M\%S`-cron.log 2>&1
30 3 * * * /var/www/html/maintenance/run cleanupInvalidDbKeys.php --globals --memory-limit default --profiler json > ~/wikilogs/cleanupInvalidDbKeys-`date +\%Y\%m\%d\%H\%M\%S`-cron.log 2>&1
0 4 * * * /var/www/html/maintenance/run rebuildall.php --use-master --globals --memory-limit default --profiler json --update > ~/wikilogs/rebuildAll-`date +\%Y\%m\%d\%H\%M\%S`-cron.log 2>&1
0 5 * * * /var/www/html/maintenance/run cleanupUsersWithNoId.php --use-master --globals --memory-limit default --profiler json --update > ~/wikilogs/cleanupUsersWithNoId-`date +\%Y\%m\%d\%H\%M\%S`-cron.log 2>&1

In the /root/wikilogs folder, I have created a simple shell script to help import the dump.

#!/bin/bash

# Get the Wiki version to test for compatibility
# Get the BASH version to test for compatibility
# Get the current number of pages, round down to the nearest 1000's or 0 if less than 1000
# Get the server URL
# Start the import to the number of Articles
# Show total pages imported

# Function to convert HH:MM:SS to seconds
function time_to_seconds() {
    local time=$1
    local seconds=0
    IFS=":" read -r hours minutes seconds <<< "$time"
    seconds=$((hours*3600 + minutes*60 + seconds))
    echo "$seconds"
}

# Function to compare version numbers
function version_compare() {
    local version1="$1"
    local version2="$2"
    if [[ "$(printf '%s\n' "$version1" "$version2" | sort -V | head -n1)" == "$version1" ]]; then
        return 0
    else
        return 1
    fi
}

wikilocation="/var/www/html"

# Get the currently installed Wiki Version
rawWikiVersion=$(${wikilocation}/maintenance/run version)
wikiVersion=$(echo "$rawWikiVersion" | awk -F': ' '{split($2,a," "); print a[1]}')

# Compare the version with 1.4.0
# if ! version_compare "$wikiVersion" "1.40.0"; then
#     echo "MediaWiki version is not greater than 1.40.0. Aborting import process."
#     exit 1
# fi

# Get the latest number of pages in the stats
totalStartPages=$(${wikilocation}/maintenance/run showSiteStats | awk '/Total pages/ {print $NF}')

# Round down
articleCount=$(echo "scale=0; (${totalStartPages} / 1000) * 1000" | bc)

# The URL to the wiki
server="<YOUR HTTPS://WEBSITE HERE>"

# Location of the xml file
inputFile="/var/www/html/enwiki-20240120-pages-articles-multistream.xml"

start_time=$(date +%s)
echo "Starting on page ${articleCount} at $(date '+%m-%d-%y %I:%M %p') - Wiki Version ${wikiVersion} with ${server} for the URL"
${wikilocation}/maintenance/run importDump --quiet --memory-limit default --skip-to "${articleCount}" --server "${server}" < ${inputFile} > /dev/null
end_time=$(date +%s)

# Calculate the total number of pages imported
totalEndPages=$(${wikilocation}/maintenance/run showSiteStats | awk '/Total pages/ {print $NF}')
pages_imported=$((totalEndPages - totalStartPages))

# Calculate the duration of the import process in seconds
duration=$((end_time - start_time))

# Convert duration to minutes
duration_minutes=$((duration / 60))

# Calculate pages per minute
pages_per_minute=$(awk "BEGIN {printf \"%.2f\", ${pages_imported} / ${duration_minutes}}")

echo "Completed ${pages_imported} pages in ${duration_minutes} minutes (${pages_per_minute} pages/minute) on $(date '+%m-%d-%y %I:%M %p')"
Monitoring and Troubleshooting
[edit]

During the import process, monitor the server logs for any errors or warnings. Common issues may include database connection errors, memory exhaustion, or file permissions issues. Troubleshoot any encountered problems promptly to ensure a successful import.

Post-Import Tasks
[edit]

Once the import is complete, perform any necessary post-import tasks, such as rebuilding the search index or updating site configurations. Verify the integrity of the imported content to ensure consistency with the original Wikipedia articles.

Conclusion
[edit]

Utilizing a server equipped with 16 cores, 16GB of RAM, and a 2TB SSD, the average import rate over a two-week period stands at 96 pages per minute. This performance metric provides valuable insight into the expected efficiency of the import process, allowing for effective planning and scheduling of import tasks. By leveraging the capabilities of such a server configuration, you can successfully import the English Wikipedia XML dump into your MediaWiki installation, enriching your platform with a wealth of knowledge and resources for users to explore.

To calculate the approximate time it will take to import 59,795,345 pages at an average rate of 96 pages per minute, we can use the following formula:

Total pages ÷ Pages per minute ÷ Minutes per hour ÷ Hours per day ÷ Days per week

Plugging in the values: 59,795,345÷96÷(60×24)÷7 ≈11.87

Rounding up, it will take approximately 12 weeks to import 59,795,345 pages.

Using Special:Import

[edit]

Special:Import can be used by wiki users with import permission (by default this is users in the sysop group) to import a small number of pages (about 100 should be safe).

Trying to import large dumps this way may result in timeouts or connection failures.

  • See Help:Import for a basic description of how the importing process works.[1]

You are asked to give an interwiki prefix. For instance, if you exported from the English Wikipedia, you have to type 'en'. XML importing requires the import and importupload permissions. See Manual:User rights .

Large XML uploads

[edit]

Large XML uploads might be rejected because they exceed the PHP upload limit set in the php.ini file.

 ; Maximum allowed size for uploaded files.
 upload_max_filesize = 20M

Try changing these four settings in php.ini:

 ; Maximum size of POST data that PHP will accept.
 post_max_size = 20M

 max_execution_time = 1000  ; Maximum execution time of each script, in seconds
 max_input_time = 2000	    ; Maximum amount of time each script may spend parsing request data

 ; Default timeout for socket based streams (seconds)
 default_socket_timeout = 2000

Using importDump.php, if you have shell access

[edit]
Recommended method for general use, but slow for very big data sets.
See: Manual:ImportDump.php , including tips on how to use it for large wikis.

Using importTextFiles.php maintenance Script

[edit]
MediaWiki version:
1.23
MediaWiki version:
1.27

If you have a lot of content converted from another source (several word processor files, content from another wiki, etc), you may have several files that you would like to import into your wiki. In MediaWiki 1.27 and later, you can use the importTextFiles.php maintenance script.

You can also use the edit.php maintenance script for this purpose.

rebuildall.php

[edit]

For large XML dumps, you can run rebuildall.php, but, it will take a long time, because it has to parse all pages. This is not recommended for large data sets.

Using pywikibot, pagefromfile.py and Nokogiri

[edit]

pywikibot is a collection of tools written in python that automate work on Wikipedia or other MediaWiki sites. Once installed on your computer, you can use the specific tool pagefromfile.py which lets you upload a wiki file on Wikipedia or MediaWiki sites. The xml file created by dumpBackup.php can be transformed into a wiki file suitable to be processed by 'pagefromfile.py' using a simple Ruby program similar to the following (here the program will transform all xml files which are on the current directory which is needed if your MediaWiki site is a family):

# -*- coding: utf-8 -*-
# dumpxml2wiki.rb

require 'rubygems'
require 'nokogiri'

# This program dumpxml2wiki reads MediaWiki xml files dumped by dumpBackup.php
# on the current directory and transforms them into wiki files which can then 
# be modified and uploaded again by pywikipediabot using pagefromfile.py on a MediaWiki family site.
# The text of each page is searched with xpath and its title is added on the first line as
# an html comment: this is required by pagefromfile.py.
# 
Dir.glob("*.xml").each do |filename|
  input = Nokogiri::XML(File.new(filename), nil, 'UTF-8')

  puts filename.to_s  # prints the name of each .xml file

  File.open("out_" + filename + ".wiki", 'w') {|f| 
    input.xpath("//xmlns:text").each {|n|
      pptitle = n.parent.parent.at_css "title" # searching for the title
      title = pptitle.content
      f.puts "\n{{-start-}}<!--'''" << title.to_s << "'''-->" << n.content  << "\n{{-stop-}}"
    }
  }
end

For example, here is an excerpt of a wiki file output by the command 'ruby dumpxml2wiki.rb' (two pages can then be uploaded by pagefromfile.py, a Template and a second page which is a redirect):

{{-start-}}<!--'''Template:Lang_translation_-pl'''--><includeonly>Tłumaczenie</includeonly>
{{-stop-}}

{{-start-}}#REDIRECT[[badania demograficzne]]<!--'''ilościowa demografia'''-->
<noinclude>
[[Category:Termin wielojęzycznego słownika demograficznego (pierwsze wydanie)|ilościowa demografia]]
[[Category:Termin wielojęzycznego słownika demograficznego (pierwsze wydanie) (redirect)]]
[[Category:10]]</noinclude>
{{-stop-}}

The program accesses each xml file, extracts the texts within <text> </text> markups of each page, searches the corresponding title as a parent and enclosed it with the paired {{-start-}}<!--'''Title of the page'''--> {{-stop-}} commands used by 'pagefromfile' to create or update a page. The name of the page is in an html comment and separated by three quotes on the same first start line. Please notice that the name of the page can be written in Unicode. Sometimes it is important that the page starts directly with the command, like for a #REDIRECT; thus the comment giving the name of the page must be after the command but still on the first line.

Please remark that the xml dump files produced by dumpBackup.php are prefixed by a namespace:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/">

In order to access the text node using Nokogiri, you need to prefix your path with 'xmlns':

input.xpath("//xmlns:text")

Nokogiri is an HTML, XML, SAX, & Reader parser with the ability to search documents via XPath or CSS3 selectors from the last generation of XML parsers using Ruby.

Example of the use of 'pagefromfile' to upload the output wiki text file:

python pagefromfile.py -file:out_filename.wiki -summary:"Reason for changes" -lang:pl -putthrottle:01

How to import logs?

[edit]

Exporting and importing logs with the standard MediaWiki scripts often proves very hard; an alternative for import is the script pages_logging.py in the WikiDAT tool, as suggested by Felipe Ortega.

Troubleshooting

[edit]

Merging histories, revision conflict, edit summaries, and other complications

[edit]

When importing a page with history information, if the user name already exists in the target project, change the user name in the XML file to avoid confusion. If the user name is new to the target project, the contributions will still be available, but no account will be created automatically.

When a page is referenced through a link or a URL, generic namespace names are converted automatically. If the prefix is not a recognized namespace name, the page will default to the main namespace. However, prefixes like "mw:" might be ignored on projects that use them for interwiki linking. In such cases, it might be beneficial to change the prefix to "Project:" in the XML file before importing.

If a page with the same name already exists, importing revisions will combine their histories. Adding a revision between two existing ones can make the next user's changes appear different. To see the actual change, compare the two original revisions, not the inserted one. So, only insert revisions to properly rebuild the page history.

A revision won't be imported if there's already a revision with the same date and time. This usually happens if it's been imported before, either to the current, a previous wiki, or both, possibly from a third site.

An edit summary might reference or link to another page, which can be confusing if the page being linked to has not been imported, even though the referencing page has been.

The edit summary doesn't automatically indicate whether the page has been imported, but in the case of an upload import, you can add this information to the edit summaries in the XML file before the import. This can help prevent confusion. When modifying the XML file with find/replace, keep in mind that adding text to the edit summaries requires differentiating between edits with and without an edit summary. If there are multiple comment tags in the XML file, only the last set will be applied.

Interwikis

[edit]

If you get the message Interwiki the problem is that some pages to be imported have a prefix that is used for Interwiki linking. For example, ones with a prefix of 'Meta:' would conflict with the interwiki prefix meta: which by default links to https://meta.wikimedia.org.

You can do any of the following.

  • Remove the prefix from the interwiki table. This will preserve page titles, but prevent interwiki linking through that prefix.
    Example: you will preserve page titles 'Meta:Blah blah' but will not be able to use the prefix 'meta:' to link to meta.wikimedia.org (although it will be possible through a different prefix).
    How to do it: before importing the dump, run the query DELETE FROM interwiki WHERE iw_prefix='prefix' (note: do not include the colon in the prefix). Alternatively, if you have enabled editing the interwiki table, you can simply go to Special:Interwiki and click the 'Delete' link on the right side of the row belonging to that prefix.
  • Replace the unwanted prefix in the XML file with "Project:" before importing. This will preserve the functionality of the prefix as an interlink, but will replace the prefix in the page titles with the name of the wiki where they're imported into, and might be quite a pain to do on large dumps.
    Example: replace all 'Meta:' with 'Project:' in the XML file. MediaWiki will then replace 'Project:' with the name of your wiki during importing.

See also

[edit]

References

[edit]
  1. See Manual:XML Import file manipulation in CSharp for a C# code sample that manipulates an XML import file.