Jump to content

Manual:ImportDump.php

From mediawiki.org
This page is a translated version of the page Manual:ImportDump.php and the translation is 7% complete.
Recommended method for general use, but slow for very big data sets. See #Importing English Wikipedia or other large wikis, below.

importDump.php file is a maintenance script to import XML dump files into the current wiki. It reads pages from an XML file as produced from Special:Export or dumpBackup.php , and saves them into the current wiki. It is one of MediaWiki's maintenance scripts, and is located in the maintenance folder of your MediaWiki installation.

Description of operation

The script reports ongoing progress in 100-page increments (by default), reporting the number of pages and revisions imported per second for each increment, so you can monitor its activity, and see that it hasn't hung. Can take 30 or more seconds between increments.

The script is robust, as it skips past previously loaded pages, rather than overwrites them, so that it can pick up where it left off fairly quickly after being interrupted and restarted. It still displays progress increments while doing this, which skips by pretty fast.

Pages will be imported preserving the timestamp of each edit. Due to this feature, if a page being imported is older than the existing page, it will only populate the page history, but it won't replace the most recent revision with an older one. If that behavior is not desired, existing pages should be deleted first prior to import, or they'll need to be edited, reverting to the last imported revision found in the page history.

The wiki is usable during the import.

The wiki looks weird missing most of the templates, and with so many red links, but it gets better as the import proceeds.

Ejemplos

If you have shell access, you can call importdump.php from within the maintenance folder like this (add paths as necessary):

php importDump.php --conf ../LocalSettings.php /path_to/dumpfile.xml.gz --username-prefix=""

or this:

php importDump.php < dumpfile.xml

where dumpfile.xml is the name of the XML dump file. If the file is compressed and that has a .gz or .bz2 file extension (but not .tar.gz or .tar.bz2), it is decompressed automatically.

Due to this bug it may be necessary to specify --username-prefix="" when importing files.

Afterwards use ImportImages.php to import the images:

php importImages.php ../path_to/images
Running importDump.php can take quite a long time. For a large Wikipedia dump with millions of pages, it may take days, even on a fast server. Add --no-updates for faster import. Also note that the information in Import about merging histories, etc. also applies.

After running importDump.php, you may want to run rebuildrecentchanges.php in order to update the content of your Special:Recentchanges page.

If you imported a dump with the --no-updates parameter, you'll need to run rebuildall.php to populate all the links, templates and categories.

Opciones

Opción/parámetro Descripción
--report Report position and speed after every n pages processed.
--namespaces Import only the pages from namespaces belonging to the list of pipe-separated namespace names or namespace indexes.
--dry-run Parse dump without actually importing pages.
--debug Output extra verbose debug information.
--uploads Process file upload data if included (experimental).
--no-updates Disable link table updates. Is faster but leaves the wiki in an inconsistent state. Run rebuildall.php after the import to correct the link table.
--image-base-path Import files from a specified path.
--skip-to Start from the given page number, by skipping first n-1 pages.
--username-prefix Adds a prefix to usernames. Due to this bug it may be necessary to specify --username-prefix="" when importing files.

FAQ

How to setup debug mode?

Use command line option --debug.

How to make a dry run (no data added to the database)?

Use command line option --dry-run

Error messages

Failed to open stream

In case you get an error "failed to open stream: No such file or directory", make sure that the specified file does exist and that PHP has access to it.

Error while running importImages

Typed

roots@hello:~# php importImages.php /maps gif bmp PNG JPG GIF BMP

Error

> PHP Deprecated:  Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/mcrypt.ini on line 1 in Unknown on line 0
> Could not open input file: importImages.php

Cause

Before running importImages.php you first need to change directories to the maintenance folder which has the importImages.php maintenance script.

Error while running MAMP

DB connection error: No such file or directory (localhost)

Solution

Using specific database credentials

$wgDBserver         = "localhost:/Applications/MAMP/tmp/mysql/mysql.sock";
$wgDBadminuser      = "XXXX";
$wgDBadminpassword  = "XXXX";

Importing English Wikipedia or other large wikis

For very large data sets, importDump.php may take a long time (days or weeks); there are alternate methods which can be much faster for full site restoration, see Manual:Importación de volcados XML .

If you can't get the other methods to work, here are some pointers for using importDump.php for importing large wikis, to reduce import time as much as possible...

Parallelizing the import

You could try running importDump.php multiple times simultaneously on the same dump, using the option --skip-to...

In an experiment on Ubuntu, the script was run (on a decompressed dump) multiple times in separate windows simultaneously using the --skip-to option. On a quad-core laptop computer, running the script in 4 windows sped up import by a factor of 4. In the experiment, the --skip-to parameter was set 250 000 to 1 000 000 pages apart per instance, and the import was monitored (checked on from time to time), to stop each instance before catching up to another.

Nota Nota: This experiment was not tried running multiple instances without the "--skip-to" parameter, to avoid potential clashing -- if you try this without --skip-to, or you let the instances catch up to each other, please post your findings here. In this experiment, 2 of the windows caught up, and no error messages resulted. The instances of the script appeared to be jumping past each other.

Using --skip-to differs from normal operation, in that progress increments are not displayed during the skip, instead, it's just the (blinking) cursor. After a few minutes, the increment reports begin to display.

Data segmentation

It may be a good idea to segment the data first, with an xml splitter, before importing it in parallel. Then run importDump.php on each segment in a separate window, which would avoid potential clashes. (If you successfully split the dump so it works in this process, please post how to, here).

Import the most useful namespaces first

To speed up import of the most useful parts of the wiki, use the --namespaces parameter. Import templates first, because articles without working templates look awful. Then import articles. Or, do both at the same time, in multiple windows, as described above, starting templates first, as they import faster and the articles window(s) won't catch up.

Nota Nota: The main namespace doesn't have a prefix, and so it must be specified using a 0. "Main" and "Article" fail to run and return errors.

Once complete, this will necessitate using importDump.php again to get the pages in all the other namespaces.

Estimating how long it will take

Before you can estimate how long an import will take, you've got to find out how many total pages are in the wiki you are importing. That is displayed at Special:Statistics in each wiki. As of October 2023, the English Wikipedia had over 59 000 000 pages, including all page types such as talk pages, redirects, etc, but not including pictures ("files").

To see how fast the import is going, go to the page Special:Statistics in the wiki you are are importing into. Note the time and jot down the total pages. Then come back later and see by how much that number has changed. Convert that to pages per day, and then divide that figure into the total pages for the wiki you are importing, to see how many days the import will take.

For example, in the experiment mentioned above, importing using parallelization, and looking at the total pages in Special:Statistics, the wiki is growing about 1 000 000 pages per day. Therefore, it will take around 59 days at that rate to import the 59 000 000 pages (as of October 2023) in the English Wikipedia (not including pictures).

Notas

Since MediaWiki 1.29 (task T144600), importDump.php doesn't update statistics. You should run initSiteStats.php manually after the import to update page and revision counts.

Troubleshooting

If errors occur when importing files, it may be necessary to use the --username-prefix option.

Véase también