Руководство:Pywikibot/replace.py
Wikimedia Git repository has this file: scripts/replace.py |
Скрипты Pywikibot |
---|
|
Replace.py is part of the Pywikibot framework.
Этот бот осуществляет замену текста. Он может получать список страниц для редактирования из дампа XML или текстового файла, либо же работать на единственной странице. Чтобы получить больше информации, используйте:
$ python pwb.py replace -help
Если вы используете Windows, python
можно опустить.
If you have Linux/Ubuntu, you must specify "python3
".
Обзор
Скрипт можно использовать двумя способами:
- Все параметры, включая искомый текст и текст на замену, указываются как параметры в командной строке. Подходит для простых задач. Например,
python pwb.py replace -ns:0 color colour -search:color
будет искать слово «color» только в статьях (ns:0
), заменяя вхождения в нижнем регистре на «colour», каждый раз запрашивая подтверждения на замену и используя стандартное описание правки. Будьте осторожны, поскольку могут быть случаи, в которых, например, слово «color» не должно заменяться на «colour» (например, в статье о CSS); никогда не запускайте скрипт в автоматическом режиме, если вы не уверены в корректности замены хотя бы в 100 % случаев! Это простейшая форма команды (см. ниже минимально необходимые параметры).-search:color
эквивалентно поиску по слову color и предоставляет быстрый путь подбора статей, но colored и colors не будут найдены.- Если в старом или новом тексте содержатся пробелы, используйте кавычки!
- Повторяющиеся задания можно сохранять в batch-файле (Windows) или shell-скрипте (Linux).
- Основные параметры, включая старый и новый текст, исключения и комментарий к правке, хранятся в файле. В одном файле можно сохранить и повторно использовать несколько задач, называемых фиксами (fixes). This file may be either
fixes.py
which is included in your pywikibot distribution, but is subject to change at every update, so you have to save it for yourself, oruser-fixes.py
that is designed for personal use, but is not included in Pywiki and may be created withgenerate_user_files.py
. This latter one has a slightly different syntax, but has an example. This is much more efficient and flexible but needs some preparation. These two methods may be combined; however, some parameters stored in the fix will overwrite corresponding parameter given in command line.
Once you have chosen between these options, you have another decision:
- You make simple text replacements like the above one. This is a good way for changing words, templates, categories, section titles or names, but is not flexible. For example, the above command will replace "color", but not colors, colored and Color in upper case.
(If you are worried only about the case, you may still typepython pwb.py replace -ns:0 color colour Color Colour -search:color
.) - You use regular expressions (often mentioned as regexes). These seek patterns and replace them with patterns. There are some examples in
fixes.py
. For agglutinative and inflecting languages this is the only efficient way of spelling corrections.
Your third decision will be this:
- You search for pages to be modified in the live wiki. This will result in acceptable speed if you work with templates, categories or the search engine, but is usually very slow for simple iteration of pages, especially in large and medium sized wikis. So
python pwb.py replace -start:!
is the least recommended way of usage as it wastes your time and the resources of the server. (However, sometimes it is necessary and unavoidable if your wiki does not have dumps.) - You download an XML dump of your wiki from https://dumps.wikimedia.org (usually xxwiki-latest-pages-articles.xml.bz2) and use it with
-xml
. This will speed up your bot and use your time as well as the time of your computer and the server more efficiently. This is the recommended way of searching for color, colors and colored together, because the search engine unfortunately does not handle regexes. The disadvantage of this method is that you won't find articles into which the given text has got since the composition of the last dump.- Direct access to the dumps of your wiki is something like download:huwiki/. Change the first two letters to your language code.
- Dumps at the given link are available only for Wikimedia wikis. For other wikis, contact the maintainer of that wiki to learn if they have dumps.
And, last but not least, you face one more decision:
- You search for pages and modify them on the fly. This will again result in acceptable speed if you work with templates, categories or the search engine, but may be very slow if you just search for a regex in the complete namespace or wiki.
- Minimal set of parameters
At least these data should be given for the bot every time:
- Where and how to search for the pages to be edited?
- The corresponding parameter may be any of
-start, -file, -page, -search, -xml, -cat
etc.; see the Source section of the below table.
- The corresponding parameter may be any of
- The old text to be replaced and the new one to be substituted.
- This may be one or more pairs of strings or the name of a fix.
- It is not mandatory, but usually worth and strongly recommended for beginners to limit the work to the main namespace with
-ns:0
. Thus you can avoid changing the contributions of users on talk pages or correcting the title of an article on a page where the talk is just about that title. Visible part of templates (but not the code itself!) and file descriptions are also in the scope of readers. It is better not to modify talk pages, user pages and project pages (the "Wikipedia" namespace) in the first time, and it needs special care and community consensus even later. Don't be surprised of angry reactions or your bot being blocked if you omit the namespace parameter.
Файлы
The bot uses three files in addition to the framework:
replace.py
- the main module
fixes.py
- a few predefined "fixes"
user-fixes.py
- a file to add ones own fixes. The file is created nearly empty by
generate_user_files.py
Files that may be used for input and/or output:
filename.txt
- a file with a list of articles if specified with the parameter "-file"
filename.xml
- a local XML dump if used with parameter "-xml"
replacelog
- the log with a name that may be specified with parameter "-log"
Параметры
Локальные
You can run replace.py with the following parameters (for example, python pwb.py replace -file:articles_list.txt "errror" "error"
).
Source | |
---|---|
-xml | Retrieve information from a local XML dump (pages_current, see https://dumps.wikimedia.org). Argument can also be given as "-xml:filename ".
|
-xmlstart | Use with -xml. This will start at the given title (they are usually in order of the first edit). If you quit with Ctrl C, replace.py will write on the screen, where to continue. |
-file | Work on all pages given in a local text file. Will read any [[wiki link]] and use these articles. Argument can also be given as "-file:filename ".
|
-cat | Work on all pages (not categories) which are in a specific category. Argument can also be given as "-cat:categoryname ".
|
-catr | Like -cat , but also recursively includes pages in subcategories, sub-subcategories etc. of the given category.
|
-subcats | Work on all subcategories of a specific category. Argument can also be given as "-subcats:categoryname " or as "-subcats:categoryname|fromtitle ".
|
-subcatsr | Like -subcats , but also includes sub-subcategories etc. of the given category.
|
-transcludes | Work on all pages which transclude a specific template. Argument can also be given as "-transcludes:referredtemplate ", e.g. "-transcludes:stub " means transcluding stub template.
|
-page | Only edit a specific page. Argument can also be given as "-page:pagetitle ". You can give this parameter multiple times to edit multiple pages.
|
-ref | Work on all pages that link to a certain page. Argument can also be given as "-ref:referredpagetitle ".
|
-filelinks | Work on all pages that link to a certain image. Argument can also be given as "-filelinks:ImageName ".
|
-links | Work on all pages that are linked to from a certain page. Argument can also be given as "-links:linkingpagetitle ".
|
-start | Work on all pages in the wiki, starting at a given page. Choose "-start:! " to start at the beginning. Note: You are advised to use -xml instead of this option; this is meant for cases where there is no recent XML dump.
|
-prefixindex | Work on pages commencing with a common prefix. |
-titleregex | Work on pages that have titles matching the given regular expression, e.g. -titleregex:'.*foo.*' |
-search | Work on pages that contain the given search string e.g. -search:"Color" |
Target | |
Replace parameters | |
-excepttitle:XYZ | Skip pages with titles that contain XYZ. If the -regex argument is given, XYZ will be regarded as a regular expression. |
-excepttext:XYZ | Skip pages which contain the text XYZ. If the -regex argument is given, XYZ will be regarded as a regular expression. |
-exceptinside:XYZ | Skip occurrences of the to-be-replaced text which lie within XYZ. If the -regex argument is given, XYZ will be regarded as a regular expression. |
-exceptinsidetag:XYZ | Skip occurrences of the to-be-replaced text which lie within an XYZ tag.
|
-summary:XYZ | Set the summary message text, bypassing the default edit summaries. |
-fix:XYZ | Perform one of the predefined replacements tasks, which are given in the dictionary 'fixes' defined inside the file fixes.py or user-fixes.py . The -regex argument and given replacements will be ignored if you use -fix. Currently available predefined fixes are:
|
-pairsfile | Lines from the given file name(s) will be read as if they were added to the command line at that point. I.e. a file containing lines "a" and "b", used as python pwb.py replace -page:X -pairsfile:file c d will replace 'a' with 'b' and 'c' with 'd'. However, using python pwb.py replace -page:X c -pairsfile:file d will also work, and will replace 'c' with 'a' and 'b' with 'd'.
|
-namespace:n
abbrev. -ns:n |
Number of namespace to process. The parameter can be used multiple times. It works in combination with all other parameters except for the -start parameter. (If you want to change all pages in a particular namespace, add the namespace prefix; for example, -start:User:! .)
|
unnamed | First unnamed argument is the old text, second argument is the new text. If the -regex argument is given, the first argument will be regarded as a regular expression, and the second argument might contain expressions like \1 or \g<name>. |
Options | |
-always | Don't prompt you for each replacement. |
-recursive | Recurse replacement until possible. |
-nocase | Use case insensitive search expressions (including regex). |
-allowoverlap | When occurrences of the pattern overlap, replace all of them. Warning! Don't use this option if you don't know what you're doing, because it might easily lead to infinite loops then. |
-regex | Make replacements using regular expressions. If this argument isn't given, the bot will make simple text replacements. |
-dotall | a dot (.) also matches linebreaks when using regex. |
-multiline | '^' and '$' will now match begin and end of each line. |
Global
Эта страница устарела. |
Эти параметры переопределяют настройки параметров в user-config.py .
Параметр | Описание | Конфигурационная переменная |
---|---|---|
-dir:PATH |
Прочитать настройки бота из каталога, заданному переменной PATH, а не из каталога по умолчанию. | |
-config:file |
The user config filename. Default is user-config.py. | user-config.py |
-lang:xx |
Установить язык Вики с которой вы хотите работать, перезаписывая конфигурацию из user-config.py. Вместо xx должен быть указан код языка (ru). | mylang |
-family:xyz |
Установите семейство Вики с которой вы хотите работать, например, Википедия, викисловарь, викисклад, викитрэвел, ... Переопределяет конфигурацию в user-config.py. | family |
-user:xyz |
Войдите в систему как пользователь 'xyz' вместо пользователя по умолчанию. | usernames |
-daemonize:xyz |
Немедленно возвращает управление терминалу и перенаправляет stdout и stderr в файл xyz (использовать только для ботов, которые не требуют ввода из stdin). | |
-help |
Показать справку. | |
-log |
Включить лог-файл, используя имя файла по умолчанию script_name-bot.log . Журналы будут храниться в подкаталоге logs. |
log |
-log:xyz |
Включить лог-файл, используя 'xyz' в качестве имени файла. | logfilename |
-nolog |
Отключить лог (если он включен по умолчанию). | |
-maxlag |
Устанавливает новый параметр - maxlag (число секунд). Отложить правки ботов в периоды лагов сервера базы данных. Значение по умолчанию устанавливается в config.py | maxlag |
-putthrottle:n -pt:n -put_throttle:n |
Указать минимальное время (в секундах) которое бот будет ждать после сохранения страниц. | put_throttle |
-debug:item -debug |
Включить лог-файл и включить расширенные отладочные данные для компонента "item(элемент)" (для всех компонентов, если используется последующая форма). | debug_log |
-verbose -v |
Выводить больше отладочной информации в консоль. | verbose_output |
-cosmeticchanges -cc |
Переключает настройки cosmetic_changes в config.py или user-config.py в противоположные или отменяет их. Все остальные параметры и ограничения остаются без изменений. | cosmetic_changes |
-simulate |
Запрещается запись на сервер. Полезно для тестирования и отладки нового кода (если эта опция указана, не делается каких-либо реальных изменений, а только показывается, что изменилось бы). | simulate |
-<config var>:n |
Вы можете использовать все заданные числовые настройки переменных как параметр и изменить его из командной строки. |
Examples
If you want to change templates from the old syntax, e.g. {{msg:Stub}}, to the new syntax, e.g. {{Stub}}, download an XML dump file (page table) from https://dumps.wikimedia.org, then use this command:
$ python pwb.py replace -xml -regex "{{msg:(.*?)}}" "{{\1}}"
You can match patterns across more than one line:
$ python pwb.py replace -regex -start:! "First line\nSecond line" ""
You can insert or append text to a page (note the replacement text has embedded new lines):
$ python pwb.py replace -regex '(?s)^(.*)$' "\1
> ==new message==
> blah
> "
If you have a dump called foobar.xml and want to fix typos, e.g. Errror -> Error, use this:
$ python pwb.py replace -xml:foobar.xml "Errror" "Error"
If you have a page called 'John Doe' and want to convert HTML tags to wiki syntax, use:
$ python pwb.py replace -page:John_Doe -fix:HTML
If you run the bot without arguments you will be prompted multiple times for replacements:
$ python pwb.py replace -file:blah.txt
The script asks the user before modifying an article. It is recommended to double-check the result to be sure that the bot did not introduce errors (especially with misspelled words). It is possible to specify a set of articles with an external text file containing Wiki links :
[[plane]]
[[vehicle]]
[[train]]
[[car]]
The bot is then called using something like:
$ python pwb.py replace [global-arguments] -file:articles_list.txt "errror" "error"
Rather than specifying regular expressions at the command line, it's preferable to add them to user-fixes.py
$ python pwb.py replace -file:articles_list.txt -fix:example2
Example: Replacing multiple paragraphs
The original text of the page Meta:Sandbox is:
This page is for any tests. Welcome to the sandbox!
If you want to switch the statement (the second one goes before the first one), you type the following syntax:
$ python pwb.py replace -page:Meta:Sandbox -regex "This page is for any tests.\r\n\r\nWelcome to the sandbox!" "Welcome to the sandbox!\n\nThis page is for any tests."
To add a new line we use \n
.
Example: Plenty of unbolding within an article
In this article there were really lots of bolded episode titles in several tables that were to be unbolded. This is the case when you may want to use a bot for one single article and this shows the role of some interesting parameters.
- What are we looking for?
- Texts between pairs of
'''
, - which are within a table (we don't want to replace in the rest of the article!),
- but do not contain a
|
character (just for safety, to make sure we are still within one cell — we might perhaps omit this), - paying attention to having many occurrences within a table (recursion),
- and that the tables are wrapped to several lines (dotall),
- and that every table opening tag should match its own closing tag and every beginning of bold text should match its own closing
'''
(that's why we use?
s to make the expressions ungreedy).
- What do we replace it with?
With the text parts before, between and after the boldings — these are put in parentheses to be able to refer to them with their group numbers, respectively.
- The command
(It is wrapped here for readability, but you should write to one line, of course.)
$ python pwb.py replace -page -regex -recursive -dotall -summary:Vastagtalanítás "(\{\|.*?)'''([^\|]*?)'''(.*?\|\})" "\1\2\3"
The bot will ask for the title since we have not given it. Using double quotation marks fits to the command line and gives the freedom to use apostrophes in the expression.
- Result
Here you are. (Don't click if your computer is not strong enough!)
Advanced use of fixes: own functions
Being a wizard by means of replace.py is not a dream if you are familiar with the basics of Python programming. textlib.py (another module of the pywikibot framework) has a wonderful but not widely used ability. If you write a function instead of a constant text or a regular expression to the replacement text or the exceptions, it will recognize and execute it. With a little bit of programming you may take advantage of this feature, and use replace.py at a higher level. Needless to say, using an own function gives much more flexibility than a simple regex. You may also use a function to generate the replacement expressions so as to keep them clearly arranged.
To learn how to use your own functions in fixes.py and user-fixes.py and what is this good for, see hu:Szerkesztő:Bináris/Fixes and functions HOWTO.
External links
- The Python Standard Library » Regular expression operations
- Kodos - The Python Regular Expression Debugger - to test regular expressions specifically for python
- Regular-Expression.Info - introduction and comparison of various regex implementations, including python
- Unicode to HTML converter (useful to create articles_list.txt)