User:Bináris/Pywikibot cookbook
This page is under construction Please help review and edit this page. |
Published | ||||
---|---|---|---|---|
Pywikibot is the ninth wonder of the world, the eighth being MediaWiki itself. Pywikibot is very flexible and powerful tool to edit Wikipedia or another MediaWiki instance. However, there comes the moment when you feel that something is missing from it, and the Universe calls you to write your own scripts. Don't be afraid, this is not a disease, this is the natural way of personal evolution. Pywikibot is waiting for you: you will find the This book is for you, if you
For general help see the bottom right template. In this book we go imto coding examples with some deeper explanation. (A personal confession from the creator of this page: I just wanted to use Pywikipedia, as we called it in the old times, then I wanted to slightly modify some of the scripts to better fit to my needs, then I went to the book store and bought my first Python book. So it goes.) |
Introduction
[edit]Published | ||||
---|---|---|---|---|
Creating a script[edit]
Running a script[edit]You have basically two ways. The recommended one is to call your script through
However, if you don't need these features, especially if you don't use global options and don't want Coding style[edit]Of course, we have PEP 8, Manual:Coding conventions/Python and Manual:Pywikibot/Development/Guidelines. But sometimes we feel like just hacking a small piece of code for ourselves and not bothering the style. Several times a small piece of temporary code begins to grow beyond our initial expectations, and we have to clean it. If you'll take my advice, do what you want, but my experience is that it is always worth to code for myself as if I coded for the world. On the other side, when you use Pywikibot interactively (see below), it is normal to be lazy and use abbreviations and aliases. For example >>> import pywikibot as p
>>> import pywikibot.pagegenerators as pg
Note that the However, in this cookbook we won't use these abbreviations for better readability. Beginning and ending[edit]In most cases you see something like this in the very first line of Pywkibot scripts:
This is a shebang. If you use a Unix-like system, you know what it is for. If you run your scripts on Windows, you may just omit this line, it does not do anything. But it can be a good idea to use anyway in order someday others want to use your script. The very last two lines of the scripts also follow a pattern. They usually look like this: if __name__ == '__main__':
main()
This is a good practice in Python. When you run the script directly from command line (that's what we call directory mode), the condition will be true, and the Scripting vs interactive use[edit]For proper work we use scripts. But there is an interesting way of creating a sandbox. Just go to your Pywikibot root directory (where python and at the Python prompt type >>> import pywikibot
>>> site = pywikibot.Site()
Now you are in the world of Pywikibot (if >>> page = pywikibot.Page(site, 'titlecomeshere')
>>> page.text = page.text.replace('Pywikipedia', 'Pywikibot')
>>> page.save('Pywikibot forever!')
Throughout this document A big advantage of shell is that you may omit the page.title()
equals to print(page.title())
#Working with namespaces section shows a rare exception when these are not equivalent, and we can take advantage of the difference for understanding what happens. Documentation and help[edit]We have three levels of documentation. As you go forward into understanding Pywikibot, you will become more and more familiar with these levels.
Notes[edit] |
Basic concepts
[edit]Published | ||||
---|---|---|---|---|
Throughout the manual and the documentation we speak about MediaWiki rather than Wikipedia and Wikibase rather than Wikidata because these are the underlying software. You may use Pywikibot on other projects of WikiMedia Foundation, any non-WMF wiki and repository on Internet, even on a MediaWiki or Wikibase instance on your home computer. See the right-hand menu for help.
>>> import pywikibot
>>> site = pywikibot.Site()
>>> site
APISite("hu", "wikipedia")
>>> site.data_repository()
DataSite("wikidata", "wikidata")
>>> site.image_repository()
DataSite("commons", "commons")
All the above concepts are classes or classlike factories; in the scripts we instantiate them. E.g.
|
Testing the framework
[edit]Published | ||||
---|---|---|---|---|
Let's try this at Python prompt: >>> import pywikibot
>>> site = pywikibot.Site()
>>> print(site.username())
BinBot
Of course, you will have the name of your own bot there if you have set the If you save the above code to a file called test.py: import pywikibot
site = pywikibot.Site()
print(site.username())
and run it with Now try print(site.user())
This is already a real contacting your wiki; the result is the name of your bot if you have logged in, otherwise |
Getting a single page
[edit]Published | ||||
---|---|---|---|---|
Creating a Page object from title[edit]In the further part of this cookbook, unless otherwise stated, we always assume that you have already used these two basic statements: import pywikibot
site = pywikibot.Site()
You want to get the article about Budapest in your wiki. While it is in the article namespace, it is as simple as page = pywikibot.Page(site, 'Budapest')
Note that Python is case sensitive, and in its world For such simple experiments interactive Python shell is comfortable, as you can easily see the results without using >>> page = pywikibot.Page(site, 'Budapest')
>>> page
Page('Budapest')
>>> type(page)
<class 'pywikibot.page._page.Page'>
Getting the type of an object is often useful when you want to discover the capabilities of Pywikibot. It seems to be strange, but the main thing is that you got a Page. Now let's see the user page of your bot. Either you prefix it with the namespace ( >>> title = site.username()
>>> page = pywikibot.Page(site, 'User:' + title)
>>> page
Page('Szerkesztő:BinBot')
and >>> title = site.username()
>>> page = pywikibot.Page(site, title, 2)
>>> page
Page('Szerkesztő:BinBot')
will give the same result. Getting the title of the page[edit]On the other hand, if you already have a >>> page = pywikibot.Page(site, 'Budapest')
>>> page.title()
'Budapest'
Possible errors[edit]While getting pages may cause much less errors than saving them, a few types are worth to mention, some of them being technical, while others possible contradictions between our expectations and reality. Let's speak about them before actually getting the page.
Getting the content of the page[edit]Important: by this time we don't have any knowledge about the existence of the page. We have not contacted live wiki yet. We just created an object. It is just as a street number: you may write it on a document, but either there is a house there or not. There are two main approaches of getting the content. It is important to understand the difference. Page.text[edit]You may notice that >>> page = pywikibot.Page(site, 'Budapest')
>>> page.text
will write the whole text on your screen. Of course, this is for experiment. You may writetext = page.text
Page.text is not a method, so referring to it several times does not slow down your bot. Just manipulate page.text or assign it a new value, then save.
If you want to know details on how a property works, search for "Python decorators". For using it in your scripts it is enough to know the behaviour. Click on the above link and go through the right-hand menu. You will find some other properties without parentheses.
>>> page = pywikibot.Page(site, 'Arghhhxqrwl!!!')
>>> if page.text:
... print('Got it!')
... else:
... print(f'Page {page.title()} does not exist or has no content.')
...
Page Arghhhxqrwl!!! does not exist or has no content.
>>> page = pywikibot.Page(site, 'Arghhhxqrwl!!!')
>>> if page.text:
... print(len(page.text))
... else:
... print(page.exists())
...
False
While page creation does not contact the live wiki, refering to text for the first time and Page.get()[edit]The traditional way is >>> page = pywikibot.Page(site, 'Budapest')
>>> text = page.get()
>>> len(text)
165375
A non-existing page causes a NoPageError: >>> page = pywikibot.Page(site, 'Arghhhxqrwl!!!')
>>> text = page.get()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\Pywikibot\pywikibot\page\_page.py", line 397, in get
self._getInternals()
File "c:\Pywikibot\pywikibot\page\_page.py", line 436, in _getInternals
self.site.loadrevisions(self, content=True)
File "c:\Pywikibot\pywikibot\site\_generators.py", line 772, in loadrevisions
raise NoPageError(page)
pywikibot.exceptions.NoPageError: Page [[hu:Arghhhxqrwl!!!]] doesn't exist.
A redirect page causes an IsRedirectPageError: >>> page = pywikibot.Page(site, 'Time to Shine')
>>> text = page.get()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\Pywikibot\pywikibot\page\_page.py", line 397, in get
self._getInternals()
File "c:\Pywikibot\pywikibot\page\_page.py", line 444, in _getInternals
raise self._getexception
pywikibot.exceptions.IsRedirectPageError: Page [[hu:Time to Shine]] is a redirect page.
If you don't want to handle redirects, just make the difference between existing and non-existing pages, >>> page = pywikibot.Page(site, 'Time to Shine')
>>> page.get(get_redirect=True)
'#ÁTIRÁNYÍTÁS [[Time to Shine (egyértelműsítő lap)]]'
Here is a piece of code to handle the cases. It is already too long for prompt, so I saved it. for title in ['Budapest', 'Arghhhxqrwl!!!', 'Time to Shine']:
page = pywikibot.Page(site, title)
try:
text = page.get()
print(f'Length of {page.title()} is {len(text)} bytes.')
except pywikibot.exceptions.NoPageError:
print(f'{page.title()} does not exist.')
except pywikibot.exceptions.IsRedirectPageError:
print(f'{page.title()} redirects to {page.getRedirectTarget()}.')
print(type(page.getRedirectTarget()))
Which results in: Length of Budapest is 165375 bytes.
Arghhhxqrwl!!! does not exist.
Time to Shine redirects to [[hu:Time to Shine (egyértelműsítő lap)]].
<class 'pywikibot.page._page.Page'>
While For a practical application see #Content pages and talk pages. Reloading[edit]If your bot runs slowly and you are in doubt that the page text is still actual, use >>> import pywikibot as p
>>> site = p.Site()
>>> page = p.Page(site, 'Kisbolygók listája (1–1000)')
>>> page.text
'[[#1|1–500.]] • [[#501|501–1000.]]\n\n{{:Kisbolygók listája (1–500)}}\n{{:Kisbolygók listája (501–1000)}}\n\n[[Kategória:A Naprendszer kisbolygóinak
listája]]'
>>> page.text = 'Luke,I am your father!'
>>> page.text
'Luke, I am your father!'
>>> page.get(force=True)
'[[#1|1–500.]] • [[#501|501–1000.]]\n\n{{:Kisbolygók listája (1–500)}}\n{{:Kisbolygók listája (501–1000)}}\n\n[[Kategória:A Naprendszer kisbolygóinak
listája]]'
>>> page.text
'Luke, I am your father!'
>>>
>>> page.text = page.get()
>>> page.text
'[[#1|1–500.]] • [[#501|501–1000.]]\n\n{{:Kisbolygók listája (1–500)}}\n{{:Kisbolygók listája (501–1000)}}\n\n[[Kategória:A Naprendszer kisbolygóinak
listája]]'
Notes[edit]
|
Saving a single page
[edit]Published | ||||
---|---|---|---|---|
Here we also have two solutions, but the difference is much less then for getting. The new and recommended way is Save is hard to follow beacuse it calls On the other hand, page = pywikibot.Page(site, 'Special:MyPage')
text = page.get()
# Do something with text
page.put(text, 'Hello, I modified it!')
Putting page.text = text
page.save('Hello, I modified it!')
If you look into the code, this is just what The only capability of old
page.put('This page was emptied.', 'Empty talk pages')
in a loop is more simple than to assign |
Possible errors
[edit]Page generators – working with plenty of pages
[edit]Published | ||||
---|---|---|---|---|
Overview[edit]
Page generators form one of the most powerful tools of Pywikibot. A page generator iterates the desired pages. Why use page generators?
A possible reason to write your own page generator is mentioned in Follow your bot section. Most page generators are available via command line arguments for end users. See https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.pagegenerators.html for details. If you write your own script, you may use these arguments, but if they are permanent for the task, you may want to directly invoke the appropriate generator instead of handling command line arguments. Life is too short to list them all here, but the most used generators are listed under the above link. you may also discover them in the
Pagegenerators package (group 1 and 2)[edit]Looking into import pywikibot.pagegenerators
for page in pywikibot.pagegenerators.AllpagesPageGenerator(total=10):
print(page)
which is almost equivalent to: from pywikibot.pagegenerators import AllpagesPageGenerator
for page in AllpagesPageGenerator(total=10):
print(page)
To interpret this directory which appears for us as
API generators (group 3)[edit]MediaWiki offers a lot of low-level page generators, which are implemented in Usage[edit]Generators may be used in print(list(AllpagesPageGenerator(total=10)))
But be careful: while loops continuously process pages, the list comprehension may take a while because it has to read all the items from the generator. This statement is very fast for total=10, takes noticeable time for total=1000, and is definitely slow for total=100000. Of course, it will consume a lot of memory for big numbers, so usually it is better to use generators in a loop. A few interesting generators[edit]A non-exhaustive list of useful generators. All these may be imported from Autonomous generators (_generators.py)[edit]Most of them correspond to a special page on wiki.
... and much more... An odd one out[edit]
Filtering generators (_filters.py)[edit]
Other wrappers (__init__.py)[edit]
Examples[edit]List Pywikibot user scripts with [edit] |
Working with page histories
[edit]Almost published, except the file pages | ||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Revisions[edit]Processing page histories may be frightening due to the amount of data, but is easy because we have plenty of methods. Some of them extract a particular information such as a user name, while others return an object called revision. A revision represents one line of a page history with all its data that are more than you see in the browser and more than you usually need. Before we look into these methods, let's have a look at revisions. We have to keep in mind that
Furthermore, bots are not directly marked in page histories. You see in recent changes if an edit is made by bot because this property is stored in the recent changes table of the database and is available there for a few weeks. If you want to know if an edit was made by a bot, you may
API:Revisions gives a deeper insight, while Manual:Slot says something about slots and roles (for most of us this is not too interesting). Methods returning a single revision also return the content of the page, so it is a good idea to choose a short page for experiments (see Special:ShortPages on your home wiki). For now I choose a page which is short, but has some page history: hu:8. évezred (8th millennium). Well, we have really few to say about it, and we suffer from lack of reliable sources. Let's see the last (current) revision! page = pywikibot.Page(site, '8. évezred')
rev = page.latest_revision # It's a property, don't use ()!
print(rev)
{'revid': 24120313, 'parentid': 15452110, 'minor': True, 'user': 'Misibacsi', 'userid': 110, 'timestamp': Timestamp(2021, 8, 10, 7, 51), 'size': 341,
'sha1': 'ca17cba3f173188954785bdbda1a9915fb384c82', 'roles': ['main'], 'slots': {'main': {'contentmodel': 'wikitext', 'contentformat': 'text/x-wiki',
'*': "{{évezredek}}\n\nA '''8. évezred''' ciklusa 7001. [[január 1.|január 1-jén]] kezdődik, és 8000. [[december 31.|december 31-éig]] tart.\n\n== Csillagászati előrejelzések ==\n\n* A [[90377 Sedna|Szedna]] [[kisbolygó]] keringése során pályájának legtávolabbi pontjába ([[aphélium]]) ér 7800 körül.
\n\n[[Kategória:Évezredek|08]]"}}, 'comment': '/* Csillagászati előrejelzések */ link', 'parsedcomment': '<span dir="auto"><span class="autocomment"><a href="/wiki/8._%C3%A9vezred#Csillagászati_előrejelzések" title="8. évezred">→\u200eCsillagászati előrejelzések</a>: </span> link</span>', 'tags': []
, 'anon': False, 'userhidden': False, 'commenthidden': False, 'text': "{{évezredek}}\n\nA '''8. évezred''' ciklusa 7001. [[január 1.|január 1-jén]] kezdődik, és 8000. [[december 31.|december 31-éig]] tart.\n\n== Csillagászati előrejelzések ==\n\n* A [[90377 Sedna|Szedna]] [[kisbolygó]] keringése során pályájának legtávolabbi pontjába ([[aphélium]]) ér 7800 körül.\n\n[[Kategória:Évezredek|08]]", 'contentmodel': 'wikitext'}
As we look into the code, we don't get too much information about how to print it in more readably, but we notice that for item in rev.items():
print(item)
('revid', 24120313)
('parentid', 15452110)
('minor', True)
('user', 'Misibacsi')
('userid', 110)
('timestamp', Timestamp(2021, 8, 10, 7, 51))
('size', 341)
('sha1', 'ca17cba3f173188954785bdbda1a9915fb384c82')
('roles', ['main'])
('slots', {'main': {'contentmodel': 'wikitext', 'contentformat': 'text/x-wiki', '*': "{{évezredek}}\n\nA '''8. évezred''' ciklusa 7001. [[január 1.|január 1-jén]] kezdődik, és 8000. [[december 31.|december 31-éig]] tart.\n\n== Csillagászati előrejelzések ==\n\n* A [[90377 Sedna|Szedna]] [[kisbolygó]] keringése során pályájának legtávolabbi pontjába ([[aphélium]]) ér 7800 körül.\n\n[[Kategória:Évezredek|08]]"}})
('comment', '/* Csillagászati előrejelzések */ link')
('parsedcomment', '<span dir="auto"><span class="autocomment"><a href="/wiki/8._%C3%A9vezred#Csillagászati_előrejelzések" title="8. évezred">→\u200eCsillagászati előrejelzések</a>: </span> link</span>')
('tags', [])
('anon', False)
('userhidden', False)
('commenthidden', False)
('text', "{{évezredek}}\n\nA '''8. évezred''' ciklusa 7001. [[január 1.|január 1-jén]] kezdődik, és 8000. [[december 31.|december 31-éig]] tart.\n\n== Csillagászati előrejelzések ==\n\n* A [[90377 Sedna|Szedna]] [[kisbolygó]] keringése során pályájának legtávolabbi pontjába ([[aphélium]]) ér 7800 körül.\\n[[Kategória:Évezredek|08]]")
('contentmodel', 'wikitext')
While a revision may look like a dictionary on the screen, however it is not a dictionary, a Non-existing pages raise
>>> 'Modification' if rev.parentid else 'Creation'
'Creation'
Extract data from revison[edit]We don't have to transform a print(rev['comment'])
print(rev.comment)
/* Csillagászati előrejelzések */ link
/* Csillagászati előrejelzések */ link
As you see, they are identical. But keep in mind that both solutions may cause problems if some parts of the revision were hidden by an admin. Let's see what happens upon hiding:
You may say this is not quite consequent, but this is the experimental result. You have to handle hidden properties, but for a general code you should know whether the bot runs as admin. A possible solution: admin = 'sysop' in pywikibot.User(site, site.user()).groups()
If you are not an admin but need admin rights for testing, you may get one on https://test.wikipedia.org. For example print(rev.commenthidden or rev.parsedcomment)
will never raise an if rev.text:
# do something
will do something if the content is not hidden for you and not empty. if rev.text is not None:
# do something
will make the difference. Example was found when an oversighter suppressed the content of the revision. Is it a bot edit?[edit]Have a look at this page history. It has a lot of bots, some of which is no more or was never registered. Pywikibot has a def maybebot(rev):
if site.isBot(rev.user):
return True
user = rev.user.lower()
comment = rev.comment.lower()
return user.endswith('bot') or user.endswith('script') \
or comment.startswith('bot:') or comment.startswith('robot:')
page = pywikibot.Page(site, 'Ordovicesek')
for rev in page.revisions():
print(f'{rev.user:15}\t{site.isBot(rev.user)}\t{maybebot(rev)}\t{rev.comment}')
Addbot False True Bot: 15 interwiki link migrálva a [[d:|Wikidata]] [[d:q768052]] adatába
Hkbot True True Bottal végzett egyértelműsítés: Tacitus –> [[Publius Cornelius Tacitus]]
ArthurBot False True r2.6.3) (Bot: következő hozzáadása: [[hr:Ordovici]]
Luckas-bot False True r2.7.1) (Bot: következő hozzáadása: [[eu:Ordoviko]]
Beroesz False False
Luckas-bot False True Bot: következő hozzáadása: [[sh:Ordoviki]]
MondalorBot False True Bot: következő hozzáadása: [[id:Ordovices]]
Xqbot True True Bot: következő módosítása: [[cs:Ordovikové]]
ArthurBot False True Bot: következő hozzáadása: [[cs:Ordovicové]]
RibotBOT False True Bot: következő módosítása: [[no:Ordovikere]]
Xqbot True True Bot: következő hozzáadása: [[br:Ordovices]]; kozmetikai változtatások
Pasztillabot True True Kategóriacsere: [[Kategória:Ókori népek]] -> [[Kategória:Ókori kelta népek]]
SamatBot True True Robot: következő hozzáadása: [[de:Ordovicer]], [[es:Ordovicos]]
Istvánka False False
Istvánka False False +iwiki
Adapa False False
Data Destroyer False False Új oldal, tartalma: „'''Ordovicesek''', [[ókor]]i népcsoport. [[Britannia]] nyugati partján éltek, szemben [[Anglesea]] szigetével. [[Tacitus]] tesz említést róluk. ==Források== {{p...”
hu:Kategória:Bottal létrehozott olasz vasútállomás cikkek contains articles created by bot. Here is a page generator that yields pages which were possibly never edited by any human: cat = pywikibot.Category(site, 'Bottal létrehozott olasz vasútállomás cikkek')
def gen():
for page in cat.articles():
if all([maybebot(rev) for rev in page.revisions()]):
yield page
Test with: for page in gen():
print(page)
Timestamp[edit]
The documentation suggests to use Elapsed time since last edit: >>> page = pywikibot.Page(site, 'Ordovicesek')
>>> print(site.server_time() - page.latest_revision.timestamp)
3647 days, 21:09:56
Pretty much, isn't it? :-) The result is a In the shell timestamps are human-readable. But when you print them from a script, they get a machine-readable format. If you want to restore the easily readable format, use the >>> page = pywikibot.Page(site, 'Budapest')
>>> rev = page.latest_revision
>>> time = rev.timestamp
>>> time
Timestamp(2023, 2, 26, 9, 4, 14)
>>> print(time)
2023-02-26T09:04:14Z
>>> print(repr(time))
Timestamp(2023, 2, 26, 9, 4, 14)
For the above subtraction Useful methods[edit]Methods discussed in this section belng to Page history in general[edit]
For example to get a difflink between the first and the second version of a page without knowing its oldid (works for every language version) print(f'[[Special:Diff/{list(page.revisions(reverse=True))[1].revid}]]')
And this one is a working piece of code from hu:User:DhanakBot/szubcsonk.py. This bot administers substubs. Before we mark a substub for deletion, we wonder if it has been vandalized. Maybe it was a longer article, but someone has truncated it, and a good faith user marked it as substub not regarding the page history. So the bot places a warning if the page was 1000 bytes longer or twice as long at any point of its history as now. def shortenedMark(self, page):
""" Mark if it may have been vandalized."""
versions = list(page.revisions())
curLength = versions[0]['size']
sizes = [r['size'] for r in versions]
maxLength = max(sizes)
if maxLength >= 2 * curLength or maxLength > curLength + 1000:
return '[[File:Ambox warning orange.svg|16px|The article became shorter!]]'
else:
return ''
Last version of the page[edit]Last version got a special attention from developers and is very comfortable to use.
For example to get a difflink between the newest and the second newest version of a page without knowing its oldid (works for every language version) print(f'[[Special:Diff/{page.latest_revision.revid}]]')
Oldest version of a page[edit]
Determine how many times is the current version longer then the first version (beware of division by zero which is unlikely but possible): >>> page = pywikibot.Page(site, 'Budapest')
>>> page.latest_revision.size / page.oldest_revision.size
115.17727272727272
>>> pywikibot.Page(site, 'Test').put('', 'Test')
Page [[Test]] saved
>>> page.latest_revision.size / page.oldest_revision.size
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero
Determine which user how many articles has created in a given category, not including its subcategories: >>> import collections
>>> cat = pywikibot.Category(site, 'Budapest parkjai') # Parks of Budapest
>>> collections.Counter(page.oldest_revision.user for page in cat.articles())
Counter({'OsvátA': 4, 'Antissimo': 3, 'Perfectmiss': 2, 'Szilas': 1, 'Pásztörperc': 1, 'Fgg': 1, 'Solymári': 1, 'Zaza~huwiki': 1, 'Timur lenk': 1, 'Pa
sztilla (régi)': 1, 'Millisits': 1, 'László Varga': 1, 'Czimmy': 1, 'Barazoli40x40': 1})
(Use Knowing that Main Page is a valid alias for the main page in every language and family, sort several Wikipedias by creation date: >>> from pprint import pprint
>>> langs = ('hu', 'en', 'de', 'bg', 'fi', 'fr', 'el')
>>> pprint(sorted([(lang, pywikibot.Page(pywikibot.Site(lang, 'wikipedia'), 'Main Page').oldest_revision.timestamp) for lang in langs], key=lambda tup: tup[1]))
[('en', Timestamp(2002, 1, 26, 15, 28, 12)),
('fr', Timestamp(2002, 10, 31, 10, 33, 35)),
('hu', Timestamp(2003, 7, 8, 10, 26, 5)),
('fi', Timestamp(2004, 3, 22, 16, 12, 24)),
('bg', Timestamp(2004, 10, 25, 18, 49, 23)),
('el', Timestamp(2005, 9, 7, 13, 55, 9)),
('de', Timestamp(2017, 1, 27, 20, 11, 5))]
For some reason it gives a false date for dewiki where Main Page is redirected to You want to know when did the original creator edit in your wiki last time. In some cases it is a question whether it's worth to contact him/her. The result is a timestamp as described above, so you can subtract it from the current date to get the elapsed time. See also Working with users section. >>> page = pywikibot.Page(site, 'Budapest')
>>> user = pywikibot.User(site, page.oldest_revision.user)
>>> user.last_edit[2]
Timestamp(2006, 5, 21, 11, 11, 22)
Other[edit]
for rev in page.revisions():
print(f'{repr(rev.timestamp)}\t{rev.revid}\t{page.permalink(rev.revid, with_protocol=True)}')
Timestamp(2013, 3, 9, 2, 3, 14) 13179698 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=13179698
Timestamp(2011, 7, 10, 5, 50, 56) 9997266 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=9997266
Timestamp(2011, 3, 13, 17, 41, 19) 9384635 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=9384635
Timestamp(2011, 1, 15, 23, 56, 3) 9112326 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=9112326
Timestamp(2010, 11, 18, 15, 25, 44) 8816647 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=8816647
Timestamp(2010, 9, 27, 13, 16, 24) 8539294 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=8539294
Timestamp(2010, 3, 28, 10, 59, 28) 7422239 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=7422239
etc.
However, besides this URL-format during the years MediaWiki invented a nicer form for inner use. You may use in any language print(f'[[Special:Permalink/{rev.revid}]]')
This will result in such permalinks that you can use on your wiki:
Deleted revisions[edit]When a page is deleted and recreated, it will get a new id. Thus the only way of mining in the deleted revisions is to identify the page by the title. On the other hand, when a page is moved (renamed), it takes the old id to the new title and a new redirect page is created with a new id and the old title. Taking everything into account, investigation may be complicated as deleted versions may be under the old title and deleted versions under the same title may belong to another page. It may be easier without bot if it is about one page. Now we take a simple case where the page was never renamed.
>>> page = pywikibot.Page(site, '2023')
>>> page.has_deleted_revisions()
True
The following methods need amin rights, otherwise they will raise
print(len(list(page.loadDeletedRevisions())))
The main use case is to get timestamps for This method takes a timestamp which is most easily got from the above dict_keys(['revid', 'user', 'timestamp', 'slots', 'comment']) Theoretically a text = page.getDeletedRevision('2023-01-30T19:11:36Z', content=True)['slots']['main']['*']
Underlying method for both above methods is |
File pages
[edit]FilePage
is a cubclass of Page
, so you can use all the above methods, but it has some special methods. Keep in mind that a FilePage
represents a file desciption page in the File:
namespace. Files (images, voices) themselves are in the Media:
pseudo namespace.
- https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.FilePage.get_file_history
- https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.FilePage.get_file_historyversion
- https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.FilePage.getFileVersionHistoryTable
- https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.FilePage.oldest_file_info
- https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.FilePage.latest_file_info
Working with namespaces
[edit]Published | ||||
---|---|---|---|---|
Walking the namespaces[edit]Does your wiki have an article about Wikidata? A Category:Wikidata? A template named Wikidata or a Lua module? This piece of code answers the question: for ns in site.namespaces:
page = pywikibot.Page(site, 'Wikidata', ns)
print(page.title(), page.exists())
In Page.text section we got know the properties that work without parentheses; Our documentation says that for ns in site.namespaces:
print(ns, site.namespaces[ns])
will write the namespace indices and the canonical names. The latters are usually English names, but for example the #100 namespace in Hungarian Wikipedia has a Hungarian canonical name because the English Wikipedia does not have the counterpart any more. Namespaces may vary from wiki to wiki; for Wikipedia WMF sysadmins set them in the config files, but for your private wiki you may set them yourself following the documentation. If you run the above loop, you may notice that File and Category appears as Now we have to dig into the code to see that namespace objects have absoutely different for ns in site.namespaces:
print(repr(site.namespaces[ns]))
The first few lines of the result from Hungarian Wikipedia are: Namespace(id=-2, custom_name='Média', canonical_name='Media', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=False) Namespace(id=-1, custom_name='Speciális', canonical_name='Special', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=False) Namespace(id=0, custom_name=, canonical_name=, aliases=[], case='first-letter', content=True, nonincludable=False, subpages=False) Namespace(id=1, custom_name='Vita', canonical_name='Talk', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=True) Namespace(id=2, custom_name='Szerkesztő', canonical_name='User', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=True) Namespace(id=3, custom_name='Szerkesztővita', canonical_name='User talk', aliases=['User vita'], case='first-letter', content=False, nonincludable=False, subpages=True) Namespace(id=4, custom_name='Wikipédia', canonical_name='Project', aliases=['WP', 'Wikipedia'], case='first-letter', content=False, nonincludable=False, subpages=True) So custom names mean the localized names in your language, while aliases are usually abbreviations such as WP for Wikipedia or old names for backward compatibility. Now we know what we are looking for. But how to get it properly? On top level documentation suggests to use for ns in site.namespaces:
print(ns,
site.namespaces[ns],
site.ns_normalize(str(site.namespaces[ns])) if ns else '')
-2 Media: Média
-1 Special: Speciális
0 :
1 Talk: Vita
2 User: Szerkesztő
3 User talk: Szerkesztővita
4 Project: Wikipédia
5 Project talk: Wikipédia-vita
6 :File: Fájl
etc.
This will write the indices, canonical (English) names and localized names side by side. There is another way that gives nicer result, but we have to guess it from the code of namespace objects. This keeps the colons: for ns in site.namespaces:
print(ns,
site.namespaces[ns],
site.namespaces[ns].custom_prefix())
-2 Media: Média:
-1 Special: Speciális:
0 : :
1 Talk: Vita:
2 User: Szerkesztő:
3 User talk: Szerkesztővita:
4 Project: Wikipédia:
5 Project talk: Wikipédia-vita:
6 :File: :Fájl:
etc.
Determine the namespace of a page[edit]Leaning on the above results we can determine the namespace of a page in any form. We investigate an article and a user talk page. Although the namespace object we get is told to be a dictionary in the documentation, it is quite unique, and its behaviour and even the apparent type depends on what we ask. It can be equal to an integer and to several strings at the same time. This reason of this strange personality is that the default methods are overwritten. If you want to deeply understand what happens here, open >>> page = pywikibot.Page(site, 'Budapest')
>>> page.namespace()
Namespace(id=0, custom_name='', canonical_name='', aliases=[], case='first-letter', content=True, nonincludable=False, subpages=False)
>>> print(page.namespace())
:
>>> page.namespace() == 0
True
>>> page.namespace().id
0
>>> page = pywikibot.Page(site, 'user talk:BinBot')
>>> page.namespace()
Namespace(id=3, custom_name='Szerkesztővita', canonical_name='User talk', aliases=['User vita'], case='first-letter', content=False, nonincludable=False, subpages=True)
>>> print(page.namespace())
User talk:
>>> page.namespace() == 0
False
>>> page.namespace() == 3
True
>>> page.namespace().custom_prefix()
'Szerkesztővita:'
>>> page.namespace() == 'User talk:'
True
>>> page.namespace() == 'Szerkesztővita:' # *
True
>>> page.namespace().id
3
>>> page.namespace().custom_name
'Szerkesztővita'
>>> page.namespace().aliases
['User vita']
The starred command will give Any of the values may be got with the dotted syntax as shown in the last three lines. It is common to get unknown pages from a page generator or a similar iterator and it may be important to know what kind of page we got. Int this example we walk through all the pages that link to an article. def for_readers(ns: int) -> bool:
return ns in (0, 6, 10, 14, 100)
# (main, file, template, category and portal)
basepage = pywikibot.Page(site, 'Budapest')
for page in basepage.getReferences(total=110):
print(page.title(),
page.namespace(),
page.namespace().custom_prefix(),
page.namespace().id,
for_readers(page.namespace().id)
)
Content pages and talk pages[edit]Let's have a rest with a much easier exercise! Another frequent task is to switch from a content page to its talk page or vice versa. We have a method to toggle and another to decide if it is in a talk namespace: >>> page = pywikibot.Page(site, 'Budapest')
>>> page
Page('Budapest')
>>> page.isTalkPage()
False
>>> talk = page.toggleTalkPage()
>>> talk
Page('Vita:Budapest')
>>> talk.isTalkPage()
True
>>> talk.toggleTalkPage()
Page('Budapest')
Note that pseudo namespaces (such as The next example shows how to work with content and talk pages together. Many wikis place a template on the talk pages of living persons' biographies. This template collects the talk pages into a category. We wonder if there are talk pages without the content page having a "Category:Living persons". These pages need attention from users. The first experience is that separate listing of blue pages (articles), green pages[1] (redirects) and red (missing) pages is useful as they need a different approach. We walk the category (see #Working with categories), get the articles and search them for the living persons' categories by means of a regex (it is specific for Hungarian Wikipedia, not important here). As the purpose is to separate pages by colour, we decide to use the old approach of getting the content (see #Page.get()). import re
import pywikibot
site = pywikibot.Site()
cat = pywikibot.Category(site, 'Kategória:Élő személyek életrajzai')
regex = re.compile(
r'(?i)\[\[kategória:(feltehetően )?élő személyek(\||\]\])')
blues = []
greens = []
reds = []
for talk in cat.members():
page = talk.toggleTalkPage()
try:
if not regex.search(page.get()):
blues.append(page)
except pywikibot.exceptions.NoPageError:
reds.append(page)
except pywikibot.exceptions.IsRedirectPageError:
greens.append(page)
Note that running on errors on purpose and preferring them to Notes[edit]
|
Working with users
[edit]{{Pywikibot cookbook}} {{Pywikibot}} https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.User
User is a subclass of Page. Therefore user.exists()
means that user page exists. To determine if the user exists, use user.isRegistered()
. These are independent, either may be true without the other.
Last edit of an anon
See also revisions
Working with categories
[edit]
Task: hu:Kategória:A Naprendszer kisbolygói (minor planets of Solar System) has several subcategories (one level is enough) with reduntantly categorized articles. Let's remove the parent category from the articles in subcategories BUT stubs!
Chances are that it could be solved by
category.py, after reading the documentation carefully, but for me this time it was faster to hack:
sum = 'Redundáns kategória ki, ld. [[Wikipédia-vita:Csillagászati műhely#Redundáns kategorizálás]]'
cat = pywikibot.Category(site, 'Kategória:A Naprendszer kisbolygói')
for subcat in cat.subcategories():
if subcat.title(with_ns=False) == 'Csonkok (kisbolygó)': # Stubs category
continue
for page in subcat.articles():
page.change_category(cat, None, summary=sum)
Creating and reading lists
[edit]Published | ||||
---|---|---|---|---|
Creating a list of pages is frequent task. For example
A list may be saved to a file or to a wikipage. listpages.py does something like this, but the input is restricted to builtin page generators and output has a lot of options. If you write an own script, you may want a simple solution in place. Suppose that you have any iterable (list, tuple or generator) called Something like this: '\n'.join(['* ' + page.title(as_link=True) for page in pages])
will give an appropriate list that is suitable both for wikipage and file. It looks like this: * [[Article1]]
* [[Article2]]
* [[Article3]]
* [[Article4]]
On Windows sometimes you get a import codecs
with codecs.open('myfile.txt', 'w', 'utf-8') as file:
file.write(text)
Of course, imports should be on top of your script, this is just a sample. While a file does not require the linked form, it is useful to keep them in the same form so that a list can be copied from a file to a wikipage at any time. To retrieve your list from page [[Special:MyPage/Mylist]] use: from pywikibot.pagegenerators import LinkedPageGenerator
listpage = pywikibot.Page(site, 'Special:MyPage/Mylist')
pages = list(LinkedPageGenerator(listpage))
If you want to read the pages from the file to a list, do: # Use with codecs.open('myfile.txt', 'r', 'utf-8') as file: if you saved with codecs.
with open('myfile.txt') as file:
text = file.read()
import re
pattern = re.compile(r'\[\[(.*?)\]\]')
pages = [pywikibot.Page(site, m) for m in pattern.findall(text)]
If you are not familiar with regular expressions, just copy, it will work. :-) Where to save the files?[edit]While introducing the A possible solution is to create a directory directly under Pywikibot root such as 2023.03.01. 20:35 <DIR> .
2023.03.01. 20:35 <DIR> ..
2022.11.21. 01:09 <DIR> .git
2022.10.12. 07:16 <DIR> .github
2023.01.30. 13:38 <DIR> .svn
2023.03.04. 20:07 <DIR> __pycache__
2022.10.12. 07:35 <DIR> _distutils_hack
2023.03.03. 14:32 <DIR> apicache-py3
2023.01.30. 14:53 <DIR> docs
2023.02.19. 21:28 <DIR> logs
2022.10.12. 07:35 <DIR> mwparserfromhell
2022.10.12. 07:35 <DIR> pkg_resources
2023.01.30. 13:35 <DIR> pywikibot
2023.01.30. 22:44 <DIR> scripts
2022.10.12. 07:35 <DIR> setuptools
2023.02.24. 11:08 <DIR> t
2023.02.14. 12:15 <DIR> tests
Now instead |
Working with your watchlist
[edit]Published | ||||
---|---|---|---|---|
We have a watchlist.py among scripts which deals with the watchlist of the bot. This does not sound too exciting. But if you have several thousand pages on your watchlist, handling it by bot may sound well. To do this you have to run the bot with your own username rather than that of the bot. Either you overwrite it in user-config.py or save the commands in a script and run python pwb.py -user:MyUserAccount myscript Loading your watchlist[edit]The first tool we need is Print the number of watched pages: print(len(list(site.watched_pages())))
This may take a while as it goes over all the pages. For me it is 18235. Hmm, it's odd, in both meaning. :-) How can a doubled number be odd? This reveals a technical error: decades ago I watched If you don't have such a problem, then watchlist = [page for page in site.watched_pages() if not page.isTalkPage()]
print(len(watchlist))
will do both tasks: turn the generator into a real list and throw away talk pages as dealing with them separately is senseless. Note that pywikibot.exceptions.InvalidTitleError: The (non-)talk page of 'Vita:WP:AZ' is a valid title in another namespace. So I have to write a loop for the same purpose: watchlist = []
for page in site.watched_pages():
try:
if not page.isTalkPage():
watchlist.append(page)
except pywikibot.exceptions.InvalidTitleError:
pass
print(len(watchlist))
Well, the previous one looked nicer. For the number I get 9109 which is not exactly A basic statistics[edit]In one way or other, we have at last a list with our watched pages. The first task is to create a statistics. I wonder how many pages are on my list by namespace. I wish I had the data in sqlite, but I don't. So a possible solution: # Create a sorted list of unique namespace numbers in your watchlist
ns_numbers = sorted(list(set([page.namespace().id for page in watchlist])))
# Create a dictionary of them with a default 0 value
stat = dict.fromkeys(ns_numbers, 0)
for page in watchlist:
stat[page.namespace().id] += 1
print(stat)
{0: 1871, 2: 4298, 4: 1803, 6: 98, 8: 96, 10: 391, 12: 3, 14: 519, 90: 15, 100: 10, 828: 5}
There is an other way if we steal the technics from from collections import Counter
stat = Counter(p.namespace().id for p in watchlist)
print(stat)
Counter({2: 4298, 0: 1871, 4: 1803, 14: 519, 10: 391, 6: 98, 8: 96, 90: 15, 100: 10, 828: 5, 12: 3})
This is a subclass of dictionaries so may be used as a dict. The difference compared to the previous is that a Selecting anon pages and unwatch according to a pattern[edit]The above statistics shows that almost the half of my watchlist consists of user pages because I patrol recent changes, welcome people and warn if neccessary. And it is neccessary often. Now I focus on anons: from pywikibot.tools import is_ip_address
anons = [page for page in watchlist
if page.namespace().id == 2 and is_ip_address(page.title(with_ns=False))]
I could use the own method of a User instance to determine if they are anons without importing, but for that I would have to convert pages to Users: anons = [page for page in watchlist
if page.namespace().id == 2
and pywikibot.User(page).isAnonymous()]
Anyway, IPv4 addresses starting with for page in anons:
if page.title(with_ns=False).startswith('195.199'):
print(page.watch(unwatch=True))
With Watching and unwatching a list of pages[edit]By this time we delt with pages one by one with Even more exciting, this method can handle complete lists at once, and even better the list items may be strings – this means you don't have to create Page objects of them, just provide titles. Furthermore it supports other sequence types like a generator function, so page generators may be used directly. To watch a lot of pages if you have the titles, just do this: titles = [] # Write titles here or create a list by any method
site.watch(titles)
To unwatch a lot of pages if you already have Page objects: pages = [] # Somehow create a list of pages
site.watch(pages, unwatch=True)
For use of page generators see the second example under #Red pages. Further ideas for existing pages[edit]With Pywikibot you may watch or unwatch any quantity of pages easily if you can create a list or generator for them. Let your brain storm! Some patterns:
API:Watch shows that MediaWiki API may have further parameters such as expiry and builtin page generators. At the time of writing this article Pywikibot does not support them yet. Please hold on. Red pages[edit]Non-existing pages differ from existing in we have to know the exact titles in advance to watch them. Watch the yearly death articles in English Wikipedia for next decade so that you see when they are created: for i in range(2023, 2033):
pywikibot.Page(site, f'Deaths in {i}').watch()
hu:Wikipédia:Érdekességek has "Did you know?" subpages by the two hundreds. It has other subpages, and you want to watch all these tables until 10000, half of what is blue and half red. So follow the name pattern: prefix = 'Wikipédia:Érdekességek/'
for i in range(1, 10000, 200):
pywikibot.Page(site, f'{prefix}{i}–{i + 199}').watch()
While English Wikipedia tends to list existing articles, in other Wikipedias list articles are to show all the relevant titles either blue or red. So the example is from Hungarian Wikipedia. Let's suppose you are interested in history of Umayyad rulers. hu:Omajjád uralkodók listája lists them but the articles of Córdoba branch are not yet written. You want to watch all of them and know when a new article is created. You notice that portals are linked from the page, but you want to watch only the articles, so you use a wrapper generator to filter the links. from pywikibot.pagegenerators import \
LinkedPageGenerator, NamespaceFilterPageGenerator
basepage = pywikibot.Page(site, 'Omajjád uralkodók listája')
site.watch(NamespaceFilterPageGenerator(LinkedPageGenerator(basepage), 0))
List of ancient Greek rulers differs from the previous: many year numbers are linked which are not to be watched. You exclude them by title pattern. basepage = pywikibot.Page(site, 'Ókori görög uralkodók listája')
pages = [page for page in
NamespaceFilterPageGenerator(LinkedPageGenerator(basepage), 0)
if not page.title().startswith('Kr. e')]
site.watch(pages)
Or just to watch the red pages in the list: basepage = pywikibot.Page(site, 'Ókori görög uralkodók listája')
pages = [page for page in LinkedPageGenerator(basepage) if not page.exists()]
site.watch(pages)
In the first two examples we used standalone pages in a loop, then a page generator, then lists. They all work. Summary[edit]
|
Working with dumps
[edit]Pywikibot |
---|
|
Working with logs
[edit]Pywikibot |
---|
|
Working with Wikidata
[edit]Pywikibot |
---|
|
Using predefined bots as parent classes
[edit]Pywikibot |
---|
|
See https://doc.wikimedia.org/pywikibot/master/library_usage.html.
Working with textlib
[edit]{{Pywikibot cookbook}} {{Pywikibot}} Example: https://www.mediawiki.org/wiki/Manual:Pywikibot/Cookbook/Creating_pages_based_on_a_pattern
Creating pages based on a pattern
[edit]Published | ||||
---|---|---|---|---|
Pywikibot is your friend when you want to create a lot of pages that follow some pattern. In the first task we create more than 250 pages in a loop. Then we go on to categories. We prepare a lot of them, but create only as many in one run that we want to fill with articles, in order to avoid a lot of empty categories. Rules of orthography[edit]Rules of Hungarian orthography have 300 points, several of which have a lot of subpoints marked with letters. There is no letter a without b, and last letter is l. We have templates pointing to these on an outer source. Templates cannot be used in an edit summary, but inner links can, so we create a lot of pages with short inner links that hold these templates. Of course, bigger part is a bot work, but first we have to list the letters. Each letter from b to l gets a list with the numbers of points of which this is the last letter (lines 5–12). For example, 120 is in the list of e, so we create the pages for the 120, 120 a) ... 120 e) points. The idea is to build a title generator (from line 14). (It also could be a page generator, but title was more comfortable.) The result is at hu:Wikipédia:AKH. import pywikibot as p
site = p.Site()
mainpage = p.Page(site, 'WP:AKH')
b = [4, 7, 25, 27, 103, 104, 108, 176, 177, 200, 216, 230, 232, 261, 277, 285, 286, 288, 289, 291]
c = [88, 101, 102, 141, 152, 160, 174, 175, 189, 202, 250, 257, 264, 267, 279, 297]
d = [2, 155, 188, 217, 241, 244, 259, 265,]
e = [14, 120, 249,]
f = [82, 195, 248,]
g = [226,]
i = [263,]
l = [240,]
def gen():
for j in range(1, 301):
yield str(j)
if j in b + c + d + e + f + g + i + l:
yield str(j) + ' a'
yield str(j) + ' b'
if j in c + d + e + f + g + i + l:
yield str(j) + ' c'
if j in d + e + f + g + i + l:
yield str(j) + ' d'
if j in e + f + g + i + l:
yield str(j) + ' e'
if j in f + g + i + l:
yield str(j) + ' f'
if j in g + i + l:
yield str(j) + ' g'
if j in i + l:
yield str(j) + ' h'
yield str(j) + ' i'
if j in l:
yield str(j) + ' j'
yield str(j) + ' k'
yield str(j) + ' l'
maintext = ''
summary = 'A szerkesztési összefoglalókban használható hivatkozások létrehozása a helyesírási szabályzat pontjaira'
for s in gen():
print(s)
title = 'WP:AKH' + s.replace(' ', '')
li = s.split(' ')
try:
s1 = li[0] + '|' + li[1]
s2 = li[0] + '. ' + li[1] + ')'
except IndexError:
s1 = li[0]
s2 = li[0] + '.'
templ = '{{akh|' + s1 + '}}\n\n'
print(title, s1, s2, templ)
maintext += f'[[{title}]] '
page = p.Page(site, title)
print(page)
text = templ
text += f'Ez az oldal hivatkozást tartalmaz [[A magyar helyesírás szabályai]] 12. kiadásának {s2} pontjára. A szerkesztési összefoglalókban '
text += f'<nowiki>[[{title}]]</nowiki> címmel hivatkozhatsz rá, így a laptörténetekből is el lehet jutni a szabályponthoz.\n\n'
text += 'Az összes hivatkozás listája a [[WP:AKH]] lapon látható.\n\n[[Kategória:Hivatkozások a helyesírási szabályzat pontjaira]]\n'
print(text)
page.put(text, summary)
maintext += '\n\n[[Kategória:Hivatkozások a helyesírási szabályzat pontjaira| ]]'
print(maintext)
mainpage.put(maintext, summary)
Categories of notable pupils and teachers[edit]We want to create categories for famous pupils and teachers of Budapest schools based on a pattern. Of course, this is not relevant for each school; first we want to see which article has "famous pupils" and "famous teachers" section which may occur in several forms, so the best thing is to review it by eyes. We also check if the section contains enough notable people to have a category. In this task we don't bother creating Wikidata items; these categories are huwiki-specific, and creating items in Wikidata by bot needs an approval.
>>> import pywikibot
>>> from pywikibot.textlib import extract_sections
>>> site = pywikibot.Site()
>>> cat = pywikibot.Category(site, 'Budapest középiskolái')
>>> text = ''
>>> for page in cat.articles():
... text += '\n;' + page.title(as_link=True) + '\n'
... sections = [sec[0] for sec in extract_sections(page.text, site).sections]
... for sect in sections:
... text += ':' + sect.replace('=', '').strip() + '\n'
...
>>> pywikibot.Page(site, 'user:BinBot/try').put(text, 'Listing schools of Budapest')
import re
import pywikibot
site = pywikibot.Site()
base = pywikibot.Page(site, 'user:BinBot/try')
regex = re.compile(r';\[\[(.*?)\]\]:(pt?):(.*?):(.*?)\n')
main = '[[Kategória:Budapesti iskolák tanárai és diákjai iskola szerint|{k}]]\n'
comment = 'Budapesti iskolák diákjainak, tanárainak kategóriái'
cattext = \
'Ez a kategória a{prefix} [[{school}]] és jogelődjeinek {member} tartalmazza.'
for m in regex.findall(base.text):
cat = pywikibot.Category(site, m[2] + ' tanárai és diákjai')
if not cat.exists():
cat.put(main.format(k=m[3]), comment, minor=False, botflag=False)
prefix = 'z' if m[0][0] in 'EÓÚ' else '' # Some Hungarian grammar stuff
# Pupils
catp = pywikibot.Category(site, m[2] + ' diákjai')
if not catp.exists():
text = cattext.format(prefix=prefix, school=m[0], member='diákjait')
text += f'\n[[{cat.title()}|D]]\n'
catp.put(text, comment, minor=False, botflag=False)
if not 't' in m[1]:
continue
# Teachers
catt = pywikibot.Category(site, m[2] + ' tanárai')
if not catt.exists():
text = cattext.format(prefix=prefix, school=m[0], member='tanárait')
text += f'\n[[{cat.title()}|T]]\n'
catt.put(text, comment, minor=False, botflag=False)
|