Manual:Pywikibot/Cookbook/Getting a single page
Pywikibot |
---|
|
Creating a Page object from title
[edit]In the further part of this cookbook, unless otherwise stated, we always assume that you have already used these two basic statements:
import pywikibot
site = pywikibot.Site()
You want to get the article about Budapest in your wiki. While it is in the article namespace, it is as simple as
page = pywikibot.Page(site, 'Budapest')
Note that Python is case sensitive, and in its world Site
and Page
mean classes,[1] Site()
and Page()
class instances, while lowercase site
and page
should be variables.
For such simple experiments interactive Python shell is comfortable, as you can easily see the results without using print()
, saving and running your code.
>>> page = pywikibot.Page(site, 'Budapest')
>>> page
Page('Budapest')
>>> type(page)
<class 'pywikibot.page._page.Page'>
Getting the type of an object is often useful when you want to discover the capabilities of Pywikibot. It seems to be strange, but the main thing is that you got a Page
. pywikibot.page._page.Page
shows the path to it: you may find the Page
class in Pywikibot/pywikibot/page/_page.py
. Now let's see the user page of your bot. Either you prefix it with the namespace ('User'
and other English names work everywhere, while the localized names only in your very wiki) or you give the namespace number as the third argument. So
>>> title = site.username()
>>> page = pywikibot.Page(site, 'User:' + title)
>>> page
Page('SzerkesztĹ:BinBot')
and
>>> title = site.username()
>>> page = pywikibot.Page(site, title, 2)
>>> page
Page('SzerkesztĹ:BinBot')
will give the same result. 'SzerkesztĹ'
is the localized version of 'User'
in Hungarian; Pywikibot won't respect that I used the English name for the namespace in my command, the result is always localized.
Getting the title of the page
[edit]On the other hand, if you already have a Page
object, and you need its title as a string, title()
method will do the job:
>>> page = pywikibot.Page(site, 'Budapest')
>>> page.title()
'Budapest'
Possible errors
[edit]While getting pages may cause much less errors than saving them, a few types are worth to mention, some of them being technical, while others possible contradictions between our expectations and reality. Let's speak about them before actually getting the page.
- The page does not exist.
- The page is a redirect.
- You may have been mislead regarding the content in some namespaces. If your page is in Category namespace, the content is the descriptor page. If it is in User namespace, the content is the user page. The trickiest is the File namespace: the content is the file descriptor page, not the file itself; however if the file comes from Commons, the page may not exist in your wiki at all, while you still see the picture.
- The expected piece of text is not in the page content because it is transcluded from a template. You see the text on the page, but cannot replace it directly by bot.
- Sometimes a badly formed code may work well. For example [[Category:Foo bar]] with two spaces will behave as [[Category:Foo bar]]. While the page is in the category and you will get it from a page generator (see below), you won't find the desired string in it.
- And, unfortunately, Wikipedia servers sometimes face errors. If you get a 500 error, go and read a book until server comes back.
- InvalidTitleError is raised in very rare cases. A possible reason is that you wanted to get a page title that contains illegal characters.
Getting the content of the page
[edit]Important: by this time we don't have any knowledge about the existence of the page. We have not contacted live wiki yet. We just created an object. It is just as a street number: you may write it on a document, but either there is a house there or not.
There are two main approaches of getting the content. It is important to understand the difference.
Page.text
[edit]You may notice that text
does not have parentheses. Looking into the code we discover that it is not a method, rather a property. This means text
is ready to use without calling it, may be assigned a value, and is present upon saving the page.
>>> page = pywikibot.Page(site, 'Budapest')
>>> page.text
will write the whole text on your screen. Of course, this is for experiment.
You may write
text = page.text
if you need a copy of the text, but usually this is unneccessary. Page.text
is not a method, so referring to it several times does not slow down your bot. Just manipulate page.text
or assign it a new value, then save.
If you want to know details on how a property works, search for "Python decorators". For using it in your scripts it is enough to know the behaviour. Click on the above link and go through the right-hand menu. You will find some other properties without parentheses.
Page.text
will never raise an error. If the page is a redirect, you will get the redirect link instead of the content of the target page. If the page does not exist, you will get an empty string which is just what happens if the page does exist, but is empty (it is usual at talk pages). Try this:
>>> page = pywikibot.Page(site, 'Arghhhxqrwl!!!')
>>> if page.text:
... print('Got it!')
... else:
... print(f'Page {page.title()} does not exist or has no content.')
...
Page Arghhhxqrwl!!! does not exist or has no content.
Page.text
is comfortable if you don't have to deal with the existence of the page, otherwise it is your responsibility to make the difference. An easy way is Page.exists()
.
>>> page = pywikibot.Page(site, 'Arghhhxqrwl!!!')
>>> if page.text:
... print(len(page.text))
... else:
... print(page.exists())
...
False
While page creation does not contact the live wiki, refering to text for the first time and Page.exists()
usually does. For several pages it will take a while. If it is too slow for you, go to the Working with dumps section. page.has_content()
shows if it is neccessary; if it returns True, the bot will not retrieve the page again. Therefore it returns True
for non-existing pages as it is senseless to reload them. Although this is a public method, you are unlikely to have to use it directly.
Page.get()
[edit]The traditional way is page.get()
which forces you to handle the errors. In this case we store the value in a variable.
>>> page = pywikibot.Page(site, 'Budapest')
>>> text = page.get()
>>> len(text)
165375
A non-existing page causes a NoPageError:
>>> page = pywikibot.Page(site, 'Arghhhxqrwl!!!')
>>> text = page.get()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\Pywikibot\pywikibot\page\_page.py", line 397, in get
self._getInternals()
File "c:\Pywikibot\pywikibot\page\_page.py", line 436, in _getInternals
self.site.loadrevisions(self, content=True)
File "c:\Pywikibot\pywikibot\site\_generators.py", line 772, in loadrevisions
raise NoPageError(page)
pywikibot.exceptions.NoPageError: Page [[hu:Arghhhxqrwl!!!]] doesn't exist.
A redirect page causes an IsRedirectPageError:
>>> page = pywikibot.Page(site, 'Time to Shine')
>>> text = page.get()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\Pywikibot\pywikibot\page\_page.py", line 397, in get
self._getInternals()
File "c:\Pywikibot\pywikibot\page\_page.py", line 444, in _getInternals
raise self._getexception
pywikibot.exceptions.IsRedirectPageError: Page [[hu:Time to Shine]] is a redirect page.
If you don't want to handle redirects, just make the difference between existing and non-existing pages, get_redirect
will make its behaviour more similar to that of text
:
>>> page = pywikibot.Page(site, 'Time to Shine')
>>> page.get(get_redirect=True)
'#ĂTIRĂNYĂTĂS [[Time to Shine (egyĂŠrtelmĹąsĂtĹ lap)]]'
Here is a piece of code to handle the cases. It is already too long for prompt, so I saved it.
for title in ['Budapest', 'Arghhhxqrwl!!!', 'Time to Shine']:
page = pywikibot.Page(site, title)
try:
text = page.get()
print(f'Length of {page.title()} is {len(text)} bytes.')
except pywikibot.exceptions.NoPageError:
print(f'{page.title()} does not exist.')
except pywikibot.exceptions.IsRedirectPageError:
print(f'{page.title()} redirects to {page.getRedirectTarget()}.')
print(type(page.getRedirectTarget()))
Which results in:
Length of Budapest is 165375 bytes.
Arghhhxqrwl!!! does not exist.
Time to Shine redirects to [[hu:Time to Shine (egyĂŠrtelmĹąsĂtĹ lap)]].
<class 'pywikibot.page._page.Page'>
While Page.text
is simple, it gives only the text of the redirect page. With getRedirectTarget()
we got another Page instance without parsing the text. Of course, the target page may also not exist or be another redirect. Scripts/redirect.py gives a deeper insight.
For a practical application see Content pages and talk pages.
Reloading
[edit]If your bot runs slowly and you are in doubt that the page text is still actual, use get(force=True)
. The experiment shows that it does not update page.text
, which is good on one side, as you don't lose your data, but on the other side needs attention to be concious,
>>> import pywikibot as p
>>> site = p.Site()
>>> page = p.Page(site, 'KisbolygĂłk listĂĄja (1â1000)')
>>> page.text
'[[#1|1â500.]] ⢠[[#501|501â1000.]]\n\n{{:KisbolygĂłk listĂĄja (1â500)}}\n{{:KisbolygĂłk listĂĄja (501â1000)}}\n\n[[KategĂłria:A Naprendszer kisbolygĂłinak
listĂĄja]]'
>>> page.text = 'Luke,I am your father!'
>>> page.text
'Luke, I am your father!'
>>> page.get(force=True)
'[[#1|1â500.]] ⢠[[#501|501â1000.]]\n\n{{:KisbolygĂłk listĂĄja (1â500)}}\n{{:KisbolygĂłk listĂĄja (501â1000)}}\n\n[[KategĂłria:A Naprendszer kisbolygĂłinak
listĂĄja]]'
>>> page.text
'Luke, I am your father!'
>>>
>>> page.text = page.get()
>>> page.text
'[[#1|1â500.]] ⢠[[#501|501â1000.]]\n\n{{:KisbolygĂłk listĂĄja (1â500)}}\n{{:KisbolygĂłk listĂĄja (501â1000)}}\n\n[[KategĂłria:A Naprendszer kisbolygĂłinak
listĂĄja]]'
Page.exists()
currently does not reflect to forced reload, see phab:T330980.
Notes
[edit]- â
This is not quite true; as we saw earlier,
Site
is a factory that creates objects. The difference is hidden on purpose because it acts like a class, andSite()
will really be an instance.