Jump to content

Manual:Pywikibot/Cookbook/Working with page histories

From mediawiki.org

Revisions

[edit]

Processing page histories may be frightening due to the amount of data, but is easy because we have plenty of methods. Some of them extract a particular information such as a user name, while others return an object called revision. A revision represents one line of a page history with all its data that are more than you see in the browser and more than you usually need. Before we look into these methods, let's have a look at revisions. We have to keep in mind that

  • some revisions may be deleted
  • otherwise the text, comment or the contributor's name may be hidden by admins so that non-admins won't see (this may cause errors to be handled)
  • oversighters may hide revisions so deeply that even admins won't see them

Furthermore, bots are not directly marked in page histories. You see in recent changes if an edit is made by bot because this property is stored in the recent changes table of the database and is available there for a few weeks. If you want to know if an edit was made by a bot, you may

  • guess it from the bot name and the comment (not yet implemented, but we will try below)
  • follow through a lot of database tables which contributor had a bot flag in the time of the edit, and consider that registered bots can switch off their flag temporarily and admins can revert changes in bot mode (good luck!)
  • retrieve from the wiki the current bot flag owners and suppose that same users were bots in the time of the edit (that's what Pywikibot does)

API:Revisions gives a deeper insight, while Manual:Slot says something about slots and roles (for most of us this is not too interesting).

Methods returning a single revision also return the content of the page, so it is a good idea to choose a short page for experiments (see Special:ShortPages on your home wiki). revisions() by default does not contain the text unless you force it.

If you want to know what sha1 is good for, see here. Tl;dr: for comparing revisions and recognizing reverts.

For now I choose a page which is short, but has some page history: hu:8. évezred (8th millennium). Well, we have really few to say about it, and we suffer from lack of reliable sources. Let's see the last (current) revision!

page = pywikibot.Page(site, '8. évezred')
rev = page.latest_revision  # It's a property, don't use ()!
print(rev)

{'revid': 24120313, 'parentid': 15452110, 'minor': True, 'user': 'Misibacsi', 'userid': 110, 'timestamp': Timestamp(2021, 8, 10, 7, 51), 'size': 341,
'sha1': 'ca17cba3f173188954785bdbda1a9915fb384c82', 'roles': ['main'], 'slots': {'main': {'contentmodel': 'wikitext', 'contentformat': 'text/x-wiki',
'*': "{{évezredek}}\n\nA '''8. évezred''' ciklusa 7001. [[január 1.|január 1-jén]] kezdődik, és 8000. [[december 31.|december 31-éig]] tart.\n\n== Csillagászati előrejelzések ==\n\n* A [[90377 Sedna|Szedna]] [[kisbolygó]] keringése során pályájának legtávolabbi pontjába ([[aphélium]]) ér 7800 körül.
\n\n[[Kategória:Évezredek|08]]"}}, 'comment': '/* Csillagászati előrejelzések */ link', 'parsedcomment': '<span dir="auto"><span class="autocomment"><a href="/wiki/8._%C3%A9vezred#Csillagászati_előrejelzések" title="8. évezred">→\u200eCsillagászati előrejelzések</a>: </span> link</span>', 'tags': []
, 'anon': False, 'userhidden': False, 'commenthidden': False, 'text': "{{évezredek}}\n\nA '''8. évezred''' ciklusa 7001. [[január 1.|január 1-jén]] kezdődik, és 8000. [[december 31.|december 31-éig]] tart.\n\n== Csillagászati előrejelzések ==\n\n* A [[90377 Sedna|Szedna]] [[kisbolygó]] keringése során pályájának legtávolabbi pontjába ([[aphélium]]) ér 7800 körül.\n\n[[Kategória:Évezredek|08]]", 'contentmodel': 'wikitext'}

As we look into the code, we don't get too much information about how to print it in more readably, but we notice that Revision is a subclass off Mapping, which is described here. So we can try items():

for item in rev.items():
    print(item)

('revid', 24120313)
('parentid', 15452110)
('minor', True)
('user', 'Misibacsi')
('userid', 110)
('timestamp', Timestamp(2021, 8, 10, 7, 51))
('size', 341)
('sha1', 'ca17cba3f173188954785bdbda1a9915fb384c82')
('roles', ['main'])
('slots', {'main': {'contentmodel': 'wikitext', 'contentformat': 'text/x-wiki', '*': "{{évezredek}}\n\nA '''8. évezred''' ciklusa 7001. [[január 1.|január 1-jén]] kezdődik, és 8000. [[december 31.|december 31-éig]] tart.\n\n== Csillagászati előrejelzések ==\n\n* A [[90377 Sedna|Szedna]] [[kisbolygó]] keringése során pályájának legtávolabbi pontjába ([[aphélium]]) ér 7800 körül.\n\n[[Kategória:Évezredek|08]]"}})
('comment', '/* Csillagászati előrejelzések */ link')
('parsedcomment', '<span dir="auto"><span class="autocomment"><a href="/wiki/8._%C3%A9vezred#Csillagászati_előrejelzések" title="8. évezred">→\u200eCsillagászati előrejelzések</a>: </span> link</span>')
('tags', [])
('anon', False)
('userhidden', False)
('commenthidden', False)
('text', "{{évezredek}}\n\nA '''8. évezred''' ciklusa 7001. [[január 1.|január 1-jén]] kezdődik, és 8000. [[december 31.|december 31-éig]] tart.\n\n== Csillagászati előrejelzések ==\n\n* A [[90377 Sedna|Szedna]] [[kisbolygó]] keringése során pályájának legtávolabbi pontjába ([[aphélium]]) ér 7800 körül.\\n[[Kategória:Évezredek|08]]")
('contentmodel', 'wikitext')

While a revision may look like a dictionary on the screen, however it is not a dictionary, a print(type(item)) in the above loop would show that all these tuple-like pairs are real tuples.

Non-existing pages raise NoPageError if you get their revisions.

parentid keeps the oldid of the previous edit and will be zero for a newly created page. For any revision as rev:

>>> 'Modification' if rev.parentid else 'Creation'
'Creation'

Extract data from revision

[edit]

We don't have to transform a Revision object into a dictionary to use it. The above experiment was just for overview. Now we know what to search for, and we can directly get the required data. Better, this structure is more comfortable to use than a common directory, because you have two ways to get a value:

print(rev['comment'])
print(rev.comment)

/* Csillagászati előrejelzések */ link
/* Csillagászati előrejelzések */ link

As you see, they are identical. But keep in mind that both solutions may cause problems if some parts of the revision were hidden by an admin. Let's see what happens upon hiding:

Key Value when not hidden Value when hidden Value when hidden (adminbot)
Page content
text text of the given revision as str (may be '' if empty)
None
None
User
user user name (str) (not a User object!) '' (empty string) same as if it wasn't hidden
userid user id (int) (0 for anons) AttributeError same as if it wasn't hidden
anon True or False False same as if it wasn't hidden
userhidden False True True
Edit summary
comment human-readable comment (str) '' (empty string) same as if it wasn't hidden
parsedcomment comment suitable for page histories and recent changes (clickable if /* section */ present) (str) AttributeError same as if it wasn't hidden
commenthidden False True True

You may say this is not quite consequent, but this is the experimental result. You have to handle hidden properties, but for a general code you should know whether the bot runs as admin. A possible solution:

admin = 'sysop' in pywikibot.User(site, site.user()).groups()

If you are not an admin but need admin rights for testing, you may get one on https://test.wikipedia.org.

For example

print(rev.commenthidden or rev.parsedcomment)

will never raise an AttributeError, but is not very useful in most cases. On the other hand,

if rev.text:
    # do something

will do something if the content is not hidden for you and not empty. False means here either an empty page or a hidden content. If it counts for you,

if rev.text is not None:
    # do something

will make the difference.

Example was found when an oversighter suppressed the content of the revision. text and sha1 were None both for admin and non-admin bot, and an additional suppressed key appeared with the value '' (empty string).

Is it a bot edit?

[edit]

Have a look at this page history. It has a lot of bots, some of which is no more or was never registered. Pywikibot has a site.isBot() method which takes user name (not an object) and checks if it has a bot flag. This won't detect all these bots. We may improve with regarding the user name and the comment. This method is far not sure, may have false positives as well as negatives, but – as shown in the 3rd column – gives a better result then site.isBot() which is in the second column.

def maybebot(rev):
    if site.isBot(rev.user):
        return True
    user = rev.user.lower()
    comment = rev.comment.lower()
    return user.endswith('bot') or user.endswith('script') \
        or comment.startswith('bot:') or comment.startswith('robot:')

page = pywikibot.Page(site, 'Ordovicesek')
for rev in page.revisions():
    print(f'{rev.user:15}\t{site.isBot(rev.user)}\t{maybebot(rev)}\t{rev.comment}')

Addbot          False   True    Bot: 15 interwiki link migrálva a [[d:|Wikidata]] [[d:q768052]] adatába
Hkbot           True    True    Bottal végzett egyértelműsítés: Tacitus > [[Publius Cornelius Tacitus]]
ArthurBot       False   True    r2.6.3) (Bot: következő hozzáadása: [[hr:Ordovici]]
Luckas-bot      False   True    r2.7.1) (Bot: következő hozzáadása: [[eu:Ordoviko]]
Beroesz         False   False
Luckas-bot      False   True    Bot: következő hozzáadása: [[sh:Ordoviki]]
MondalorBot     False   True    Bot: következő hozzáadása: [[id:Ordovices]]
Xqbot           True    True    Bot: következő módosítása: [[cs:Ordovikové]]
ArthurBot       False   True    Bot: következő hozzáadása: [[cs:Ordovicové]]
RibotBOT        False   True    Bot: következő módosítása: [[no:Ordovikere]]
Xqbot           True    True    Bot:  következő hozzáadása: [[br:Ordovices]]; kozmetikai változtatások
Pasztillabot    True    True    Kategóriacsere: [[Kategória:Ókori népek]] -> [[Kategória:Ókori kelta népek]]
SamatBot        True    True    Robot:  következő hozzáadása: [[de:Ordovicer]], [[es:Ordovicos]]
Istvánka        False   False
Istvánka        False   False   +iwiki
Adapa           False   False
Data Destroyer  False   False   Új oldal, tartalma: '''Ordovicesek''', [[ókor]]i népcsoport. [[Britannia]] nyugati partján éltek, szemben [[Anglesea]] szigetével. [[Tacitus]] tesz említést róluk.  ==Források==  {{p...

hu:Kategória:Bottal létrehozott olasz vasútállomás cikkek contains articles created by bot. Here is a page generator that yields pages which were possibly never edited by any human:

cat = pywikibot.Category(site, 'Bottal létrehozott olasz vasútállomás cikkek')
def gen():
    for page in cat.articles():
        if all([maybebot(rev) for rev in page.revisions()]):
            yield page

Test with:

for page in gen():
    print(page)

Timestamp

[edit]

Revision.timestamp is a pywikibot.time.Timestamp object which is well documented here. It is subclass of datetime.datetime. Most importantly, MediaWiki always stores times in UTC, regardless of your time zone and daylight saving time.

The documentation suggests to use Site.server_time() for the current time; it is also a pywikibot.time.Timestamp in UTC.

Elapsed time since last edit:

>>> page = pywikibot.Page(site, 'Ordovicesek')
>>> print(site.server_time() - page.latest_revision.timestamp)
3647 days, 21:09:56

Pretty much, isn't it? :-) The result is a datetime.timedelta object.

In the shell timestamps are human-readable. But when you print them from a script, they get a machine-readable format. If you want to restore the easily readable format, use the repr() function:

>>> page = pywikibot.Page(site, 'Budapest')
>>> rev = page.latest_revision
>>> time = rev.timestamp
>>> time
Timestamp(2023, 2, 26, 9, 4, 14)
>>> print(time)
2023-02-26T09:04:14Z
>>> print(repr(time))
Timestamp(2023, 2, 26, 9, 4, 14)

For the above subtraction print() is nicer, because repr() gives days and seconds, without converting them to hours and minutes.

Useful methods

[edit]

Methods discussed in this section belng to BasePage class with one exception, so may be used for practically any page.

Page history in general

[edit]
  • BasePage.getVersionHistoryTable()will create a wikitable form the page history. The order may be reverted and number of rows may be limited. Useful e.g. when the page history gets unavailable and you want to save it to a talk page.
  • BasePage.contributors() returns a small statistics: contributors with the number of their edits in the form of a dictionary, sorted by the decreasing number. Hidden names appear as empty string both for admin and non-admin bots.
  • BasePage.revisions() will iterate through the revisions of a page beginning from the latest. As detailed above, this differs from one revision in that by default it does not retrieve the content of the revision. To get a certain revision turn the iterator into list.
use:
  • reverse=True for beginning from the oldest version
  • content=True for retrieving the page contents
  • total=5 for limiting the iteration to 5 entries
  • starttime= endtime= with a pywikibot.Timestamp() to limit the iteration in time

For example to get a difflink between the first and the second version of a page without knowing its oldid (works for every language version)

print(f'[[Special:Diff/{list(page.revisions(reverse=True))[1].revid}]]')

And this one is a working piece of code from hu:User:DhanakBot/szubcsonk.py. This bot administers substubs. Before we mark a substub for deletion, we wonder if it has been vandalized. Maybe it was a longer article, but someone has truncated it, and a good faith user marked it as substub not regarding the page history. So the bot places a warning if the page was 1000 bytes longer or twice as long at any point of its history as now.

    def shortenedMark(self, page):
        """ Mark if it may have been vandalized."""
        
        versions = list(page.revisions())
        curLength = versions[0]['size']
        sizes = [r['size'] for r in versions]
        maxLength = max(sizes)
        if maxLength >= 2 * curLength or maxLength > curLength + 1000:
          return '[[File:Ambox warning orange.svg|16px|The article became shorter!]]'
        else:
          return ''
  • Site.loadrevisions() may also be interesting; this is the underlying method that is called by revisions(), but it has some extra features. You may specify the user whose contributions you want or don't want to have.

Last version of the page

[edit]

Last version got a special attention from developers and is very comfortable to use.

  • BasePage.latest_revision (property): returns the current revision for this page. It's a Revision object as detailed above.

For example to get a difflink between the newest and the second newest version of a page without knowing its oldid (works for every language version)

print(f'[[Special:Diff/{page.latest_revision.revid}]]')
But some properties are available directly (they are equivalent to retrieve values from latest_revision):

Oldest version of a page

[edit]
  • BasePage.oldest_revision (property) is very similar to BasePage.latest_revision, but returns the first version rather than last.

Determine how many times is the current version longer then the first version (beware of division by zero which is unlikely but possible):

>>> page = pywikibot.Page(site, 'Budapest')
>>> page.latest_revision.size / page.oldest_revision.size
115.17727272727272

>>> pywikibot.Page(site, 'Test').put('', 'Test')
Page [[Test]] saved
>>> page.latest_revision.size / page.oldest_revision.size
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero

Determine which user how many articles has created in a given category, not including its subcategories:

>>> import collections
>>> cat = pywikibot.Category(site, 'Budapest parkjai')  # Parks of Budapest
>>> collections.Counter(page.oldest_revision.user for page in cat.articles())

Counter({'OsvátA': 4, 'Antissimo': 3, 'Perfectmiss': 2, 'Szilas': 1, 'Pásztörperc': 1, 'Fgg': 1, 'Solymári': 1, 'Zaza~huwiki': 1, 'Timur lenk': 1, 'Pa
sztilla (régi)': 1, 'Millisits': 1, 'László Varga': 1, 'Czimmy': 1, 'Barazoli40x40': 1})

(Use cat.articles(recurse=True) if you are interested in subcategories, too, but that will be slightly slower.)

Knowing that Main Page is a valid alias for the main page in every language and family, sort several Wikipedias by creation date:

>>> from pprint import pprint
>>> langs = ('hu', 'en', 'de', 'bg', 'fi', 'fr', 'el')
>>> pprint(sorted([(lang, pywikibot.Page(pywikibot.Site(lang, 'wikipedia'), 'Main Page').oldest_revision.timestamp) for lang in langs], key=lambda tup: tup[1]))
[('en', Timestamp(2002, 1, 26, 15, 28, 12)),
 ('fr', Timestamp(2002, 10, 31, 10, 33, 35)), 
 ('hu', Timestamp(2003, 7, 8, 10, 26, 5)),
 ('fi', Timestamp(2004, 3, 22, 16, 12, 24)),
 ('bg', Timestamp(2004, 10, 25, 18, 49, 23)),
 ('el', Timestamp(2005, 9, 7, 13, 55, 9)),
 ('de', Timestamp(2017, 1, 27, 20, 11, 5))]

For some reason it gives a false date for dewiki where Main Page is redirected to Wikipedia: namespace, but looks nice anyway. :-)

You want to know when did the original creator edit in your wiki last time. In some cases it is a question whether it's worth to contact him/her. The result is a timestamp as described above, so you can subtract it from the current date to get the elapsed time. See also Working with users section.

>>> page = pywikibot.Page(site, 'Budapest')
>>> user = pywikibot.User(site,  page.oldest_revision.user)
>>> user.last_edit[2]
Timestamp(2006, 5, 21, 11, 11, 22)

Other

[edit]
This will return a permalink to the given version. The oldid argument may be got as the revid value of that revision. If you omit it, the latest id (latest_revision_id) will automatically be assigned. For getting permalinks for all versions of the page:
for rev in page.revisions():
    print(f'{repr(rev.timestamp)}\t{rev.revid}\t{page.permalink(rev.revid, with_protocol=True)}')

Timestamp(2013, 3, 9, 2, 3, 14)                13179698 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=13179698
Timestamp(2011, 7, 10, 5, 50, 56)               9997266 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=9997266
Timestamp(2011, 3, 13, 17, 41, 19)              9384635 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=9384635
Timestamp(2011, 1, 15, 23, 56, 3)               9112326 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=9112326
Timestamp(2010, 11, 18, 15, 25, 44)             8816647 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=8816647
Timestamp(2010, 9, 27, 13, 16, 24)              8539294 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=8539294
Timestamp(2010, 3, 28, 10, 59, 28)              7422239 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=7422239
etc.

However, besides this URL-format during the years MediaWiki invented a nicer form for inner use. You may use in any language

print(f'[[Special:Permalink/{rev.revid}]]')

This will result in such permalinks that you can use on your wiki: [[Special:Permalink/6833510]].

  • BasePage.getOldVersion takes an oldid and returns the text of that version (not a Revision object!). May be useful if you know the version id from somewhere.

Deleted revisions

[edit]

When a page is deleted and recreated, it will get a new id. Thus the only way of mining in the deleted revisions is to identify the page by the title. On the other hand, when a page is moved (renamed), it takes the old id to the new title and a new redirect page is created with a new id and the old title. Taking everything into account, investigation may be complicated as deleted versions may be under the old title and deleted versions under the same title may belong to another page. It may be easier without bot if it is about one page. Now we take a simple case where the page was never renamed.

  • BasePage.has_deleted_revisions() does not need admin rigths, and simply says a yes or no to the question if the page has any deleted revisions. Don't ask me for a use case.
>>> page = pywikibot.Page(site, '2023')
>>> page.has_deleted_revisions()
True

The following methods need amin rights, otherwise they will raise pywikibot.exceptions.UserRightsError.

  • BasePage.loadDeletedRevisions() iterates through the timestamps of deleted revisions and yields them. Meanwhile it caches other data in a private variable for later use. Iterators may be processed with a for loop or transformed into lists. For example to see the number of deleted revisions:
print(len(list(page.loadDeletedRevisions())))

The main use case is to get timestamps for getDeletedRevision().

This method takes a timestamp which is most easily got from the above loadDeletedRevisions(), and returns a dictionary. Don't be mislead by the name; this is not a Revision object. Its keys are:

dict_keys(['revid', 'user', 'timestamp', 'slots', 'comment'])

Theoretically a content=Trueargument should return the text of the revision (otherwise text is returned only if it had previously been retrieved). Currently the documentation does not exactly cover the happenings, see phab:T331422. Instead, revision text may be got (with an example timestamp) as

text = page.getDeletedRevision('2023-01-30T19:11:36Z', content=True)['slots']['main']['*']

Underlying method for both above methods is Site.deletedrevs() which makes possible to get the deleted revisions of several pages together and to get only or rather exclude the revisions by a given user.