Jump to content

Topic on User talk:J.a.coffman/GSoc 2013 Proposal

Accessing the revisions

6
Svick (talkcontribs)
Sorting the revisions by timestamp would allow fast access to any revision.

How exactly would that work? How would you quickly access a specific revision, considering that in a normal compressed file, you need to decompress all content before a specific point to read data from that point? And how would you find that point, based on the timestamp?

Also, how would for example get the latest revision for a certain page? Or all revisions of that page?

J.a.coffman (talkcontribs)

When we start the dump process, we note the time that the dump started. We identify revisions that have been added since the last dump by comparing their timestamps to the timestamp of the previous dump. Once the dump has completed, we have a big list of revisions, sorted by the dump timestamp. When someone wants to update their local dump, all we need from them is the timestamp for the last dump they downloaded.

In what case would we need to get the latest revision for a certain page, or all revisions for a page? I was under the impression that a user would only ever want to download all changes that have been made since their local dump. This would not require per page access for revisions with the format I am proposing.

Svick (talkcontribs)

Yes, but when the user has downloaded (and updated) the dump, he is going to access it. And accessing it is going to be much easier if he can quickly get to the revision(s) he wants. I don't see how would your proposed format allow that and I thought the part I quoted meant it would.

J.a.coffman (talkcontribs)

Sorry, I'll edit my proposal to make it more clear.

The user still employs the same page-oriented dump format that they use now. Once they've received all the revisions they need, the script that the user uses to download them will then decompress each block of revisions and apply the changes to the user's local dump. The format that I describe in my proposal is only used to store the revisions server-side. This means we do not need to worry about breaking any systems the user might use that rely on the current dump format.

Svick (talkcontribs)

Does that mean the user has to decompress and recompress the whole dump on every update? I don't think that would work, because doing that would take a very long time.

J.a.coffman (talkcontribs)

No, because the individual revisions are collected into blocks of arbitrary size. We only need to decompress one block at a time. So we decompress a block, apply updates, compress it again, and move on to the next one. Or we could delete each block once all of its updates have been applied.

Reply to "Accessing the revisions"