Jump to content

Topic on Talk:Multi-Content Revisions/Database Schema

Summary by Daniel Kinzler (WMDE)

we need rev_len to be the sum of the relevant content_size fields. Nothing changes otherwise.

Anomie (talkcontribs)

page_len is queried for various purposes, including Special:ShortPages. It's presumably denormalized from page_latestā†’rev_len to avoid having to join and to allow for the page_len index on the page table.

rev_len seems to be used just for displaying the size of the revision (and size change compared to the previous revision) in history pages and such. It may still be useful as the sum of content_len for that purpose to avoid history pages having to join slots and content and add the lengths together (either with a GROUP BY or in PHP).

Daniel Kinzler (WMDE) (talkcontribs)

Yes, we should keep rev_len as a sum of content_size. Denormalizing page_len seems dubious, but whether it's needed really depends on the access patterns.

I would love to see the mutable part of page (page_len, page_touched, page_current) split off into a separate table though, so that the "primary" part becomes append-only. But that's definitely a separate project.

Anomie (talkcontribs)

Denormalizing page_len is probably needed for everything that uses the page_len or page_redirect_namespace_len indexes. Otherwise, for the former index it would need to scan all revisions by rev_len (which would need an index) and filter just the latest revisions, and for the latter index it would either have to do the same or would have to fetch all redirects in the namespace and filesort by rev_len.

A lot more of page is mutable than just those three fields, although those three (along with page_links_updated) are probably the ones most often mutated. The only really constant fields are page_id and page_random.

Daniel Kinzler (WMDE) (talkcontribs)

Yea, you are right.

So, we need rev_len to be the sum of the relevant content_size fields. Nothing changes otherwise. Thoughts about page_len are idle musing.