Multi-Content Revisions/Database Schema

This is an overview over the database schema for representing multi-content revisions.

See the Glossary for the intended meaning of the terms used in this document.

The basic entity/relationship structure is as follows: revision <-- slot --> content --> text

For a discussion of the scalability and efficiency of this design, see Multi-Content Revisions/Content Meta-Data#Database Schema. The source code of the proposed schema is available on gerrit. For notes on the initial population of the new tables, see Multi-Content_Revisions/Schema_Migration.

Below there is a description of all the tables involved, and comments about what should be done with individual fields. The status column gives a quick indication of what changes a field may undergo, and what decisions are to be made:

Status	Explanation
add	needed for MCR
keep	no change needed
keep*	subject to change/removal by another project or idea, no change needed for MCR.
drop	redundant or misleading with MCR. Should be removed, can be kept and ignored for a while, though.
drop...	redundant with MCR, but keep for backwards compat for now.

`slots`

Field	Type	Ref	Index	Status	Comment
slot_revision_id	bigint unsigned	rev_id	PRI	add	`slot_revision`+`slot_role` is unique and can be used as primary. Note: benchmark performance implications of joining a BIGINT against an INT field.
slot_role_id	smallint unsigned	role_id	PRI	add	Normalized representation of the role, as a reference to the `roles` table. The mapping can be aggressively cached.
slot_content_id	bigint(10) unsigned	content_id	IDX	add	Indexed so we can find out where which content is used.
slot_origin	bigint unsigned	rev_id	IDX	add	The revision that last changed this slot's content. Can be used to determine whether a slot's content was inherited from a previous revision by checking if slot_revision_id != slot_origin. (Note: We originally decided against tracking the origin, and in favor of a slot_inherited flag, but went back on this decision due to the need to track the origin of content in the presence of revision deletion and undeletion)

`content`

Field	Type	Ref	Index	Status	Comment
content_id	bigint(10) unsigned		PRI	add	Auto-increment key.
content_size	int(10) unsigned			add	Nominal size of the content, in bogo-bytes. Used to re-calculate `rev_len` for new revisions. This is not the length of the blob in bytes!
content_sha1	varbinary(32)			add	Needed to re-calculate `rev_sha1` for new revisions. Note: Should be nullable in case we want to switch to on-the-fly calculation to save space.
content_model	smallint unsigned	model_id		add	Normalized representation of the model, as a reference to the `models` table. The mapping can be aggressively cached.
content_address	varbinary(255)		IDX?	add	URL-Like content address, e.g. "tt:7681347" or "ex:DB://s15/8734/7623".

`revision`

Field	Type	Ref	Status	Comment
rev_id	int(10) unsigned		keep*	Should probably become a BIGINT
rev_page	int(10) unsigned	page_id	keep*	This could be used for sharding if we had a separate `contributions` table. Listing user contributions the the only access pattern for this table that cannot be made to pre-select by page.
rev_text_id	int(10) unsigned	old_id	drop	Replaced by slot_rev, slot_content, and content_address
rev_comment	varbinary(767)		keep*	Subject to the comment extraction project.
rev_user	int(10) unsigned	user_id	keep*	Subject to the actor extraction project.
rev_user_text	varchar(255)		keep*	Subject to the actor extraction project.
rev_timestamp	binary(14)		keep
rev_minor_edit	tinyint(3) unsigned		keep
rev_deleted	tinyint(3) unsigned		keep
rev_len	int(10) unsigned		keep	Calculated as the sum of the length of all slots associated with the revision. This is the nominal size of the content of the revision in bogo-bytes. Its only legitimate use is to given an approximate idea of how much information an edit added or removed.
rev_parent_id	int(10) unsigned	rev_id	keep*	BIGINT?
rev_sha1	varbinary(32)		keep*	Calculated from the hashes of all slots associated with the revision. (Note: we decided to keep this for now, but may change to calculating it on the fly when generating dumps etc).
rev_content_model	varbinary(32)		drop...	Semantically, a revision no longer has a content model, each slot may have one. We can however keep this field around and use the model of the main slot.
rev_content_format	varbinary(64)		drop...	Semantically, a revision no longer has a content format, each slot may have one. We can just ignore the field for now. However, the benefit of storing the serialization format is dubious: it can be derived from the model, or at least be auto-detected by code associated with the model.

`archive`

Field	Type	Ref	Status	Comment
ar_id	int(10) unsigned		keep*	Ideally, this would be the same as ar_rev_id. Since nothing should refer to these IDs, it should be possible to transition legacy data easily. Removing one of the two fields would be a little more tricky. May need to become a BIGINT.
ar_namespace	int(11)		keep
ar_title	varchar(255)		keep*	Subject to the title normalization project.
ar_text	mediumblob		keep*	Apparently disused in favor of ar_text_id
ar_comment	varbinary(767)		keep*	Subject to the comment extraction project.
ar_user	int(10) unsigned	user_id	keep*	Subject to the actor extraction project.
ar_user_text	varchar(255)		keep*	Subject to the actor extraction project.
ar_timestamp	binary(14)		keep
ar_minor_edit	tinyint(4)		keep
ar_flags	tinyblob		keep*	Flags for disused ar_text
ar_rev_id	int(10) unsigned	(rev_id)	keep	Does not point to an existing revision! Ideally, could be replaced by plain ar_id. But that's for another time.
ar_text_id	int(10) unsigned	old_id	drop	Replaced by slot_rev, slot_content, and content_address. Note that slot_rev refers to ar_rev_id, not to ar_id.
ar_deleted	tinyint(3) unsigned		keep
ar_len	int(10) unsigned		keep	Calculated as the sum of the length of all slots associated with the revision (or just copied from rev_len)
ar_page_id	int(10) unsigned	(page_id)	keep	May not point to an existing page
ar_parent_id	int(10) unsigned	rev_id	keep	May not point to an existing revision
ar_sha1	varbinary(32)		keep*	see rev_sha1
ar_content_model	varbinary(32)		drop...	Semantically, a revision no longer has a content model, each slot may have one. We can however keep this field around and use the model of the main slot.
ar_content_format	varbinary(64)		drop...	Semantically, a revision no longer has a content format, each slot may have one. We can however just ignore this field for now. Anyway, the benefit of storing the serialization format is dubious: it can be derived from the model, or at least be auto-detected by code associated with the model.

`page`

Field	Type	Ref	Status	Comment
page_id	int(10) unsigned		keep
page_namespace	int(11)		keep*	We may want to introduce a `namespaces` table, like the tables we use for normalized storage of the content model, etc.
page_title	varchar(255)		keep*	Subject to the title normalization project.
page_restrictions	tinyblob		keep*	Has been replaced by the page_restrictions table for a long time.
page_is_redirect	tinyint(3) unsigned		keep
page_is_new	tinyint(3) unsigned		keep
page_random	double unsigned		keep
page_touched	binary(14)		keep
page_links_updated	varbinary(14)		keep
page_latest	int(10) unsigned	rev_id	keep	May need to become a BIGINT
page_len	int(10) unsigned		keep*	Redundant to the current revision's length. Whether it's useful to keep this depends on the access patterns.
page_content_model	varbinary(32)		drop...	Semantically, a page no longer has a content model, each stream may have one. We can however keep this field around and use it as the default for the main slot, but the exact semantics of the interaction between page_content_model and rev_content_model have been causing confusion before. Also, this usually NULL.
page_lang	varbinary(35)		keep*	This should really be defined per stream. Would need a separate table. Perhaps later. We could take that opportunity to normalize language code storage.

`text`

Field	Type	Status	Comment
old_id	int(10) unsigned	keep	May need to grow into a BIGINT if we keep using it
old_text	mediumblob	keep	Content blob, or some sort of reference to the content blob, depending on old_flags
old_flags	tinyblob	keep

Note that MCR can use ExternalStore directly by using the appropriate URL in content_address, bypassing the text table entirely.

Also, it was suggested that the text table could become the content table, instead of being referenced from the content table.

Name Tables

All name tables (roles, models, formats, namespaces, etc) have the same structure:

Field	Type	Ref	Status	Comment
xxx_id	smalint unsigned		keep
xxx_name	varbinary(255)		keep	The canonical (human readable) name. Normalization rules and character set restrictions may vary.

The table must allow lookups in both directions. Both columns are unique. The id is auto-incrementing.

The mapping defined by the table may be cached aggressively. The mapping should update automatically when attempting to look up the id for an unknown name.