Multi-Content Revisions/Database Schema
This is an overview over the database schema for representing multi-content revisions.
See the Glossary for the intended meaning of the terms used in this document.
The basic entity/relationship structure is as follows: revision <-- slot --> content --> text
For a discussion of the scalability and efficiency of this design, see Multi-Content Revisions/Content Meta-Data#Database Schema. The source code of the proposed schema is available on gerrit. For notes on the initial population of the new tables, see Multi-Content_Revisions/Schema_Migration.
Below there is a description of all the tables involved, and comments about what should be done with individual fields. The status column gives a quick indication of what changes a field may undergo, and what decisions are to be made:
Status | Explanation |
---|---|
add | needed for MCR |
keep | no change needed |
keep* | subject to change/removal by another project or idea, no change needed for MCR. |
drop | redundant or misleading with MCR. Should be removed, can be kept and ignored for a while, though. |
drop... | redundant with MCR, but keep for backwards compat for now. |
slots
[edit]Field | Type | Ref | Index | Status | Comment |
---|---|---|---|---|---|
slot_revision_id | bigint unsigned | rev_id | PRI | add | slot_revision +slot_role is unique and can be used as primary.
Note: benchmark performance implications of joining a BIGINT against an INT field. |
slot_role_id | smallint unsigned | role_id | PRI | add | Normalized representation of the role, as a reference to the roles table. The mapping can be aggressively cached.
|
slot_content_id | bigint(10) unsigned | content_id | IDX | add | Indexed so we can find out where which content is used. |
slot_origin | bigint unsigned | rev_id | IDX | add | The revision that last changed this slot's content.
Can be used to determine whether a slot's content was inherited from a previous revision by checking if slot_revision_id != slot_origin. (Note: We originally decided against tracking the origin, and in favor of a slot_inherited flag, but went back on this decision due to the need to track the origin of content in the presence of revision deletion and undeletion) |
content
[edit]Field | Type | Ref | Index | Status | Comment |
---|---|---|---|---|---|
content_id | bigint(10) unsigned | PRI | add | Auto-increment key. | |
content_size | int(10) unsigned | add | Nominal size of the content, in bogo-bytes. Used to re-calculate rev_len for new revisions.
This is not the length of the blob in bytes! | ||
content_sha1 | varbinary(32) | add | Needed to re-calculate rev_sha1 for new revisions.
Note: Should be nullable in case we want to switch to on-the-fly calculation to save space. | ||
content_model | smallint unsigned | model_id | add | Normalized representation of the model, as a reference to the models table. The mapping can be aggressively cached.
| |
content_address | varbinary(255) | IDX? | add | URL-Like content address, e.g. "tt:7681347" or "ex:DB://s15/8734/7623". |
revision
[edit]Field | Type | Ref | Status | Comment |
---|---|---|---|---|
rev_id | int(10) unsigned | keep* | Should probably become a BIGINT | |
rev_page | int(10) unsigned | page_id | keep* | This could be used for sharding if we had a separate contributions table. Listing user contributions the the only access pattern for this table that cannot be made to pre-select by page.
|
rev_text_id | int(10) unsigned | old_id | drop | Replaced by slot_rev, slot_content, and content_address |
rev_comment | varbinary(767) | keep* | Subject to the comment extraction project. | |
rev_user | int(10) unsigned | user_id | keep* | Subject to the actor extraction project. |
rev_user_text | varchar(255) | keep* | Subject to the actor extraction project. | |
rev_timestamp | binary(14) | keep | ||
rev_minor_edit | tinyint(3) unsigned | keep | ||
rev_deleted | tinyint(3) unsigned | keep | ||
rev_len | int(10) unsigned | keep | Calculated as the sum of the length of all slots associated with the revision. This is the nominal size of the content of the revision in bogo-bytes. Its only legitimate use is to given an approximate idea of how much information an edit added or removed. | |
rev_parent_id | int(10) unsigned | rev_id | keep* | BIGINT? |
rev_sha1 | varbinary(32) | keep* | Calculated from the hashes of all slots associated with the revision.
(Note: we decided to keep this for now, but may change to calculating it on the fly when generating dumps etc). | |
rev_content_model | varbinary(32) | drop... | Semantically, a revision no longer has a content model, each slot may have one. We can however keep this field around and use the model of the main slot. | |
rev_content_format | varbinary(64) | drop... | Semantically, a revision no longer has a content format, each slot may have one. We can just ignore the field for now. However, the benefit of storing the serialization format is dubious: it can be derived from the model, or at least be auto-detected by code associated with the model. |
archive
[edit]Field | Type | Ref | Status | Comment |
---|---|---|---|---|
ar_id | int(10) unsigned | keep* | Ideally, this would be the same as ar_rev_id. Since nothing should refer to these IDs, it should be possible to transition legacy data easily. Removing one of the two fields would be a little more tricky.
May need to become a BIGINT. | |
ar_namespace | int(11) | keep | ||
ar_title | varchar(255) | keep* | Subject to the title normalization project. | |
ar_text | mediumblob | keep* | Apparently disused in favor of ar_text_id | |
ar_comment | varbinary(767) | keep* | Subject to the comment extraction project. | |
ar_user | int(10) unsigned | user_id | keep* | Subject to the actor extraction project. |
ar_user_text | varchar(255) | keep* | Subject to the actor extraction project. | |
ar_timestamp | binary(14) | keep | ||
ar_minor_edit | tinyint(4) | keep | ||
ar_flags | tinyblob | keep* | Flags for disused ar_text | |
ar_rev_id | int(10) unsigned | (rev_id) | keep | Does not point to an existing revision! Ideally, could be replaced by plain ar_id. But that's for another time. |
ar_text_id | int(10) unsigned | old_id | drop | Replaced by slot_rev, slot_content, and content_address. Note that slot_rev refers to ar_rev_id, not to ar_id. |
ar_deleted | tinyint(3) unsigned | keep | ||
ar_len | int(10) unsigned | keep | Calculated as the sum of the length of all slots associated with the revision (or just copied from rev_len) | |
ar_page_id | int(10) unsigned | (page_id) | keep | May not point to an existing page |
ar_parent_id | int(10) unsigned | rev_id | keep | May not point to an existing revision |
ar_sha1 | varbinary(32) | keep* | see rev_sha1 | |
ar_content_model | varbinary(32) | drop... | Semantically, a revision no longer has a content model, each slot may have one. We can however keep this field around and use the model of the main slot. | |
ar_content_format | varbinary(64) | drop... | Semantically, a revision no longer has a content format, each slot may have one. We can however just ignore this field for now. Anyway, the benefit of storing the serialization format is dubious: it can be derived from the model, or at least be auto-detected by code associated with the model. |
page
[edit]Field | Type | Ref | Status | Comment |
---|---|---|---|---|
page_id | int(10) unsigned | keep | ||
page_namespace | int(11) | keep* | We may want to introduce a namespaces table, like the tables we use for normalized storage of the content model, etc.
| |
page_title | varchar(255) | keep* | Subject to the title normalization project. | |
page_restrictions | tinyblob | keep* | Has been replaced by the page_restrictions table for a long time. | |
page_is_redirect | tinyint(3) unsigned | keep | ||
page_is_new | tinyint(3) unsigned | keep | ||
page_random | double unsigned | keep | ||
page_touched | binary(14) | keep | ||
page_links_updated | varbinary(14) | keep | ||
page_latest | int(10) unsigned | rev_id | keep | May need to become a BIGINT |
page_len | int(10) unsigned | keep* | Redundant to the current revision's length. Whether it's useful to keep this depends on the access patterns. | |
page_content_model | varbinary(32) | drop... | Semantically, a page no longer has a content model, each stream may have one. We can however keep this field around and use it as the default for the main slot, but the exact semantics of the interaction between page_content_model and rev_content_model have been causing confusion before. Also, this usually NULL. | |
page_lang | varbinary(35) | keep* | This should really be defined per stream. Would need a separate table. Perhaps later. We could take that opportunity to normalize language code storage. |
text
[edit]Field | Type | Ref | Status | Comment |
---|---|---|---|---|
old_id | int(10) unsigned | keep | May need to grow into a BIGINT if we keep using it | |
old_text | mediumblob | keep | Content blob, or some sort of reference to the content blob, depending on old_flags | |
old_flags | tinyblob | keep |
Note that MCR can use ExternalStore directly by using the appropriate URL in content_address
, bypassing the text table entirely.
Also, it was suggested that the text table could become the content
table, instead of being referenced from the content
table.
Name Tables
[edit]All name tables (roles, models, formats, namespaces
, etc) have the same structure:
Field | Type | Ref | Status | Comment |
---|---|---|---|---|
xxx_id | smalint unsigned | keep | ||
xxx_name | varbinary(255) | keep | The canonical (human readable) name. Normalization rules and character set restrictions may vary. |
The table must allow lookups in both directions. Both columns are unique. The id is auto-incrementing.
The mapping defined by the table may be cached aggressively. The mapping should update automatically when attempting to look up the id for an unknown name.