Jump to content

Multi-Content Revisions/Content Meta-Data: Difference between revisions

From mediawiki.org
Content deleted Content added
→‎Efficiency: address representation
Optimization
Line 302: Line 302:
Since <code>cont_address</code> is the "heaviest" field the new scheme introduces, it is worth considering how it will behave for legacy revisions. When converting legacy revisions, <code>cont_address</code> will be set to an address that points to a row in the text table, e.g. <code>"TT:123456"</code> (if we are desperate to save bits, we can use a more compact encoding for the row ID). However, with External Storage enabled, the text table already contains a URL pointing elsewhere, something like <code>"DB://cluster5/878675423"</code>. This ES URL can be moved to <code>cont_address</code> during migration (or later, on the fly), and the now redundant row in the <code>text</code> table can be deleted. Note that relevant <code>old_flags</code> need to be encoded in the address, perhaps like <code>"DB://cluster5/878675423;utf8,gzip"</code>.
Since <code>cont_address</code> is the "heaviest" field the new scheme introduces, it is worth considering how it will behave for legacy revisions. When converting legacy revisions, <code>cont_address</code> will be set to an address that points to a row in the text table, e.g. <code>"TT:123456"</code> (if we are desperate to save bits, we can use a more compact encoding for the row ID). However, with External Storage enabled, the text table already contains a URL pointing elsewhere, something like <code>"DB://cluster5/878675423"</code>. This ES URL can be moved to <code>cont_address</code> during migration (or later, on the fly), and the now redundant row in the <code>text</code> table can be deleted. Note that relevant <code>old_flags</code> need to be encoded in the address, perhaps like <code>"DB://cluster5/878675423;utf8,gzip"</code>.


=== Optimization ===
The reduce the storage requirements for <code>cont_address</code>, it would be possible to split off separate fields for common prefixes and suffixes, which would be stored as integer IDs, referencing a table that contains the actual strings, just like <code>content_roles</code> does for roles. For example, <code>"DB://cluster5/878675423;utf8,gzip"</code> could be split into <code>"DB://cluster5/"</code> <code>"878675423"</code> <code>";utf8,gzip"</code> which would be saved as e.g. ( 834234, <code>"878675423"</code>, 787432 ).
'''''Note that the optimizations discussed below are considerations for the future, they are not part of this proposal.'''''

The proposed schema offers some potential for optimization, which would make the schema less clean and obvious, but could improve performance.

'''Split cont_address''': The reduce the storage requirements for <code>cont_address</code>, it would be possible to split off separate fields for common prefixes and suffixes, which would be stored as integer IDs, referencing a table that contains the actual strings, just like <code>content_roles</code> does for roles. For example, <code>"DB://cluster5/878675423;utf8,gzip"</code> could be split into <code>"DB://cluster5/"</code> <code>"878675423"</code> <code>";utf8,gzip"</code> which would be saved as e.g. ( 834234, <code>"878675423"</code>, 787432 ).

'''Avoid rev_user_text''': with <code>rev_content_model</code>, <code>rev_content_format</code>, <code>rev_text_id</code> gone, revision rows are already smaller. They could be further reduced by setting <code>rev_user_text</code> to NULL if <code>rev_user</code> is not 0. <code>rev_user_text</code> could be removed completely If <code>rev_user</code> was replaced with a 128 bit binary, this could be used to represent IPv4 and IPv6 addresses directly, as well as to refer to user accounts using their internal ID encoded in a reserved IPv6 address range.

'''Factor out rev_comment''': the revision comment is potentially very wide, <code>varbinary(767)</code>. Since it is not needed in all scenarios, it could be factored out into a separate table. Since revision comments often contain common prefixes, it may also be possible to split them in the same way suggested for <code>cont_address</code> above.

'''Fixed width revision rows''': with the above optimizations of <code>rev_user_text</code> and <code>rev_comment</code>, and with <code>rev_content_model</code> and <code>rev_content_format</code> gone, the only remaining variable width field is <code>rev_sha1</code>, which could easily be converted to a fixed width <code>binary(32)</code> or even <code>binary(20</code>.

'''Partitioning''': the <code>revision</code>, <code>content</code>, and <code>slots</code> tables could benefit from partitioning. Further investigation is needed to determine which criterion would be useful for partitioning, and what the goal of such a partitioning should be. Should the number of partitions used per query be minimized (improve locality) or maximized (improve parallelism)? Assuming we want to optimize access to "hot" data versus the access to "stale" data[https://www.percona.com/blog/2010/12/11/mysql-partitioning-can-save-you-or-kill-you/], partitioning by blocks of <code>rev_id</code> resp. <code>cont_revision</code> would be the obvious choice. For optimizing localize, partitioning by modulo of <code>page_id</code> resp. <code>rev_page</code> would be an option, since it keeps information about all a revisions of a page in a single partition.


== Migration Plan ==
== Migration Plan ==

Revision as of 12:23, 15 September 2016

This page is part of the MCR proposal. Status: stable draft, comments welcome.

The central component of Multi-Content-Revision support is the management of content meta-data. To allow multiple content objects to be managed per revision, part of the storage layer needs to be adapted by adding a new level of indirection between revisions and content objects. This provides a new degree of freedom to modeling the wiki content: pages can have multiple "streams" now. We go from the straight forwards

  • page -> revision -> text ( -> external store )

to a more indirect model:

  • page -> ( revision -> ) slots -> content -> ( text | external store )

The introduction of this additional layer of indirection requires a new storage interface in PHP, but more importantly it requires a database schema change and data migration described later on this page.

Data Model

An overview in bullet points (refer to the Glossary for a definition of the terms used):

  • Each revision has named slots, the slot names define the slot's role. Each slot may contain a content object.
  • There is always at least one slot defined: the main slot. It will be used in all places that do not explicitly specify a slot. This is the primary backwards compatibility mechanism.
  • Slots can be enumerated (if they are primary, i.e. not derived slots, see below). The link between a revision ID and the associated slots is maintained in the main database (see Database Schema below). Listing all slots is needed for viewing, to create diffs, generate XML dumps, perform revert/undo, etc.
  • A role may be associated with a specific content model (e.g. the "categories" role would use the "categories" model). The main slot however may contain any kind of content (the default model depending on the page tile, etc), and some other roles may also not require a specific model to be used.
  • Slots can be uniquely identified by revision ID and slot name.
  • There is meta-data associated with each slot. The content meta-data (content model and format, logical size, and hash) and slot meta-data (slot name, revision id).
  • Slots have no intrinsic order. Two revisions are considered equal if their (primary) slots have the same content. Two equal revisions have the same hash and length.
  • A revision's hash is aggregated from the hashes of its (primary) slots. If there is only one slot, the revision's hash is the same as the slot's hash. Similarly, a revision's length is calculated as the sum of the (logical) sizes of the slots.

For the future, it may prove useful to support the association of derived content with a revision.

  • A generic way to associate materialized derived content with a revision could be provided, similar to (or integrated with) the mechanism used for associating primary, user-generated content with a revision.
  • Derived content may be generated and saved ("materialized") when a revision is created, or it may be generated on the fly ("virtual content"). The link between a revision ID and associated (materialized) derived content would be stored in the main database.
  • Derived slots cannot be enumerated.

Content Meta-Data Service (DAO)

For accessing the content of the different slots of a revision, a RevisionSlotLookup service is defined as described below. For a description of the storage interface for retrieving actual content data and listing available slots per revision, see Revision Retrieval.

/**
* @todo find a better name. ContentMetaData?
**/
class RevisionContentInfo {
  public function getBlobAddress();
  public function getContentFormat();
  public function getContentModel();
  public function getLogicalSize();
  public function getHash();
}

class RevisionSlot extends RevisionContentInfo {
  public function getRevisionId();
  public function getRole();
}

interface RevisionSlotLookup {
    /**
    * @param string[] $slots The names of the slots to return. Null means all primary slots.
    **/
    function getRevisionSlots( $revisionId, array $slots = null ): RevisionSlot[];
}

interface RevisionSlotStore extends RevisionSlotLookup {
    /**
    * @param RevisionSlot[] $slots The slot data to save.
    **/
    function insertRevisionSlots( $revisionId, array $slots );
}

(Code experiment: https://gerrit.wikimedia.org/r/#/c/217710/ and https://gerrit.wikimedia.org/r/#/c/302492; See also phab:T142980)

TBD: insertRevisionSlots needs to know the parent revision, so it can copy or link any content from the parent revision that was not changed in the present edit. The caller (WikiPage or Revision) typically already knows the parent revision ID, so we could pass it in, but that would pollute the interface, leaking implementations to allow optimization.

Initial Implementation

The initial implementation of RevisionSlotLookup would just be a re-factoring of the current functionality. No schema changes are needed in the database. Only the main slot is supported. Implementation steps:

  • Move storage layer code for accessing revision content from Revision into RevisionSlotLookup.
  • Change Revision to use a RevisionSlot to access revision content.
  • The initial implementation of RevisionSlotLookup will rely on information from the revision table to provide meta-information about the main slot. Later, that information would be moved to a different storage schema.

(Code experiment: https://gerrit.wikimedia.org/r/#/c/246433/6)

Once the application code uses the new interface for accessing revision content, a new implementation can be created that uses the new database schema described below. For the migration period, we will need an implementation that can write to the old and the new schema at once.

Database Schema

Please refer to the Multi-Content Revisions/Glossary for a description of the entities modeled by this schema!

In order to allow multiple content objects per revision, the revision table needs to be split, so that the information about the content lives in a separate table, while information about the revision as such stays where it is. The structure of this table was discussed in the Create a content meta-data table RFC, in a somewhat different context.

Introducing the content and slots tables is the core feature needed for multi-content revision support. In terms of entities and relations, it allows revision content to be modeled as follows:

Old schema:
  [page] --page_current--> [revision] --rev_text_id--> [text] --old_text--> (external)
         <----rev_page----            
  
MCR without re-use of content rows:
  [page] --page_current--> [revision] <--cont_revision-- [content] --cont_address--> (text|external)
         <----rev_page----            
  
MCR with re-usable content rows:
  [page] --page_current--> [revision] <--slot_revision-- [slots] --slot_content--> [content] --cont_address--> (text|external)
         <----rev_page----            <--------------------cont_revision---------------------

The table structure is as follows:

 CREATE TABLE /*_*/content (
   cont_id BIGINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,  
   cont_revision INT UNSIGNED NOT NULL, -- may need to become a BIGINT
   cont_role SMALLINT UNSIGNED NOT NULL,  
   cont_model SMALLINT UNSIGNED NOT NULL,
   cont_format SMALLINT UNSIGNED NOT NULL,
   cont_address VARBINARY(255) NOT NULL, -- the content blob address, similar to what External Store does
   cont_logical_size INT(10) UNSIGNED, -- the logical sizes of the Content object
   cont_hash VARBINARY(32) NOT NULL, -- the hash of the serialized content (should this be called cont_sha1?)
   cont_deleted TINYINT DEFAULT 0 -- whether this content was suppressed or deleted.
   PRIMARY KEY (cont_id)
 ) /*$wgDBTableOptions*/;
 
 CREATE UNIQUE INDEX /*i*/cont_revision_role ON /*_*/content (cont_revision, cont_role);

 CREATE TABLE /*_*/content_roles (
   cr_id SMALLINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
   cr_role VARBINARY(32) NOT NULL
 ) /*$wgDBTableOptions*/;
 
 CREATE UNIQUE INDEX /*i*/cr_role ON /*_*/content_roles (cr_role);

(Code experiment: https://gerrit.wikimedia.org/r/#/c/302056/ and http://sqlfiddle.com/#!9/0b847/7)

A cont_role field would be added to identify the role the content plays in the revision (e.g. main, style, categories, meta, blame, etc). cont_role would reference a content_roles table defined in the same way as the content_model and content_format tables proposed in Phab:T105652 and phab:T142980. [TBD: instead content_roles, content_model, and content_format could have a single names table, mapping arbitrary names to integers for compact storage].

A content row can be uniquely identified by cont_revision and cont_role - that is, there can only be one content object per role in each revision.

In some cases, it may be sufficient to access the content (or slots, see below) table, and bypass the revision table completely. For instance, content (resp. slots) can be joined directly against page_latest to find the content relevant for showing the current revision of a page.

When joining against page_current, rev_id, etc., cont_role will then have to be fixed (e.g. to the "main" role) to allow a unique match per revision.

The auto-increment cont_id field is not strictly necessary unless we want to re-use content rows, see below. But being able to identify specific content with a unique id seems like a good idea in any case.

Re-using Content Rows

If we assume that it will be common for one stream of a page to be edited frequently (e.g. the main stream), while other streams are only updated rarely (e.g. categories), it seems inefficient to create rows in the content table for every slot of every revision. Instead, we want to re-use a row in the content table for multiple rows in the revision table. To allow this, we can introduce another table that records the association of content with revisions: the slots table (which was called revision_content in earlier versions of this proposal). The idea is that instead of relying on cont_revision (a n:1 relationship), we use a separate table to allow an n:m relationship:

  CREATE TABLE `slots` (
    `slot_revision` INT UNSIGNED NOT NULL, -- may need to become a BIGINT
    `slot_content` BIGINT UNSIGNED NOT NULL,
    PRIMARY KEY (`slot_revision`,`slot_content`),
    KEY `slot_content` (`slot_content`)
  ) /*$wgDBTableOptions*/;

Note that we still need the cont_revision field in the content table to track which revision introduced a specific bit of content.

Also note that multiple rows in the content table may refer to the same blob (that is, they have the same value in cont_address). So with this approach, there are two levels of indirection that allow re-use: revision -> content (n:m), and content -> blob (n:1).

An alternative design for associating content rows with revisions rows (TBD):

  CREATE TABLE `slots` (
    `slot_revision` INT UNSIGNED NOT NULL, -- may need to become a BIGINT
    `slot_content` BIGINT UNSIGNED NOT NULL,
    `slot_role` SMALLINT UNSIGNED NOT NULL,
    PRIMARY KEY (`slot_revision`,`slot_content`),
    KEY `slot_content` (`slot_content`)
  ) /*$wgDBTableOptions*/;
  
  CREATE UNIQUE INDEX /*i*/slot_role ON /*_*/slots (slot_revision, slot_role);

With this design, the content_role field would be removed from the content table, and would be maintained as slot_role in the slots table instead. This is nicer semantically, since the role is really a property of the the connection between content and revision, and the same content could in theory be re-used in different roles. However, in practice, it seems unlikely that the same content could be re-used in different roles, and we would be repeating the role information for all slots in each revision, even for slots there were not modified. On the other hand, this setup allows us to enforce the constraint that each revision must only have one content for each role via a unique index. Without the slot_role field, this constraint cannot be enforced directly in the database.

Multi-Content Archive

In order to allow multi-content revisions to be deleted and restored, content can not longer be stored directly in the archive table. Instead, rows from the content and slots tables will remain unchanged when a revision (or page) is deleted. They remain accessible via the ar_rev_id field.

Full migration to MCR requires the ar_text and ar_flags fields to be copied to the text table (or, if we can integrate ExternalStore immediately, into cont_address). This also requires any legacy rows in the archive table that have no ar_rev_id set to be initialized.

However, as an interim solution, it would be possible to set ar_text and ar_flags to NULL for new rows in the archive table, relying on the information in slots and content, while old archive entries continue to use ar_text and ar_flags directly.

Removing Redundant Information

Once we have the content table and have migrated existing data, the following columns in the old revision and page table are redundant, and can be dropped:

ALTER TABLE revision DROP COLUMN rev_content_model;
ALTER TABLE revision DROP COLUMN rev_content_format;
ALTER TABLE revision DROP COLUMN rev_text_id;
ALTER TABLE archive DROP COLUMN ar_content_model;
ALTER TABLE archive DROP COLUMN ar_content_format;
ALTER TABLE archive DROP COLUMN ar_flags;
ALTER TABLE archive DROP COLUMN ar_text;
ALTER TABLE archive DROP COLUMN ar_text_id;
ALTER TABLE page DROP COLUMN page_content_model;

Since the old way to store content_model and content_format was rather inefficient, we should free some space by doing this (even though the vast majorities of these fields are currently NULL).

The following fields in the revision table do not become redundant, since they act as summary fields for all content objects of the revision:

  • revision.rev_len: the sum of the (logical) size of the content objects of the revision's slots.
  • revision.rev_sha1: the aggregated hash of the content objects of the revision's slots, i.e. sha1( content3, sha1( content2, sha1( content1 ) ) ). This way, the revision hash for a revisions with only a single slot is the same as the slot's hash.

The purpose of these fields is to compare revisions, which would not be possible in an efficient way if they were removed: rev_len is used to indicate how much an edit added or removed, while rev_sha1 can be used to check whether two revisions have the same content (e.g. to identify edits that restored an old revision).

Note that rev_len and rev_sha1 would not cover any derived slots, if support for such was added.

Scalability

One major concern with the new database design is of course scalability. On wikis like en.wikipedia.org, there are several hundred million revisions in the database, so we have to be careful to keep the model scalable while introducing the additional levels of indirection.

To get an idea of the scalability of the proposed schema, consider the following example, based on the numbers on Size of Wikipedia and site statistics of en.wikipedia.org:

Description Assumed value

when introducing MCR

Extrapolated value

for +10 million pages

(~2 years)

Extrapolated value

for x2 pages

(~8 years)

Extrapolated value

for x4 pages

(~16 years)

Extr. value without MCR

and x4 pages

p page rows 50 million 60 million 100 million 200 million 200 million
n revision rows (avg 20 per page) 1000 million 1200 million 2000 million 4000 million 4000 million
t text rows (with External Store) 1000 million 1000 million (legacy only) 1000 million (legacy only) 1000 million (legacy only) 4000 million
k avg slots per revision 1 by design 1.33 incl. legacy,

3 for new revisions

2 incl. legacy,

3 for new revisions

2.5 incl. legacy,

3 for new revisions

1
e avg slots changed per edit 1 by design 1.08 incl. legacy,

1.5 for new edits

1.2 incl. legacy,

1.5 for new edits

1.375 incl. legacy,

1.5 for new edits

1
c content rows

= n for old + n * e for new edits

1000 million

Most heavy: cont_hash
and cont_address

1300 million 2500 million 5500 million 4000 million

(as part of the revision row)

r slots rows

= n for old + n * k for new edits

1000 million 1600 million 4000 million 10,000 million -

Note that we assume that edits made after the conversion to MCR will on average touch 1.5 slots, and that pages will come to have 3 streams on average. For the extrapolation into the future, a doubling time of 8 years is assumed for the x2 and x4 factors, and a linear growth of 15k pages/day is assumed for the +10 Million column.

Efficiency

Since we will have one entry per revision and stream (resp slot) in slots (perhaps 3 on average), it is going to be quite "tall", but since it is very "narrow" (only two or three integers per row), this will hopefully not be much of a problem. Since we will have one entry in the content table per revision and slot touched (perhaps 1.5 on average), it is somewhat taller than the old revision table. The content table is rather broad, since it contains the cont_address VARCHAR(255) and cont_hash VARCHAR(32) fields.

This implies that the largest scalability concern stems from the fact that we store blob addresses as URLs instead of an integer id pointing to text table. Considering however that with External Storage, we are already storing these URLs now in the text table for each revision, which we will not do with the new scheme, the new scheme should not need much more space for a single slot revision than the old system.

Duplication of data for single-slot revisions is also a concern. This is particularly relevant since it affects all legacy revisions that get converted to the new schema. For a single-slot revision, the following fields are the same:

  • cont_hash VARCHAR(32) is the same as rev_sha1 VARCHAR(32) if there is only one slot. This is probably the biggest concern. Perhaps a more compact storage of the hash can be used like a BINARY(20) field as suggested in [1].
  • cont_logical_size INT UNSIGNED is the same as rev_len INT UNSIGNED if there is only one slot.
  • cont_deleted TINYINT is the same as rev_deleted TINYINT if there is only one slot.

Additionally, there are some fields added that act as foreign keys, which introduces overhead:

  • cont_id BIGINT UNSIGNED
  • cont_revision INT UNSIGNED
  • slot_content BIGINT UNSIGNED
  • slot_revision INT UNSIGNED

Some fields contain truly new information:

  • cont_role SMALLINT UNSIGNED resp. slot_role SMALLINT UNSIGNED
  • cont_address VARCHAR(255) (Note: with External Storage enabled, this isn't a new field, the data just gets moved from the text table)

Since cont_address is the "heaviest" field the new scheme introduces, it is worth considering how it will behave for legacy revisions. When converting legacy revisions, cont_address will be set to an address that points to a row in the text table, e.g. "TT:123456" (if we are desperate to save bits, we can use a more compact encoding for the row ID). However, with External Storage enabled, the text table already contains a URL pointing elsewhere, something like "DB://cluster5/878675423". This ES URL can be moved to cont_address during migration (or later, on the fly), and the now redundant row in the text table can be deleted. Note that relevant old_flags need to be encoded in the address, perhaps like "DB://cluster5/878675423;utf8,gzip".

Optimization

Note that the optimizations discussed below are considerations for the future, they are not part of this proposal.

The proposed schema offers some potential for optimization, which would make the schema less clean and obvious, but could improve performance.

Split cont_address: The reduce the storage requirements for cont_address, it would be possible to split off separate fields for common prefixes and suffixes, which would be stored as integer IDs, referencing a table that contains the actual strings, just like content_roles does for roles. For example, "DB://cluster5/878675423;utf8,gzip" could be split into "DB://cluster5/" "878675423" ";utf8,gzip" which would be saved as e.g. ( 834234, "878675423", 787432 ).

Avoid rev_user_text: with rev_content_model, rev_content_format, rev_text_id gone, revision rows are already smaller. They could be further reduced by setting rev_user_text to NULL if rev_user is not 0. rev_user_text could be removed completely If rev_user was replaced with a 128 bit binary, this could be used to represent IPv4 and IPv6 addresses directly, as well as to refer to user accounts using their internal ID encoded in a reserved IPv6 address range.

Factor out rev_comment: the revision comment is potentially very wide, varbinary(767). Since it is not needed in all scenarios, it could be factored out into a separate table. Since revision comments often contain common prefixes, it may also be possible to split them in the same way suggested for cont_address above.

Fixed width revision rows: with the above optimizations of rev_user_text and rev_comment, and with rev_content_model and rev_content_format gone, the only remaining variable width field is rev_sha1, which could easily be converted to a fixed width binary(32) or even binary(20.

Partitioning: the revision, content, and slots tables could benefit from partitioning. Further investigation is needed to determine which criterion would be useful for partitioning, and what the goal of such a partitioning should be. Should the number of partitions used per query be minimized (improve locality) or maximized (improve parallelism)? Assuming we want to optimize access to "hot" data versus the access to "stale" data[2], partitioning by blocks of rev_id resp. cont_revision would be the obvious choice. For optimizing localize, partitioning by modulo of page_id resp. rev_page would be an option, since it keeps information about all a revisions of a page in a single partition.

Migration Plan

This document describes a migration strategy for introducing the content table.

NOTE: This is intended as a guide for manual migration for large wikis, with millions of rows in the revision table. Wikis with only a moderate number of revisions can rely on the update.php script[*].

Phase 0: Create new tables

The following tables need to be created:

  • content
  • slots
  • content_models
  • content_formats
  • content_roles

Phase I: Fix Legacy Archive Rows

Populate empty ar_rev_id fields:

  • Determine how many rows in archive have ar_rev_id = NULL. Let's call that number m.
  • Reserve m (or m+k, for good measure) IDs in the revision table:
  • Make a note of max( max( rev_id ), max( ar_rev_id ) ), let's call it b.
  • Insert a row with rev_id = b+m+k into the revision table, and delete it again, to bump the auto-increment counter.
  • For any row in archive that has ar_rev_id = NULL, set ar_rev_id to a unique id between b+1 and b+m+k. This could be done via a temporary table, or programmatically.

Make ar_text and ar_flags unused:

  • For each row in archive that has a non-null ar_text field, insert a row into the text table, copying ar_text to old_text and ar_flags to old_flags. Set ar_text_id to the old_id from the newly created text row.
  • Set ar_text and ar_flags to the empty string everywhere.

Phase II: Population

  • Set MediaWiki to write content meta-data to the old AND the new columns (via config[**]). Don't forget to also do this for new entries in the archive table.
  • Wait a bit and watch for performance issues caused by writing to the new table.
  • Run maintenance/populateContentTable.php to populate the content table. The script needs to support chunking (and maybe also sharding, for parallel operation).
  • Keep watching for performance issues while the new table grows.

Operation of populateContentTable.php:

  • Select n rows from the revision table that do not have a corresponding entry in the content table (a WHERE NOT EXISTS subquery is probably better than a LEFT JOIN for this, because of LIMIT).
  • For each such row, construct a corresponding row for the content and slots table[***][****]. The rows can either be collected in an array for later mass-insert, or inserted individually, possibly buffered in a transaction.
  • The content_models, content_formats, and content_roles tables will be populated as a side-effect, by virtue of calling the assignId() function in order to get a numeric ID for content models, formats, and roles.
  • When all rows in one chunk have been processed, insert/commit the new rows in the content table and wait for slaves to catch up.
  • Repeat until there are no more rows in revision that have no corresponding row in content. This will eventually be the case, since web requests are already populating the content table when creating new rows in revision.

The same procedure can be applied to the archive table respectively.

Phase III: Finalize

  • Set MediaWiki to read content meta-data from the new content table.
  • Set MediaWiki to not populate the ar_text and ar_flags fields.
  • Watch for performance issues caused by adding a level of indirection (a JOIN) to revision loads.
  • Set MediaWiki to insert content meta-data ONLY into the new columns in the content table. (To allow this, the old columns must have a DEFAULT).
  • Enable MCR support in the API and UI (as far as implemented).
  • Optional: Drop the redundant columns from the page, revision, and archive tables, see Removing Redundant Information above.

Phase IV: Migrate External Store URLs

If desired, we can migrate data stored in the External Store away from the text table: The External Store URL that is contained in the text blob can be written to the cont_address field (possibly with a prefix, to be decided, see External Store Integration). Then the corresponding rows can be deleted from the text table.

Footnotes

[*] update.php creates the tables, populates the content table using the same code as populateContentTable.php, and then drops the redundant columns.

[**] Instead of introducing new global settings for each schema migration, $wgMigrationSettings is introduced, with fields for each migration step, as needed. The above migration could use, $wgMigrationSettings['content-meta-data/write'] and $wgMigrationSettings['content-meta-data/read'].

[***] This generally does not involve loading the content blob, except for cases where rev_hash or rev_len is NULL, and need to be computed.

[****] We could migrate text rows that contain ExternalStore URLs at this point, but it seems to be prudent to not do even more at the same time. That migration can be done on the fly, or done by script later.