Requests for comment/image and oldimage tables
image and oldimage tables | |
---|---|
Component | Database |
Creation date | |
Author(s) | Leucosticte, Krinkle |
Document status | accepted --Krinkle 23:09, 6 February 2017 (UTC) See Phabricator. |
Problems
[edit]- Database table image has no auto-incrementing primary key.
- Database table oldimage has no primary key, at all.
- File revisions as a concept (within the code and API) should have unique identifier. Right now we only have "current file title + upload timestamp". The lack of a unique id for file revisions leads to race conditions, makes the table hard to index and hard to query.
- Uploading file revisions should not involve moving rows across tables, or replacing entire rows.
Use cases
[edit]- End-user improvements
- T139294: Persistent media links for files - Right now we only have permalinks for non-current file revisions. The current file revision is only accessible by the canonical url for the file. This in contrast to page revisions, which always have a stable revision id.
- Enable creation of secondary database tables that associate data with files and/or their revisions (e.g. T33257: File properties and License integration). This is not feasible with the current schema due to a lack of primary keys.
- API Improvements
- prop=imageinfo - Use a proper file revision ID instead of fragile timestamp as identifier.
- Technical debt
- LocalFile::recordUpload2() - Eliminate timestamp kludge.
- Eliminate the need to query two tables when needing to find a file revision.
- Eliminate the need to move rows across tables when uploading a file. This is an anti-pattern in relational databases and makes certain improvements to database stability and performance hard or impossible (as found at Wikimedia Foundation).
- Stability: Having a separate file and filerevision table would solve data inconsistency issues. A request will resolve the pointer once and all other queries will fetch information about the same revision. It also makes it more attractive to query a replica instead of the master.
Possible solutions
[edit]1. Add primary keys
[edit]Rejected. This solution would be relatively easy to implement, but does not solve Problem 4.
2. Separate file and file revision
[edit]Advantages
[edit]- Eliminates the legacy handling of needing to interact with two tables for most operations (image and oldimage).
- Aligns the file tables with the design of the page and revision tables.
- Creates proper semantic separation between file entities and the representation of their revisions.
Effective change
[edit]- Add fields:
- img_id: New primary key for image.
- img_latest: Pointer to
oi_id
for the current revision of a file (similar to page.page_latest) - oi_id: Primary key of oldimage.
- oi_img: Pointer to
img_id
of the file this revision corresponds to (similar to revision.rev_page)
- Rename tables:
- Rename
image
tofile
. - Rename
oldimage
tofilerevision
.
- Rename
- Add missing rows in
filerevision
for current revisions, based on rows fromimage
. - Reduce fields in
file
to only those needed for current revisions only.- Currently indexed: Â
img_timestamp
,img_user_text
,img_sha1
,img_media_type
,img_major_mime
,img_minor_mime
,img_size
. - img_timestamp: Remove. Used for contribution history and creation of archive name. Replaced by filerevision.
- img_user_text: Remove. Used for contribution history, ApiQueryAllImages, SpecialMIMEsearch. Query from
filerevision
instead (using a join, if needed). - img_sha1: Keep. Used for duplication detection for current revisions.
- img_media_type, img_major_mime, img_minor_mime: Keep. Used by Special:MIMESearch for current revisions. A separate initiative may start after this RFC is approved and implemented to consider removing this in favour of a SearchEngine-based approach, and/or to find a way to efficiently from
filerevision
instead. - img_size: Remove. Not indexed. Used by checkImages.php (error check) and Special:MediaStatistics (SUM query). Can be queried by joining against
filerevision
instead. - img_width, img_height: Remove. Not indexed. Only store in
filerevision
instead. - img_bits: Remove. Not indexed. Not queried. Only store in
filerevision
instead. - img_description: Remove. Not indexed. Only store in
filerevision
instead. Can be queried there if needed. - img_user: Remove. Only store in
filerevision
instead. Can be queried there if needed. Used by ApiQueryAllImages and NewFilesPager (SpecialNewimages) to join againstuser
for filtering bots. Can join against filerevision instead. Though should probably use recentchanges.rc_bot instead.
- Currently indexed: Â
Migration strategy
[edit]Exact schema migration script to be written as part of the implementation and fine-tuned as needed during code review. We may need two separate strategies due to the size of the migration (one for the default db updater, and an opt-in maintenance script for large farms such as Wikimedia's). A few ideas so far:
- Tim Starling comment #2747454:
- Rename
image
tofilerevision
. Create a view calledimage
. - Add new fields to
filerevision
. - Create
file
table. - (Disable uploads.)
- Populate
file
fromfilerevision
. - Also move
oldimage
rows intofilerevision
. (Not be visible from theimage
view) - (Deploy new MediaWiki version that uses file/filerevision.)
- (Re-enable uploads)
- Drop image and oldimage.
- Rename
- Jaime Crespo mentioned the idea of potentially doing the migration offline while serving traffic from codfw instead of eqiad.
3. Replace with MCR
[edit]Migrate all file information to page and revision tables and migrate meta data and content references by using Multi-Content Revisions per T107595.
We could decide to stop using the image/file tables entirely, in favour of storing the information in the wiki page/revision tables, with file-backend references similar to what we do for ExternalStorage. We could keep "file" as a simplified secondary data table (similar to "categories").