Jump to content

User:Bawolff/metadata table

From mediawiki.org

As part of my GSOC project - enhancing mediawiki's metadata support - I might be splitting the image metadata into its own table. Currently metadata is stored as a serialized php array in a field of the image table, which doesn't lend itself to querying specific values.

Original idea

[edit]

This is what I'm currently thinking might be a good table design for a new image metadata table. The basic idea is to model it after the XMP model, since all the metadata we're interested in can be expressed as XMP, however this complicates things slightly as XMP supports nested structures, so a <page id>,<property>,<value> table will not suffice. See #Nested structures for how i plan to make this support nested structures.

All comments appreciated.

Related links wikipedia XMP article, the spec, my post to wikitech-l on this

Describe metadata;

+-----------------------+---------------------+------+-----+---------+----------------+
| Field                 | Type                | Null | Key | Default | Extra          |
+-----------------------+---------------------+------+-----+---------+----------------+
| meta_id               | int(10) unsigned    | NO   | PRI | NULL    | auto_increment |
| meta_schema           | varchar(255)        | NO   | MUL | NULL    |                |
| meta_name             | varchar(64)         | NO   | MUL | NULL    |                |
| meta_value            | mediumblob          | YES  | MUL | NULL    |                |
| meta_type             | varchar(8)          | NO   |     | 0       |                |
| meta_page_id          | int(8) unsigned     | NO   | MUL | NULL    |                |
| meta_qualifies        | tinyint(1) unsigned | NO   |     | NULL    |                |
| meta_ref              | int(10) unsigned    | YES  |     | NULL    |                |
+-----------------------+---------------------+------+-----+---------+----------------+

Fields

[edit]

meta_id

[edit]

primary key, auto-incrementing integer

meta_page_id

[edit]

meta_page - foreign key for page_id - what image is this for.

meta_type

[edit]

meta_type - type of entry - simple value or some sort of compound structure.

XMP supports ordered/unordered lists, associative array type structures, alternate array's (things like array's of the value in different languages).

Possibly should use an enum for this instead. possible values are

  • simple - value is a string in meta_value field
  • seq - value is an ordered list in other rows (see below for description of nested stuff)
  • bag - unordered list in other rows
  • altLang - value is translated into multiple languages, with the value spread out over multiple rows
  • alt - alternative array (multiple choices, in other rows).
  • struct - nested structure, somewhat akin to an associative array. value is present in other rows
  • Also possibly include url, as its handled slightly different in XMP - not sure.

Nested structures are described below.

meta_schema

[edit]

meta_schema - xmp uses different namespaces to prevent name collisions. exif properties have their own namespace, IPTC properties have their own namespace, etc. We can define our own namespaces if we need to store other info that currently doesn't have a namespace.

This property would be a URI

meta_name

[edit]

meta_name - The name of the property.

For array like structures that don't really have a name:

  • for unordered lists (bag's as xmp refers to them) its name is the lists name with a [] appened to the end
  • for ordered lists (seq's as xmp refers to them) is same deal but with [item-number] at end
  • for alternative-lang lists append [lang-name] to end
  • for alternative lists (non-lang) append [] at the end, and [*] for the default.

I believe [] is an invalid metadata property name, but i should double check that.

See below for how the nested structures would work

meta_value

[edit]

The properties value, or null if it has children. See below for nested stuff in more details

meta_ref

[edit]

a reference to a meta_id of a different row for nested structures, or null if not applicable (or 0 perhaps).

This connects child properties to their parent. It is also used to specify which property a qualifier would qualify.

See below for how this parameter works.

meta_qualifies

[edit]

meta_qualifies - boolean to denote if this property is a qualifier.

In xmp properties can be normal properties, or they can be qualifiers that modify the meaning of its parent property. These types of properties seem rare in actual practise.

A qualifier will have the meta_ref field pointing to the field it refers to

Nested structures

[edit]

Basically for nested structures the parent has a row with the meta_value field null, and the child rows have the meta_ref field pointing to its parent, and have the same name as their parents but with [] characters (of some form) appended to the end of the name. For example an unordered list named foo associated with the image with a page_id of 112, containing the values bar baz and fred would look like this:

meta_idmeta_namemeta_valuemeta_schemameta_typemeta_refmeta_qualifies meta_page_id
7fooNULLhttp://example.combagNULL0112
8foo[]barhttp://example.comsimpleNULL7112
9foo[]bazhttp://example.comsimpleNULL7112
10foo[]fredhttp://example.comsimpleNULL7112

If you want to get all the values of foo, you can do SELECT meta_value FROM metadata where meta_name='foo[]' and meta_page_id=112, however this still retains the structure, so if you wanted to output the xmp, you could SELECT * from metadata where meta_page_id=112 re-connect it in the php, and output the XMP.

Other notes

[edit]

The design goals of this table are:

  1. First and foremost, it should be fast and efficient to get the information necessary to make the box on the image description page
  2. It should be able to store any type of image metadata we wish to, or could wish to in the future.
  3. We should be able to make queries for specific properties (get a list of images that were taken with a flash, etc)

Since XMP is pretty much the most complex metadata format for images, and is extensible, allowing the other formats to be stored within it, I think it makes sense to make the table match fairly closely with the XMP data model. This allows us to store all the various types of metadata we could want, and is fairly future proof.

Nested structures are generally not efficient when stored in a database, but in this case I think its ok (but I am new to scalable table design, so hopefully someone with more knowledge will confirm or reject that). Consider the use-cases. The first use-case is displaying the relevant metadata on an image description page. You SELECT * from metadata WHERE meta_page_id = "whaterver your image id is";. Now for most simple properties you don't even have to worry about nesting. the exif property DateTimeOriginal is only going to appear once for each image, so don't even bother rebuilding the tree. Some properties, like images described in multiple languages you may need to reconnect the tree, but even still you just need to reconnect one small part of the tree, which (I assume) can be done rather efficiently in php (and potentially you might not have to, because you can rely on the children being named things like description[fr]). The other use case of getting a list of page_ids that match some metadata property is very easy as its just a straight select. From what I understand most inefficiencies of putting nested structures into db's is due to using multiple joins to connect it all back together. This design doesn't need to use joins to reconnect the structure back together again, so hopefully that makes it acceptably scalable.

Things to figure out

[edit]
  • should it have a meta_lang field.
  • should urls be a seperate type
  • double check that [] really are illegal in xmp metadata property names

revised plan

[edit]

After getting feedback on the mailing list, it seems like my original plan was more complicated then need be. It was pointed out that we do not need to store all metadata potentially contained in the file, as the only useful metadata is the metadata we understand. So the revised plan is a hybrid approach:

  • Store metadata (Especially anything very nested) as serialized blob of php in img_metadata
  • Expose specific pieces of interesting bits of metadata in a new simple table.