Talk:Analytics/Reportcard/2.0/Database design

otto comments

Latest comment: 12 years ago14 comments2 people in discussion

MySQL...or something new?

Is this for

One would presume so. Sqlite isn't going to scale, and we don't use Oracle/Postgres/MongoDB erc. Reedy 23:43, 12 January 2012 (UTC)Reply

MySQL table

change column type date to datetime

If you don't need to store dates before 1970, I recommend using TIMESTAMP for all date/time columns. They are stored as integers but work with MySQL date functions

OK, will change that.Drdee 01:44, 13 January 2012 (UTC)Reply

It'd be saner if this is still mediawiki based to use "binary(14)" for the timestamp. Reedy 23:27, 12 January 2012 (UTC)Reply

Disagree. Not sure why you would want to forfeit MySQL's date/time query functions, all charts will have a time component (x-axis). In addition, storing it as BINARY makes face validation harder. Drdee 00:54, 13 January 2012 (UTC)Reply

set engine to Innodb instead of MyISAM.

Otto: if we are planning on doing lots of updates, then definitely. If not, then MyISAM is probably fine. Also, if we use MyISAM then we could also use a MERGE table to aggregate a bunch of these smaller tables. As long as we are pretty sure the structure of this table will rarely change, keeping lots of smaller tables and then making them look like one big table is very useful. A rotator script could be responsible for creating a new data table (once a week? once a month?), and then altering the MERGE table UNION definition to include the new table. INSERTs will be inserted into the last table in the UNION list, and SELECTs on the MERGE table will select across all of the underlying tables and aggregate the results.

See: http://dev.mysql.com/doc/refman/5.0/en/merge-storage-engine.html

My understanding is that InnoDB has become the default and that performance wise they are very close. Drdee 02:03, 13 January 2012 (UTC)Reply

OTTO: InnoDB has a lot of advantages if you are doing a lot of updates/deletes on the table. It doesn't do full table locks like MyISAM usually does. But! As long as you never modify the existing data (like with log tables), then MyISAM doesn't do full table locks:

From: http://dev.mysql.com/doc/refman/5.0/en/internal-locking.html

The MyISAM storage engine supports concurrent inserts to reduce contention between readers and writers for a given table: If a MyISAM table has no free blocks in the middle of the data file, rows are always inserted at the end of the data file. In this case, you can freely mix concurrent INSERT and SELECT statements for a MyISAM table without locks. That is, you can insert rows into a MyISAM table at the same time other clients are reading from it. Holes can result from rows having been deleted from or updated in the middle of the table. If there are holes, concurrent inserts are disabled but are enabled again automatically when all holes have been filled with new data.

Also:

Generally, table locks are superior to page-level or row-level locks in the following cases:

Most statements for the table are reads

Statements for the table are a mix of reads and writes, where writes are updates or deletes for a single row that can be fetched with one key read:

Many scans or GROUP BY operations on the entire table without any writers

OTTO: Ah! I was under the assumption that this was going to be a HUGE log table with records for each data point. Pssh, this is aggregate! I take back what I just wrote! InnoDb is fine!

add view column which is of type ENUM

I like ENUMs when either the table size is going to stay small, or we can be 100% certain that we will never need to alter the ENUM column to add or remove values. Could we use a separate join table instead? So, somethinglike:

So instead of this in the data table

 `view` char(9) NOT NULL ENUM('media', 'mobile', 'editors', 'offline', 'targets', 'wikistats', 'pageviews'),

Have this

 `data_view_id` int(9) NOT NULL,

And then

 CREATE TABLE data_view (`data_view_id` int(9) NOT NULL, name varchar(256) ); 
 INSERT INTO data_view(name) VALUES('media', 'mobile', 'editors', 'offline', 'targets', 'wikistats', 'pageviews');

Yes, we could that. However, I am not sure if we can put an index on a foreign key and I want to avoid JOINS. From a performance point of view, ENUM combined with InnoDB is the fastest http://www.mysqlperformanceblog.com/2008/01/24/enum-fields-vs-varchar-vs-int-joined-table-what-is-faster/. The only real downside, IMHO, is adding new values to ENUM as that requires an ALTER TABLE statement. However, I think we know quite well which theme's we want to display in the dashboard so I am not too concerned about having to change this often. Drdee 02:04, 13 January 2012 (UTC)Reply

OTTO: Yeah, but I think in this case we don't really need a join. Actually, we don't even really need the other table at all, as long as the data_view_id is well defined somewhere (in code?). An ENUM is just an integer with string aliases for the integer values. The disadvantage is you can only have as many values as you have names. If you just use an integer, then you can have maxint values. The index goes on the data_view_id in the data table, which is just an integer.

OTTO: Ack! Same! This is an aggregate data table, it won't grow so huge. ENUM is fine for this column.

Don't use a enum. You've got to waste time changing it to add other stuff with invariably will be wanted in future. Use a text based field and save yourselves the hassle Reedy 23:19, 12 January 2012 (UTC)Reply

Actually, we have a good understanding of what themes we want to present in the reportcard and this gives a safeguard against adding new themes that are not supported by the frontend. As the size of the database will be small, 100k's of rows max, ALTER statements are not that bad either. Drdee 00:02, 13 January 2012 (UTC)Reply

Don't implement database referential integrity either. Of course you can index a FK... Reedy

Not sure what this refers to. Drdee 00:02, 13 January 2012 (UTC)Reply

`value` bigint(9) unsigned DEFAULT 0,

Are you always storing integers? maybe a BLOB type here would be more appropriate for free form data?

To keep it simple, I would say let's assume we will only store integers. That allows us to do sum, max, min and other statistical queries. Drdee 01:36, 13 January 2012 (UTC)Reply

OTTO: MMmk!

I wouldn't assume that. Reedy 23:37, 12 January 2012 (UTC)Reply

Why not? What kind of metrics would generate decimals, floats, that we would want to store in the database? Everything is either pageview-related or a count of something. Drdee 23:55, 12 January 2012 (UTC)Reply

rename table name to data

I am a huge fan of long unambiguous names. 'data' is probably appropriate here, because this is, um, 'data', but then again, what isn't data? How about something like 'log_data', or 'log' or 'analytics_data' or something?

Yes, you are right, I'll come up with something more descriptive.Drdee 01:37, 13 January 2012 (UTC)Reply

PRIMARY KEY (`dt`,`project_code`,`language_code`,`view`, `key`)

I believe that all tables should have a single (probably auto increment) unsigned integer primary key. You can make this combo a UNIQUE KEY instead.

maybe we should store the hashed value of `dt`,`project_code`,`language_code`,`view`, `key` and index that, and have an ordinary primary key. Drdee 01:36, 13 January 2012 (UTC)Reply

OTTO: Oo, I like, sortaaa.

Is there a reason not to have an simple integer primary key? Even if you don't need it in the data, it is useful to have them for maintenance purposes. If things go bad with certain bits of data you can work on restoring with ranges on primary key. It is way harder to add back in later, and doesn't hurt to have.

What's the point of this key? Just to ensure that these values are unique? If so, then the hashed unique field could be a good idea anyway, even if it is not the primary key. MySQL will probably have less trouble maintaining this single column index than the composite one.

Why

Latest comment: 12 years ago1 comment1 person in discussion

Why can't we use this like a normal MediaWiki talk page!? Reedy 23:19, 12 January 2012 (UTC)Reply

Teach me! Otto 16 January 2012

language_code length

Latest comment: 12 years ago1 comment1 person in discussion

4 characters is not enough... Longest we have I think is "zh-classical" or "wikimania2012" if we include those as "languages" Reedy 23:25, 12 January 2012 (UTC)Reply

Drdee Fair enough, bumping length to 12 but I will verify this.

Moar indexes

Latest comment: 12 years ago6 comments3 people in discussion

Chances are you're going to be wanting to filter by key, by view, by project, by language code... Going to need a few more indexes there Reedy 23:26, 12 January 2012 (UTC)Reply

Drdee 23:53, 12 January 2012 (UTC) Well, we can always add more indexes :)Reply

Additional clarification, we might not need extra indexes. Two reasons:

There are no defined use-cases for querying a metric just by language, or querying by all metrics for a single project.
The column 'hash' (which is a poor name) contains the hashed value of: timestamp, project, language_code, view, key and so you can really fast read all the values for the combination of a particular project, timestamp, language_code, view and key. You can repeat this for different projects to combine metrics in a single chart.
Adding extra indexes is still cheap, not worried about it

70.28.63.126 00:44, 13 January 2012 (UTC)Reply

On the current implementation, there was apparently a need for filtering by region, country... Reedy 00:47, 13 January 2012 (UTC)Reply

Yes, and that's still possible right? You just need to specify a metric, you don't want to have all metrics for one country in shot. And even if we need that use case then we just add that index. Drdee 01:00, 13 January 2012 (UTC)Reply

Final thought, actually the current index has the exact same functionality as the new proposed index. No change there. In the current design the index is on the combined value of date, project, language,metric and the table restricts it to a particular view. The new index is based on the combined value of timestamp, project, language, view and metric. So, no change here. Drdee 01:35, 13 January 2012 (UTC)Reply

project codes

Latest comment: 12 years ago2 comments2 people in discussion

2 characters is going to end up with clashes, ie Wikispecies, Wikisource.

4 would keep with more common used codes, meaning we don't need arbitary mapping back and forth. And with a public facing API, it has sense to be kept with what the community is used to...

See m:Interwiki map, we have "wikt"

How are we going to do Commons, Meta, MediaWiki Wiki, other *.wikimedia.org domains....

Reedy 23:34, 12 January 2012 (UTC)Reply

Not sure we need the meta:Interwiki_map for this project, you are right that the field should be bit longer to accommodate commons. I will verify longest project name. AFAIK, Mediawiki Wiki is outside the scope of the reportcard project. Drdee 23:58, 12 January 2012 (UTC)Reply

Status of this?

Latest comment: 12 years ago2 comments2 people in discussion

What's the status of this database design? Can it be fixed? It has a number of issues (missing field prefixes, weird indices, it uses enum...). Not really worth fixing if it's already deployed, though. It'd be nice to know the status here. --MZMcBride 01:23, 13 January 2012 (UTC)Reply

It is a proposal for the redesign of the database used for the new reportcard. Drdee 01:26, 13 January 2012 (UTC)Reply

Key/value pairs

Latest comment: 12 years ago1 comment1 person in discussion

I also considered using key/value pairs initially. Obvious advantage, as is part of the current rationale for change, is that new data streams can be added without schema change. And the data are indeed all similar is resolution, monthly counts. However there are drawbacks as well:

The interface becomes more abstract and less self-documenting, which make queries harder to maintain, I guess.
Currently performance is no issue at all, all tables are tiny really. If larger datasets are added it could be more efficient to retrieve one record with 10 fields than 10 separate key/value pairs.
The current design makes creation of SQL views (virtual tables) easy. Those were already prototyped, but then it turned out that we don't do views in Mediawiki for some compatability reason. (so that difference is mute)

Yet I wonder: one could structure almost any SQL database as key/value pairs, yet AFAIK this is not done often. Is this mapreduce design filosophy ported back in time, perhaps to ease migration?

Yes, ease of migration is definitely a consideration for this new design. I am not sure if interface becomes less self-documenting. In original case you would write SELECT 'metric_name' FROM 'table_name' WHERE id='foo', now you would write SELECT * FROM aggregated_data WHERE key = 'metric_name' WHERE hash='foo'. Drdee 21:27, 13 January 2012 (UTC)Reply

Comments on current version

Latest comment: 12 years ago1 comment1 person in discussion

* Merge different tables in one single table
* change CHARSET to UTF8
* change column type date to timestamp
* set engine to Innodb instead of MyISAM. 
* add view column which is of type ENUM
* change size of extension field to 4
* change size of language_code field to 4
* change bigint to bigint unsigned as 'value' will always be 0 or large
* rename binaries to value
* rename table name to aggregated_data

CREATE TABLE `analytics_aggregation` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `hashed_key` integer NOT NULL,
  `trait_instance_key` varchar(512) DEFAULT NULL COMMENT 'hashed_key is generated from this more descriptive key',
  `count` bigint(9) unsigned DEFAULT NULL,
  `project` char(32) NOT NULL COMMENT 'Wikimedia project code',
  `language_code` char(4) NOT NULL COMMENT 'Wikimedia language code, usually derived from project URL',
  `aggregate_timestamp` timestamp NULL DEFAULT NULL COMMENT 'timestamp for which this aggregate counts data',
  `granularity` integer NULL DEFAULT NULL COMMENT 'time granularity in seconds for which this aggregate counts data ',
  `theme` enum('readers','editors','devices','articles','diversity','ecosystem','files','context') DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `hashed_key` (`hashed_key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Since the last time, a few things have improved, but this schema still needs some work.

All columns need a common prefix. It's good form and it's standard for MediaWiki development. Something like aa_ or agg_ or something.
I always advise any project against using project/language code pairs. It may be unavoidable here, but it'd certainly be a shame. Simply put, they don't work. There are all sorts of edge cases (strategy.wikimedia.org, commons.wikimedia.org, etc.) where language code doesn't make any sense and ends up being either wrong or misleading. The wikis are separated by domain or by database name. If either of those can be used, it'll save a lot of headache. What content language is associated with a particular wiki is a task for another table.
Do you understand the implications of using a utf8 charset? Particularly MySQL's incomplete UTF-8 support?
I know others have already mentioned it, but the enum really ought to go. It looks like a small set of buzzwords currently that could change at any time going forward. Any reason not to normalize that data in a separate table? Using very particular terms like "diversity" or "articles" is just asking for trouble, frankly. Please don't do it.
Looks like the "ts" column has already been properly expanded. Good work on that.
The comments are all out-of-date and wrong now. But before they were extra-wrong, they included something about an extension column. Not sure that ended up in the new schema. Kill or clarify the comment?
Comments should ideally be in -- form using MySQL syntax. At some point. Not sure it's necessary right now, but it can't really hurt. Follow the style of tables.sql, please.
When adding expanded comments, you can clarify the purpose / intended values for the granularity column. :-) Same with the count column. Most of the comments above the new schema can be eliminated. Half are parts of a SVN commit message; the other parts should be inline MySQL comments.
Indexes?

--MZMcBride (talk) 05:07, 16 February 2012 (UTC)Reply