Topic on User talk:Anomie/Abstract schema

Implementation format

8 comments • 15:16, 30 May 2018 6 years ago

8

Legoktm (talkcontribs)

Since I've been poking at the installer lately, I'd like to suggest that if it's possible, the format for the schema definition should be a non-executable format (not PHP, yes JSON/XML) so that the installer can read it without executing any code. Currently the installer autoloads all classes of an extension, and runs the schema change hook. So if that hook checks a configuration variable, it won't be set. There are plenty of other scenarios where this implementation method falls apart.

In my ideal world the installer could just read the schema out of a JSON file (or your favorite format), so there's no need to half-load extensions like we currently do. It doesn't need to be in extension.json, maybe a parallel dbschema.json that extension.json points to.

Reply 15:26, 17 April 2018 6 years ago

Skizzerz (talkcontribs)

I think the implementation would need to consist of two pieces:

JSON schema (as described on the page) which core and extensions use to define tables.
On the PHP side, we process the schema and then pawn it off to Doctrine's DBAL library to handle the actual SQL generation -- gives us less to maintain on the DDL side of things

Using Doctrine directly means that the schema files would need to be PHP instead of JSON, which runs into the issues you've mentioned in the installer and upgrader. As such, I think we still need an abstraction layer on top of Doctrine. If this sounds good, I can work that into the actual page and then we're probably at the point where we can kick off the actual RFC.

Reply 16:54, 28 May 2018 6 years ago

Anomie (talkcontribs)

I never planned to use Doctrine.

I haven't looked too closely at Doctrine, but any schema abstraction layer seems very likely to be closely tied to its own database abstraction layer. Unless we plan to replace MediaWiki's Database class with Doctrine (a far bigger project), I suspect trying to use it only for schema abstraction would have issues.

See for comparison some of the issues identified for cakephp's abstraction in Topic:Uamlkn3gywc8kdds.

Reply 13:54, 29 May 2018 6 years ago

Skizzerz (talkcontribs)

Doctrine's schema layer purports to be wholly disconnected from the rest of the ORM, so that it can be used alongside other things. I think it brings a lot to the table, mainly by allowing us to not need to create all of that migration logic ourselves. Furthermore, using Doctrine would allow for downgrades/rollbacks to previous versions, which if we rolled our own would likely not be supported at least initially simply because it'd be extra engineering effort for very little benefit.

To address the points you listed in the other topic:

This layer has functions that output the SQL to run, and it's up to you to take those SQL statements and run them through whatever db layer you use. This allows deploying that SQL using our existing Database classes for normal installs/upgrades as well as generating .sql files for deployment on larger clusters.
The varchar vs varbinary thing looks to be an issue. We'd have to build our own custom MySQL platform (extending the shipped one) to override the method that generates string fields to return varbinary instead of varchar (ditto for blob vs text on the text fields).
It has native support for binary/varbinary/blob fields but these are mapped to PHP resource streams rather than strings. This would be good for the cases where we're storing true binary data in a column as opposed to utf-8 text.
Translating connection configuration is straightforward given that we're taking it in as separate variables.

In the long term, I would like to replace MediaWiki's Database classes with something that is less MySQL-centric (the way queries are built by concatenating strings works great when you assume everything works the way it does in MySQL, but supporting other dbmses requires nasty fragile hacks in a lot of cases using regex replacement and whatnot). But, that is outside the scope of this RFC, and it is likely better served by upgrading the internals of our Database class to put things together in a more structured manner rather than putting together a SQL string piecemeal rather than replacing it wholesale with something else (keeps largely the same API for backwards compat, doesn't throw away all of our institutional knowledge of what is actually needed by MediaWiki, etc.)

Reply 14:48, 29 May 2018 6 years ago

Anomie (talkcontribs)

I suspect the "nasty fragile hacks in a lot of cases using regex replacement" are not actually required, but are an artifact of people in the distant past taking the easy way to try to get something that functions for an immediate use case with minimal work put into writing the code.

I'm a bit skeptical of claims about rolling back a schema change. How do you roll back the data loss when a table or column is dropped? ... Looking at their documentation, it seems it works by simply having you write both the schema change and the "undo" schema changes, separately, and handling lost data is up to the implementor. If you don't fill in the down() method, the "rollback" won't work.

The "PHP resource streams" thing doesn't seem particularly relevant. First, that seems to be part of their DB abstraction rather than their schema abstraction. Second, PHP strings are already inherently binary, so "resource streams" aren't really needed anyway.

Reply 15:53, 29 May 2018 6 years ago

Skizzerz (talkcontribs)

In mssql, the two main hacks were to support LIMIT/OFFSET and GROUP_CONCAT. I've reworked the LIMIT/OFFSET code to make use of a new feature in SQL Server 2012 that greatly reduces the hackiness (patch still pending and it still requires a bit of regex), but the GROUP_CONCAT hack will need to remain until MediaWiki stops relying on this MySQL-specific feature. The main issue with the original LIMIT/OFFSET was that the method is passed a string and is expected to return a string, rather than getting anything more structured. The method basically assumes that the limit/offset stuff goes at the end of the string, which is simply not true in that case; I needed to actually move clauses around and mess with subqueries whenever offsets are involved. There are other ways to implement offsets, but all of them require wrapping the query into a subquery or CTE and moving or copying the select list.

I don't see any particular need or requirement to be able to roll back changes, which is why I said that should we make our own, that feature would likely not be included.

Reply Edited 16:24, 29 May 2018 6 years ago

Anomie (talkcontribs)

I don't think you can claim a concatenating aggregate function is a MySQL-specific feature, since every other database supports it too (although with no standardized syntax). BTW, it looks like MSSQL implemented it too in SQL Server 2017, as STRING_AGG.

For older versions, would it be possible to make a workalike with CREATE AGGREGATE? Or does that sort of thing always require external DLLs that we couldn't include with MediaWiki?

Reply 14:20, 30 May 2018 6 years ago

Skizzerz (talkcontribs)

Yes, that requires an external dll, which requires sysadmin (root) level access to mssql to enable the loading of external dlls as well as to run the create aggregate statement. In any case, this conversation is diverging a lot from the purpose of this talk page (discussing the RFC). If you want to talk more about our existing database abstraction layer and its shortcomings in terms of non-MySQL support, feel free to poke me on IRC :)

Reply 15:16, 30 May 2018 6 years ago

Reply to "Implementation format"