Topic on Talk:Requests for comment/DataStore

IRC meeting 2013-10-02

One comment • 02:38, 9 October 2013 11 years ago

1

Tim Starling (talkcontribs)

<TimStarling> I think Brion has already said something favourable about this
<TimStarling> I don't see it on the page, maybe it was in an in-person meeting
<gwicke> I like the general idea of having a key/value store available without creating extra tables
<legoktm> what gwicke said
<mwalker> I voiced in the comments that I think this should have some sort of defined structure per key -- that way we can have a unified upgrade process (like we have with a database) and also have a way of filling in initial values
<yuvipanda> +1
<mwalker> otherwise I failed to see the difference between this and just using memcache
<legoktm> this would be persistent
<TimStarling> well, persistence
<legoktm> memcache isn't
<gwicke> in distributed storage range queries are not free, so it might make sense to make those optional
<gwicke> similar with counters
<TimStarling> mwalker: so you're thinking of some sort of schema definition for values?
<mwalker> yes
<mwalker> that way you have a defined upgrade / update process
<MaxSem> schema definition: serialize( $struct )
<mwalker> MaxSem: how do you handle a multiversion jump though? if the structure has evolved and suddenly you dont have the data you expect
<mwalker> you can handle that in the consuming code of course -- but that's a lot of boilerplate that I think is redundant
<TimStarling> mwalker: what do you imagine the upgrade process would be/
<MaxSem> if you want schemas and upgrades, it's a good readon to use MySQL tables
<TimStarling> ?
<legoktm> mwalker: i think that's something that the extension needs to handle, with proper deprecation
<legoktm> and migration
<gwicke> mwalker: all you need is a way to traverse the keys and update all values I guess
 some update handler per key/value on fetch?
<gwicke> you can have a version key in each JSON blob you store for example
 supported by the extension
<mwalker> could do it on fetch, or could do it in a maintenance script
 whichever comes first
<TimStarling> mwalker: how would a schema help with upgrading? what boilerplate would be abstracted exactly?
<gwicke> we might want different kinds of key/value stores: those that are randomly ordered and only support listing all keys, those that are ordered and allow efficient range queries, and those with special support for counter values
<mwalker> TimStarling: I imagine that this will probably be abused to store dicts and arrays -- if we now what we're coming from and going to; we can define transforms for the old data into the new
<TimStarling> the requirement for prefix queries does appear to limit the backends you could use
<gwicke> yes, or at least it creates extra overhead for those that don't need the feature
<legoktm> mwalker: i dont think storing an array is abusing the feature ;)
<TimStarling> mwalker: abused?
<MaxSem> gwicke, if you don't want to use prefix queries, don't use them
<mwalker> TimStarling: the examples given in the RfC are simple values
<gwicke> MaxSem: yes, that's why I propose to have different key/value storage classes
<MaxSem> because there can be multiple stores, you can always make some assumptions about the store you're using
<mwalker> I say abused because I see no provision for dealing with more complex values (which is what I'm proposing :))
<gwicke> /ordered-blob/ vs /blob/ for example
<TimStarling> mwalker: maybe you misunderstood MaxSem then, because he just said he thinks values should be serialized with serialize()
 it would probably be good to classify those different stores in the RFC, define the ones likely needed
<yuvipanda> mwalker: perhaps add more data types to the RFC? Lists and Hashes, maybe. I guess different stores can define different datatypes that they support
<gwicke> mark: I have some notes at https://www.mediawiki.org/wiki/User:GWicke/Notes/Storage#Key.2Fvalue_store_without_versioning
<mwalker> TimStarling: yes; but serialization doesn't solve the problem of knowing what's in the structure
<MaxSem> mark, the proposal comes with a skeleton code for an SQL store and has a Mongo as another example
<mwalker> if you serialize a php class for example -- deserializing it into a class with the same name but different structure gives very unexpected results
* gwicke lobbies for JSON over serialize()
<TimStarling> I imagine it would be used like the way memcached is used
 yeah, nothing too PHP specific ;)
<MaxSem> gwicke, doable
<TimStarling> i.e. avoiding objects wherever possible, primarily serializing arrays, including a version number in the array
<MaxSem> :)
<TimStarling> when you fetch a value with the wrong version, the typical response in memcached client code is to discard it
<TimStarling> with persistent data, you would instead upgrade it
<gwicke> MaxSem: ok ;)
<TimStarling> that upgrade could be done by some abstracted schema system
<TimStarling> or it could be done by the caller, correct?
 also, is this proposal intended to embrace larger key/value storage applications like... images? external storage?
<mwalker> TimStarling: yes -- that's where I'm going -- but I'm agitating for the schema system so the caller doesn't have to care every place its used
 it doesn't seem to be, but I believe it's not mentioned
<MaxSem> mark, I intended to maybe use it for storing images on small wikis
<TimStarling> mwalker: I think you should write about your idea in more detail
<TimStarling> since this is not exactly a familiar concept for most MW developers
<gwicke> mark: the Cassandra stuff just came up in parallel
<mwalker> TimStarling: ok; I'll write that up tonight
<TimStarling> maybe you could even write a competing RFC
<mwalker> which do you think would be better?
<MaxSem> but it's too generic for an image store of our scale
 when we're either talking about many objects into the millions, or potentially very large objects into the gigabytes, that can matter a lot :)
<TimStarling> mwalker: I would like to know what the API will look like before I decide
<TimStarling> and I would want comments from more people
<MaxSem> mark, the key here is "small wikis":)
<mwalker> TimStarling: ok; I'll write it up as a separate RfC
<TimStarling> yeah, I think that would be easiest
<TimStarling> now, there are obvious applications for a schemaless data store
<gwicke> mark: objects into the gigabytes are unlikely to be handled well by a backend that is also good at small objects
 gwicke: that is my point
<TimStarling> because there are already schemaless data stores in use
<TimStarling> ExternalStore, geo_updates, etc.
<gwicke> mark: I'm interested in the 'at most a few megabytes' space
<MaxSem> so far to move this proposal forward I'd like people to agree upon interface
<gwicke> primarily revision storage
 yes, we should probably make that a bit more explicit in the RFC
<TimStarling> is it possible to have both a schema data store and a non-schema data store?
<TimStarling> one could be implemented using the other
<TimStarling> I think that would suit existing developers better
 2 layers of abstraction
<TimStarling> yeah, well that seems like the minimum here
<TimStarling> schemas are not so simple that you would want to do them in a few lines of code embedded in a data store class, right? you would want to have a separate class for that
<mwalker> I think this could even overlay our current memcache
<gwicke> schema as in actually storing structured data and allowing complex queries on it?
<gwicke> that sounds like sql..
<mwalker> just getStore('temporary') or something
<MaxSem> another question: does anybody want eg getMulti() and setMulti()?
<MaxSem> mwalker, temporary is BagOStuff
<TimStarling> MaxSem: ObjectCache callers don't use getMulti very often...
<gwicke> MaxSem: I think it would be great to have that capability for any service backend
<yuvipanda> +1
<mwalker> this is a PersistantBagOStuff though :) why should the API be different
<TimStarling> in core, just filebackend, by the looks
<gwicke> can be based on curl_multi
<TimStarling> but it is generally considered to be a good thing to have
 it's not always efficient to implement temp/expiry/caching with every service backend
 oh, misunderstood
<TimStarling> no, I think persistent storage does need a different API
<mwalker> yes; mark raised a point I hadn't thought of
<TimStarling> well, ideally
<TimStarling> redis handles persistent storage well enough with a mixed API
<gwicke> there are some backends with built-in expiry
<mwalker> *if you set a TTL of zero; it goes into the persistant store?
<gwicke> amazon handles the ttl with special request headers
<TimStarling> anyway, BagOStuff brings a lot of baggage (ha ha)
<gwicke> mwalker: you set it per object normally
<TimStarling> presumably DataStore would be simpler than BagOStuff
<gwicke> same is available in cassandra
<gwicke> but would be good to check other backends
<TimStarling> it wouldn't have incr/decr or lock/unlock
 swift does it, the swift compatible ceph counterpart doesn't
<TimStarling> with a simpler API, DataStore could have more backends than BagOStuff
<MaxSem> TimStarling, I actually have increment() - wonder if it's really needed
<PleaseStand> Would we need an atomic increment for things like ss_total_edits?
<gwicke> I'm pushing for a web service API
<MaxSem> it could be helpful eg for implementing SiteStats with DataStorew
<MaxSem> gwicke, web service API will be one of backends
<gwicke> PleaseStand: not atomic, but consistent
<gwicke> that should be a special storage class
<MaxSem> gwicke, know why memcached doesn't work over HTTP?
<TimStarling> MaxSem: maybe you should write a bit on the RFC about what backends you imagine this using, and what their capabilities are
<gwicke> MaxSem: efficiency for very small fetches
 it's not UDP? ;)
<gwicke> afaik it's tcp
<TimStarling> w.r.t. prefix search, increment, lock, etc.
<mwalker> facebook wrote one with udp
<TimStarling> add, cas?
<MaxSem> stupid facebook
<TimStarling> ObjectCache provides all these atomic primitives
 max size of objects
<gwicke> TimStarling: cas on etag?
<gwicke> can be supported optionally in some backends
<TimStarling> I just would like to know if the applications require all these atomic primitives
<TimStarling> and if that limits our backend choice
<MaxSem> TimStarling, cas doesn't seem to be very mixable with eventual-consistency backends
<TimStarling> essentially, there is a tradeoff between feature count and backend diversity, right?
<gwicke> I'd start with the minimal feature set initially
<TimStarling> so we want to know where on the spectrum to put DataStore
 i think an application like gwicke is interested in (external storage like) is already quite different from the counter/stats like applications also discussed here
<gwicke> and then consider adding support for something like CAS when the use case and backend landscape is clearer
<TimStarling> that tradeoff is not discussed on the RFC, so I would like to see it discussed
 agreed

Reply 02:38, 9 October 2013 11 years ago

Reply to "IRC meeting 2013-10-02"