Wikibase/Indexing

Group:	Search and Discovery
Team members:	Stas Malyshev, Nik Everett
Lead:	Nik Everett

Task tracking: wikidata-query-service in Phabricator; watch the project to receive notifications

Goals / Requirements

End of March

Publicly downloadable prototype (query and import and update)
Labs install (details hazy, same prototype)
Hardware requested
12 month roadmap

Phase 1: MVP (Before end of June)

Support use-cases of WikiGrok, remain flexible in architecture to eventually support external requests/third parties.

Must have:

query console for experimentation

As a user, I want to see a WikiGrok question immediately on the mobile site, as soon as I load an article that has any active WikGrok campaigns, so that I can respond quickly and keep getting more questions in real time, based on the ones I've already answered.

Requirements:

available through an API

available through server-side requests, connecting through PHP
for any query, result output in JSON (XML would be nice to have)
results may sometimes come in the form of lists (e.g., List of all possible occupations) and sometimes in the form of a single item
- Simple (single-item) queries: (e.g., "is this item X and not Y?")
  - generate live and run quickly – as fast as possible – to serve immediately to users via WikiGrok (and potentially continue serving more results on the fly after user input)
- Complex queries: (e.g., list of all possible suggested occupations)
  - pre-generate results and store in a table or cache, so these queries can run longer (but still within some reasonable timeframe, e.g. 1 hour)
  - regenerate as often as practical/possible

Phase 2: Support for public/external requests

ideally, public web service
- external requests return within a few seconds, use reasonable resources
  - how to enforce that constraint needs to be determined and influences the architecture
- internal requests are allowed to use more resources & time
  - these need to not crash external requests and external cannot crush internal
needs to support continuous updates to reflect latest Wikidata state
- Seconds or even a minute or two lag seems acceptable at this point but nothing beyond that.
handle high request volumes (horizontal scaling)
handle a large data set (sharding)
robust: automatic handling of node failures, cross-datacenter replication, proven in production
reasonable operational complexity

Candidate solutions

Titan

Distributed graph database
Supports online modification (OLTP), so can reflect current state
Expressive query language (Gremlin); shared with other graph dbs like Neo4j
Implemented as a thin stateless layer on top of Cassandra or HBase: transparent sharding, replication and fail-over
- async multi-cluster replication can be used for isolation of research clusters, DC fail-over
Supports relatively rich indexing, including complex indexes using ElasticSearch
- Can gradually convert complex queries into simple(r) ones by propagating information on the graph & adding indexes
TinkerPop blueprints support, including Gremlin and the GraphSail RDF interface

Magnus' Wikidata Query service

Custom in-memory graph database implemented in C++
Relatively expressive, custom query language
Limited to a single machine
- Current memory usage: 5G RSS
- Load balanced systems are available
The initial dump conversion is extremely slow, so if a server process crashes and its dump gets corrupt you face a prolonged outage. Even if we optimize it by an order of magnitude, it will still be slow.
Server startup time is not nice either, and will only grow with the growth of the dataset. While not critical e.g. for Redis that we keep running for months, the less mature nature of WDQ and the need to cater for development/future bugs will mean a lot of restarts, each being a PITA.
You can't run in production the DB query used to retrieve latest changes is, so this part will have to be redone completely.
- And this update routine results in a possible race condition making it miss some changes.
Each entity's properties are retrieved with a separate uncacheable HTTP request to Special:EntityData which isn't very fast so as the rate of changes increases, WDQ will bump into it hard, not being able to cope with updates.
Thread synchronization model needs to be totally redone, as currently used spinlocks aren't scalable.
Some code paths in the update code are missing a synchronization which makes one wonder how often does it crash and how much often will it crash under a production load. And if you add synchronization there, locking is probably going to bring everything to a halt.
With a crapload of very small objects in the same heap, a long-running server process would have problems with heap fragmentation/memory management performance, requiring custom memory management, etc.
Scalability would be very primitive: as you add more machines they start making more requests, etc.
Would need some manual solutions for approximate clustering, failover, etc.
Would require in-house maintenance.

OrientDB

Apache 2 License
Distributed graph database
Supports online modification (OLTP)
Supports Gremlin, as well as it's own SQL-like query language with graph features and no JOINs
Drop-in Lucene plugin for geospatial indexes (also full-text, not that we would use that)
Replication is multi-master and works via Hazelcast, read/write quorums, and merging conflicts
- Isolation in ACID drops a bit when distributed (intermediate results may show briefly)
- Isolation is SERIALIZABLE for direct FS access case, READ COMMITTED for remote access (what we would use)
- Since I'd like to input data from hub feeds (starting with dumps), async replication isn't really an issue; both DC would just use the same process to pull in updates
  - Actually OrientDB 2.0 supports async replication via config [1]
Supports automatic round-robin sharding as well as application directed sharding (specifying clusters for class item insertions and reads, reads default to checking all partitions)
Supports various indexes (SBTree,hash, both unique or not unique) and primitive as well as embedded data structures (sets/lists/map) that can be indexed
Queries on the JSON itself are also possible regardless of nesting levels
Supports Tinkerpop blueprints
Good query timeout support via TIMEOUT

ArangoDB

Apache 2 License
Multi-model database: key/value, document and graph DB with a query language combining all three models
Complex queries including joins between multiple collections
ACID: transactions spanning multiple documents and collections
distributed: replication (asynchronous master-slave) and sharding
AQL (ArangoDB Query Language), a declarative query language similar in spirit to SQL, designed with multi-model in mind and allowing joins. Has also other querying options.
HTTP API for RESTful interfaces
API extensible by user-defined JavaScript code (V8 embedded) via the Foxx framework
Supports relatively rich indexing (skip-list, hash, geo, full-text)
Extensive reference documentation in gitbook format
Convenient web front end integrated
Packages for Debian, other Linux distributions, Mac OSX and Windows available in repository on website
Virtual machine images and docker container available
Very modern code base (C++11) with many unit tests
Good community and professional support
Few open issues on github, new ones are quickly dealt with
Regular major releases every two to three months
Concerns
- Memory-only indexes
- Only one full-text index per collection, but a river plugin for elastic search exists^[1]
- Complex clustering/sharding model which is still in development, integration with mesos is being worked on
- No indexes on array properties yet^[2], this is being worked on

Neo4j

GPL & AGPL
Some capabilities may be enterprise-only
Dynamic schema, supports fulltext indexes and probably geospatial with plugin, also looks like pluggable indexes are possible
Cypher SQL-like query language, TP3 support
Labels may be used to make efficient lookups for yes/no properties (see in the blog)
Has query planner/profiler
Good HA support, no sharding though
Lucene indexes
Concerns
- Mixed indexes support - i.e. is it possible to use 2 indexes together?
- Edge indexes only old-style - either manual-fill or pre-configured autoindexer which needs to be configured each time anew
- Not clear how we handle geodata
- No ability to store arbitrary JSON/blob for non-indexed secondary lookup
- Multi-valued properties support - TP3 implementation uses node-per-value
- What we're missing by running non-Enterprise version (if anything)
- How sharding-less replication works in practice on our data size?
Questions
- Are edge properties first-class citizens re: indexing?
- Looking up property values - more efficient as edge properties or vertex properties? I.e. are claims edges or vertices?
- Is Enterprise version just AGPL or has some other requirements?

Other possible candidates

"Big Data" - GPLv2, Tinkerpop 2.5
Graphx - Apache License, Alpha, Uses Spark
Virtuoso cluster
4store
Neo4j - GPL and AGPL - Tinkerpop 3.0 reference implementation
Build one directly against Elasticsearch - Horrible/provocative idea - We currently have the expertise but totally against most of our goals – OTOH we were resigned to having to hack on Titan's Elasticsearch integration anyway

Open questions

Paging of large result sets
Handling of cycles in the graph
How to index the graph for efficient common query use cases
Efficient updates for materialized complex query results

Candidate Evaluation

	Titan	OrientDB	ArangoDB	Neo4J
Future Proofing
Automatic sharding for large data set?	Yes, fully automatic at Cassandra layer through DHT structure. Can add one node at a time.	Manual shard setup and assignment, but automatic write balancing across shards. In the community edition, any changes to the cluster config including adding nodes or rebalancing shards require a complete cluster restart. ^[3]	In Development. ^[4] Currently the sharding setup is manual, but the data distribution is automatic. Zero administration and integration with Apache mesos clusters is being worked on.	Replication - yes, sharding - no
Multi-cluster replication?	Yes	No	No	?
Single point of failure?	No. Storage backend is Cassandra which doesn't have a single point of contention, much less failure.	No	No. Automatic failover is being worked on. Currently asynchronous replication is possible but requires manual setup.	No
Large-scale production use?	Yes - Not clear what large number of Titan servers is. Lots of documentation on very large Cassandra clusters. Netflix has blogger about ~285 node Cassandra clusters pushing a million writes a second. We won't get near that hardware.	Yes - five or six server clusters	Not clear what "large-scale" means, but customers use clusters in production.	AFAIK Yes, not sure of the size but read about bllion-nodes-sized setups.
Active community?	Mailing list had one half the emails of Elasticsearch in October.	Mailing list had one half the emails of Elasticsearch in October.	Yes, github, google group and stackoverflow.	stackoverflow, mailing list, github
Usage
Do we like the query language?	Gremlin (maybe SPARQL)	Modified SQL (maybe Gremlin if we really want)	AQL, graph traversal	Cypher, TP3
Can we customize the query language?	Monkey patching, front-end service	front-end service	Yes (Foxx)	Plugins
Indexing capabilties	Hash and Elasticsearch	Hash, range and Lucene	hash, skiplist, geo, fulltext (one per collection), river for elastic search
Parallel request scheduling (Are different parallel queries sufficiently well (fair?) scheduled so that resource intensive and low latency interactive ones can run on the same cluster?)	You'd have to do this by running queries on different machines and sharing the same storage backends. Inside a single machine queries are just have threads doing lazy evaluation and Java threads.		Incoming HTTP requests are scheduled in a fair way. Numbers of server threads can be configured. If not all threads are busy with long running requests, then scheduling is fair.
Query time or cpu limit	Timeout added in very newest version. Its possible to write degenerate queries the circumvent the timeout.	Good, via TIMEOUT	Possible to cancel queries externally
Query memory limit	Done using process/machine isolation. The storage layer is shared but that isn't memory issue.	No	No
Query support (use cases)
items by multiple property values	Yes (has)	Yes	Yes
recursive collections (sub-taxons of x)	Yes(loop, out)	Yes	?
Follow edges in inverse direction	Yes (in, out)	Yes	Yes
top n	Yes (order+loop or [])	Yes	Yes
range	With Elasticsearch	Yes; includes SB-Trees	Yes
2d range, geo	With Elasticsearch	Yes	Yes
prefix match	Yes (Text.PREFIX)	Yes; SB-Trees	Yes
query based on custom set (birth places of 5000 people)	Yes (retain)	Yes?	Yes
Filtering/intersection a custom set	Yes (retain)	Yes?	Yes
Matching against a transitive property (instance of human, including subclasses)	Yes(loop, out)	Yes; using TRAVERSE or Gremlin	?
"Countries sorted by GDP"	Yes (but GDP doesn't seem to be in wikidata props)	Yes; Composite SB-Tree and edge per prop statemet	Yes
"People born in a city with more than 100k inhabitants"	Yes	Yes	Yes
"largest 10 cities in Europe that have a female mayor"	Yes	Yes; subquery or JSON filtering in WHERE for example	Yes
Distinct values of a property	Yes (dedup)	Yes; SB-Tree or hash index	Yes
max/min/sum/count/avg	Yes (sideEffect)	Yes	Yes
intersection/union/difference	Union - yes (copySplit & merge), intersection/difference - probably yes (retain/except) but may have limitations	Yes; via SQL UNION/WHERE/subqueries, maybe Gremlin too	Yes
Join based on property value	Trivial for edges, may not be efficient for properties without specific index	Yes	Yes
Support for subproperties: "people who have <subproperty of location> with Newcastle"	There's no concept of subproperties as such, but properties can be hashmaps and filters can use code to match against them	Yes; nested JSON can be queried all the way down	Yes
Dev
Do we have the expertise to hack on it?	Java and Groovy	Java and maybe Groovy	C++/JS
Is upstream excited to have us?	Yes	Yes!!!!	Yes, definitely!
Does contributing require a CLA?	Looks like no	Yes, copied and pasted from Apache	Yes, see ^[5]
OPS
How do we manage housekeeping? [please define]
Can we extract useful metrics for monitoring and alerting?	Standard Cassandra instrumentation and basic metrics from Titan itself		Yes, statistics are kept and available in the cluster dashboard.
How do we recover from a crash?	Entire cluster: Automatic fail over to other cluster (DC); can restore from backups of immutable LSMs		Whole cluster: Restart cluster and restore from dump.
What happens if a node goes down?	Remove node from LVS pool.		Coordinator node: nothing, DBserver node: Have to manually connect asynchronous replica as new node. Automatic failover is being worked on.
What happens if a backend storage node goes down?	Automatic rebalancing between remaining nodes in cluster.	No backend nodes	Have to manually connect asynchronous replica as new node. Automatic failover is being worked on.
How do we back its data up?	Normal Cassandra disk backup of immutable SSTable files after snapshot.	Backup is not in community edition; use JSON dumps + hub updates (the primary data is in MySQL/ExternalStore anyway)	arangodump and arangorestore for cluster are fully supported in the community edition. Asynchronous replication as well.
Source (for importing data, sample queries, etc)	https://github.com/smalyshev/wikidata-gremlin/ branch titan_flat	https://github.com/AaronSchulz/WikiDataQueryOrient	https://github.com/triAGENS/ArangoDB-Data

WDQS Beta based on RDF/SPARQL

Not candidates

Neo4j (replication only in "Enterprise Edition")
Offline / batch (OLAP) systems like Giraph (we need the capability to keep the index up to date)
Wikidata Query service: single node, no sharding, no replication
Caley: no serious production use
ArangoDB: looks like it might work but they don't allow upstream contributions. Answer to this: This was never true and is not true, see ^[5] for the CLA procedure. When they get that problem solved and OrientDB and Titan fall through then we can promote them again.

References

[1] ttps://github.com/arangodb/elasticsearch-river-arangodb

[2] ttp://comments.gmane.org/gmane.comp.db.arangodb.general/1809

[3] ttp://www.orientechnologies.com/docs/last/orientdb.wiki/Distributed-Sharding.html#hot-management-of-distributed-configuration

[4] ttps://docs.arangodb.com/Sharding/StatusOfImplementation.html

[:0-5] 5.0 ^5.1 https://www.arangodb.com/community

[1]

[2]

[3]

[4]

[5]