Wikibase/Indexing: Difference between revisions

Browse history interactively

← Older edit Newer edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 16:11, 1 December 2014

Goals / Requirements

ideally, public web service
- external requests return within a few seconds, use reasonable resources
  - how to enforce that constraint needs to be determined and influences the architecture
- internal requests are allowed to use more resources & time
  - these need to not crash external requests and external cannot crush internal
needs to support continuous updates to reflect latest Wikidata state
- Seconds or even a minute or two lag seems acceptable at this point but nothing beyond that.
support for queries that satisfy the needs of WikiGrok, cf. Extension:MobileFrontend/WikiGrok/Claim_suggestions
handle high request volumes (horizontal scaling)
handle a large data set (sharding)
robust: automatic handling of node failures, cross-datacenter replication, proven in production
reasonable operational complexity

Candidate solutions

Titan

Distributed graph database
Supports online modification (OLTP), so can reflect current state
Expressive query language (Gremlin); shared with other graph dbs like Neo4j
Implemented as a thin stateless layer on top of Cassandra or HBase: transparent sharding, replication and fail-over
- async multi-cluster replication can be used for isolation of research clusters, DC fail-over
Supports relatively rich indexing, including complex indexes using ElasticSearch
- Can gradually convert complex queries into simple(r) ones by propagating information on the graph & adding indexes

Magnus' Wikidata Query service

Custom in-memory graph database implemented in C++
Relatively expressive, custom query language
Limited to a single machine
- Current memory usage: 5G RSS

OrientDB

Cayley

Graph database "inspired by" Freebase and Google's Knowledge Graph

ArangoDB

Key-value, document and graph DB with replication (asynchronous master-slave) and sharding
Strong consistency / ACID / transactions
AQL (ArangoDB Query Language), a declarative query language similar to SQL. But also has other querying options.
Supports relatively rich indexing
Packages for Debian available in repository on website

Other possible candidates

Open questions

Paging of large result sets
Handling of cycles in the graph
How to index the graph for efficient common query use cases
Efficient updates for materialized complex query results

Candidate Evaluation

	Titan	OrientDB	ArangoDB
Future Proofing
Automatic sharding for large data set?	Yes, fully automatic at Cassandra layer through DHT structure. Can add one node at a time.	Manual shard setup and assignment, but automatic write balancing across shards. In the community edition, any changes to the cluster config including adding nodes or rebalancing shards require a complete cluster restart. ^[1]	Sharding builtin.
Multi-cluster replication?	Yes	No
Single point of failure?	No. Symmetric DHT design.
Large-scale production use?	Yes	Yes
Active community?	Mailing list had one half the emails of Elasticsearch in October.	Mailing list had one half the emails of Elasticsearch in October.	Mailing list had one tenth the emails of Elasticsearch in October.
Usage
Do we like the query language?	Gremlin (maybe SPARQL)	Modified SQL (maybe Gremlin if we really want)
Can we customize the query language?	Monkey patching, front-end service	Gremlin supports Monkey patching. Not sure about SQL. Can add front-end service.
Indexing capabilties	Hash, range and Elasticsearch	Hash, range and Lucene
Query support (use cases)
items by multiple property values	Yes (has)
recursive collections (sub-taxons of x)
Follow edges in inverse direction	Yes (in, out)
top n	Yes (order+loop or [])
range	With Elasticsearch
2d range, geo	With Elasticsearch
prefix match	Yes (Text.PREFIX)
query based on custom set (birth places of 5000 people)
Filtering/intersection a custom set	Yes (retain)
Matching against a transitive property (instance of human, including subclasses)
"Countries sorted by GDP"	Yes (but GDP doesn't seem to be in wikidata props)
"People born in a city with more than 100k inhabitants"	Yes
"largest 10 cities in Europe that have a female mayor"	Yes
Distinct values of a property	Yes (dedup)
max/min/sum/count/avg	Yes (sideEffect)
intersection/union/difference
Join based on property value
Dev
Do we have the expertise to hack on it?	Java and Groovy (Nik)	Java and maybe Groovy (Nik)	C++ (Max)
Is upstream excited to have us?	Yes	Yes	?? (Nik sent email)
Does contributing require a CLA?
OPS
How do we manage housekeeping? [please define]
Can we extract useful metrics for monitoring and alerting?	Standard Cassandra & front-end service instrumentation.
How do we recover from a crash?	Entire cluster: Automatic fail over to other cluster (DC); can restore from backups of immutable LSMs
What happens if a node goes down?	Remove node from LVS pool.
What happens if a backend storage node goes down?	Automatic rebalancing between remaining nodes in cluster.	No backend nodes
How do we back its data up?	Normal Cassandra disk backup of immutable SSTable files after snapshot.

Not candidates

Neo4j (replication only in "Enterprise Edition")
Offline / batch (OLAP) systems like Giraph (we need the capability to keep the index up to date)
RDF Databases (Wikidata's data model is more rich than RDF so there would likely be some loss of precision)
Wikidata Query service: single node, no sharding, no replication
Caley: no serious production use

References

↑ http://www.orientechnologies.com/docs/last/orientdb.wiki/Distributed-Sharding.html#hot-management-of-distributed-configuration

[1] ttp://www.orientechnologies.com/docs/last/orientdb.wiki/Distributed-Sharding.html#hot-management-of-distributed-configuration

[1]

@@ Line 40: / Line 40: @@
 * Packages for Debian available in repository on website
+== Other possible candidates ==
+* [http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtElasticClusterConfiguration Virtuoso cluster]
+* [http://4store.org 4store]
 == Open questions ==
 * Paging of large result sets
@@ Line 255: / Line 258: @@
 * [https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE&usp=drive_web#gid=0 Feature comparison table of several graph DBs]
 * Paper: [http://ribs.csres.utexas.edu/nosqlrdf/nosqlrdf_iswc2013.pdf NoSQL Databases for RDF: An Empirical Evaluation]
+* [http://www.quora.com/What-are-the-differences-between-a-Graph-database-and-a-Triple-store Quora thread discussing graph dbs vs. triple stores]
 == Not candidates ==

Revision as of 16:11, 1 December 2014

Goals / Requirements

Candidate solutions

Titan

Magnus' Wikidata Query service

OrientDB

Cayley

ArangoDB

Other possible candidates

Open questions

Candidate Evaluation

See also

Not candidates

References