Jump to content

Wikibase/Indexing: Difference between revisions

From mediawiki.org
Content deleted Content added
See also: link to benchmark paper
Candidate solutions: add links to distributed rdf stores & quora thread discussing the differences
Line 40: Line 40:
* Packages for Debian available in repository on website
* Packages for Debian available in repository on website


== Other possible candidates ==
* [http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtElasticClusterConfiguration Virtuoso cluster]
* [http://4store.org 4store]
== Open questions ==
== Open questions ==
* Paging of large result sets
* Paging of large result sets
Line 255: Line 258:
* [https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE&usp=drive_web#gid=0 Feature comparison table of several graph DBs]
* [https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE&usp=drive_web#gid=0 Feature comparison table of several graph DBs]
* Paper: [http://ribs.csres.utexas.edu/nosqlrdf/nosqlrdf_iswc2013.pdf NoSQL Databases for RDF: An Empirical Evaluation]
* Paper: [http://ribs.csres.utexas.edu/nosqlrdf/nosqlrdf_iswc2013.pdf NoSQL Databases for RDF: An Empirical Evaluation]
* [http://www.quora.com/What-are-the-differences-between-a-Graph-database-and-a-Triple-store Quora thread discussing graph dbs vs. triple stores]


== Not candidates ==
== Not candidates ==

Revision as of 16:11, 1 December 2014

Goals / Requirements

  • ideally, public web service
    • external requests return within a few seconds, use reasonable resources
      • how to enforce that constraint needs to be determined and influences the architecture
    • internal requests are allowed to use more resources & time
      • these need to not crash external requests and external cannot crush internal
  • needs to support continuous updates to reflect latest Wikidata state
    • Seconds or even a minute or two lag seems acceptable at this point but nothing beyond that.
  • support for queries that satisfy the needs of WikiGrok, cf. Extension:MobileFrontend/WikiGrok/Claim_suggestions
  • handle high request volumes (horizontal scaling)
  • handle a large data set (sharding)
  • robust: automatic handling of node failures, cross-datacenter replication, proven in production
  • reasonable operational complexity

Candidate solutions

  • Distributed graph database
  • Supports online modification (OLTP), so can reflect current state
  • Expressive query language (Gremlin); shared with other graph dbs like Neo4j
  • Implemented as a thin stateless layer on top of Cassandra or HBase: transparent sharding, replication and fail-over
    • async multi-cluster replication can be used for isolation of research clusters, DC fail-over
  • Supports relatively rich indexing, including complex indexes using ElasticSearch
    • Can gradually convert complex queries into simple(r) ones by propagating information on the graph & adding indexes

Magnus' Wikidata Query service

  • Custom in-memory graph database implemented in C++
  • Relatively expressive, custom query language
  • Limited to a single machine
    • Current memory usage: 5G RSS
  • Graph database "inspired by" Freebase and Google's Knowledge Graph
  • Key-value, document and graph DB with replication (asynchronous master-slave) and sharding
  • Strong consistency / ACID / transactions
  • AQL (ArangoDB Query Language), a declarative query language similar to SQL. But also has other querying options.
  • Supports relatively rich indexing
  • Packages for Debian available in repository on website

Other possible candidates

Open questions

  • Paging of large result sets
  • Handling of cycles in the graph
  • How to index the graph for efficient common query use cases
  • Efficient updates for materialized complex query results

Candidate Evaluation

Titan OrientDB ArangoDB
Future Proofing
Automatic sharding for large data set? Yes, fully automatic at Cassandra layer through DHT structure. Can add one node at a time. Manual shard setup and assignment, but automatic write balancing across shards. In the community edition, any changes to the cluster config including adding nodes or rebalancing shards require a complete cluster restart. [1] Sharding builtin.
Multi-cluster replication? Yes No
Single point of failure? No. Symmetric DHT design.
Large-scale production use? Yes Yes
Active community? Mailing list had one half the emails of Elasticsearch in October. Mailing list had one half the emails of Elasticsearch in October. Mailing list had one tenth the emails of Elasticsearch in October.
Usage
Do we like the query language? Gremlin (maybe SPARQL) Modified SQL (maybe Gremlin if we really want)
Can we customize the query language? Monkey patching, front-end service Gremlin supports Monkey patching. Not sure about SQL. Can add front-end service.
Indexing capabilties Hash, range and Elasticsearch Hash, range and Lucene
Query support (use cases)
items by multiple property values Yes (has)
recursive collections (sub-taxons of x)
Follow edges in inverse direction Yes (in, out)
top n Yes (order+loop or [])
range With Elasticsearch
2d range, geo With Elasticsearch
prefix match Yes (Text.PREFIX)
query based on custom set (birth places of 5000 people)
Filtering/intersection a custom set Yes (retain)
Matching against a transitive property (instance of human, including subclasses)
"Countries sorted by GDP" Yes (but GDP doesn't seem to be in wikidata props)
"People born in a city with more than 100k inhabitants" Yes
"largest 10 cities in Europe that have a female mayor" Yes
Distinct values of a property Yes (dedup)
max/min/sum/count/avg Yes (sideEffect)
intersection/union/difference
Join based on property value
Dev
Do we have the expertise to hack on it? Java and Groovy (Nik) Java and maybe Groovy (Nik) C++ (Max)
Is upstream excited to have us? Yes Yes ?? (Nik sent email)
Does contributing require a CLA?
OPS
How do we manage housekeeping? [please define]
Can we extract useful metrics for monitoring and alerting? Standard Cassandra & front-end service instrumentation.
How do we recover from a crash? Entire cluster: Automatic fail over to other cluster (DC); can restore from backups of immutable LSMs
What happens if a node goes down? Remove node from LVS pool.
What happens if a backend storage node goes down? Automatic rebalancing between remaining nodes in cluster. No backend nodes
How do we back its data up? Normal Cassandra disk backup of immutable SSTable files after snapshot.

See also

Not candidates

  • Neo4j (replication only in "Enterprise Edition")
  • Offline / batch (OLAP) systems like Giraph (we need the capability to keep the index up to date)
  • RDF Databases (Wikidata's data model is more rich than RDF so there would likely be some loss of precision)
  • Wikidata Query service: single node, no sharding, no replication
  • Caley: no serious production use

References