Wikibase/Indexing: Difference between revisions
Appearance
< Wikibase
Content deleted Content added
→See also: link to benchmark paper |
→Candidate solutions: add links to distributed rdf stores & quora thread discussing the differences |
||
Line 40: | Line 40: | ||
* Packages for Debian available in repository on website |
* Packages for Debian available in repository on website |
||
== Other possible candidates == |
|||
* [http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtElasticClusterConfiguration Virtuoso cluster] |
|||
* [http://4store.org 4store] |
|||
== Open questions == |
== Open questions == |
||
* Paging of large result sets |
* Paging of large result sets |
||
Line 255: | Line 258: | ||
* [https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE&usp=drive_web#gid=0 Feature comparison table of several graph DBs] |
* [https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE&usp=drive_web#gid=0 Feature comparison table of several graph DBs] |
||
* Paper: [http://ribs.csres.utexas.edu/nosqlrdf/nosqlrdf_iswc2013.pdf NoSQL Databases for RDF: An Empirical Evaluation] |
* Paper: [http://ribs.csres.utexas.edu/nosqlrdf/nosqlrdf_iswc2013.pdf NoSQL Databases for RDF: An Empirical Evaluation] |
||
* [http://www.quora.com/What-are-the-differences-between-a-Graph-database-and-a-Triple-store Quora thread discussing graph dbs vs. triple stores] |
|||
== Not candidates == |
== Not candidates == |
Revision as of 16:11, 1 December 2014
Goals / Requirements
- ideally, public web service
- external requests return within a few seconds, use reasonable resources
- how to enforce that constraint needs to be determined and influences the architecture
- internal requests are allowed to use more resources & time
- these need to not crash external requests and external cannot crush internal
- external requests return within a few seconds, use reasonable resources
- needs to support continuous updates to reflect latest Wikidata state
- Seconds or even a minute or two lag seems acceptable at this point but nothing beyond that.
- support for queries that satisfy the needs of WikiGrok, cf. Extension:MobileFrontend/WikiGrok/Claim_suggestions
- handle high request volumes (horizontal scaling)
- handle a large data set (sharding)
- robust: automatic handling of node failures, cross-datacenter replication, proven in production
- reasonable operational complexity
Candidate solutions
- Distributed graph database
- Supports online modification (OLTP), so can reflect current state
- Expressive query language (Gremlin); shared with other graph dbs like Neo4j
- Implemented as a thin stateless layer on top of Cassandra or HBase: transparent sharding, replication and fail-over
- async multi-cluster replication can be used for isolation of research clusters, DC fail-over
- Supports relatively rich indexing, including complex indexes using ElasticSearch
- Can gradually convert complex queries into simple(r) ones by propagating information on the graph & adding indexes
Magnus' Wikidata Query service
- Custom in-memory graph database implemented in C++
- Relatively expressive, custom query language
- Limited to a single machine
- Current memory usage: 5G RSS
- Graph database "inspired by" Freebase and Google's Knowledge Graph
- Key-value, document and graph DB with replication (asynchronous master-slave) and sharding
- Strong consistency / ACID / transactions
- AQL (ArangoDB Query Language), a declarative query language similar to SQL. But also has other querying options.
- Supports relatively rich indexing
- Packages for Debian available in repository on website
Other possible candidates
Open questions
- Paging of large result sets
- Handling of cycles in the graph
- How to index the graph for efficient common query use cases
- Efficient updates for materialized complex query results
Candidate Evaluation
Titan | OrientDB | ArangoDB | |
---|---|---|---|
Future Proofing | |||
Automatic sharding for large data set? | Yes, fully automatic at Cassandra layer through DHT structure. Can add one node at a time. | Manual shard setup and assignment, but automatic write balancing across shards. In the community edition, any changes to the cluster config including adding nodes or rebalancing shards require a complete cluster restart. [1] | Sharding builtin. |
Multi-cluster replication? | Yes | No | |
Single point of failure? | No. Symmetric DHT design. | ||
Large-scale production use? | Yes | Yes | |
Active community? | Mailing list had one half the emails of Elasticsearch in October. | Mailing list had one half the emails of Elasticsearch in October. | Mailing list had one tenth the emails of Elasticsearch in October. |
Usage | |||
Do we like the query language? | Gremlin (maybe SPARQL) | Modified SQL (maybe Gremlin if we really want) | |
Can we customize the query language? | Monkey patching, front-end service | Gremlin supports Monkey patching. Not sure about SQL. Can add front-end service. | |
Indexing capabilties | Hash, range and Elasticsearch | Hash, range and Lucene | |
Query support (use cases) | |||
items by multiple property values | Yes (has) | ||
recursive collections (sub-taxons of x) | |||
Follow edges in inverse direction | Yes (in, out) | ||
top n | Yes (order+loop or []) | ||
range | With Elasticsearch | ||
2d range, geo | With Elasticsearch | ||
prefix match | Yes (Text.PREFIX) | ||
query based on custom set (birth places of 5000 people) | |||
Filtering/intersection a custom set | Yes (retain) | ||
Matching against a transitive property (instance of human, including subclasses) | |||
"Countries sorted by GDP" | Yes (but GDP doesn't seem to be in wikidata props) | ||
"People born in a city with more than 100k inhabitants" | Yes | ||
"largest 10 cities in Europe that have a female mayor" | Yes | ||
Distinct values of a property | Yes (dedup) | ||
max/min/sum/count/avg | Yes (sideEffect) | ||
intersection/union/difference | |||
Join based on property value | |||
Dev | |||
Do we have the expertise to hack on it? | Java and Groovy (Nik) | Java and maybe Groovy (Nik) | C++ (Max) |
Is upstream excited to have us? | Yes | Yes | ?? (Nik sent email) |
Does contributing require a CLA? | |||
OPS | |||
How do we manage housekeeping? [please define] | |||
Can we extract useful metrics for monitoring and alerting? | Standard Cassandra & front-end service instrumentation. | ||
How do we recover from a crash? | Entire cluster: Automatic fail over to other cluster (DC); can restore from backups of immutable LSMs | ||
What happens if a node goes down? | Remove node from LVS pool. | ||
What happens if a backend storage node goes down? | Automatic rebalancing between remaining nodes in cluster. | No backend nodes | |
How do we back its data up? | Normal Cassandra disk backup of immutable SSTable files after snapshot. |
See also
- Meeting notes November 13th 2014
- Complex queries in the Wikidata development plan
- Paper: A Performance Evaluation of Open Source Graph Databases -- includes Titan, Giraph and others
- Scaling Giraph to a trillion edges @Facebook
- Feature comparison table of several graph DBs
- Paper: NoSQL Databases for RDF: An Empirical Evaluation
- Quora thread discussing graph dbs vs. triple stores
Not candidates
- Neo4j (replication only in "Enterprise Edition")
- Offline / batch (OLAP) systems like Giraph (we need the capability to keep the index up to date)
- RDF Databases (Wikidata's data model is more rich than RDF so there would likely be some loss of precision)
- Wikidata Query service: single node, no sharding, no replication
- Caley: no serious production use