Wikibase/Indexing
Appearance
< Wikibase
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. The Wikidata query service is likely what you are looking for. |
Wikibase indexing
Get a WDQ-like thing into production!
|
Task tracking: wikidata-query-service in Phabricator; watch the project to receive notifications
Goals / Requirements
[edit]End of March
[edit]- Publicly downloadable prototype (query and import and update)
- Labs install (details hazy, same prototype)
- Hardware requested
- 12 month roadmap
Phase 1: MVP (Before end of June)
[edit]- Support use-cases of WikiGrok, remain flexible in architecture to eventually support external requests/third parties.
Must have:
[edit]- query console for experimentation
As a user, I want to see a WikiGrok question immediately on the mobile site, as soon as I load an article that has any active WikGrok campaigns, so that I can respond quickly and keep getting more questions in real time, based on the ones I've already answered.
Requirements:
- available through an API
- available through server-side requests, connecting through PHP
- for any query, result output in JSON (XML would be nice to have)
- results may sometimes come in the form of lists (e.g., List of all possible occupations) and sometimes in the form of a single item
- Simple (single-item) queries: (e.g., "is this item X and not Y?")
- generate live and run quickly – as fast as possible – to serve immediately to users via WikiGrok (and potentially continue serving more results on the fly after user input)
- Complex queries: (e.g., list of all possible suggested occupations)
- pre-generate results and store in a table or cache, so these queries can run longer (but still within some reasonable timeframe, e.g. 1 hour)
- regenerate as often as practical/possible
- Simple (single-item) queries: (e.g., "is this item X and not Y?")
Phase 2: Support for public/external requests
[edit]- ideally, public web service
- external requests return within a few seconds, use reasonable resources
- how to enforce that constraint needs to be determined and influences the architecture
- internal requests are allowed to use more resources & time
- these need to not crash external requests and external cannot crush internal
- external requests return within a few seconds, use reasonable resources
- needs to support continuous updates to reflect latest Wikidata state
- Seconds or even a minute or two lag seems acceptable at this point but nothing beyond that.
- handle high request volumes (horizontal scaling)
- handle a large data set (sharding)
- robust: automatic handling of node failures, cross-datacenter replication, proven in production
- reasonable operational complexity
Candidate solutions
[edit]- Distributed graph database
- Supports online modification (OLTP), so can reflect current state
- Expressive query language (Gremlin); shared with other graph dbs like Neo4j
- Implemented as a thin stateless layer on top of Cassandra or HBase: transparent sharding, replication and fail-over
- async multi-cluster replication can be used for isolation of research clusters, DC fail-over
- Supports relatively rich indexing, including complex indexes using ElasticSearch
- Can gradually convert complex queries into simple(r) ones by propagating information on the graph & adding indexes
- TinkerPop blueprints support, including Gremlin and the GraphSail RDF interface
Magnus' Wikidata Query service
[edit]- Custom in-memory graph database implemented in C++
- Relatively expressive, custom query language
- Limited to a single machine
- Current memory usage: 5G RSS
- Load balanced systems are available
- The initial dump conversion is extremely slow, so if a server process crashes and its dump gets corrupt you face a prolonged outage. Even if we optimize it by an order of magnitude, it will still be slow.
- Server startup time is not nice either, and will only grow with the growth of the dataset. While not critical e.g. for Redis that we keep running for months, the less mature nature of WDQ and the need to cater for development/future bugs will mean a lot of restarts, each being a PITA.
- You can't run in production the DB query used to retrieve latest changes is, so this part will have to be redone completely.
- And this update routine results in a possible race condition making it miss some changes.
- Each entity's properties are retrieved with a separate uncacheable HTTP request to Special:EntityData which isn't very fast so as the rate of changes increases, WDQ will bump into it hard, not being able to cope with updates.
- Thread synchronization model needs to be totally redone, as currently used spinlocks aren't scalable.
- Some code paths in the update code are missing a synchronization which makes one wonder how often does it crash and how much often will it crash under a production load. And if you add synchronization there, locking is probably going to bring everything to a halt.
- With a crapload of very small objects in the same heap, a long-running server process would have problems with heap fragmentation/memory management performance, requiring custom memory management, etc.
- Scalability would be very primitive: as you add more machines they start making more requests, etc.
- Would need some manual solutions for approximate clustering, failover, etc.
- Would require in-house maintenance.
- Apache 2 License
- Distributed graph database
- Supports online modification (OLTP)
- Supports Gremlin, as well as it's own SQL-like query language with graph features and no JOINs
- Drop-in Lucene plugin for geospatial indexes (also full-text, not that we would use that)
- Replication is multi-master and works via Hazelcast, read/write quorums, and merging conflicts
- Isolation in ACID drops a bit when distributed (intermediate results may show briefly)
- Isolation is SERIALIZABLE for direct FS access case, READ COMMITTED for remote access (what we would use)
- Since I'd like to input data from hub feeds (starting with dumps), async replication isn't really an issue; both DC would just use the same process to pull in updates
- Actually OrientDB 2.0 supports async replication via config [1]
- Supports automatic round-robin sharding as well as application directed sharding (specifying clusters for class item insertions and reads, reads default to checking all partitions)
- Supports various indexes (SBTree,hash, both unique or not unique) and primitive as well as embedded data structures (sets/lists/map) that can be indexed
- Queries on the JSON itself are also possible regardless of nesting levels
- Supports Tinkerpop blueprints
- Good query timeout support via TIMEOUT
- Apache 2 License
- Multi-model database: key/value, document and graph DB with a query language combining all three models
- Complex queries including joins between multiple collections
- ACID: transactions spanning multiple documents and collections
- distributed: replication (asynchronous master-slave) and sharding
- AQL (ArangoDB Query Language), a declarative query language similar in spirit to SQL, designed with multi-model in mind and allowing joins. Has also other querying options.
- HTTP API for RESTful interfaces
- API extensible by user-defined JavaScript code (V8 embedded) via the Foxx framework
- Supports relatively rich indexing (skip-list, hash, geo, full-text)
- Extensive reference documentation in gitbook format
- Convenient web front end integrated
- Packages for Debian, other Linux distributions, Mac OSX and Windows available in repository on website
- Virtual machine images and docker container available
- Very modern code base (C++11) with many unit tests
- Good community and professional support
- Few open issues on github, new ones are quickly dealt with
- Regular major releases every two to three months
- Concerns
- GPL & AGPL
- Some capabilities may be enterprise-only
- Dynamic schema, supports fulltext indexes and probably geospatial with plugin, also looks like pluggable indexes are possible
- Cypher SQL-like query language, TP3 support
- Labels may be used to make efficient lookups for yes/no properties (see in the blog)
- Has query planner/profiler
- Good HA support, no sharding though
- Lucene indexes
- Concerns
- Mixed indexes support - i.e. is it possible to use 2 indexes together?
- Edge indexes only old-style - either manual-fill or pre-configured autoindexer which needs to be configured each time anew
- Not clear how we handle geodata
- No ability to store arbitrary JSON/blob for non-indexed secondary lookup
- Multi-valued properties support - TP3 implementation uses node-per-value
- What we're missing by running non-Enterprise version (if anything)
- How sharding-less replication works in practice on our data size?
- Questions
- Are edge properties first-class citizens re: indexing?
- Looking up property values - more efficient as edge properties or vertex properties? I.e. are claims edges or vertices?
- Is Enterprise version just AGPL or has some other requirements?
Other possible candidates
[edit]- "Big Data" - GPLv2, Tinkerpop 2.5
- Graphx - Apache License, Alpha, Uses Spark
- Virtuoso cluster
- 4store
- Neo4j - GPL and AGPL - Tinkerpop 3.0 reference implementation
- Build one directly against Elasticsearch - Horrible/provocative idea - We currently have the expertise but totally against most of our goals – OTOH we were resigned to having to hack on Titan's Elasticsearch integration anyway
Open questions
[edit]- Paging of large result sets
- Handling of cycles in the graph
- How to index the graph for efficient common query use cases
- Efficient updates for materialized complex query results
Candidate Evaluation
[edit]Titan | OrientDB | ArangoDB | Neo4J | |
---|---|---|---|---|
Future Proofing | ||||
Automatic sharding for large data set? | Yes, fully automatic at Cassandra layer through DHT structure. Can add one node at a time. | Manual shard setup and assignment, but automatic write balancing across shards. In the community edition, any changes to the cluster config including adding nodes or rebalancing shards require a complete cluster restart. [3] | In Development. [4] Currently the sharding setup is manual, but the data distribution is automatic. Zero administration and integration with Apache mesos clusters is being worked on. | Replication - yes, sharding - no |
Multi-cluster replication? | Yes | No | No | ? |
Single point of failure? | No. Storage backend is Cassandra which doesn't have a single point of contention, much less failure. | No | No. Automatic failover is being worked on. Currently asynchronous replication is possible but requires manual setup. | No |
Large-scale production use? | Yes - Not clear what large number of Titan servers is. Lots of documentation on very large Cassandra clusters. Netflix has blogger about ~285 node Cassandra clusters pushing a million writes a second. We won't get near that hardware. | Yes - five or six server clusters | Not clear what "large-scale" means, but customers use clusters in production. | AFAIK Yes, not sure of the size but read about bllion-nodes-sized setups. |
Active community? | Mailing list had one half the emails of Elasticsearch in October. | Mailing list had one half the emails of Elasticsearch in October. | Yes, github, google group and stackoverflow. | stackoverflow, mailing list, github |
Usage | ||||
Do we like the query language? | Gremlin (maybe SPARQL) | Modified SQL (maybe Gremlin if we really want) | AQL, graph traversal | Cypher, TP3 |
Can we customize the query language? | Monkey patching, front-end service | front-end service | Yes (Foxx) | Plugins |
Indexing capabilties | Hash and Elasticsearch | Hash, range and Lucene | hash, skiplist, geo, fulltext (one per collection), river for elastic search | |
Parallel request scheduling (Are different parallel queries sufficiently well (fair?) scheduled so that resource intensive and low latency interactive ones can run on the same cluster?) | You'd have to do this by running queries on different machines and sharing the same storage backends. Inside a single machine queries are just have threads doing lazy evaluation and Java threads. | Incoming HTTP requests are scheduled in a fair way. Numbers of server threads can be configured. If not all threads are busy with long running requests, then scheduling is fair. | ||
Query time or cpu limit | Timeout added in very newest version. Its possible to write degenerate queries the circumvent the timeout. | Good, via TIMEOUT | Possible to cancel queries externally | |
Query memory limit | Done using process/machine isolation. The storage layer is shared but that isn't memory issue. | No | No | |
Query support (use cases) | ||||
items by multiple property values | Yes (has) | Yes | Yes | |
recursive collections (sub-taxons of x) | Yes(loop, out) | Yes | ? | |
Follow edges in inverse direction | Yes (in, out) | Yes | Yes | |
top n | Yes (order+loop or []) | Yes | Yes | |
range | With Elasticsearch | Yes; includes SB-Trees | Yes | |
2d range, geo | With Elasticsearch | Yes | Yes | |
prefix match | Yes (Text.PREFIX) | Yes; SB-Trees | Yes | |
query based on custom set (birth places of 5000 people) | Yes (retain) | Yes? | Yes | |
Filtering/intersection a custom set | Yes (retain) | Yes? | Yes | |
Matching against a transitive property (instance of human, including subclasses) | Yes(loop, out) | Yes; using TRAVERSE or Gremlin | ? | |
"Countries sorted by GDP" | Yes (but GDP doesn't seem to be in wikidata props) | Yes; Composite SB-Tree and edge per prop statemet | Yes | |
"People born in a city with more than 100k inhabitants" | Yes | Yes | Yes | |
"largest 10 cities in Europe that have a female mayor" | Yes | Yes; subquery or JSON filtering in WHERE for example | Yes | |
Distinct values of a property | Yes (dedup) | Yes; SB-Tree or hash index | Yes | |
max/min/sum/count/avg | Yes (sideEffect) | Yes | Yes | |
intersection/union/difference | Union - yes (copySplit & merge), intersection/difference - probably yes (retain/except) but may have limitations | Yes; via SQL UNION/WHERE/subqueries, maybe Gremlin too | Yes | |
Join based on property value | Trivial for edges, may not be efficient for properties without specific index | Yes | Yes | |
Support for subproperties: "people who have <subproperty of location> with Newcastle" | There's no concept of subproperties as such, but properties can be hashmaps and filters can use code to match against them | Yes; nested JSON can be queried all the way down | Yes | |
Dev | ||||
Do we have the expertise to hack on it? | Java and Groovy | Java and maybe Groovy | C++/JS | |
Is upstream excited to have us? | Yes | Yes!!!! | Yes, definitely! | |
Does contributing require a CLA? | Looks like no | Yes, copied and pasted from Apache | Yes, see [5] | |
OPS | ||||
How do we manage housekeeping? [please define] | ||||
Can we extract useful metrics for monitoring and alerting? | Standard Cassandra instrumentation and basic metrics from Titan itself | Yes, statistics are kept and available in the cluster dashboard. | ||
How do we recover from a crash? | Entire cluster: Automatic fail over to other cluster (DC); can restore from backups of immutable LSMs | Whole cluster: Restart cluster and restore from dump. | ||
What happens if a node goes down? | Remove node from LVS pool. | Coordinator node: nothing, DBserver node: Have to manually connect asynchronous replica as new node. Automatic failover is being worked on. | ||
What happens if a backend storage node goes down? | Automatic rebalancing between remaining nodes in cluster. | No backend nodes | Have to manually connect asynchronous replica as new node. Automatic failover is being worked on. | |
How do we back its data up? | Normal Cassandra disk backup of immutable SSTable files after snapshot. | Backup is not in community edition; use JSON dumps + hub updates (the primary data is in MySQL/ExternalStore anyway) | arangodump and arangorestore for cluster are fully supported in the community edition. Asynchronous replication as well. | |
Source (for importing data, sample queries, etc) | https://github.com/smalyshev/wikidata-gremlin/ branch titan_flat | https://github.com/AaronSchulz/WikiDataQueryOrient | https://github.com/triAGENS/ArangoDB-Data |
See also
[edit]- Meeting notes November 13th 2014
- Complex queries in the Wikidata development plan
- Paper: A Performance Evaluation of Open Source Graph Databases -- includes Titan, Giraph and others
- Scaling Giraph to a trillion edges @Facebook
- Feature comparison table of several graph DBs
- Paper: NoSQL Databases for RDF: An Empirical Evaluation
- Quora thread discussing graph dbs vs. triple stores
- Graph data model proposal
- Extension:MobileFrontend/WikiGrok/Claim_suggestions
- Titan Prototype
- Titan Prototype Benchmarks
- Wikidata queries that people would find useful from a year ago
- Wikibase/Indexing/Query examples
- Wikidata Query languages spreadsheet
- The spreadheet of doom
WDQS Beta based on RDF/SPARQL
[edit]- WDQS Beta deployment
- RDF dump format
- SPARQL Query Examples
- WDQ to SPARQL translator
- Berlin SparQL benchmarks
- Paper: Wikidata RDF export
- Updater performance analysis
Not candidates
[edit]- Neo4j (replication only in "Enterprise Edition")
- Offline / batch (OLAP) systems like Giraph (we need the capability to keep the index up to date)
- Wikidata Query service: single node, no sharding, no replication
- Caley: no serious production use
- ArangoDB: looks like it might work but they don't allow upstream contributions. Answer to this: This was never true and is not true, see [5] for the CLA procedure. When they get that problem solved and OrientDB and Titan fall through then we can promote them again.
References
[edit]- ↑ https://github.com/arangodb/elasticsearch-river-arangodb
- ↑ http://comments.gmane.org/gmane.comp.db.arangodb.general/1809
- ↑ http://www.orientechnologies.com/docs/last/orientdb.wiki/Distributed-Sharding.html#hot-management-of-distributed-configuration
- ↑ https://docs.arangodb.com/Sharding/StatusOfImplementation.html
- ↑ 5.0 5.1 https://www.arangodb.com/community