Wikibase/Indexing/Prototype
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. The Wikidata query service is likely what you are looking for. |
Source
[edit]https://git.wikimedia.org/summary/wikidata%2Fgremlin
Usage
[edit]The instructions below may be out of date for Titan 0.9. Will be updated soon.
- Build with mvn install.
- Copy runit.sh and runit.groovy to Titan directory.
- Prepare config.properties with titan configurations
- Start with sh runint.sh
Loading data
[edit]- Call dataLoader.preload() to initialize sitelink/language properties.
- Separate properties list: gunzip -c dump | grep '{"id":"P' > props.json. Should be around 1380 lines.
- Start the console as in above, then do propLoader.file("props.json").load(10000) to load properties. The argument of load() should be greater than the number of lines in props.json.
- Load data with dataLoader.gzipFile("dump").load(1000000) - this loads 1M lines from the dump.
The processed line count is reported in file processed.dump.1.0 (the actual filename depends on the parameters).
Setting batch-loading step may be eliminated in the future.
Loader API
[edit]The loader class has following useful methods:
- file/gzipFile(String) - set source file
- setNum(int) - set the start line in the dump
- failOnError(bool) - if true, the loading fails immediately on exception, otherwise it proceeds and the failing line is written to rejects file.
- recover() - reset the line to the last one from processed file for this run (use the same setNum parameter!)
- load(int) - load the number of lines specified as the argument
All methods can be chained except for load().
Benchmarking
[edit]Benchmarking can be done by using w.benchmark {closure} which reports raw running time in ms, and w.measure(times) { closure } which runs the closure given number of times for 5 sessions and calculates the average.
Rexster Setup (works for 0.5)
[edit]- Copy rexster-wikidata.xml from the repo to $TITAN/conf.
- Copy rexster-init.groovy from the repo to $TITAN/rexhome.
- Edit rexster-wikidata.xml to match your settings re Cassandra/ES setup and parameters.
- Copy titan-wikidata.sh from repo to $TITAN/bin/.
- Copy or link wikidata-gremlin-0.0.1-SNAPSHOT.jar to $TITAN/ext directory.
- Run
$TITAN/bin/titan-wikidata.sh start
to start the server. - Run $TITAN/bin/rexster-console.sh to connect to the console.
- The logs are in $TITAN/log/rexstitan.log.
- The graph can be instantiated as
rexster.getGraph('wikidata')
or justgg()
.
Changes/modifications on einsteinium
[edit]These are things that I had to do on einsteinium from default config:
sudo apt-get install unzip
sudo apt-get install openjdk-7-jdk
sudo apt-get install traceroute
sudo apt-get install groovy
write_request_timeout_in_ms in cassandra.yaml to 5 s
grape install org.codehaus.groovy groovy-backports-compat23 2.3.7
- copy groovy-backports-compat23 to titan/lib
Set up proxy: export HTTP_PROXY=http://webproxy.eqiad.wmnet:8080 export HTTPS_PROXY=http://webproxy.eqiad.wmnet:8080