Wikidata Query Service/Implementation
Sources
[edit]The source code is in gerrit project wikidata/query/rdf
, github mirror: https://github.com/wikimedia/wikidata-query-rdf
. In order to start working on Wikidata Query Service codebase, clone this repository or its Git mirror:
git clone https://github.com/wikimedia/wikidata-query-rdf.git
Then you can build the distribution package by running:
cd wikidata-query-rdf
git submodule update --init --recursive
./mvnw package
and the package will be in the dist/target
directory. Or, to run Blazegraph service from the development environment (e.g. for testing) use:
bash war/runBlazegraph.sh
Add "-d
" option to run it in debug mode. In order to run Updater, use:
bash tools/runUpdate.sh
The build relies on Blazegraph packages which are stored in Wikimedia Archiva, and the source is in wikidata/query/blazegraph
gerrit repository. See instructions on Mediawiki for the case where dependencies need to be rebuilt.
See also documentation in the source for more instructions.
The GUI source is in wikidata/query/gui
project and is the submodule of the main project. The deployment version of the GUI is in production
branch, which is cherry-picked from master branch when necessary. Production branch should not contain test & build service files (which currently means some cherry-picks will have to be manually merged).
Dependencies
[edit]The maven build depends on OpenJDK 8. Because that JDK is on the older side, your distro might not offer it to you; it can be downloaded from the AdoptOpenJDK website.
WDQS depends on two customized packages that are stored in Wikimedia Archiva: blazegraph
and ldfserver
. Check main pom.xml
for current versions.
If changes are required, the new packages have to be built and deployed to Archiva, then WDQS binaries should be rebuilt.
You will need logins to Archiva to be configured for wikimedia.releases
repositories.
Rebuilding Blazegraph
[edit]- Use gerrit repo
wikidata/query/blazegraph
- Commit fixes (watch for extra whitespace changes!)
- If you are releasing private version, set version and tag like
2.1.5-wmf.4
- Run:
mvn clean; bash scripts/mavenInstall.sh; mvn -f bigdata-war/pom.xml install -DskipTests=true
- Check on local install that fixes work
- Run to deploy:
mvn -f pom.xml -P deploy-archiva deploy -P Development; mvn -f bigdata-war/pom.xml -P deploy-archiva deploy -P Development
- After deploy is done, switch to SNAPSHOT version like
2.1.5-wmf.5-SNAPSHOT
Rebuilding LDFServer
[edit]- Check out https://github.com/smalyshev/Server.Java
- Make new branch and fixes
- Push the new branch to the origin
- Run
mvn deploy -Pdeploy-archiva
- Bump the version in WDQS main pom.xml to new version
Labs Deployment
[edit]Note that currently deployment is via git-fat (see below) which may require some manual steps after checkout. This can be done as follows:
- Check out
wikidata/query/deploy
repository and updategui
submodule to currentproduction
branch (git submodule update
). - Run
git-fat pull
to instantiate the binaries if necessary. - rsync the files to deploy directory (
/srv/wdqs/blazegraph
)
Use role role::wdqs::labs
for installing WDQS. You may also want to enable role::labs::lvm::srv
to provide adequate diskspace in /srv
.
Command sequence for manual install:
git clone https://gerrit.wikimedia.org/r/wikidata/query/deploy
cd deploy
git fat init
git fat pull
git submodule init
git submodule update
sudo rsync -av --exclude .git\* --exclude scap --delete . /srv/wdqs/blazegraph
See also Wikidata Query service beta.
Production Deployment
[edit]Production deployment is done via git deployment repository wikidata/query/deploy
. The procedure is as follows:
mvn package
the source repository.mvn deploy -Pdeploy-archiva
in the source repository - this deploys the artifacts to archiva. Note that for this you will need repositorieswikimedia.releases
andwikimedia.snapshots
configured in~/.m2/settings.xml
with archiva username/password.- Install new files (which will be also in
dist/target/service-*-dist.zip
) to deploy repo above. Commit them. Note that since git-fat uses archiva as primary storage, there can be a delay between files being deployed to archiva and them appearing on rsync and ready for git-fat deployment. - Use
scap deploy
to deploy the new build.
The puppet role that needs to be enabled for the service is role::wdqs
.
It is recommended to test deployment checkout on beta (see above) before deploying it in production.
GUI deployment
[edit]GUI deployment files are in repository wikidata/query/deploy-gui
branch production
. It is a submodule of wikidata/query/deploy
which is linked as gui
subdirectory.
To build deployment GUI version, use grunt deploy
in gui subdir. This will generate patch for deploy repo that needs to be merged in gerrit (currently manually). Then update submodule gui
on wikidata/query/deploy
to latest production
head and commit/push the change. Deploy as per above.
Services
[edit]Service wdqs-blazegraph
runs the Blazegraph server.
Service wdqs-updater
runs the updater. Depends on wdqs-blazegraph.
Maintenance mode
[edit]In order to put the server in the maintenance mode, create file /var/lib/nginx/wdqs/maintenance
- this will make all HTTP requests return 503 and the LB will take this server out of rotation. Note that Icinga monitoring will alert about such server being down, so you need to take the measures to prevent it if you are going to do maintenance of the server.
Non-Wikidata deployment
[edit]WDQS can be run as a service for any Wikibase instance, not just Wikidata. You can still follow the instructions in the documentation, but you may need to make some additional configurations. Please refer to Standalone Wikibase documentation for full description of the steps necessary.
Hardware
[edit]We're currently running on three servers in eqiad: wdqs1003
, wdqs1004
, wdqs1005
and three servers in codfw: wdqs2001
, wdqs2002
and wdqs2003
. Those two clusters are in active/active mode (traffic is sent to both), but due to how we route traffic with GeoDNS, the eqiad cluster sees most of the traffic.
Server specs are similar to the following:
- CPU: dual Intel(R) Xeon(R) CPU E5-2620 v3
- Disk: 800GB raw raided space SSD
- RAM: 128GB
The internal cluster has wdqs1006
, wdqs1007
and wdqs1008
in eqiad and wdqs2004
, wdqs2005
and wdqs2006
in codfw. The hardware is the same as above.
Releasing to Maven
[edit]Release procedure described here: https://central.sonatype.org/pages/ossrh-guide.html
Releasing new version
[edit]- Set the version with
mvn versions:set -DnewVersion=1.2.3
- Commit the patch and merge it (
git commit/git review
). Commit and mergegui
first, then main repo. - Tag the version:
git tag 1.2.3
- Deploy the files to OSS:
mvn clean deploy -Prelease
. You will need GPG key configured to sign the release. - Proceed with the release as described in OSS guide.
- Set the version back to snapshot:
mvn versions:set -DnewVersion=1.2.4-SNAPSHOT
- Commit the patch and merge it (
git commit/git review
)
Updating specific ID
[edit]If there is a need to update specific ID data manually, this can be done using (for ID Q12345):
runUpdate.sh -n wdq -N -- --ids Q12345,Q6790
The runUpdate.sh script is located in the root of WDQS deployment directory. Note that each server needs to be updated separately, they do not share databases. (see wikitech:Wikidata query service#Manually updating entities for more information.) -N
option allows Updater script to run in parallel with regular Updater service without port conflict.
Resetting start time
[edit]By default, the Updater will use timestamp of the last recorded update, or the dump if no updates happened yet. Use -s DATE
option to reset start time. Start time is recorded when first change is consumed, so if you are dealing with a wiki that does not update often, to explicitly reset the data at the start use --init
option to the updater.
Resetting Kafka offset
[edit]If Updater uses Kafka as change source, it will record Kafka offsets for the latest updates consumes, and resume with them when restarted. To reset these offsets, run this query:
DELETE { ?z rdf:first ?head ; rdf:rest ?tail . } WHERE {  [] wikibase:kafka ?list .  ?list rdf:rest* ?z .  ?z rdf:first ?head ; rdf:rest ?tail . } ; DELETE WHERE {  <https://www.wikidata.org> wikibase:kafka ?o . };
Replace Wikidata URL in the last query with your instance URL if your dataset is not based on Wikidata.
Units support
[edit]
For support of the unit conversion, the configuration of unit conversion is stored in mediawiki-config/wmf-config/unitConversionConfig.json.
This config is generated by script, e.g.:
$ mwscript extensions/Wikibase/repo/maintenance/updateUnits.php --wiki wikidatawiki \
> Â Â --base-unit-types Q223662,Q208469Â \
> Â Â --base-uri http://www.wikidata.org/entity/
If the config is changed, after new config is in place, the database should be updated (unless new dump is going to be loaded) by running:
$ mwscript extensions/Wikibase/repo/maintenance/addUnitConversions.php --wiki wikidatawiki --config NEW_CONFIG.json --old-config OLD_CONFIG.json --format nt --output data.nt
This will generate an RDF file which then will need to be loaded into the database.
Monitoring
[edit]- Icinga group
- Grafana dashboard: https://grafana.wikimedia.org/d/000000489/wikidata-query-service
- Grafana frontend dashboard: https://grafana.wikimedia.org/d/000000522/wikidata-query-service-frontend
- WDQS dashboard: http://discovery.wmflabs.org/wdqs/
- Prometheus metrics collected: https://github.com/wikimedia/operations-debs-prometheus-blazegraph-exporter/blob/master/prometheus-blazegraph-exporter#L79
Data reload procedure
[edit]Please see: wikitech:Wikidata query service#Data reload procedure
Usage constraints
[edit]Wikidata Query Service has a public endpoint available at https://query.wikidata.org. As anyone is free to use this endpoint, the traffic sees a lot a variability and thus the performance of the endpoint can vary quite a lot.
Current restrictions are:
- Query timeout of 60 seconds
- One client (user agent + IP) is allowed 60 seconds of processing time each 60 seconds
- One client is allowed 30 error queries per minute
- Clients exceeding the limits above are throttled
Updates to that endpoint are asynchronous, and lag is expected. Up to 6 hours of lag are considered normal, more than 12 hours of lag indicate a real problem.
We also have an internal endpoint, which is serving WMF internal workloads. The endpoint is at http://wdqs-internal.discovery.wmnet/sparql . The new internal endpoint is subject to the following constraints:
- 30 secs timeout
- requiring user-agent to be set
- only allowing internal access
- must be used only for synchronous user facing traffic, no batch jobs
- requests are expected to be cheap
The constraints are subject to evolve once we see what the actual use cases are and how the cluster behaves. If you have a question about how to or whether to use it, please contact us.
Contacts
[edit]If you need more info, talk to User:GLederrey_(WMF) or anybody from Discovery team.