Jump to content

Wikibase/Indexing/Benchmarks

From mediawiki.org

Titan benchmarks

[edit]

Made on einsteinium with external cassandra cluster.

Shorter lookups

[edit]

These are short lookups that must be fast.

Checking random element without fetching property

[edit]
w.measure(10000) { def a = g.V('wikibaseId','Q'+(random.nextInt(10000000) as String)).hasNext(); }

[18816, 13342, 15188, 12626, 12289]

Average: 14452.2

Time: 1.44522 ms

Checking random element

[edit]
w.benchmark { 10000.times { def a = g.V('wikibaseId','Q'+(random.nextInt(10000000) as String)).labelEn.hasNext(); } }

[39330, 28555, 30037, 27755, 35049]

Average: 32145.2

Time: 3.21452 ms

Checking fixed node

[edit]

This mostly measured cache performance.

w.measure(10000) { a = g.V('wikibaseId', 'Q30').labelEn.hasNext() }

[10889, 9779, 8969, 8930, 9467]

Average: 9606.8

Time: 0.9ms

Checking supernode

[edit]

This mostly measured cache performance, but for supernode that has tons of incoming edges.

w.measure(10000)  { def a = g.V('wikibaseId', 'Q5').labelEn.next(); } 

[9611, 8339, 8174, 8360, 8815]

Average: 8659.8

Time: 0.8ms

Checking supernode out - first human

[edit]

Navigating "wide" link out of supernode.

w.measure(100) { def a = g.V('wikibaseId', 'Q5').in("P31")[0].next(); }

[8689, 7015, 7194, 8082, 8515]

Average: 7899

Time: 0.7899 ms

Random human

[edit]

This may stretch the cache a little more, but still be cacheable.

w.measure(10000) { def a = g.V('wikibaseId', 'Q5').in("P31")[random.nextInt(10000)].next(); }

[21395, 21192, 21288, 20017, 21699]

Average: 21118.2

Time: 2.11182 ms

Random human with name, bigger spread

[edit]

This is probably outside of current cache size. Also, [] probably does linear scan, so it behaves worse quadratically, as expected.

w.measure(100) { def a = g.V('wikibaseId', 'Q5').in("P31")[random.nextInt(100000)].labelEn.next(); }

[27543, 24389, 24191, 23185, 26852]

Average: 25232

Time: 252.32 ms

Random human with name - cached

[edit]
def a = g.listOf('Q5')[0].next()

Check if random entry is a human - non-cached

[edit]

This is using "out" link to Q5.

w.measure(1000) { def a = g.V('wikibaseId', 'Q'+(random.nextInt(10000000) as String)).out("P31").has('wikibaseId', 'Q5').hasNext(); }

[6509, 3882, 4626, 4165, 3371]

Average: 4510.6

Time: 4.5106 ms

Check if random entry is a human - cached

[edit]

This uses "link" property on the vertex itself. Surprisingly, not much difference! 

w.measure(10000) { def a = g.V('wikibaseId', 'Q'+(random.nextInt(10000000) as String)).has('P31link', CONTAINS, 'Q5').hasNext(); }

[54131, 52634, 43485, 41180, 44011]

Average: 47088.2

Time: 4.70882 ms

Check if random entry is human and not disambiguation

[edit]

Simplistic approach - just go by out links w.measure(1000) { def a = g.V('wikibaseId', 'Q'+(random.nextInt(10000000) as String)).as('x').out("P31").has('wikibaseId', 'Q5').back('x').filter{!it.out('P31').has('wikibaseId', 'Q4167410').hasNext()}.hasNext(); } [9069, 7610, 5076, 4825, 6499]

Average: 6615.8

Time: 6.6158 ms

More sophisticated condition handling using link property: w.measure(1000) { def a = g.V('wikibaseId', 'Q'+(random.nextInt(10000000) as String)).filter{'Q5' in it.P31link && !('Q4167410' in it.P31link);}.hasNext(); } [4489, 3696, 3677, 3597, 3480]

Average: 3787.8

Time: 3.7878 ms

Collect 1000 non-empty names

[edit]

Using link property:

w.measure(1000) {t = []; g.V('P31link', 'Q5').labelEn.filter{it != null}[0..1000].aggregate(t).iterate(); assert t.size() == 1001;}

[29682, 29685, 31022, 30879, 28966]

Average: 30046.8

Time: 30.0468 ms

Using "in" edge. Now there's a big difference:

w.measure(100) {t = []; g.V('wikibaseId', 'Q5').in('P31').labelEn.filter{it != null}[0..1000].aggregate(t).iterate(); assert t.size() == 1001;}

[13203, 11387, 11429, 11385, 11359]

Average: 11752.6

Time: 117.526 ms

Find country

[edit]

This would be heavily cached.

w.measure(1000) { def a = g.V('wikibaseId', 'Q1013639').toCountry().labelEn.next(); }

[2905, 2625, 2504, 2358, 2436]

Average: 2565.6

Time: 2.5656 ms

Find country of random neighborhood

[edit]

This one may have less luck with caching.

w.measure(100) { def a = g.listOf('Q123705').shuffle()[0].toCountry().labelEn.hasNext(); }

[17432, 17212, 16752, 16681, 16310]

Average: 16877.4

Time: 168.774 ms

Check if random neighborhood is in Finland?

[edit]
w.measure(100) { g.listOf('Q123705').shuffle()[0].toCountry().has('wikibaseId', 'Q33').hasNext(); }

[17707, 17807, 17310, 17461, 18288]

Average: 17714.6

Time: 177.146 ms

Longer list queries

[edit]

These may generate long lists and are expected to be slower.

List of countries by population

[edit]

The list is small, so most probably it's cacheable.

w.measure(100) { t= []; g.listOf('Q6256').as('c').groupBy{it}{it.claimValues('P1082').preferred().latest()}.cap.scatter.filter{it.value.size()>0}.transform{it.value = it.value.P1082value.collect{it?it as int:0}.max(); it}.order{it.b.value <=> it.a.value}.transform{[it.key.wikibaseId, it.key.labelEn, it.value]}.aggregate(t).iterate(); } 

[2885, 2838, 2811, 2803, 2776]

Average: 2822.6

Time: 28.226 ms

List of all occupations

[edit]

Probably caches too.

w.measure(100) { t = []; g.wd('Q28640').treeIn('P279').instances().dedup().aggregate(t).iterate(); assert t.size() == 2777}

[4647, 4530, 4593, 4549, 4479]

Average: 4559.6

Time: 45.596 ms

List of potential nationalities

[edit]

WDQ produces 571815 results.

g.listOf('Q5').as('humans').claimValues('P569').filter{it.P569value != 'somevalue' && it.P569value > Date.parse('yyyy', '1750')}
   .back('humans').claimVertices('P19').toCountry().as('countries').select(['humans', 'countries']){it.labelEn}{it.labelEn}

List of humans having occupation writer but not author

[edit]

This one has 36K+ entries, takes a lot of time. Maybe there's more optimal way to write the same query.

w.benchmark { g.V.has('P106link', 'Q36180').filter{'Q5' in it.P31link && !('Q482980' in it.P106link)}.dump("authors", "wikibaseId", "labelEn") }
 
 w.benchmark { t = []; g.V.has('P106link', 'Q36180').as('w').has('P106link', 'Q482980').aggregate(t).optional('w').except(t).dump("authors", "wikibaseId", "labelEn") }
 

86.017s

List of humans with no date of death

[edit]

WDQ produces 14431 results.

w.benchmark { g.listOf('Q5').as('humans').claimValues('P569').filter{it.P569value && it.P569value < Date.parse('yyyy', '1880')}.back('humans').filter{!it.out('P570').hasNext()}.dump("undead", "wikibaseId", "labelEn"); }

4763.817 s

too slow, probably needs value index.