User:GWicke/Notes/Storage/Cassandra testing
Testing Cassandra as a backend for the Rashomon storage service. See also User:GWicke/Notes/Storage, Requests for comment/Storage service.
- cerium 10.64.16.147
- praseodymium 10.64.16.149
- xenon 10.64.0.200
Cassandra docs (we are testing 2.0.1 2.0.2 (latest changes)):
Setup
[edit]Cassandra node setup
[edit]apt-get install cassandra openjdk-7-jdk libjna-java libjemalloc1
Set up /etc/cassandra/cassandra.yaml according to the docs. Main things to change:
- listen_address, rpc_address
- set to external IP of this node
- seed_provider / seeds
- set to list of other cluster node IPs: "10.64.16.147,10.64.16.149,10.64.0.200"
(Re)start cassandra: service cassandra restart
. The command
nodetool status
should return information and show your node (and the other nodes) as being up. Example output:
root@xenon:~# nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.64.16.149 91.4 KB 256 33.4% c72025f6-8ad8-4ab6-b989-1ce2f4b8f665 rack1 UN 10.64.0.200 30.94 KB 256 32.8% 48821b0f-f378-41a7-90b1-b5cfb358addb rack1 UN 10.64.16.147 58.75 KB 256 33.8% a9b2ac1c-c09b-4f46-95f9-4cb639bb9eca rack1
Rashomon setup
[edit]The cassandra bindings used need node 0.10. For Ubuntu precise LTS, we need to do some extra work [1]:
apt-get install python-software-properties python g++ make add-apt-repository ppa:chris-lea/node.js apt-get update apt-get install build-essential nodejs # this ubuntu package also includes npm and nodejs-dev
On Debian unstable, we'd just do apt-get install nodejs npm
and get the latest node including security fixes rather than the old Ubuntu PPA package.
Now onwards to the actual rashomon setup:
# temporary proxy setup for testing npm config set https-proxy http://brewster.wikimedia.org:8080 npm config set proxy http://brewster.wikimedia.org:8080 cd /var/lib https_proxy=brewster.wikimedia.org:8080 git clone https://github.com/gwicke/rashomon.git cd rashomon # will package node_modules later npm install cp contrib/upstart/rashomon.conf /etc/init/rashomon.conf adduser --system --no-create-home rashomon service rashomon start
Create the revision tables (on one node only):
cqlsh < cassandra-revisions.cql
Cassandra issues
[edit]- With the default settings and without working jna (see install instructions above), cassandra on one node ran out of heap space during a large compaction. The resulting state was inconsistent enough that it would not restart cleanly. The quick fix was wiping the data on that replica and re-joining the cluster.
- This was caused by multi-threaded compaction, which is scheduled for removal in 2.1. Moving back to the default setting and reducing the number of concurrent writes a bit eliminated this problem. Tweaking the GC settings (see below) also helped.
- Stopping and restarting the cassandra service with
service cassandra stop
did not work. Faidon tracked this down to a missing '$' in the init script: [2]. Fixed in 2.0.2. - Compaction was fairly slow for a write benchmark. Changed
compaction_throughput_mb_per_sec: 16
tocompaction_throughput_mb_per_sec: 48
in cassandra.yaml. Compaction is also niced and single-threaded, so during high load it will use less disk bandwidth than this upper limit. See [3] for background. - Not relevant for our current use case, but good to double-check if we wanted to start using CAS: bugs in 2.0.0 Paxos implementation. The relevant bugs [4][5][6] seem to be fixed in 2.0.1.
Tests
[edit]Dump import, 600 writers
[edit]Six writer processes working on one of these dumps ([7][8][9][10][11][12]) with up to 100 concurrent requests each. Rashomon uses write consistency level quorum for these writes, so 2 nodes out of three need to ack. The Cassandra commit log is placed on an SSD, data files on rotating metal RAID1.
6537159 revisions in 42130s (155/s); total size 85081864773 6375223 revisions in 42040s (151/s); total size 84317436542 6679729 revisions in 39042s (171/s); total size 87759806169 5666555 revisions in 32704s (173/s); total size 79429599007 5407901 revisions in 32832s (164/s); total size 72518858048 6375236 revisions in 37758s (168/s); total size 84318152281 ============================================================== 37041803 revisions total, 493425716820 total bytes (459.5G) 879/s, 11.1MB/s du -sS on revisions table, right after test: 162 / 153 / 120 G (avg 31.5% of raw text) du -sS on revisions table, after some compaction activity: 85G (18.4% of raw text) du -sS on revisions table, after full compaction: 73.7G (16% of raw text)
- clients, rashomon and cassandra on the same machine
- clients and cassandra CPU-bound, rashomon using little CPU
- basically no IO wait time despite data on spinning disks. Compaction too throttled for heavy writes, but low wait even with a higher max compaction bandwidth cap. In a pure write workload all reads and writes are sequential. Cassandra also uses posix_fadvise for read-ahead and page cache optimization.
Write test 2: Dump import, 300 writers
[edit]Basically the same setup, except:
- clients on separate machine
- Cassandra 2.0.2
- additional revision index maintained, which allows revision retrieval by oldid
- better error handling in Rashomon
- client connects to random Rashomon, and Rashomon uses set of Cassandra backends (instead of just localhost)
With default heap size, one node ran out of he heap about 2/3 through the test. Eventually a second node suffered the same, which let all remaining saves fail as there was no more quorum.
A similar failure happened before in preliminary testing. Setting the heap size to 1/2 of the RAM (7G instead of 4G on these machines) fixed this in the follow-up test, the same way it did before the first write test run.
Write test 3: Dump import, 300 writers
[edit]Same setup as in Write test 2, except heap limit increased from 4G default to 7G.
5407950 revisions in 46007s (117.5/s); Total size: 72521960786; 36847 retries 5666609 revisions in 52532s (107.8/s); Total size: 79431029856; 41791 retries 6375283 revisions in 67059s (95.0/s); Total size: 84318276123; 38453 retries 6537194 revisions in 64481s (101.3/s); Total size: 85084097888; 41694 retries 6679780 revisions in 60408s (110.5/s); Total size: 87759962590; 43422 retries 6008332 revisions in 50715s (118.4/s); Total size: 64537467290; 39078 retries ========================================= 648/s, 7.5MB/s 0.65% requests needed to be retried after timeout 441.12G total After test: 85G on-disk for revisions table (19.3%) 2.2G on-disk for idx_revisions_by_revid table
With improved error reporting and -handling in the client these numbers should be more reliable than the first test. The secondary index adds another query for each eventually consistent batch action, which slows down the number of revision inserts per second slightly. The higher compaction throughput also performs more of the compaction work upfront during the test, and results in significantly smaller disk usage right after the test.
Write test 4-6: Heap size vs. timeouts, GC time and out-of-heap errors
[edit]I repeated the write tests a few more times and got one more out-of heap error on a node. Increasing the heap to 8G had the effect of increasing the number of timeouts to about 90k per writer. The Cassandra documentation mentions 8G as the upper limit for reasonable GC pause times, so it seems that the bulk of those timouts are related to GC.
In a follow-up test, I reset the heap to the default value (4.3G on these machines) and lowered memtable_total_space_in_mb
from the 1/3 heap default to 1200M to avoid out-of-heap errors under heavy write load despite a relatively small heap. Cassandra will flush the largest memtable when this much memory is used.
memtable_total_space_in_mb = 1200M
- out of heap space
memtable_total_space_in_mb = 800M
- better, but still many timeouts
- default
memtable_total_space_in_mb
,memtable_flush_writers = 4
andmemtable_flush_queue_size = 10
- fast, but not tested on full run
memtable_total_space_in_mb = 900
,memtable_flush_writers = 4
andmemtable_flush_queue_size = 10
- out of heap space
memtable_total_space_in_mb = 800m
,memtable_flush_writers = 4
andmemtable_flush_queue_size = 10
- no crash, but slowdown once db gets larger without much IO
memtable_total_space_in_mb = 700m
,memtable_flush_writers = 4
andmemtable_flush_queue_size = 14
- larger flush queue seems to counteract slowdown, flushing more often should reduce heap pressure even further. Slowed down with heavy GC activity, very likely related to compaction and the pure-JS deflate implementation.
- reduced
concurrent_compactors
from the default (number of cores) to 4 - ran out of heap
- increased heap to 10G, reduced
concurrent_compactors
to 2, disabled GC threshold so that collection starts not just when the heap is 75% full - Lowering the compactor thresholds is good, but removing the GC threshold actually made it worse. Full collection happens only close to max, which makes it more likely to run out of memory.
Flush memory tables often and quickly to avoid timeouts and memory pressure
[edit]- use multiple flusher threads:
memtable_flush_writers = 4
; See [13] - increase the queue length to absorb burst:
memtable_flush_queue_size = 14
- can play with queue length for load shedding
- seems to be sensitive to number of tables: more tables -> more likely to hit limit
GC tuning for short pauses and OOM error avoidance despite heavy write load
[edit]After a lot of trial and error, the following strategy seems to work well:
- during a heavy write test that has been running for a few hours, follow the OU column in
jstat -gc <cassandra pid>
output. Check memory usage after a major collection. This is the base heap usage.- in heavy write tests with small memtables (600M) between 1 and 2G
- size MAX_HEAP_SIZE ~5x base heap usage for headroom, but not more to preserve ram for page cache
- set CMSInitiatingOccupancyFraction to something around 45 (75% default is to close to the max -> OOM with heavy writes). This marks the high point where a full collection is triggered. It should be ~2-3x the size after a major collection. Starting too late makes it hard to really collect all garbage, as CMS is inexact in the name of low pause times and misses some garbage. Starting early keeps the heap to traverse small, which limits pause times.
- size HEAP_NEWSIZE to ~2/3 the base heap use
- for more thorough young generation collection, increase MaxTenuringThreshold from 1 to something between 1 and 15, for example 10. This means that something in the young generation (HEAP_NEWSIZE) needs to survive 10 minor collections to make it into the old generation.
All these settings are in /etc/cassandra/cassandra-env.sh.
Write test 12 (or so)
[edit]Settings in /etc/cassandra/cassandra-env.sh
:
MAX_HEAP_SIZE = "10G" # can be closer to 6 HEAP_NEWSIZE = "1000M" # about 3.2G, needs to be adjusted to keep 3.2G point # similar when MAX_HEAP_SIZE changes CMSInitiatingOccupancyFraction = 30 MaxTenuringThreshold = 10
Results:
5407948 revisions in 29838s (181.24/s); 72521789775 bytes; 263 retries 5666608 revisions in 31051s (182.49/s); 79431082810 bytes; 295 retries 6375272 revisions in 33578s (189.86/s); 84318196084 bytes; 281 retries 6537196 revisions in 34220s (191.03/s); 85084340595 bytes; 295 retries 6679784 revisions in 34597s (193.07/s); 87759936033 bytes; 284 retries 6008332 revisions in 30350s (197.96/s); 64537469847 bytes; 288 retries ============================ 1133 revisions/s
- This also includes a successful concurrent repair (
nodetool repair
) at high load after changing settings on all nodes not too long after the test start - The retries were all triggered by 'Socket reset' on the test client, there don't seem to be any server timeouts any more. Need to investigate the reason for these.
Write test 13-15
[edit]- MAX_HEAP_SIZE="7G" (reduced from 10)
- HEAP_NEWSIZE="1000M"
Overwrote the db from test 12 a few times. Similar results as in test 12, no issues. Compaction was a bit slow, possibly related to a compaction order bug fixed just after the 2.0.3 release. It eventually succeeded with a manual nodetool compact
.
Write test 16
[edit]Updated to debs built from git cf83c81d8, and waited for a full compaction before restarting the writes. Mainly interested in compaction behavior.
Random reads
[edit]Goal: simulate read-heavy workload for revisions (similar to ExternalStore), and verify writes from the previous test.
- Access random title, but most likely the newest revision
- verify md5
The random read workload will be much IO-heavier. There should be noticeable differences between data on SSD vs. rotating disks.
Mix of a few writes and random reads
[edit]Perform about 50 revision writes / second, and see how many concurrent reads can still be sustained at acceptable latency. Closest approximation to actual production workload. Mainly looking for impact of writes on read latency.