Performance Testing Guidelines - Orodan/Hilary GitHub Wiki

Environment setup

Setting an entire environment up from scratch can be done with oae-provisioning, slapchop and fabric:

fab ulous:performance

This should create all the machines, run the puppet scripts and give you a working environment.

Once your environment is up and running, it's probably a good idea to stop the puppet service on all the machines during the data-load process as that is particularly stressing on the application and avoids having additional background noise. From the puppet master machine run:

mco service puppet stop

Environment sanity-checks

Before you start loading data and running tests, it's usually a good idea to see if the environment is well balanced. A quick and easy way to check that is to ssh into the machines and run a sysbench test.

To run sysbench on a group of servers of the same class such as app servers, you can do something like the following from the puppet master:

mco package sysbench install

# The following runs the command "sysbench --test=cpu..." (after the -- ) on hosts (-H) app0, app1, app2 and app3 in parallel (-P)
fab -H app0,app1,app2,app3 -P -- sysbench --test=cpu --cpu-max-prime=20000 run

Each execution should result in a response like the following:

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000


Test execution summary:
    total time:                          33.4556s
    total number of events:              10000
    total time taken by event execution: 33.4531
    per-request statistics:
         min:                                  2.83ms
         avg:                                  3.35ms
         max:                                 19.62ms
         approx.  95 percentile:               4.65ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   33.4531/0.00

The "important" number here is the avg execution time. That should be similar across the various groups. For example, a decent number on the app/activity nodes is around 30. Joyent can't always guarantee a solid distribution of nodes, so if a couple of them are off by a factor of 2 it's a good idea to trash those machines and fire up new ones. This can be done with slapchop/fabric like so:

slapchop -d performance destroy -i app0 -i app2 -i activity1
fab "provision_machines:performance,app0;app2;activity1"

Generating/loading the data

If you're starting from a fresh environment you need to generate and load data. If you've already done a dataload and are following this guide, you can restore a backup in cassandra and start from there Data can be generated with OAE-model-loader. Ensure you have all the user pictures, group pictures, ...

Generating:

nohup node generate.js -b 10 -t oae -u 1000 -g 2000 -c 5000 -d 5000 &

Loading:

Create a tenant with the oae alias
Disable reCaptcha on the tenant (or globally)
Disable the activity servers, as the activities that get generated by the model loader would kill the db/activity-cache servers. On the puppet machine you can run: mco service hilary stop -I activity0 -I activity1 -I activity2
Start the dataload: nohup node loaddata.js -h http://oae.oae-performance.oaeproject.org -b 10 -s 0 -c 2

It's important that the dataload ends without any errors. If some users, groups or content failed to be created you will end up with a bunch of 400's in the tsung tests which makes them hard to read/interpret.

Now that your data is in Cassandra, its a good idea to take a backup of it so we can restore it for the next test.

Taking a cassandra backup

Take a snapshot with fab -H db0,db1,db2 -P -- nodetool -h localhost -p 7199 snapshot oae

For more info see http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_backup_takes_snapshot_t.html

Restoring a cassandra backup

TODO: See http://www.datastax.com/docs/1.0/operations/backup_restore for now

Pre-tests

Stop the app/activity/pp servers from puppet master: mco service hilary stop -W oaeservice::hilary RabbitMQ:
Reset RabbitMQ from mq0: rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app

Clean the ActivityStreams CF in Cassandra:

Truncate the ActivityStreams CF:

cqlsh << EOF 
use oae;
truncate ActivityStreams;
EOF

Stop Cassandra: fab -H db0,db1,db2 -P -- service dse stop
Trash the commitlogs and saved caches: fab -H db0,db1,db2 -P -- rm -rf /var/lib/cassandra/* (note the * removing the entire /var/lib/cassandra will cause permission issues when starting cassandra back up)
Start Cassandra: fab -H db0,db1,db2 -P -- service dse start

Redis data:

fab -H cache0,activity-cache0 -P -- redis-cli flushdb

Restart:

Start a (single!) app server: mco service hilary start -I app0 (might want to check the logs on app0 if everything started up fine. We start a single app server because each app server will try and create queues on startup and you can get otherwise get concurrency issues
Start all the app servers: mco service hilary start -W oaeservice::hilary

General tips

Use Munin

Ensure munin is working in your test, it gives valuable OS performance information for identifying bottlenecks. If Tsung's munin integration fails on one node you will get 0 munin stats :( So tail your performance test for a bit to ensure you're getting OS stats, and investigate the issue if you are not.

Cleaning cassandra data

From the puppet master node:

mco service dse stop -W oaeservice::hilary::dse

On each db node

rm -rf /data/cassandra/*
rm -rf /var/lib/cassandra/*
rm -rf /var/log/cassandra/*

# Ensure that the cassandra user has r/w access on all those directories
chown cassandra:cassandra /data/cassandra
chown cassandra:cassandra /var/lib/cassandra
chown cassandra:cassandra /var/log/cassandra

Then start them back up one-by-one

service dse start

Might be best to restart opscenterd (from the monitor machine)

service opscenterd restart

Generating a tsung test

Once your data has been generated/loaded you can generate a tsung test with node-oae-tsung which should be at /opt/node-oae-tsung.

node main.js -s /opt/OAE-model-loader/scripts -b 10 -o ./name-of-feature-you-are-testing -a answers/baseline.json

That should give you a directory at /opt/node-oae-tsung/name-of-feature-you-are-testing with the tsung.xml and the properly formatted csv files that tsung can use.

The baseline.json file contains (among others) the arrival rates that will be used in tsung. Generally, we have 2 setups. When you're trying to find the breaking point of the application it's usually a good idea to do this in waves. ie:

arrival rate of 5 users/s for 10 minutes
arrival rate of 0.1 users/s for 10 minutes (cool down phase)
arrival rate of 5.5 users/s for 10 minutes
arrival rate of 0.1 users/s for 10 minutes (cool down phase)
arrival rate of 6 users/s for 10 minutes
arrival rate of 0.1 users/s for 10 minutes (cool down phase)
arrival rate of 6.5 users/s for 10 minutes
arrival rate of 0.1 users/s for 10 minutes (cool down phase) ...

The cool down phases are there to allow the system some time to catch up with users from the previous phase.

If you're trying to compare a feature against master (baseline), it might be easier to have one big phase that stretches 2 hours with a constant arrival rate:

arrival rate of 5 users/s for 2 hours

The answers.json file also includes which hosts should be monitored with munin.

Running a tsung test

cd /opt/node-oae-tsung/name-of-feature-you-are-testing
nohup tsung -f tsung.xml -l tsung start &

Generating tsung reports

watch -n 120 /usr/lib/tsung/bin/tsung_stats.pl --stats tsung.log