Notes on Configuration and Tuning - adarob/eXpress-d GitHub Wiki

Spark (and EC2)

Spark configurations and JVM tuning are fully described on the Spark website (check out the Spark configuration guide and Spark/JVM tuning guide). This wiki page explains the subset of the Spark guide most applicable for running eXpress-D, and includes a general notes on performance tuning based on our experiences with Amazon EC2.

Garbage Collections (GC) by the Java Virtual Machine (JVM)

Full GC runs can take tens of seconds, proportional to the number of objects (i.e. the size of the cache that's been used up) that were created. To add context to the GC issue, a cluster of 60 m2.2xlarge nodes that runs eXpress-D over a 1 billion fragment dataset (with bias) averages ~24 seconds per iteration. A full GC delay on just one slave instance not only increases the latency for each iteration across the entire cluster, but also leads to underutilization of other slave processes for which there was no GC.

eXpress-D and Spark Configuration

eXpress-D includes some configurations that allow the user to specify the size of fragments and targets partitions. These configurations can be set in the express-d/config/config.py file.

EC2 Cluster Details

Spot instances are "auction-driven" instances with pricing that is (intended to be) based on demand. Pricing-wise, we found that using spot instances is much more cost-effective than on-demand instances. When we used instances from the us-east-1c region, spot prices for m2.2xlarge averaged $0.115, which is almost 9 times cheaper than the on-demand price of $1.00!