Spark Config, monitoring and tuning - salmanbaig8/imp GitHub Wiki

3 locations for config: -spark properties -env vars: conf/spark-env.sh -logging: log4j.properties

Override default conf DIR(SPARK_HOME/conf)

SPARK_CONF_DIR .spark-defaults.conf .spark-env.sh .log4j.properties

Spark Shell can be verbose -To view ERRORs only changed the INFO value to ERROR in the log4j.properties

Spark Cluster overview .Components -Driver(Spark Context) -Cluster master: Need Cluster MGR(Standalone/Apache MESOS/Hadoop YARN) -Executors(Resides within Worker Nodes)

Spark Monitoring:

Web UI(port 4040) available till duration of APP contains foll info: -A list of scheduler stages and tasks -A summary of RDD sizes and memory usage -Env info -info abt running executors Viewing the history on MESOS/YARN conf the history server to set memory allocated,JVM options,public address for the server, various properties
Metrics(Based on the COda Hale metrics Lib), Report to a variety of sinks(Http,jmx,csv)
External Instruments(Ganglia, OS profiling tools(dstat,iostat,iotop),JVM utilities(jstack,jmap,jstat,jconsole)

Spark tuning: .Data serialization -java ser -kyro ser .memory tuning -amt of memory used by the objects -cost of accessing those objs -overhead of GC(garbage coll) .level of parallelism .memory usage of reduce tasks .serialized size of each tasks are located on master