Spark Config, monitoring and tuning - salmanbaig8/imp GitHub Wiki
3 locations for config: -spark properties -env vars: conf/spark-env.sh -logging: log4j.properties
Override default conf DIR(SPARK_HOME/conf)
- SPARK_CONF_DIR .spark-defaults.conf .spark-env.sh .log4j.properties
Spark Shell can be verbose -To view ERRORs only changed the INFO value to ERROR in the log4j.properties
Spark Cluster overview .Components -Driver(Spark Context) -Cluster master: Need Cluster MGR(Standalone/Apache MESOS/Hadoop YARN) -Executors(Resides within Worker Nodes)
Spark Monitoring:
-
Web UI(port 4040) available till duration of APP contains foll info: -A list of scheduler stages and tasks -A summary of RDD sizes and memory usage -Env info -info abt running executors Viewing the history on MESOS/YARN conf the history server to set memory allocated,JVM options,public address for the server, various properties
-
Metrics(Based on the COda Hale metrics Lib), Report to a variety of sinks(Http,jmx,csv)
-
External Instruments(Ganglia, OS profiling tools(dstat,iostat,iotop),JVM utilities(jstack,jmap,jstat,jconsole)
Spark tuning: .Data serialization -java ser -kyro ser .memory tuning -amt of memory used by the objects -cost of accessing those objs -overhead of GC(garbage coll) .level of parallelism .memory usage of reduce tasks .serialized size of each tasks are located on master