EMR 020 Spark Logs - qyjohn/AWS_Tutorials GitHub Wiki
The following description applies to EMR-5.30.0.
First of all, we need to understand that there are two type of logs: (a) Spark application logs, and (b) Spark job history.
Spark application logs are the YARN container logs for the Spark jobs. They are located in /var/log/hadoop-yarn/containers on the core node when the application is running. When the application finishes running, these logs are copied to HDFS under /var/log/hadoop-yarn/apps/hadoop/logs/.
On a core node, you can check the logs on local file system with the following command:
ls /var/log/hadoop-yarn/containers
On the master node, you can check the logs on HDFS with the following command:
hdfs dfs -ls /var/log/hadoop-yarn/apps/hadoop/logs/
On HDFS, the logs are aggregated, so the filenames (and file count) are different from what you see on local file system. The retention period of these logs is governed by YARN log aggregation and retention configurations. By default, YARN keeps application logs on HDFS for 48 hours. To reduce the retention period, you need to change the value of yarn.log-aggregation.retain-seconds in /etc/hadoop/conf/yarn-site.xml. Below is the default configuration on EMR-5.30.1:
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>172800</value>
</property>
When the above-mentioned application logs are copied to HDFS, they are not deleted from the local file system. But rather, the logs are left on the core nodes, waiting for the EMR LogPusher to push them to S3. The retention period of these logs on local file system is governed by EMR LogPusher configuration in /etc/logpusher/hadoop.config. Below is the default configuration on EMR-5.30.1. By default, the logs are deleted by EMR LogPusher from the local file system after 4 hours.
"/var/log/hadoop-yarn/containers" : {
"includes" : [ "(.*)" ],
"s3Path" : "containers/$0",
"retentionPeriod" : "4h",
"deleteEmptyDirectories": true,
"logType" : [ "USER_LOG", "SYSTEM_LOG" ]
}
It is important to understand that HDFS uses storage space on the core nodes. When the aggregated logs are still on HDFS, you do not see a reduction in disk usage on the core nodes.
Spark job history files are located in /var/log/spark/apps on HDFS. On the master node, you can check the logs using the following command:
hdfs dfs -ls /var/log/spark/apps
The retention period of these logs is governed by spark.history.fs.cleaner.enabled (default false), spark.history.fs.cleaner.interval (default 1 day), and spark.history.fs.cleaner.maxAge (default 7 days). To change the retention period, you will need to change these values in /etc/spark/conf/spark-defaults.conf.