YARN Job Management - rambabu-chamakuri/PSTL-DOC GitHub Wiki
If you are not familiar with the basic characteristics of a Jaguar Job, please refer to the Job Overview Guide.
If you are not familiar with the basics of launching Jaguar jobs, please refer to the Job Launching Guide.
If you are not familiar with the basics of YARN, please refer to the YARN Documentation. If you are typically a CLI user, you may find the YARN Commands Documentation documentation a useful reference.
WARNING: Jaguar is currently feature rich for Spark's YARN integration. To date we have not deployed with Spark's MESOS integration. As such, avid users may encounter issues if they attempt to launch jobs via Jaguar on Spark with MESOS. This guide is specific to job management on YARN.
WARNING: If you are running YARN securely (eg. with Kerberos or similar), this guide assumes you are capable of authenticating prior to running YARN commands.
Find Running Job
You may have many Jaguar jobs running on your cluster, as well as other applications leveraging YARN for resource management. In these cases, you may need to locate a specific Jaguar job when you do not have the YARN application_id
available. Fortunately, we can easily search YARN applications:
[bowdch01@msc02-jag-en-001 ~]$ yarn application -appStates RUNNING -appTypes SPARK -list
18/01/04 23:17:05 INFO impl.TimelineClientImpl: Timeline service address: http://msc02-jag-yrm-001.uat.gdcs.apple.com:8188/ws/v1/timeline/
18/01/04 23:17:06 INFO client.AHSProxy: Connecting to Application History server at msc02-jag-yrm-001.uat.gdcs.apple.com/10.190.178.98:10200
Total number of applications (application-types: [SPARK] and states: [RUNNING]):4
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1514929143019_0007 Spark shell SPARK bowdch01 default RUNNING UNDEFINED 10% http://10.190.178.113:4040
As seen above, we can easily search for applications based on their state and application type. Once we locate the application_id
, in this case application_1514929143019_0007
, we can then stop the running job, etc.
Stop Running Job
Once you have successfully launched a Jaguar job, there may come a point in time where you need to stop the job. If you have the job's YARN application_id
handy, this is a simple operation:
[bowdch01@msc02-jag-en-001 ~]$ yarn application -kill application_1514929143019_0005
18/01/04 23:05:32 INFO impl.TimelineClientImpl: Timeline service address: http://msc02-jag-yrm-001.uat.gdcs.apple.com:8188/ws/v1/timeline/
18/01/04 23:05:32 INFO client.AHSProxy: Connecting to Application History server at msc02-jag-yrm-001.uat.gdcs.apple.com/10.190.178.98:10200
Killing application application_1514929143019_0005
18/01/04 23:05:33 INFO impl.YarnClientImpl: Killed application application_1514929143019_0005
View Job Logs
While your Jaguar job is running, you may need to periodically view the job logs to investigate outstanding issues, etc. While Jaguar's metrics make processing issues obvious in near real-time, they may not necessarily point you to the root cause of an issue. If you have the job's YARN application_id
handy, this is a simple operation:
[bowdch01@msc02-jag-en-001 ~]$ yarn logs -applicationId application_1514929143019_0008 -log_files stderr
18/01/04 23:25:22 INFO impl.TimelineClientImpl: Timeline service address: http://msc02-jag-yrm-001.uat.gdcs.apple.com:8188/ws/v1/timeline/
18/01/04 23:25:23 INFO client.AHSProxy: Connecting to Application History server at msc02-jag-yrm-001.uat.gdcs.apple.com/10.190.178.98:10200
Container: container_e52_1514929143019_0008_01_000001 on msc02-jag-dn-015.uat.gdcs.apple.com:45454
==================================================================================================
LogType:stderr
Log Upload Time:Thu Jan 04 23:25:24 +0000 2018
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/01/filecache/539/spark-hdp-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.5.3.0-37/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/01/04 23:24:31 INFO ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]
18/01/04 23:24:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/01/04 23:24:31 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1514929143019_0008_000001
...
Note we specify a specific log file name. While specific applications may log as they see fit (to custom file names, etc), it is very common for YARN applications to emit a stderr
and stdout
file. Spark applications like Jaguar often use stderr
as a replacement to stdout
for reasons not worth detailing here (feel free to ask if you are interested). You can also request all logs for a specific YARN application_id
by simply not specifying -log_files ...
.
If you will frequently interact with YARN logging infrastructure, we highly recommend you read the YARN Documentation and the YARN Commands Documentation.