Job Launching - rambabu-chamakuri/PSTL-DOC GitHub Wiki

If you are not familiar with the basic characteristics of a Jaguar Job, please refer to the Job Overview Guide.

WARNING: Jaguar is currently feature rich for Spark's YARN integration. To date we have not deployed with Spark's MESOS integration. As such, avid users may encounter issues if they attempt to launch jobs via Jaguar on Spark with MESOS. Internally, Jaguar's SparkLauncher provisions sane settings when launching in a YARN environment, and we need to expand this automation to MESOS.

When working with Spark applications, it is very typical to launch your application via spark-submit, providing all of the necessary arguments your application needs to bootstrap dependencies, etc. Jaguar is very similar, each launched job is nothing more than a spark-submit, where we provide the implementation of the application in a very generic way. Each job launched by Jaguar is an isolated Spark driver running on the underlying cluster. However, Jaguar requires non-trivial dependencies and non-trivial configuration to ensure all of our functionality is available correctly within Spark at runtime. As a result, we generally frown on forcing users to generate their own spark-submit ... commands, since it is tedious.

Instead, Jaguar comes with a custom SparkLauncher. Jaguar's launcher automates spark-submit generation based on well defined conventions. It also ensures classpaths, configuration, dependencies, jars, etc. are properly defined based on the installation environment. This guide provides details on getting started with launching jobs, as well as a full overview of all of the options available to you so you understand how to modify configurations, etc. specific to your workload.

Getting Started

Similar to Spark's spark-submit, Jaguar comes with a binary for launching and deploying jobs: pstl. If you are having trouble locating the pstl binary, please consult the Jaguar Installation Guide.

Launching a job is as simple as running the following command: pstl --deploy /path/to/job, where /path/to/job is assumed to be a directory. All jobs are assumed to be isolated directory structures in a convention based format. Let's take a look at a typical job directory:

Chriss-MacBook-Pro-4:~ cbowden$ tree my-job/
my-job/
├── environment.properties
├── job.sql
├── spark.properties
├── spark_files
│   └── product.avsc
└── spark_jars
    └── my-udf-library.jar

2 directories, 5 files

As we learned in the Job Overview Guide, all jobs require a unique job id. Typically, the job directory is self identifying based on the directory name. In this case, the unique identifier for our example job directory is my-job. However, when deploying a job, we can optionally specify a user provided job id by specifying the --job-id argument when invoking pstl --deploy:

Chriss-MacBook-Pro-4:~ cbowden$ pstl
pstl --deploy /path/to/job
Option               Description                              
------               -----------                              
--deploy <File>      job directory to deploy                  
--help               show this message                        
--job-id [String]    human readable label to identify this job
--pstl-home [File]   path to pstl installation directory      
--spark-home [File]  path to spark installation directory     
--verbose            provide(s) verbose output                
--version            show version number and quit

job.sql

As we learned in the Job Overview Guide, all jobs also require a job definition. Within the job directory, Jaguar will search for a file named job.sql. If Jaguar fails to locate a file named job.sql within the job directory, it will fail to deploy the job. The contents of job.sql are always assumed to be a set of SQL commands, terminated by ;. The defined SQL commands must contain at least one action, typically defined as SAVE STREAM .... If no action(s) are defined in the job definition, Jaguar will fail to deploy the job, since there is no real processing work defined within the job. This follows suit with Spark's lazy execution model.

environment.properties

Users can optionally specify a file named environment.properties in their job directory. This file must be specified in the standard Java Properties File Format. The key/value pairs defined in this file are exposed as environment variables to this job at runtime. Defining environment variables help us:

Provide values at runtime to the job definition via SQL variable substitution
Provide values for certain features, if they expect an environment variable to be defined at runtime

Environment variables defined in the job directory have higher precedence than environment variables defined in the Jaguar installation directory. See the Jaguar Installation Guide for more details.

spark.properties

Users can optionally specify a file named spark.properties in their job directory. This file must be specified in the standard Java Properties File Format. The key/value pairs defined in this file are used to configure the SparkContext and SparkSession when deploying the job definition. Defining spark properties help us tune configuration settings specific to our job's workload as needed.

Spark properties defined in the job directory typically have higher precedence than spark properties defined in the Jaguar installation directory. See the Jaguar Installation Guide for more details.

spark_files

Users can optionally specify additional files required by their job definition to function correctly. These can be simple, small static files such as avro schema definitions (if no external schema registry is available), etc. These files can be in any arbitrary format as required by the user or the job definition, the only constraints are as follows:

Subdirectories are not supported
File names must be unique (made obvious by the first constraint)

Spark files defined in the job directory have higher precedence than spark files defined in the Jaguar installation directory. See the Jaguar Installation Guide for more details.

spark_jars

Users can optionally specify additional jars required by their job definition to function correctly. These are typically additional dependencies such as:

User defined SQL function(s)
User defined sources
User defined sinks

Similar to spark_files, subdirectories are not supported and file names must be unique.

Spark jars defined in the job directory have higher precedence than spark jars defined in the Jaguar installation directory. See the Jaguar Installation Guide for more details.

WARNING: If your job directory contains a JAR already referenced in Jaguar's core library dependencies (eg. same JAR name, typically based on groupId, artifactId, version), Jaguar's core library takes higher precedence than the JAR in your job directory.