Guzzle Parameters - ja-guzzle/guzzle_docs GitHub Wiki
- Overview
- Guzzle System Parameters
- Context Parameters
- User Job Parameters
- Precedence Order of Job Parameters
- Precedence of user defined and system parameters:
- Precedence of Context parameters:
- Using Groovy templates to manipulate Guzzle Parameters
- Params Displayed in Guzzle Logs UI
- Precedence Order of Param
Guzzle offers wide array of standard parameters and they can be broadly classified into following categories:
This are standard job parameters of Guzzle. They contain contain specific parameters. Some of this are parameters are purely meant for tractability purpose providing the job name, environment, stage, job group etc. Others are meant to control behavior of Guzzle jobs and Guzzle common services. Below table provides full list of this variables and their purpose:
Sr. | Parameter Name | Description | Applicable to | Mandatory | Sample values | References |
---|---|---|---|---|---|---|
* | job_config_name | This is the name of the job configuration | All job module | Yes (automatically populated) | load_airlines | |
* | environment | This contains environment name used triggering the job | All job module | Yes (when running jobs from UI its passed as per the selection on the Top-Right in job list) | ||
* | batch_id | Contains the batch id of the job. A batch usually is used to track the the end to end batch load for a given system | All Job Module | Yes (defaulted to -1 for adhoc job or job group run) | 1231321313131 | |
* | stage_id | This is a stage under which the job was run. Stages are in turn part of batches | All Job Module | Yes (defaulted to -1 for adhoc job or job group run) | 1231321313131 | |
* | job_instance_id | This is the unique job instance id for the individual job run | All job module | Yes | 12321313132 | |
* | business_date | Business date for the job run | All | Yes | 2018-01-01 00:00:00 | |
* | job_group | This is the job group of which this job is part of. Only passed when job is run as part of job group or stage | All job types | No | job_group1 | |
* | stage | Stage name if the job is invoke through a stage run | All Jobs when called from "Run Stage" | N | FND, STG | |
* | guzzle.spark.name | The spark envrionment name | All | Y | spark_local | |
* | guzzle.job_group.partial | Indicates whether a job group allows partial run. When marked as Y will result in running the job | ||||
* | guzzle.ingestion.load_type | This is applicable for ingestion jobs for JDBC. The | Ingestion | N | F: full, I: Incremental | https://github.com/ja-guzzle/ingestion/issues/31 |
* | guzzle.batchpipeline.threads | This controls the number of files that are processed in parallel when running a ingestion job. this is applicable when the there are more than one incoming files for a given job. | Ingestion | No | 10 | https://github.com/ja-guzzle/ingestion/issues/114 |
* | guzzle.stage.resume | This indicates whether a stage has to recover from. This is only applicable when the stage is defined as Partial | Job Groups and Job Stages | N | Y or N | https://github.com/ja-guzzle/guzzle_common/issues/91 |
* | hive.storage_format | used as storage format for auto create table | Ingestion | N |
When setting up Guzzle environment, one can specify the standard context variables in guzzle.yml which are then available to Stage, Job groups and Job invocation as fixed set of additional parameters.
<< screenshot to be added latter>>
This context parameters form part of various guzzle runtime audit tables namely: job_info, batch_control, recon_summray, recon_detail, check_constraint_summary,check_contraint_detail,watermark and batch_constraint
Context parameters (also referred as context columns) are designed to bring consistency when tracking common information about data pipelines both in the data table as well as runtime audit and recon tables
Context parameters can also be designed to build sophisticated batch orchestration for large scale data lake implementation ingesting data from multitude of systems, and location. This shall be described in separate document
Context parameters being prompted in the Job, Job group and Run Stage respectively: << screenshot to be added latter>>
When creating guzzle job config, one can refer to additional custom parameters. This parameters can then be passed to jobs at the time of invoking.
The user defined parameters can be passed when invoking the job or via adhoc job_group or when running the Stage (in which case the additional user defined job parameters are passed to all the jobs which form part of the job group or stage
System, context and user parameters can be passed from multiple places. Below is the precedence order in which job parameters are applied when running the job:
Precedence Order | Layer | Screenshot |
---|---|---|
1 | Parameter passed during Invocation of Run Stage/ Job Group / Job | |
2 | Environment settings | |
3 | Parameters specified when adding the Job to Job Group |
Precedence Order | Layer | Screenshot |
---|---|---|
1 | Parameter passed during Invocation of Run Stage/ Job Group / Job | |
2 | Parameters specified when adding the Job to Job Group | |
3 | Environment file |
Note: 1 means highest precedence and 3 means lowest.
At times the params have to be further modified before it can be used in job definition. Guzzle supports Groovy templates to allow modifying parameters.
an example of grovvy template below: csv/user_${business_date[0..3] + business_date[5..6] + business_date[8..9]}.csv
Below is the details of how params (internal /external) are displayed in Log view or not:
In log view, we are displaying parameters from job_info and job_info_param table: below parameter's value displayed from job_info table:
- "Name" -> name
- "Module" -> module
- "Job Id" -> job_instance_id
- "Business Date" -> business_date
- "Start Time" -> start_time
- "End Time" -> end_time
- "Status" -> status
- "Message" -> message
- "Batch Id" -> batch_id
- "Stage Id" -> parent_job_instance_id
From job_info_param, except below parameter, all parameters are shown from job_info_param table (this parameter values are shown from job_info record in above section). The names are as-is what is present in those tables (so no friendly name for it)
- "job_config_name"
- "job_instance_id"
- "business_date"
- "batch_id"
- "stage_id"
- "dependency_graph"
A job can get param from following:
- At the time of execution
- Env
- In the pipeline. (what happens if its in pipeline but set as blank )
--- with scheduler It will get scheduler global param.. + runnable levle param instead of " At the time of execution"