ingestion job configuration - ja-guzzle/guzzle_docs GitHub Wiki
Following is the sample ingestion job configuration for csv source and hive target
job:
type: ingestion
failure_threshold: 20
partial_file_load: false
source:
endpoint: users_base_path
source_schema_derivation_strategy: source
properties:
source_file_pattern: ${location}/${system}/users.*.csv
format: delimited
charset: UTF-8
format_properties:
column_delimiter: ","
contains_header: true
control_file:
extension: ctl
path: /control-files/${environment}/${location}/${system}
processed_file_path: /processed/${environment}/${location}/${system}
schema:
strict_schema_check: true
filter_sql: "name like 'user%' and age > 23 or name in (select name from ${endpoint.users_db}.test)"
columns:
id:
primary_key: true
data_type: INT
nullable: false
name:
data_type: CHAR(10)
validate_sql: "@ in (select name from ${endpoint.users_db}.test)"
transform_sql: "case when @ in (select name from ${endpoint.users_db}.test) then @ else 'None' end"
age:
data_type: DECIMAL(2,0)
validate_sql: "@ > 25"
created_time:
validate: false
target:
endpoint: users_db
columns_to_be_loaded: common
partition_columns:
instance_id:
value: ${job_instance_id}
system:
value: ${system}
location:
value: ${location}
properties:
table: ${database_name}.${users_table}
In job configuration two types of placeholders can be used:
- ${job_parameter_name} : Value for such expression will be resolved by parameters we pass while invoking job
- ${endpoint.logical_endpoint_name} : Value for such expression will be resolved by getting database name property from physical-endpoint of given logical-endpoint and environment
Job:
- type will be used by common service (orchestration) to identify which type of job to be triggered
- failure_threshold is used to specify number of invalid records(in percentage) allowed while processing single file. If threshold reaches, whole file is considered to be discarded and no records from that file will be processed
- When multiple files are present in the source and failure threshold is reached for some of the files, partial_file_load is used to specify whether to process the remaining files or discard all the source files. Default value is false which means all the files will be discarded if threshold is reached in a single source file
Source:
TODO
Schema:
TODO
Target:
TODO
Support for placeholder
-
In source section:
source_file_pattern
control_file > path
processed_file_path -
In schema section:
validate_sql
transform_sql -
In target section:
partition_columns
table