ingestion job configuration - ja-guzzle/guzzle_docs GitHub Wiki

Following is the sample ingestion job configuration for csv source and hive target

job:
    type: ingestion
    failure_threshold: 20
    partial_file_load: false

source:
    endpoint: users_base_path
    source_schema_derivation_strategy: source
    properties:
        source_file_pattern: ${location}/${system}/users.*.csv
        format: delimited
        charset: UTF-8
        format_properties:
            column_delimiter: ","
            contains_header: true
        control_file:
          extension: ctl
          path: /control-files/${environment}/${location}/${system}
        processed_file_path: /processed/${environment}/${location}/${system}

schema:
    strict_schema_check: true
    filter_sql: "name like 'user%' and age > 23 or name in (select name from ${endpoint.users_db}.test)"
    columns:
        id:
            primary_key: true
            data_type: INT
            nullable: false
        name:
            data_type: CHAR(10)
            validate_sql: "@ in (select name from ${endpoint.users_db}.test)"
            transform_sql: "case when @ in (select name from ${endpoint.users_db}.test) then @ else 'None' end"
        age:
            data_type: DECIMAL(2,0)
            validate_sql: "@ > 25"
        created_time:
            validate: false

target:
    endpoint: users_db
    columns_to_be_loaded: common
    partition_columns:
      instance_id:
        value: ${job_instance_id}
      system:
        value: ${system}
      location:
        value: ${location}
    properties:
        table: ${database_name}.${users_table}

In job configuration two types of placeholders can be used:

  • ${job_parameter_name} : Value for such expression will be resolved by parameters we pass while invoking job
  • ${endpoint.logical_endpoint_name} : Value for such expression will be resolved by getting database name property from physical-endpoint of given logical-endpoint and environment

Job:

  • type will be used by common service (orchestration) to identify which type of job to be triggered
  • failure_threshold is used to specify number of invalid records(in percentage) allowed while processing single file. If threshold reaches, whole file is considered to be discarded and no records from that file will be processed
  • When multiple files are present in the source and failure threshold is reached for some of the files, partial_file_load is used to specify whether to process the remaining files or discard all the source files. Default value is false which means all the files will be discarded if threshold is reached in a single source file

Source:
TODO

Schema:
TODO

Target:
TODO

Support for placeholder

  • In source section:
    source_file_pattern
    control_file > path
    processed_file_path

  • In schema section:
    validate_sql
    transform_sql

  • In target section:
    partition_columns
    table

⚠️ **GitHub.com Fallback** ⚠️