ExcelTemplateToGuzzleJobConfigYAML_readme - ja-guzzle/guzzle_docs GitHub Wiki

Table of Contents

Purpose

Goal of this application is to prepare Guzzle job(only processing job, for now) config yaml files from provided excel files.

This application accepts,

  • Configuration file (which is in yml file which gives diff information on how to locate diff aspects in the the actual excle)
  • Excel file from which job config will be prepared

And Produces,

  • job config YAML files
  • logs

Sample template

STTM_Processing_Excel_Template.xlsx

Run the app

java -jar ExcelTemplateToGuzzleJobConfigYAML.jar "excel=</absolute/path/to/excel.xlsx>" "conf=</absolute/path/to/conf.yml>" "output=</absolute/path/to/outputDir>" "log=</absolute/path/to/logDir>"

Arguments

  1. * Path of folder which have source and target query files. ["excel=/absolute/path/to/excel.xlsx"]
  2. Path of Configuration file. ["conf=/absolute/path/to/conf.yml"] --- [Default: will try to pick config.yml named file from 'excel' argument's directory]
  3. Path of output directory. ["output=/absolute/path/to/outputDir"] --- [Default: 'output' directory under 'excel' argument's path]
  4. Path of log directory. ["log=/absolute/path/to/logDir"] --- [Default: 'log' directory under 'excel' argument's path]

Only * argument is mandatory, for others default values will be used.

Configurations

This confirm is very powerful and gives you a lot of flexibility of how you want to setup your ETL template. It does not go by position by labels of the fields.

general:
    value_delimiter: ","     # to separate ids of additional column and truncate partition columns e.g.: fr_col, add_col
    range_delimiter: ".."    # can be used to specify id range when ids are integer e.g.: 1..7,8,10..13. Usually we will never have column names which are integer.
clause:
    from: "From"             # excel cell value which represents 'from' clause
    where: "Filter"          # excel cell value which represents 'where' clause
    join: "JOINS"            # excel cell value which represents 'join' clause
    order_by: "Order By"     # excel cell value which represents 'order by' clause
    group_by: "Group By"     # excel cell value which represents 'group by' clause
    having: "Having Clause"  # excel cell value which represents 'having' clause
sheet_detail:
    additional_columns:
        sheet_name: "Additional Columns"          # name of additional column sheet
        header_row_index: "0"                     # index of row (0 based) from which header of 'additional columns' starts
    truncate_partition_columns:
        sheet_name: "Truncate Partition Columns"  # name of truncate partition column sheet
        header_row_index: "0"                     # index of row (0 based) from which header of 'truncate partition columns' starts
column_type:
    primary_key: "P"      # put the value you have used in sttm sheet to denote the column is primary key
    merge_key: "M"        # put the value you have used in sttm sheet to denote the column is merge key
    history_column: "H"   # put the value you have used in sttm sheet to denote the column is history column
fields:   # below are necessary fields used in app, provide the mapping (excel cell value) which represents the field
    description: "Description"
    job_type: "Job Type"
    tags: "Tags"
    s_endpoint: "Source Endpoint"
    s_technology: "Source Technology"
    incremental: "Incremental"
    s_table: "Source Schema.Source Table"
    sql_query: "SQL"
    filter: "Filter On Source"
    s_pre_sql: "PreSQL"
    s_post_sql: "PostSQL"
    s_table_dependency: "Source Table Dependancy"
    t_table_dependency: "Target Table Dependancy"
    additional_columns: "Add Columns"
    t_endpoint: "Target Endpoint"
    t_technology: "Target Technology"
    t_schemaName.tableName: "Target Schema.Target Table"
    operation: "Loading Startegy"
    template: "Template"
    t_pre_sql: "PreSQL"
    t_post_sql: "PostSQL"
    soft_delete: "Soft Delete"
    truncate_partition_columns: "Truncate Clumn"
sttm_columns:
    source_table: "Source Table"                         # put the value of excel cell which represents the COLUMN 'source table' in sttm table
    source_column: "Source Column (with Transformation)" # put the value of excel cell which represents the COLUMN source column in sttm table
    expression: "Expression"                             # Not being used in current excel template
    target_column: "Target Column"                       # put the value of excel cell which represents the COLUMN target column in sttm table
    column_type: "Pk/Merge Columns/History Trigger"      # put the value of excel cell which represents the COLUMN column_type (to mark column as primary, merge or history column) in sttm table
additional_columns_sheet_columns:             # header column mapping of 'additional columns' sheet
    column_id: "Key"
    name: "Name"
    value: "Value"
    framework_column: "Framework Column"
    framework_generated: "Framework Generated"
truncate_partition_columns_sheet_columns:     # header column mapping of 'truncate partition columns' sheet
    column_id: "Key"
    name: "Column_name"
    value: "Value"

Output - Job Config YAML

  • The app will produce 1 job config yaml file for 1 excel sheet (It will exclude certain sheets like "Additioanl Column" "truncate partition column" sheet)
  • Generated job config will be named after it's source excel sheet name.

Limitations

  • Job config file may contain invalid query, as it's just merging and replacing strings. Validating formed query is not a part of the app, as of now. Verify before using it in Guzzle.
  • It's correctness depends on the use of accurate syntax in excel sheet.
  • Range delimeter only works with integer ids (this are unlikely and hence can be ignored)
  • Pre-SQL and Post-SQL has to be provided as multi row. this has to be tested.
  • Any formatting /extract char which excel may introduce may have impact to job yml generated

Recommendation

  • While using clauses, use them in order. (eg.: In excel sheet, after sttm query, join clause row must appear in before order by clause row)
  • Use exact syntax of corresponding technology

Notes

  1. If there are other columns, besides specified in sttm_columns conf, will be ignored. [in STTM table, in sttm sheet]
  2. Data above the 'header_row_index' in 'additional column sheet' and 'truncate partition column' sheet, will be ignored.
  3. Continuous 10 blank cells in first column in any sheet will be considered as the end of sheet (when 'header_row_index' is specified, it will count it after 'header_row_index').
  4. This app is tightly coupled with config file and the excel template. So your config and template has to be hand in hand.
⚠️ **GitHub.com Fallback** ⚠️