Recon Test Utility App - ja-guzzle/guzzle_docs GitHub Wiki

Command To Run App

spark-submit --class com.justanalytics.guzzle.ext.recontest.Main --master yarn {/Path/To/Jar/ReconTest.jar}
"input-dir=/directory_path/to/source_target_query_files/input"
"output-dir=/directory_path/to/store_result/outputs"
"log-dir=/directory_path/to/store_log/logs"
"run-id=example_job_20200316_1"

Input / Arguments

  1. * Path of folder which have source and target query files. ["input-dir=/directory_path/to/source_target_files/"]
  2. Path of Configuration file. ["conf-file=/file_path/to/config.yml"] --- [Default: First yaml file under "input-dir" (1st arg) path]
  3. Path of output directory. ["output-dir=/directory_path/to/store_result/"] --- [Default: "outputs" directory under "input-dir" (1st arg) path]
  4. Path of log directory. ["log-dir=/directory_path/to/store_log/"] --- [Default: "logs" directory under "input-dir" (1st arg) path]
  5. id for the job, which will use as output csv filename. ["run-id=job_20200313_1"] --- [Default: result_timestamp.csv under "output"(3rd arg) directory path]

Only * argument is mandatory, for others default values will be used.

Configuration YAML File

Format

conf:
  processing_cores: 0
  source_file_suffix: _src
  target_file_suffix: _tgt
  params:
    db: pv_test1
    tb1: src_table
    tb2: tgt_table
  wrapper_query: | # Use '|' sign for multiline strings and start your string in new line after double space
    WITH
    src AS (<<src_query>>),
    tgt AS (<<tgt_query>>),
    src_cnt AS (SELECT <<src_columns>>, count(*) cnt FROM src GROUP BY <<src_columns>>),
    tgt_cnt AS (SELECT <<tgt_columns>>, count(*) cnt FROM src GROUP BY <<tgt_columns>>)
    SELECT a.*, case when src_minus_tgt_cnt + tgt_minus_src_cnt = 0 and src_cnt = tgt_cnt then 'Y' else 'N' END AS recon_check FROM 
    (SELECT
      (select count(*) from (select * from src_cnt MINUS select * from tgt_cnt)) AS src_minus_tgt_cnt,
      (select count(*) from (select * from tgt_cnt MINUS select * from src_cnt)) AS tgt_minus_src_cnt,
      (select count(*) from src) AS src_cnt,
      (select count(*) from tgt) AS tgt_cnt
    ) a  
  1. processing_cores
    cores to use while run in parallel.
    If this conf is not specified or it's value is less than 1 or greater than availables cores, max available cores will be used by app.
  2. source_file_suffix
    suffix used with source query file in input directory
  3. target_file_suffix
    suffix used with target query file in input directory
  4. params
    key-value pair for groovy expressions
  5. wrapper_query
    This is the final query that going to be trigger on hive database (as of now).

Note: DO NOT USE ordinal values in GROUP BY QUERY [eg: GROUP BY 2,4,Col names...]

Following static placeholder can be used in wrapper query as per requirement.

  • <<src_query>>  (will be replace with source file query)
  • <<tgt_query>>  (will be replaces with target file query)
  • <<src_columns>>  (will be replaces with comma separated columns list, derived from source query file)
  • <<tgt_columns>>  (will be replaces with comma separated columns list, derived from target query file)

Assumptions

  1. Host machine of the application must have spark configured with hive metastore configurations.

  2. Source and target query files will be available separately under 'input-dir' path and only suffix will be different for them. (eg. query1_src.txt , query1_tgt.txt)
    Currently files are only supported in text format
    The files,

    • which are not in pair (source-target both)
    • does not have suffix, provided in conf file

    will be ignored.

  3. Wrapper query will produces correct results, which may fetch 1 or more results from datasource.

Output

Output will be the single csv file under the directory provided in conf file or generated with default path.

  • The result rows can be distinguished based on filename column.
  • csv file will hold set of columns generated by wrapper query and other static columns from app
    header for dynamic columns will be same as wrapper query's output
  • related to columns derived from source query and target query,
    • If number of source columns and target columns are different, 'count mismatch' message will appended to result file.
    • If there will be extra columns on target query's output, they will ignored.
    • If there will be some columns present in source but not present in target columns, message will appended to result file with column names, which are not present in target query's output.
⚠️ **GitHub.com Fallback** ⚠️