Recon Test Utility App - ja-guzzle/guzzle

Command To Run App

spark-submit --class com.justanalytics.guzzle.ext.recontest.Main --master yarn {/Path/To/Jar/ReconTest.jar}
"input-dir=/directory_path/to/source_target_query_files/input"
"output-dir=/directory_path/to/store_result/outputs"
"log-dir=/directory_path/to/store_log/logs"
"run-id=example_job_20200316_1"

Input / Arguments

* Path of folder which have source and target query files. ["input-dir=/directory_path/to/source_target_files/"]
Path of Configuration file. ["conf-file=/file_path/to/config.yml"] --- [Default: First yaml file under "input-dir" (1st arg) path]
Path of output directory. ["output-dir=/directory_path/to/store_result/"] --- [Default: "outputs" directory under "input-dir" (1st arg) path]
Path of log directory. ["log-dir=/directory_path/to/store_log/"] --- [Default: "logs" directory under "input-dir" (1st arg) path]
id for the job, which will use as output csv filename. ["run-id=job_20200313_1"] --- [Default: result_timestamp.csv under "output"(3rd arg) directory path]

Only * argument is mandatory, for others default values will be used.

Configuration YAML File

Format

conf:
  processing_cores: 0
  source_file_suffix: _src
  target_file_suffix: _tgt
  params:
    db: pv_test1
    tb1: src_table
    tb2: tgt_table
  wrapper_query: | # Use '|' sign for multiline strings and start your string in new line after double space
    WITH
    src AS (<<src_query>>),
    tgt AS (<<tgt_query>>),
    src_cnt AS (SELECT <<src_columns>>, count(*) cnt FROM src GROUP BY <<src_columns>>),
    tgt_cnt AS (SELECT <<tgt_columns>>, count(*) cnt FROM src GROUP BY <<tgt_columns>>)
    SELECT a.*, case when src_minus_tgt_cnt + tgt_minus_src_cnt = 0 and src_cnt = tgt_cnt then 'Y' else 'N' END AS recon_check FROM 
    (SELECT
      (select count(*) from (select * from src_cnt MINUS select * from tgt_cnt)) AS src_minus_tgt_cnt,
      (select count(*) from (select * from tgt_cnt MINUS select * from src_cnt)) AS tgt_minus_src_cnt,
      (select count(*) from src) AS src_cnt,
      (select count(*) from tgt) AS tgt_cnt
    ) a

processing_cores
cores to use while run in parallel.
If this conf is not specified or it's value is less than 1 or greater than availables cores, max available cores will be used by app.
source_file_suffix
suffix used with source query file in input directory
target_file_suffix
suffix used with target query file in input directory
params
key-value pair for groovy expressions
wrapper_query
This is the final query that going to be trigger on hive database (as of now).

Note: DO NOT USE ordinal values in GROUP BY QUERY [eg: GROUP BY 2,4,Col names...]

Following static placeholder can be used in wrapper query as per requirement.

<<src_query>> (will be replace with source file query)
<<tgt_query>> (will be replaces with target file query)
<<src_columns>> (will be replaces with comma separated columns list, derived from source query file)
<<tgt_columns>> (will be replaces with comma separated columns list, derived from target query file)

Assumptions

Host machine of the application must have spark configured with hive metastore configurations.
Source and target query files will be available separately under 'input-dir' path and only suffix will be different for them. (eg. query1_src.txt , query1_tgt.txt)
Currently files are only supported in text format
The files,
- which are not in pair (source-target both)
- does not have suffix, provided in conf file
will be ignored.
Wrapper query will produces correct results, which may fetch 1 or more results from datasource.

Output

Output will be the single csv file under the directory provided in conf file or generated with default path.

The result rows can be distinguished based on filename column.
csv file will hold set of columns generated by wrapper query and other static columns from app
header for dynamic columns will be same as wrapper query's output
related to columns derived from source query and target query,
- If number of source columns and target columns are different, 'count mismatch' message will appended to result file.
- If there will be extra columns on target query's output, they will ignored.
- If there will be some columns present in source but not present in target columns, message will appended to result file with column names, which are not present in target query's output.

Recon Test Utility App - ja-guzzle/guzzle_docs GitHub Wiki

Command To Run App

Input / Arguments

Configuration YAML File

Format

Assumptions

Output

Output will be the single csv file under the directory provided in conf file or generated with default path.

⚠️ GitHub.com Fallback ⚠️

Recon Test Utility App - ja-guzzle/guzzle_docs GitHub Wiki

Command To Run App

Input / Arguments

Configuration YAML File

Format

Assumptions

Output

Output will be the single csv file under the directory provided in conf file or generated with default path.

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️