Recon Test Utility App - ja-guzzle/guzzle_docs GitHub Wiki
spark-submit --class com.justanalytics.guzzle.ext.recontest.Main --master yarn {/Path/To/Jar/ReconTest.jar}
"input-dir=/directory_path/to/source_target_query_files/input"
"output-dir=/directory_path/to/store_result/outputs"
"log-dir=/directory_path/to/store_log/logs"
"run-id=example_job_20200316_1"
- * Path of folder which have source and target query files. ["input-dir=/directory_path/to/source_target_files/"]
- Path of Configuration file. ["conf-file=/file_path/to/config.yml"] --- [Default: First yaml file under "input-dir" (1st arg) path]
- Path of output directory. ["output-dir=/directory_path/to/store_result/"] --- [Default: "outputs" directory under "input-dir" (1st arg) path]
- Path of log directory. ["log-dir=/directory_path/to/store_log/"] --- [Default: "logs" directory under "input-dir" (1st arg) path]
- id for the job, which will use as output csv filename. ["run-id=job_20200313_1"] --- [Default: result_timestamp.csv under "output"(3rd arg) directory path]
Only * argument is mandatory, for others default values will be used.
conf: processing_cores: 0 source_file_suffix: _src target_file_suffix: _tgt params: db: pv_test1 tb1: src_table tb2: tgt_table wrapper_query: | # Use '|' sign for multiline strings and start your string in new line after double space WITH src AS (<<src_query>>), tgt AS (<<tgt_query>>), src_cnt AS (SELECT <<src_columns>>, count(*) cnt FROM src GROUP BY <<src_columns>>), tgt_cnt AS (SELECT <<tgt_columns>>, count(*) cnt FROM src GROUP BY <<tgt_columns>>) SELECT a.*, case when src_minus_tgt_cnt + tgt_minus_src_cnt = 0 and src_cnt = tgt_cnt then 'Y' else 'N' END AS recon_check FROM (SELECT (select count(*) from (select * from src_cnt MINUS select * from tgt_cnt)) AS src_minus_tgt_cnt, (select count(*) from (select * from tgt_cnt MINUS select * from src_cnt)) AS tgt_minus_src_cnt, (select count(*) from src) AS src_cnt, (select count(*) from tgt) AS tgt_cnt ) a
-
processing_cores
cores to use while run in parallel.
If this conf is not specified or it's value is less than 1 or greater than availables cores, max available cores will be used by app. -
source_file_suffix
suffix used with source query file in input directory -
target_file_suffix
suffix used with target query file in input directory -
params
key-value pair for groovy expressions -
wrapper_query
This is the final query that going to be trigger on hive database (as of now).
Note: DO NOT USE ordinal values in GROUP BY QUERY [eg: GROUP BY 2,4,Col names...]
Following static placeholder can be used in wrapper query as per requirement.
- <<src_query>> (will be replace with source file query)
- <<tgt_query>> (will be replaces with target file query)
- <<src_columns>> (will be replaces with comma separated columns list, derived from source query file)
- <<tgt_columns>> (will be replaces with comma separated columns list, derived from target query file)
-
Host machine of the application must have spark configured with hive metastore configurations.
-
Source and target query files will be available separately under 'input-dir' path and only suffix will be different for them. (eg. query1_src.txt , query1_tgt.txt)
Currently files are only supported in text format
The files,- which are not in pair (source-target both)
- does not have suffix, provided in conf file
will be ignored.
-
Wrapper query will produces correct results, which may fetch 1 or more results from datasource.
- The result rows can be distinguished based on filename column.
- csv file will hold set of columns generated by wrapper query and other static columns from app
header for dynamic columns will be same as wrapper query's output - related to columns derived from source query and target query,
- If number of source columns and target columns are different, 'count mismatch' message will appended to result file.
- If there will be extra columns on target query's output, they will ignored.
- If there will be some columns present in source but not present in target columns, message will appended to result file with column names, which are not present in target query's output.