etl_qa_run_pipeline()

Introduction

The etl_qa_run_pipeline() function is designed to identify data quality issues before performing data analysis. To maximize its effectiveness, it is recommended to use this tool throughout the ETL pipeline. It can be run after extraction to ensure that the data we received is as expected and has not been corrupted during transmission. After transformation, it can check for error introduced by coding mistakes. Finally, after loading the data to its near-final destination, it can be run again — perhaps by an analyst familiar with the data — to check for any remaining data quality issues.

In essence, the function draws data from SQL Server, a data.table or data.frame in R’s memory, or via the rads::get_data_xxx functions. It identifies missing data and assesses changes in numeric statistics and categorical frequency distributions over time. Results are provided as tables that can be filtered and reviewed (and easily read into Tableau) and as PDF graphs that help identify changes over time. Optionally, it can check if variables align with CHI (Community Health Indicators) standards.

Sneak peek at QA output

The function returns a named list containing four items:

config: Configuration settings used for the analysis
initial: Initial ETL QA results
final: Final ETL QA data.table summaries that are exported and used in plots
exported: File paths for exported tables and plots

Here is a peek at the output generated by the example code that will be discussed in detail below.

Graphs

Figure 1: Missingness

This graph shows the percentage of observations (i.e., rows) that are missing at each time period.

Figure 2a: Numeric statistics

This graph shows basic statistics for numeric variables at each time period.

Figure 2b: Date statistics

This graph shows basic statistics for date variables at each time period. Note that datetime variables such as POSIXct and POSIXt in R and datetime, datetime2, and smalldatetime in SQL will assessed as dates.

Figure 2c: Categorical frequencies

This graph shows the proportion of observations with a given value at each time period. It displays the top eight most frequent values PLUS missing values (dotted line) PLUS all other values combined and labeled ‘Other values’. Note that numeric and date variables with six or fewer distinct values will be treated as a categorical.

Tables

As you can see in the figure from Figure 2c, the proportion lines can overlap and it can be difficult to identify all the differences that might be of interest. For a more detailed exploration of this data, you can view the tables stored in the exported Excel file (which are also saved in the final item of the returned list). Here is a snapshot of each of tables:

missingness

Here is a sample of the ‘missingness’ worksheet saved to the Excel file. You would be able to access this in R by typing something like myresult$final$missingness. As per the default settings (which can be adjusted), it is flagging >= 3% absolute changes in the proportion missing between adjacent time periods.

time_period	varname	nrow	proportion	abs_change
2013	birthplace_city	87829	1.000	NA
2014	birthplace_city	89858	1.000	NA
2015	birthplace_city	90371	1.000	NA
2016	birthplace_city	91756	1.000	NA
2017	birthplace_city	1347	0.015	-98.5%
2018	birthplace_city	1398	0.016	NA
2019	birthplace_city	1365	0.016	NA
2020	birthplace_city	1423	0.017	NA
2021	birthplace_city	1595	0.019	NA

values

Similarly, here is a sample from the ‘values’ worksheet (myresult$final$values). The default settings flag rows with a >= 2% relative change in the mean value between adjacent years.

time_period	varname	value	count	proportion	abs_proportion_change	vartype	mean	median	min	max	rel_mean_change	rel_median_change	median_date	min_date	max_date
2020	mother_date_of_birth	NA	NA	NA	NA	Date	NA	NA	NA	NA	NA	NA	1989-12-03	1955-09-03	2007-07-07
2021	mother_date_of_birth	NA	NA	NA	NA	Date	NA	NA	NA	NA	NA	NA	1990-09-11	1958-02-28	2010-01-03
2022	mother_date_of_birth	NA	NA	NA	NA	Date	NA	NA	NA	NA	NA	NA	1991-07-24	1965-09-28	2010-04-23
2013	num_prev_cesarean	NA	NA	NA	NA	Continuous	0.180	0	0	9	NA	NA	NA	NA	NA
2014	num_prev_cesarean	NA	NA	NA	NA	Continuous	0.188	0	0	8	4.6%	NA	NA	NA	NA
2015	num_prev_cesarean	NA	NA	NA	NA	Continuous	0.190	0	0	8	NA	NA	NA	NA	NA
2016	num_prev_cesarean	NA	NA	NA	NA	Continuous	0.190	0	0	7	NA	NA	NA	NA	NA
2017	num_prev_cesarean	NA	NA	NA	NA	Continuous	0.750	1	0	9	295%	Inf%	NA	NA	NA
2018	num_prev_cesarean	NA	NA	NA	NA	Continuous	1.173	1	0	10	56.4%	NA	NA	NA	NA

chi_standards

Finally, if you set check_chi = TRUE, your Excel workbook will have a sheet called ‘chi_standards’ (myresult$final$chi_standards). It provides a simple ’*’ to indicate when there is a misalignment between the CHI standard and your data.

chi_year	varname	group	your_data	chi	problem
2019	chi_race_7	Black	1	1	NA
2020	chi_race_7	Black	1	1	NA
2021	chi_race_7	Black	1	1	NA
2022	chi_race_7	Black	1	1	NA
NA	chi_race_7	Hispanic	0	1	*
2013	chi_race_7	Multiple	1	1	NA
2014	chi_race_7	Multiple	1	1	NA
2015	chi_race_7	Multiple	1	1	NA
2016	chi_race_7	Multiple	1	1	NA

Now that you’ve learned what this function can can, let’s learn how to use it!

Setting up the environment

Start by loading the package:

library(apde)

Function arguments

Since this function allows the user to QA data in three different data sources that need different specifications, understanding the arguments is not as simple as typing args(etl_qa_run_pipeline). Here are all the possible arguments as described in the helpfile (?etl_qa_run_pipeline):

`data_source_type`

Character string specifying the type of data source. Must be one of 'r_dataframe', 'sql_server', or 'rads'.

`connection`

A DBIConnection object.
Required only when data_source_type = 'sql_server'.

`data_params`

List of data related parameters specific to the data source. Not all parameters are needed for all data sources. Please review the examples for details.

data_params$check_chi

Logical vector of length 1. When check_chi = TRUE, function will add any available CHI related variables to data_params$cols and will assess whether their values align with standards in rads.data::misc_chi_byvars.
Default is check_chi = FALSE.
data_params$cols

Character vector specifying the column names to analyze, e.g., cols = c('race4', 'birth_weight_grams', 'birthplace_city').
data_params$time_range

Character vector of length 2 specifying the start and end of the time range, e.g., time_range = c(2015, 2024).
data_params$time_var

Character string specifying the time interval variable, e.g., time_var = 'chi_year'.
data_params$data

Name of a data.frame or data.table that you want to assess with this function, e.g., data = myDataTable.
Required only when data_source_type = 'r_dataframe'.
data_params$function_name

Character string specifying the relevant rads::get_data_xxx function, e.g., function_name = 'get_data_birth'.
Required only when data_source_type = 'rads'.
data_params$kingco

Logical vector of length 1. Identifies whether you want limit the data to King County.
Required only when data_source_type = 'rads'.
Default is kingco = TRUE.
data_params$version

Character string specifying either 'final' or 'stage'.
Required only when data_source_type = 'rads'.
Default is version = 'stage'.
data_params$schema_table

The name of the schema and table to be accessed within the SQL Server connection. Must be in the form myschema.mytable, with a period as a separator, e.g., schema_table = 'birth.stage_analytic'.
Required only when data_source_type = 'sql_server'.

`output_directory`

Character string specifying the directory where output files will be saved. If NULL, the current working directory is used.
Default is output_directory = NULL.

`digits_mean`

Integer specifying the number of decimal places for rounding the reported mean, median, min, and max.
Default is digits_mean = 3.

`digits_prop`

Integer specifying the number of decimal places for rounding proportions.
Default is digits_prop = 3.

`abs_threshold`

Numeric threshold for flagging absolute percentage changes in proportions. Permissible range is [0, 100].
Default is abs_threshold = 3.

`rel_threshold`

Numeric threshold for flagging relative percentage changes in means and medians. Permissible range is [0, 100].
Default is rel_threshold = 2.

Examples

As stated above, this function can QA data in SQL Server, a local data.table/ data.frame, or from rads. Here we provide an example of each method to QA the same data source. Since all three methods provide identical output, we will only explore the SQL Server QA in detail knowing that it is generalizable.

QA SQL Server data

myconnection <- rads::validate_hhsaw_key()
qaSQL <- etl_qa_run_pipeline(
  data_source_type = 'sql_server',
  connection = myconnection,
  data_params = list(
    schema_table = 'birth.final_analytic',
    time_var = 'chi_year',
    time_range = c(2013, 2022),
    cols =c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city', 
            'num_prev_cesarean', 'mother_date_of_birth'), 
    check_chi = TRUE
  ), 
  output_directory = tempdir()
)

Creating config object ... running etl_qa_setup_config()

Analyzing data ... running etl_qa_initial_results()

😦👿🤬⚠ 
The following varname and group combinations exist in the rads.data::misc_chi_byvars 
standards but are missing from your dataset. Please ensure your dataset complies with
the CHI standard.

|    varname|group        |note                                                      |
|----------:|:------------|:---------------------------------------------------------|
| chi_race_7|Hispanic     |It's OK! A race variable cannot also represent ethnicity. |
|      race3|Hispanic     |It's OK! A race variable cannot also represent ethnicity. |
|      race3|Non-Hispanic |It's OK! A race variable cannot also represent ethnicity. |

Preparing results ... running etl_qa_final_results()

Visualizing data ... running etl_qa_export_results()

Finished! Check your output_directory: C:\Users\DCOLOM~1.KC\AppData\Local\Temp\Rtmp27aLWN

Since we set check_chi = TRUE, the function pulled in all available CHI related variables and compared them to known standards. In this case, it identified three deviations between the birth data and the CHI standards. However, these are known issues that are related to conception of the CHI variables rather than a data processing issue, so there is a note letting the user know not to worry about the discrepancy. I’ve hidden this message for the examples that follow to simplify this vignette.

The final statement asks the user to check their output_directory, which is tempdir(). Let’s see what is saved there:

list.files(tempdir())

 [1] "birth.final_analytic_qa_2024_09_30.xlsx"       
 [2] "birth.final_analytic_qa_missing_2024_09_30.pdf"
 [3] "birth.final_analytic_qa_values_2024_09_30.pdf"

birth.final_analytic_qa_missing_2024_09_30.pdf contains graphs like the one showin in Figure 1.
birth.final_analytic_qa_values_2024_09_30.pdf contains the graphs like those in Figures 2a, 2b, and 2c.
birth.final_analytic_qa_2024_09_30.xlsx contains three worksheeets, one for each of the three tables given in the Tables section above.

We also saved the output as the object qaSQL. Let’s confirm that what was written about the returned object in the Sneak peek section above was correct:

# confirm it is a list
inherits(qaSQL, 'list')

[1] TRUE

# get the name of the object in the returned list
names(qaSQL)

[1] "config"   "initial"  "final"    "exported"

# get the names of the returned tables (which correspond to those in the Excel file)
names(qaSQL$final)

[1] "missingness"   "values"        "chi_standards"

QA data.table / data.frame data in memory

birth_data <- rads::get_data_birth(year = c(2013:2022), 
                             kingco = F, 
                             version = 'final')

qaDF <- etl_qa_run_pipeline(
  data_source_type = 'r_dataframe',
  data_params = list(
    data = birth_data,
    time_var = 'chi_year',
    time_range = c(2013, 2022),
    cols = c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city', 
             'num_prev_cesarean', 'mother_date_of_birth'), 
    check_chi = TRUE
  ), 
  output_directory = tempdir()
)

QA `rads` data

qaRADS <- etl_qa_run_pipeline(
  data_source_type = 'rads',
  data_params = list(
    function_name = 'get_data_birth',
    time_var = 'chi_year',
    time_range = c(2013, 2022),
    cols = c('chi_age', 'race4', 'birth_weight_grams', 'birthplace_city',
             'num_prev_cesarean', 'mother_date_of_birth'),
    version = 'final',
    kingco = FALSE,
    check_chi = TRUE
  ),
  output_directory = tempdir()
)

Confirm that the results from each method are identical

identical(qaDF$final, qaSQL$final)

[1] TRUE

identical(qaDF$final, qaRADS$final)

[1] TRUE

Conclusion

Remember to consult the function documentation for more detailed information on usage and parameters. Happy coding!

– Updated by dcolombara, 2024-09-30

etl_qa_run_pipeline - PHSKC-APDE/apde GitHub Wiki

etl_qa_run_pipeline()

Introduction

Sneak peek at QA output

Graphs

Figure 1: Missingness

Figure 2a: Numeric statistics

Figure 2b: Date statistics

Figure 2c: Categorical frequencies

Tables

missingness

values

chi_standards

Setting up the environment

Function arguments

`data_source_type`

`connection`

`data_params`

`data_params$check_chi`

`data_params$cols`

`data_params$time_range`

`data_params$time_var`

`data_params$data`

`data_params$function_name`

`data_params$kingco`

`data_params$version`

`data_params$schema_table`

`output_directory`

`digits_mean`

`digits_prop`

`abs_threshold`

`rel_threshold`

Examples

QA SQL Server data

QA data.table / data.frame data in memory

QA `rads` data

Confirm that the results from each method are identical

Conclusion

⚠️ GitHub.com Fallback ⚠️

etl_qa_run_pipeline - PHSKC-APDE/apde GitHub Wiki

etl_qa_run_pipeline()

Introduction

Sneak peek at QA output

Graphs

Figure 1: Missingness

Figure 2a: Numeric statistics

Figure 2b: Date statistics

Figure 2c: Categorical frequencies

Tables

missingness

values

chi_standards

Setting up the environment

Function arguments

data_source_type

connection

data_params

data_params$check_chi

data_params$cols

data_params$time_range

data_params$time_var

data_params$data

data_params$function_name

data_params$kingco

data_params$version

data_params$schema_table

output_directory

digits_mean

digits_prop

abs_threshold

rel_threshold

Examples

QA SQL Server data

QA data.table / data.frame data in memory

QA rads data

Confirm that the results from each method are identical

Conclusion

⚠️ **GitHub.com Fallback** ⚠️

`data_source_type`

`connection`

`data_params`

`data_params$check_chi`

`data_params$cols`

`data_params$time_range`

`data_params$time_var`

`data_params$data`

`data_params$function_name`

`data_params$kingco`

`data_params$version`

`data_params$schema_table`

`output_directory`

`digits_mean`

`digits_prop`

`abs_threshold`

`rel_threshold`

QA `rads` data

⚠️ GitHub.com Fallback ⚠️