# 01.Association02.Inferential tests - sporedata/researchdesigneR Wiki

## 1. Use cases: in which situations should I use this method?

They are used to compare two variables (or a variable vs. a population) as an initial exploratory analysis or explore unadjusted relations. Unadjusted relations are useful since they show "the world as it is" rather than exploring causes.

## 2. Input: what kind of data does the method require?

1. Cross-sectional or longitudinal data
2. Outcome and predictor variables
• Frequentist tests are chosen by algorithm. Common ones include

• Two-sample t-test - outcome close to a normal distribution, risk factor dichotomous, patients coming from two distinct samples
• Paired t-test
• One-sample t-test
• Chi-square test
• Correlation test
• Mock dataset

• library(fabricatr)
patients <- fabricate(
N = 1000,
gender = draw_binary(0.5, N = N),
qol = round(runif(N, 45, 90)),
age = round(runif(N, 18, 85)),
prepost = draw_binary(prob = ifelse(qol < 40, 0.4, 0.7), N=N),
rural = draw_binary(prob = ifelse(qol < 55, 0.3, 0.9), N=N)
)

## 3. Algorithm: how does the method work?

### Model mechanics

• Bayesian inferential tests are based on the presence of prior belief (in clinical research often being mildly informative) being updated by data, and generating a posterior belief

#### Data science packages

• rstan 1(#1) and brms 2(#2) for Bayesian methods.
• frequentist t-tests, Chi square tests, and correlation tests are available as part of base R, as well as across dozens of different packages.
• epiR for Population Attributable Fraction 3(#3).

1. Books
1. Articles

## 4. Output: how do I interpret this method's results?

For each of the tests below, there is a frequentist as well as a Bayesian version of the test.

• T-tests - comparing a continuous variable across two groups, i.e. a continuous and a dichotomous variable. A t-test requires a continuous outcome variable and a dichotomous (yes/no) predictor.
• Chi-square tests - compares two categorical variables. Chi-square tests requires a categorical outcome variable and a categorical outcome variable
• Correlation tests - compares to continuous variables.
• Standardized Mean Difference (SMD) - comparing a continuous variable across two groups, i.e. a continuous and a dichotomous variable. We consider the following guidelines when interpreting SMD magnitude: SMD = 0.2 corresponds to a small effect; SMD = 0.5 corresponds to a medium effect; and SMD = 0.8 corresponds to a large effect 4(#4)
• Population Attributable Fraction (PAF) - calculates the contribution of individual risk factors to the burden of disease 5(#5).

### Typical tables and plots and corresponding text description

• Table one

sdatools::tableOne(Data, vars, strata), vars <- c("age", "gender","qol", "Diabetes"), strata <- c("Cancer")

• Table description: `r table_nums("tableOne", "Sample description.", "cite")` displays a description of the study sample. We present a comparison between patients with a cancer diagnosis and no cancer diagnosis. Our total sample included 1000 patients, 673 with a cancer diagnosis and 327 with no cancer diagnosis. The sample's mean age was 50.3 (+- 19) and most were male (60.4%). Compared to patients with a cancer diagnosis, those with no cancer diagnosis presented a higher incidence of diabetes (76.7, 80, p= 0.01)

• Table description template:`r table_nums("tableOne", "Sample description.", "cite")` displays a description of the study sample. We also present a comparison between patients in ( {{group| Insert the projectarms example cancer diagnosis and no cancer diagnosis}}). Our total sample included ( {{Samplenumber| Insert the sample number example 1000}}), ( {{Samplenumber| Insert the first group example 302}}) who underwent ( {{First group| Insert the first group}}) and ( {{Second group sample number| Insert the sample number example 698}}) who underwent( {{Secondintervention| Insert the Secondgroup}}). The sample's mean age was ( {{Mean age| Insert the mean age}}).

• Explanatory analysis

outcomes <- c("qol")
predictors <- c("gender", "age")
confounders <- c()
expanalysis <- sdatools::ExplanatoryAnalysis(data, predictors, confounders, outcomes, split_predictors = TRUE,
preprocess_missing = FALSE,
preprocess_linear_combos = FALSE,
preprocess_nzv = FALSE,
preprocess_high_correlation = FALSE,
labels = NULL)
knitr::kable(t(sdatools::predictedMeans(expanalysis)))

• Table description: A multiple regression analysis was carried out between sociodemographic variables, and qol. The qol was significantly affected by age (p = 0.036) and diabetes diagnosis (p < 0.001).

• Plots

• Box plot

sdatools::boxPlot(patients,"age", strata)

``````* ScatterPlot
``````

sdatools::scatterPlot(patients,"age", "qol")

``````* Bar plot
``````

sdatools::barPlot(patients,"Cancer", "qol")

• Stackbar plot

sdatools::stackedBarPlot(patients,"Cancer", "qol")

• Pirate plot

sdatools::piratePlot(patients,"Cancer", "qol")

• t-test: [group 1] presented a significantly smaller [outcome] then [group 2], [mean 1 vs mean 2, p value].

• Chi-square tests: a higher frequency of [var 1] was significantly associated with a higher frequency of [var 2](p value).

• Pearson correlation test: an increase/decrease in [var 1] was significantly correlated with an increase/decrease in [var 2] (p value)

a. Variable order: Always follow this order when presenting variables (describing in methods or in tables from results):
1. Sociodemographic variables (age, education, gender, etc).
2. Social determinants of health
3. Comorbidities
4. Clinical variables (diagnosis, etc)
5. Outcomes

b. Univariate and bivariate analyses should be presented prior to modeling. For example, Kaplan Meyer plots should go before results from Cox Proportional Hazard models

### Associated concepts

Inferential tests assist in providing suggested explanations for situations or phenomena shown in the clinic. It is also possible to draw conclusions and make inferences after analyzing data collected in surveys (data observed in clinical trials).

### Mock conclusions or most frequent format for conclusions reached at the end of a typical analysis.

• Frequentist (traditional or non-Bayesian) tests will often provide a p value along with 95% confidence intervals (CIs). The interpretation of p values is complex since it represents the probability of rejecting the null hypothesis, and not whether our actual hypothesis is correct or not. A given confidence interval level represents the proportion of possible confidence intervals that contain the true value of whatever you might be trying to estimate, for example a mean difference between two samples.
• Bayesian tests make use of credible intervals, which contain the correct answer in 95% of the time. This interpretation tends to be more intuitive and straightforward.

## 5. SporeData-specific

### Data science functions

• sdatools::tableOne
• sdatools::boxPlot
• sdatools::scatterPlot
• sdatools::barPlot
• sdatools::stackedBarPlot
• sdatools::piratePlot
• sdatools::ExplanatoryAnalysis(data, predictors, confounders, outcomes, split_predictors = TRUE, preprocess_missing = FALSE, preprocess_linear_combos = FALSE, preprocess_nzv = FALSE, preprocess_high_correlation = FALSE, labels = NULL)

## SporeData data dictionaries

### Mock conclusions or most frequent format for conclusions reached at the end of a typical analysis.

• Frequentist (traditional or non-Bayesian) tests will often provide a p value along with 95% confidence intervals (CIs). The interpretation of p values is complex since it represents the probability of rejecting the null hypothesis, and not whether our actual hypothesis is correct or not. A given confidence interval level represents the proportion of possible confidence intervals that contain the true value of whatever you might be trying to estimate, for example a mean difference between two samples.
• Bayesian tests make use of credible intervals, which contain the correct answer in 95% of the time. This interpretation tends to be more intuitive and straightforward.

## References

[1] Team SD. RStan: the R interface to Stan. R package version. 2016;2(1).
[2] Bürkner PC. Advanced Bayesian multilevel modeling with the R package brms. arXiv preprint arXiv:1705.11123. 2017 May 31.
[3] Stevenson M, Nunes T, Heuer C, Marshall J, Sanchez J, Thornton R, Reiczigel J, Robison-Cox J, Sebastiani P, Solymos P, Yoshida K. epiR: Tools for the analysis of epidemiological data. R package version 0.9-62.
[4] Faraone, Stephen V. 2008. “Interpreting Estimates of Treatment Effects: Implications for Managed Care.” P & T :A Peer-Reviewed Journal for Formulary Management 33 (12): 700–711. [5] World Health Organization. Metrics: population attributable fraction (PAF).