01.Association08.Missing data - sporedata/researchdesigneR GitHub Wiki

1. Use cases: in which situations should I use this method?

In situations where missing values are present

2. Input: what kind of data does the method require?

A dataset with missing values for one or more variables.

3. Algorithm: how does the method work?

Model mechanics

Missing data can have multiple underlying mechanisms, such as missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

MAR (missing at random) means that missing values depend on a certain measured variable (there is a reason), such as being older (older patients might not complete specific questions more frequently) [8].
MCAR (missing completely at random) means that the missing values do not depend on observed or unobserved measurements. MCAR essentially never happens, and as a result, there is always a reason for a missing variable - patients are unconscious and cannot answer, or they have lower educational levels and do not know stuff, etc
MNAR (missing not at random) means that the missing patterns are dependent on unmeasured variables (the probability of being missing varies for reasons that are unknown to us) - imputation is indeed possible https://github.com/cran/miceMNAR. A typical MNAR is when the patient misses an entire appointment [8].

The main difference between MAR and MNAR is that the predictors are measured (MAR) or not measured (MNAR). The former then becomes a matter of establishing predictive models (such as multiple imputations), while for the latter, what they do is a joint model: two models simultaneously solved (through maximum likelihood), where one solves for the outcome and the other for the missing values [8].

In general, once missing data imputation is done, data scientists will run models with and without imputed values as a form of sensitivity analysis.

In multiple imputations, there are three main steps: imputation, analysis, and pooling, shown in the figure below. The software stores the results of each step in a specific class: mids, mira, and mipo [2].

Main steps used in multiple imputation [2]

The leftmost side of the picture indicates that the analysis starts with an observed, incomplete data set Yobs. In general, the problem is that we cannot estimate Q from Yobs without making unrealistic assumptions about the unobserved data. Multiple imputation is a general framework that several imputed versions of the data by replacing the missing values by plausible data values. These plausible values are drawn from a distribution specifically modeled for each missing entry. In mice this task is being done by the function mice(). Figure 1 portrays m = 3 imputed data sets Y^(1), . . . , Y^(3). The three imputed sets are identical for the non-missing data entries, but differ in the imputed values. The magnitude of these difference reflects our uncertainty about what value to impute. The package has a special class for storing the imputed data: a multiply imputed dataset of class mids. The second step is to estimate Q on each imputed data set, typically by the method we would have used if the data had been complete. This is easy since all data are now complete. The model applied to Y (1), . . . , Y (m) is the generally identical. mice 2.9 contains a function with.mids() that perform this analysis. This function supersedes the lm.mids() and glm.mids(). The estimates s Qˆ^(1), . . . , Qˆ^(m) will differ from each other because their input data differ. It is important to realize that these differences are caused because of our uncertainty about what value to impute. In mice the analysis results are collectively stored as a multiply imputed repeated analysis within an R object of class mira. The last step is to pool the m estimates Qˆ^(1), . . . , Qˆ^(m) into one estimate Q¯ and estimate its variance. For quantities Q that are approximately normally distributed, we can calculate the mean over Qˆ^(1), . . . , Qˆ^(m) and sum the within- and between-imputation variance according to the method outlined in Rubin (1987) [7]. The function pool() contains methods for pooling quantities by Rubin’s rules. The results of the function is stored as a multiple imputed pooled outcomes object of class mipo.

Describing in words

Describing in images

Describing with code

Breaking down equations

Suggested companion methods

Missing data supports every other method where the missing data might affect the result, and it should be used.

Learning materials

4. Output: how do I interpret this method's results?

Typical tables and plots and corresponding text description

There are multiple ways of looking at missing patterns when evaluating a dataset

VIM - this package provides the missing percentage across different sections of the dataset [4] and [5].
Margin plot - compares the distribution of the variable with a distribution of missing patterns, if MNAR is present then the red and blue box plots should be similar [4] and [5].
X/Y plot - compares the original and imputed data [4] and [5].
Density plot - if the distributions are MNAR the red lines and blue lines should have a similar path [4] and [5].
Strip plot - compares missing and original values for multiple variables [4] and [5].
visdat - is used when looking to multiple variables within an entire dataset [6].
visdat - is also used to look at the percentage of missing data for different variables using a lollipop plot [6].
lollipop plot - is used to display missing over time [6].
part - can be used to display a tree regression model predicting which variables and their values are associated with missingness [6].

Metaphors

Lack of data or answers would be considered missing values and using data imputation can replace them.
To deal with missing data several techniques can be used: Median Imputation, Prediction Model and KNN(k-nearest neighbor) Imputation.

Reporting guidelines

Multiple imputation for missing data [1].

5. SporeData-specific

Templates

Data science functions

Data science packages

mice [2] and [3].

General description

Clinical areas of interest

Variable categories

Linkage to other datasets

Limitations

Related publications

SporeData data dictionaries

References

[1] Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Bmj. 2009 Jun 29;338.

[2] Buuren SV, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in R. Journal of statistical software. 2010:1-68.

[3] Buuren SV, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in R. Journal of statistical software. 2010:1-68.

[4] Alice M. {Imputing missing data with R; MICE package](https://datascienceplus.com/imputing-missing-data-with-r-mice-package/).

[5] A Solution to Missing Data: Imputation Using R

[6] Getting Started with naniar

[7] Rubin DB. Multiple Imputation after 18+ Years. Journal of the American Statistical Association, 91(434), 473–489.

[8] Galimard JE, Chevret S, Curis E, Resche-Rigon M. Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors. BMC medical research methodology. 2018 Dec;18(1):1-3.