Data driven selection of an imputation algorithm - kreutz-lab/OmicsData GitHub Wiki

DIMA

Egert J, Brombacher E, Warscheid B, and Kreutz C. DIMA: Data-driven Selection of an Imputation Algorithm. J Proteome Res, 2021. doi: 10.1021/acs.jproteome.1c00119.

Short description

Missing value imputation is a crucial and essential step in a proteomics analysis pipeline and downstream analyses are highly affected by this processing step. To facilitate the decision for a well-performing imputation method, a novel concept for data-driven recommendation of a high-performing imputation algorithm is presented. DIMA has the advantage that it combines the learning of the individual pattern of MVs in a given data set with the testing of many different imputation strategies to suggest the best-performing algorithm for the specific input data.

Usage

An OmicsData object is created by

O = OmicsData(file);

where .txt, .xls, .xlsx, .csv and .mat files as well as a numeric input are accepted, e.g. the MaxQuant output tables can serve as file inputs here.

A pre-processing step may be applied by

O = OmicsFilter(O,nacut,logflag,scaleflag);

In which features with a higher proportion of MVs than the $nacut$ threshold are removed. The flags $logflag$ and $scaleflag$ can be logical or string inputs defining if a log2 or log10 transformation and a median or mean normalization should be performed. DIMA is executed via

O = DIMA(O,[algorithms],[bio]);

Either the default imputation algorithms are used or they can be specified by the user. DIMA currently comprises 30 imputation algorithms from 12 R-packages. A fast version, which only runs the nine most frequently recommended algorithms based on 142 PRIDE data sets, is also implemented and available by setting algorithms= 'fast', but should be used with caution. The optional third input argument is a flag if additional biological information - if available in the input data file - should be taken into account. After applying DIMA the suggested algorithm and the respective imputation are saved in the proteomics data object:

algorithm = get(O,'DIMA');

data = get(O,'data');

Example

OmicsInit
O = OmicsData('proteinGroups.txt');
O = OmicsFilter(O,0.8,'log2');
O = DIMA(O);

Performing this example produces the file 'proteinGroups_Imp.txt' with most information from the original file 'proteinGroups.txt' but with the imputed intensities.

Package installation

DIMA performs imputation via Rcall and all R packages applied have to be installed beforehand:

packages <- c('Rtools','R.matlab','amap','mice','norm','Amelia','Hmisc','imputeLCMD','missForest','softImpute','VIM','rrcovNA','missMDA','mi','DMwR','GMSimpute')
for (i in 1:length(packages)){
   install.packages(packages[i], dependencies=TRUE, repos='http://cran.rstudio.com/')
}
install.packages("BiocManager")
BiocManager::install("pcaMethods")
BiocManager::install("impute")
install.packages("https://cran.r-project.org/src/contrib/Archive/imputation/imputation_1.3.tar.gz", repos=NULL, type='source')

Implementation

Images/DIMAcode.png

The pattern of missing values in the original data O is learned by logistic regression analysis.
A reference data R with fewer MVs is constructed from the original data O to evaluate imputation performance on.
To generate a pattern of missing data with a similar distribution as in the original data, the logistic regression model of step 1 is applied to the reference data R. Bernoulli trials are performed to simulate different patterns of MVs.
Various imputation algorithms are applied to the reference data R with patterns of MVs and ranked by their root mean square error. The best-performing algorithm is defined by the lowest mean rank over all pattern simulations.
The best-performing imputation algorithm of step 4 is recommended as imputation algorithm for the original data O and imputation of O is performed.