Usage: Model Estimation & Analysis - GarrettJenkinson/informME GitHub Wiki
Command:
informME_run.sh [OPTIONS] MAT_FILES PHENO CHR_NUM
This step is comprised of two phases. During the first phase, informME learns the parameters of the Ising probability distribution by combining the methylation data matrices provided through the argument MAT_FILES (comma-separated list) for chromosome number CHR_NUM. By default, the MAT_FILES are expected to be in a subdirectory named after CHR_NUM in INTERDIR. The output generated during this phase is also stored in a subdirectory in INTERDIR named after chromosome number CHR_NUM. The output file has as prefix PHENO and the suffix '_fit.mat' appended to it (e.g. if 'normal' is the PHENO, and CHR_NUM is 10, then the output is stored as INTERDIR/chr10/normal_fit.mat). The file produced contains the following information:
-
CpG distances
-
CpG densities
-
estimated alpha, beta, and gamma parameters of the Ising model
-
initial and transition probabilities of the inhomogeneous Markov chain representation of the Ising model
-
marginal probabilities at each CpG site
-
the log partition function of the estimated Ising model
The second phase of this step consists in analyzing the model learned by computing a number of statistical summaries of the methylation state, including probability distributions of methylation levels, mean methylation levels, and normalized methylation entropies, as well as mean and entropy based classifications. This step also computes entropic sensitivity indices, methylation sensitivity indices, as well information-theoretic quantities associated with methylation channels, such as turnover ratios, channel capacities, and relative dissipated energies. The output generated during this phase is stored in the same directory as the output generated during the first phase, using the same prefix as before. However, the suffix is now '_analysis.mat' (e.g. following the previous example, the output file of this phase is stored as INTERDIR/chr10/normal_analysis.mat). This file contains the following information:
-
the locations of the CpG sites within the genomic region
-
numbers of CpG sites within the analysis subregions
-
which analysis subregions are modeled and which are not
-
estimated parameters of Ising model in genomic region
-
methylation level probabilities in modeled subregions
-
coarse methylation level probabilities
-
mean methylation levels
-
normalized methylation entropies
-
entropic sensitivity indices
-
methylation sensitivity indices
-
turnover ratios
-
channel capacities
-
relative dissipated energies
NOTE1: We recommend taking advantage of the array feature available in SGE and SLURM based clusters to submit an individual job for each chromosome.
NOTE2: Here is the full help file from informME_run.sh
:
Description:
This function learns the parameters of the Ising model and performs methylation analysis.
It estimates the parameters of the Ising probability distribution used to model
methylation within equally sized (in base pairs) non-overlapping regions of the genome.
The input is expected to be in INTERDIR, and the output is also stored in INTERDIR by
default. The output file produced by the learning phase contains the following
information for each genomic region used in model estimation:
o CpG distances
o CpG densities
o estimated alpha, beta, and gamma parameters of the Ising model
o initial and transition probabilities of the inhomogeneous Markov chain representation of
the Ising model
o marginal probabilities at each CpG site
o the log partition function of the estimated Ising model
The output file produced by the analysis phase contains the following information:
o the locations of the CpG sites within the genomic region
o numbers of CpG sites within the analysis subregions
o which analysis subregions are modeled and which are not
o estimated parameters of Ising model in genomic region
o methylation level probabilities in modeled subregions
o coarse methylation level probabilities
o mean methylation levels
o normalized methylation entropies
o entropic sensitivity indices
o methylation sensitivity indices
o turnover ratios
o channel capacities
o relative dissipated energies
Usage:
informME_run.sh [OPTIONS] MAT_FILES PHENO CHR_NUM
Mandatory arguments:
o MAT_FILES: list of methylation matrices to be modeled
o PHENO: prefix of output files (name of phenotype)
o CHR_NUM: chromosome to be processed
Options:
-h|--help help
-r|--refdir directoty of reference genome and CpG location files (default: $REFGENEDIR)
-m|--matdir matrices directory (default: $INTERDIR)
-e|--estdir modeling directory (default: $INTERDIR)
-d|--outdir output analysis directory (default: $INTERDIR)
-q|--threads number of threads used (default: 1)
--tmpdir temporary directory (default: $SCRATCHDIR)
--time_limit maximum time (in minutes) allowed for each thread to complete (default: 60)
-l|--MATLICENSE path to MATLAB's license
Example:
* Running informME on chromosome 1 using 5 threads:
informME_run.sh -q 5 sample1 pheno_1 1
* Running informME on chromosome 1 using 5 threads and 3 samples pooled into one model:
informME_run.sh -q 5 sample2,sample3,sample4 pheno_2 1
Output:
MATLAB .mat file
Dependancies:
* MATLAB
* estimation.sh
* mergeEstimation.sh
* singleMethAnalysis.sh
* mergeSingleMethAnalysis.sh
Upstream:
getMatrices.sh
Downstream:
singleMethAnalysisToBed.sh
diffMethAnalysisToBed.sh
Authors:
Garrett Jenkinson <[email protected]>
Jordi Abante <[email protected]>