Usage: Methylation Data Matrix Generation - GarrettJenkinson/informME GitHub Wiki

Command:

getMatrices.sh [OPTIONS] BAM_FILE CHR_NUM

This step takes the BAM file BAM_FILE as input and generates the methylation data matrix for chromosome number CHR_NUM. By default, the file BAM_FILE and its associated index file (with extension .bai) is expected to be in BAMDIR, and the output file produced by this step is stored in a subdirectory in INTERDIR named after the chromosome number CHR_NUM. The output file preserves the prefix from the file BAM_FILE and the suffix '_matrices.mat' is appended to it (e.g. if BAM_FILE is normal_sample.bam and CHR_NUM is 10, then the output file is saved as INTERDIR/chr10/normal_sample_matrices.mat). The file produced contains the following information for each genomic region, which is subsequently used for model estimation:

  • data matrix with -1,0,1 values for methylation status

  • CpG locations broken down by region

NOTE1: We recommend taking advantage of the array feature available in SGE and SLURM based clusters to submit an individual job for each chromosome.

NOTE2: See reference [1], "Online Methods: Quality control and alignment" for our suggested preprocessing steps when generating a sorted, indexed, deduplicated BAM file to input to informME.

NOTE3: Here is the full help file for getMatrices.sh:

Description:
    This function takes a BAM file as input and generates a methylation data matrix for
    a given chromosome. The BAM file is expected to be in BAMDIR, while the output 
    file is stored in INTERDIR by default. These directories can be modified via 
    optional arguments that can be passed to the function. The output file produced 
    contains the following information for each genomic region which is subsequently 
    used for model estimation:
    o data matrix with -1,0,1 values for methylation status
    o CpG locations broken down by region

Usage:
    getMatrices.sh  [OPTIONS]  BAM_FILE CHR_NUM

Mandatory arguments:
    o BAM_FILE: BAM file for which the methylation data matrix will be generated
    o CHR_NUM: chromosome to be processed

Options:
    -h|--help           help
    -r|--refdir         directory of reference genome and CpG location files (default: $REFGENEDIR)
    -b|--bamdir         directory of BAM file (default: $BAMDIR)
    -d|--outdir         output directory (default: $INTERDIR)
    -q|--threads        number of threads used (default: 1)
    -t|--trim           number or vector of bases to be trimmed (default: 0)
    -c|--c_string       name convention for chromosomes: 0 => 'X'; 1 => 'chrX' (default: 1)
    -p|--paired_ends    Type of reads: 0 => single-ends;1 => paired-ends. Default: 1.
    --tmpdir     	directory of intermediate files (default: $SCRATCHDIR) 
    --time_limit     	maximum time (in minuttes) allowed for each thread to complete (default: 60)
    -l|--MATLICENSE     path to MATLAB's license

Examples:
    * Usage keeping default options (e.g., no trimming or multithreading):
        getMatrices.sh  sample_1.bam 1
    * Trimming 10 bases on each read and using 5 threads:
    	getMatrices.sh -q 5 -t 10  sample_1.bam 1
    * Trimming different bases on each read, using 5 threads, and 'chrX' naming convention:
    	getMatrices.sh -q 5 -t '[15,20]' -c 1  sample_1.bam 1

Output:
    MATLAB .mat file with suffix *_matrices.mat

Dependancies:
    * MATLAB
    * xargs
    * timeout
    * matrixFromBam.sh
    * mergeMatrices.sh

Upstream:
    fastaTCpG.sh

Downstream: 
    informME_run.sh

Authors:
    Garrett Jenkinson <[email protected]>
    Jordi Abante <[email protected]>