Usage: Reference Genome Analysis - GarrettJenkinson/informME GitHub Wiki

Command:

fastaToCpg.sh [OPTIONS] FASTA_FILE

This step analyzes the reference genome FASTA_FILE (in FASTA format) and produces a MATLAB MAT file CpGlocationChr#.mat for each chromosome, which is stored by default in REFGENEDIR, and contains the following information:

  • location of CpG sites

  • CpG density for each CpG site

  • distance between neighboring CpG sites

  • location of the last CpG site in the chromosome

  • length of chromosome (in base pairs)

NOTE1: This step only needs to be completed one time for a given reference genome. Start analyzing samples at step D.2 if you have previously completed step D.1 for your sample's reference genome.

NOTE2: At this time the statistical model of informME has been designed to work only with autosomes, and so the informME software will not model mitochondrial chromosomes, lambda spike-ins, partial contigs, sex chromosomes, et cetera. Also the reference fasta file to which bam files have been aligned is assumed to be sorted so that the somatic chromosomes come first and in the usual order: chr1,chr2,...,chrN.

NOTE3: Here is the full help file for fastaToCpg.sh:

Description:
    This function is used to analyze a reference genome in order to find and store the 
    locations of all CpG sites within each chromosome and compute the CpG densities at 
    each CpG site as well as the distances between neighboring CpG sites. A 1-based 
    coordinate system is used, in which the first base is assigned to position 1 and the 
    location of a CpG site is defined by the position of the C nucleotide on the forward 
    strand of the reference genome. Each MAT file produced will be stored by default in 
    REFGENEDIR, and it will contain the following information:
    o location of CpG sites
    o CpG density for each CpG site
    o distance between neighboring CpG sites
    o location of the last CpG site in the chromosome
    o length of chromosome (in base pairs)

Usage:
    fastaToCpg.sh  [OPTIONS]  FASTA_FILE

Mandatory argument:
    o FASTA_FILE: reference genome in FASTA format. Should be ordered to have the
                  autosomes come first and in numeric order 1,2,3,... And they must
                  have naming scheme 1,2,3,... or chr1,chr2,chr3,... The log file
                  here should be examined to ensure compliance and the naming schemes 
                  should be noted and specified downstream to getMatrices.sh through
                  its -c argument. 

Options:
    -h|--help           help
    -d|--outdir         output directory (default: $REFGENEDIR)
    -l|--MATLICENSE     path to MATLAB's License

Examples:
    * Analyzing FASTA file /path/to/input.fa and storing output in REFGENEDIR: 
    	fastaToCpg.sh  /path/to/input.fa
    * Analyzing FASTA file /path/to/input.fa and storing output in directory /path/to/out: 
    	fastaToCpg.sh -d /path/to/out  /path/to/input.fa

Output:
    MATLAB .mat file for each entry in FASTA file

Dependancies:
    MATLAB

Upstream:
    NA

Downstream:
    getMatrices.sh

Authors:
    Garrett Jenkinson <[email protected]>
    Jordi Abante <[email protected]>