Usage: Reference Genome Analysis - GarrettJenkinson/informME GitHub Wiki
Command:
fastaToCpg.sh [OPTIONS] FASTA_FILE
This step analyzes the reference genome FASTA_FILE (in FASTA format) and produces a MATLAB MAT file CpGlocationChr#.mat for each chromosome, which is stored by default in REFGENEDIR, and contains the following information:
-
location of CpG sites
-
CpG density for each CpG site
-
distance between neighboring CpG sites
-
location of the last CpG site in the chromosome
-
length of chromosome (in base pairs)
NOTE1: This step only needs to be completed one time for a given reference genome. Start analyzing samples at step D.2 if you have previously completed step D.1 for your sample's reference genome.
NOTE2: At this time the statistical model of informME has been designed to work only with autosomes, and so the informME software will not model mitochondrial chromosomes, lambda spike-ins, partial contigs, sex chromosomes, et cetera. Also the reference fasta file to which bam files have been aligned is assumed to be sorted so that the somatic chromosomes come first and in the usual order: chr1,chr2,...,chrN.
NOTE3: Here is the full help file for fastaToCpg.sh
:
Description:
This function is used to analyze a reference genome in order to find and store the
locations of all CpG sites within each chromosome and compute the CpG densities at
each CpG site as well as the distances between neighboring CpG sites. A 1-based
coordinate system is used, in which the first base is assigned to position 1 and the
location of a CpG site is defined by the position of the C nucleotide on the forward
strand of the reference genome. Each MAT file produced will be stored by default in
REFGENEDIR, and it will contain the following information:
o location of CpG sites
o CpG density for each CpG site
o distance between neighboring CpG sites
o location of the last CpG site in the chromosome
o length of chromosome (in base pairs)
Usage:
fastaToCpg.sh [OPTIONS] FASTA_FILE
Mandatory argument:
o FASTA_FILE: reference genome in FASTA format. Should be ordered to have the
autosomes come first and in numeric order 1,2,3,... And they must
have naming scheme 1,2,3,... or chr1,chr2,chr3,... The log file
here should be examined to ensure compliance and the naming schemes
should be noted and specified downstream to getMatrices.sh through
its -c argument.
Options:
-h|--help help
-d|--outdir output directory (default: $REFGENEDIR)
-l|--MATLICENSE path to MATLAB's License
Examples:
* Analyzing FASTA file /path/to/input.fa and storing output in REFGENEDIR:
fastaToCpg.sh /path/to/input.fa
* Analyzing FASTA file /path/to/input.fa and storing output in directory /path/to/out:
fastaToCpg.sh -d /path/to/out /path/to/input.fa
Output:
MATLAB .mat file for each entry in FASTA file
Dependancies:
MATLAB
Upstream:
NA
Downstream:
getMatrices.sh
Authors:
Garrett Jenkinson <[email protected]>
Jordi Abante <[email protected]>