Calling CNVs - abyzovlab/CNVnator GitHub Wiki

Running involves few steps outlined below. Chromosome names and lengths are parsed from sam/bam file header. Using -genome option one can overwrite this default behavior.

EXTRACTING READ MAPPING FROM BAM/SAM FILES

        ./cnvnator [-genome name] -root out.root [-chrom name1 ...] -tree [file1.bam ...]

out.root -- output ROOT file. See ROOT package documentation

name1, name2, ... -- chromosome names

file1.bam, file2.bam, ... -- bam files

Chromosome names can be specified by name, e.g., X, or together with prefix chr, e.g., chrX. One can specify multiple chromosomes separated by space. If no chromosome specified, read mapping is extracted for all in sam/bam file. Note, that this would require machines with large physical memory of 7Gb. Extracting read mapping for subsets of chromosomes is a way around this issue. Note, root file is not being overwritten. To have correct q0 field for CNV calls (see below) one need to use option -unique when extracting read mapping from bam/sam files.

Example:

        ./cnvnator -root NA12878.root -chrom chr1 chr2 chr3  -tree NA12878_ali.bam

is equivalent to

        ./cnvnator -root NA12878.root -chrom 1 2 3           -tree NA12878_ali.bam

Example:

        ./cnvnator -root NA12878.root -chrom 4 5 6 -tree NA12878_ali.bam
        ./cnvnator -root NA12878.root -chrom 7 8 9 -tree NA12878_ali.bam

is equivalent to

        ./cnvnator -root NA12878.root -chrom 4 5 6 7 8 9 -tree NA12878_ali.bam

GENERATING HISTOGRAM

        ./cnvnator [-genome name] -root file.root [-chrom name1 ...] -his bin_size [-d dir]

This step is not memory consuming and can be done for all chromosomes at once, still can be done for a subset of chromosomes. Files with chromosome sequences are required and should reside in running directory or directory specified by -d option. Files should be named as: chr1.fa, chr2.fa, etc.

CALCULATING STATISTICS

        ./cnvnator -root file.root [-chrom name1 ...] -stat bin_size

This step must be completed before proceeding to partitioning and CNV calling.

RD SIGNAL PARTITIONING

        ./cnvnator -root file.root [-chrom name1 ...] -partition bin_size [-ngc]

Option -ngc specifies not to use GC corrected RD signal. Partitioning is the most time consuming step. There are few way to enhance performance at this step.

CNV CALLING

        ./cnvnator -root file.root [-chrom name1 ...] -call bin_size [-ngc]

Calls are printed to STDOUT.

The output is as follows:

CNV_type coordinates CNV_size normalized_RD e-val1 e-val2 e-val3 e-val4 q0

normalized_RD -- normalized to 1

e-val1 -- is calculated using t-test statistics

e-val2 -- is from probability of RD values within the region to be in the tails of gaussian distribution describing frequencies of RD values in bins

e-val3 -- same as e-val1 but for the middle of CNV

e-val4 -- same as e-val2 but for the middle of CNV

q0 -- fraction of reads mapped with q0 quality

To have correct output at q0 field one need to use option -unique when extracting read mapping from bam/sam files.