Reconstructing contigs - CGRL-QB3-UCBerkeley/seqCapture GitHub Wiki

Introduction:

This module aligns cleaned reads of each individual derived from cleanpe to its in-target reference generated by intarget using novoalign. The idea is to retain reads that mapped uniquely to the reference. GATK (McKenna et al. 2010) to is then used to perform re-alignment around indels. Finally this module uses SAMtools/BCFtools (Li et al. 2009) to generate individual consensus sequences by calling genotypes and incorporate ambiguous sites in the individual-specific assemblies. Certainly filters are applied in this process. For details please see blow.

Command and options:

(seqCapture) $ seqCapture assemble
Usage: seqCap buildcontigs [options]

Basic Options:
-t     INT     Target sequences could be one of the following: 
               1=individual exons; 
               2=cDNA (no UTR); 
               3=transcripts (including UTR); 
               4=random (such as UCEs, no need for exon identification) 
               [3]  
-a     DIR     Path to a folder with all intarget assemblies
               (AAA_targetedRegionAndFlanking.fasta);
-f     DIR     Path to a folder with all bed files
-b     DIR     A folder with all cleaned reads (AAA_1_final.fq, 
               AAA_2_final.fq, AAA_u_final.fq...)
-i     INT     Avg. Insert size [200];
-m     INT     memory limit (in MB) for the program, default 800;
               0 for unlimitted [0]   
-n     INT     number of threads [10]     
-d     INT     Minimum depth to keep a site, otherwise masked as an "N" [5]
-D     INT     Maximum depth to keep a site, otherwise masked as an "N" [100000]
-N     INT     INDEL filtering window [5]
-M     FLOAT   Discard a locus if M percent bases are Ns [0.8]
-c     INT     only keep concordant mapping for PE reads?
               1 = yes
               0 = no [1]
-r     INT     read length (bp) of original raw reads [100]
-s     INT     repeat Masking?
               1 = yes
               0 = no [0]

RepeatMasking Options: (use T or R)

-R     CHAR    Species used for repeatmasking. some examples are: human, mouse, rattus, 
               "ciona savignyi",arabidopsis, mammal, carnivore, rodentia, rat, cow, pig,
               cat, dog, chicken, fugu, danio, "ciona intestinalis", drosophila, 
               anopheles, elegans,diatoaea, artiodactyl, rice, wheat, maize, 
               "vertebrata metazoa" 
-T     CHAR    Use a custom-build repetitive library (full path) for repeat masking, 
               in this case do not use -R

Prepare input for the run:

After finishing running cleanpe, the cleaned reads of each samples are stored in diretory "cleaned_reads_dir", which is the input for this step.

(seqCapture) $ ls cleaned_reads_dir/ Sample1_1_final.fq Sample1_2_final.fq Sample1_u_final.fq Sample1.contam.out Sample2_1_final.fq Sample2_2_final.fq Sample2_u_final.fq Sample2.contam.out ...... SampleN_1_final.fq SampleN_2_final.fq SampleN_u_final.fq SampleN.contam.out
Intarget assemblies generated by step intarget:

(seqCapture) $ ls fasta/ Sample1_targetedRegionAndFlanking.fasta Sample2_targetedRegionAndFlanking.fasta ...... SampleN_targetedRegionAndFlanking.fasta

Usage examples:

Assembling each and all samples in "cleaned_reads_dir" and store raw assemblies for each sample in "raw_assemblies_dir"; choosing kmer lengths of 21, 33, 55, 77, 99, and 127 (-kmer 21,33,55,77,99,127); allocating 10 cpus (-n 10) for SPAdes assemblies.

(seqCapture) $ seqCapture assemble -reads /path/to/cleaned_reads_dir/ -kmer 21,33,55,77,99,127 -out raw_assemblies_dir -np 10 Output

In "raw_assemblies_dir" individual assemblies are stored:

(seqCapture) $ ls raw_assemblies_dir/ Sample1.fasta Sample2.fasta ...... SampleN.fasta