De novo assembly phylogenetics - CGRL-QB3-UCBerkeley/seqCapture GitHub Wiki

Introduction:

This module uses cleaned reads generated by cleanpe as input and assemble them by SPAdes. SPAdes makes a combined final assembly using a combination of multi-kmer lengths (default: 21,33 and 55). For phylogenetic datasets, each individual needs to be assembled, respectively.

Command and options:


(seqCapture) $ seqCapture assemble

Usage: seqCapture assemble  [options]

Options: 

-reads    DIR             Directory with all sequence reads
-kmer     INT,INT,INT...  Kmer lengths chosen for SPAde assemblies
                          [21,33,55] (no space)
-lib      CHAR ...        Particular libraries to process? 
                          (e.g. AAA BBB CCC). If -lib is not 
                          used then process all libraries in
                          The folder (-reads)  
-out      CHAR            Directory where results will go
-np       INT             number of processors used for assembly

Prepare input for the run:

After finishing running cleanpe, the cleaned reads of each samples are stored in diretory "cleaned_reads_dir", which is the input for this step.

(seqCapture) $ ls cleaned_reads_dir/
Sample1_1_final.fq Sample1_2_final.fq Sample1_u_final.fq Sample1.contam.out Sample2_1_final.fq  Sample2_2_final.fq Sample2_u_final.fq Sample2.contam.out ......  SampleN_1_final.fq  SampleN_2_final.fq SampleN_u_final.fq SampleN.contam.out

Usage examples:

Assembling each and all samples in "cleaned_reads_dir" and store raw assemblies for each sample in "raw_assemblies_dir"; choosing kmer lengths of 21, 33, 55, 77, 99, and 127 (-kmer 21,33,55,77,99,127); allocating 10 cpus (-n 10) for SPAdes assemblies.

(seqCapture) $ seqCapture assemble -reads /path/to/cleaned_reads_dir/ -kmer 21,33,55,77,99,127 -out raw_assemblies_dir -np 10

Output

In "raw_assemblies_dir" individual assemblies are stored:

(seqCapture) $ ls raw_assemblies_dir/
Sample1.fasta Sample2.fasta ...... SampleN.fasta