De novo assembly population genetics - CGRL-QB3-UCBerkeley/seqCapture GitHub Wiki

Introduction:

This module uses cleaned reads generated by cleanpe as input and assemble them by SPAdes. SPAdes makes a combined final assembly using a combination of multi-kmer lengths (default: 21,33 and 55). For population genetic project, we recommend select several (6-8) high quality samples (e.g. no degradation, long insert sizes, lots of data) that could best represent the genetic polymorphism of all samples (for instance, one sample from each major population). Outgroup samples, if sequenced, should NOT be included in the assembly process. After each of these representative samples is assembled, next module intarget will compare each individual assembly against targeted loci on which probe design are based, merge the contigs that are derived from the targeted loci, and produce the best representation of each in-target assembly in the final pseudo-reference.

Regarding how many representative samples to choose for assembly: Including too many samples in this step and merge their in-target assemblies afterwards would have computation time and potentially assembly errors increased considerably. But including very few samples in this step may not cover enough genomic complexity of the captures. There is fine balance between them. We recommend 6-8 samples for population genetic projects. If these samples are truly representative for major populations and of high quality, the final merged assembly usually has decent quality.

Command and options:

Usage: seqCap assemble  [options]

Options: 

-reads    DIR             Directory with all sequence reads
-kmer     INT,INT,INT...  Kmer lengths chosen for SPAde assemblies
                          [21,33,55] (no space)
-lib      CHAR ...        Particular libraries to process? 
                          (e.g. AAA BBB CCC). If -lib is not 
                          used then process all libraries in
                          The folder (-reads)  
-out      CHAR            Directory where results will go
-np       INT             number of processors used for assembly

Prepare input for the run:

After finishing running cleanpe, the cleaned reads of each samples are stored in diretory "cleaned_reads_dir". Select representative samples for this project and copy their cleaned reads in a different diretory named as "rep_samples_cleaned_reads_dir" (just an example, you can give the diretory any name you like). The reads in this diretory is the input for this step. In the below example, we choose Sample1, Sample3 and Sample15 as representatives for all samples and will make a refernece based on these three samples using intarget.

(seqCapture) $ ls rep_samples_cleaned_reads_dir/
Sample1_1_final.fq Sample1_2_final.fq Sample1_u_final.fq Sample1.contam.out Sample3_1_final.fq  Sample3_2_final.fq

Sample3_u_final.fq Sample3.contam.out Sample_15_final.fq Sample_15_final.fq Sample_15_final.fq Sample15.contam.out

Usage examples:

Assembling each sample in "rep_samples_cleaned_reads_dir" and store raw assemblies for each sample in "raw_assemblies_dir"; choosing kmer lengths of 21, 33, 55, 77, 99, and 127 (-kmer 21,33,55,77,99,127); allocating 10 cpus (-n 10) for SPAdes assemblies.

(seqCapture) $ seqCapture assemble -reads /path/to/rep_samples_cleaned_reads_dir/ -out raw_assemblies_dir -kmer 21,33,55,77,99,127 -np 10

Output:

In "raw_assemblies_dir" individual assemblies are stored:

(seqCapture) $ ls raw_assemblies_dir/
Sample1.fasta Sample3.fasta Sample15.fasta