Building intarget contigs phylogenetic dataset - CGRL-QB3-UCBerkeley/seqCapture GitHub Wiki

Introduction:

This step takes individual assemblies produced by assemble, compare each raw assembly against targeted loci, find contigs that best match the targeted loci and build one intarget assembly for each sample.

Reagarding targeted loci:

One important assumption by this module is that each targeted locus is unique. That is to say, if you blast all these loci against themselves, each locus should only match itself. If this assumption is violated, this module still works. You can either choose to masking such regions using an option in this module, or deal with them later during raw variant filtering stage. This module by default will produce a list of contigs and sites belonging to non-unique regions. When conducting variant filtering, you can choose to eliminate sites present in this list. These regions act like paralogues, and alignment against such regions are problematic in both phylogenetic and population genetic analyses.

Again: making a non-redundant list of target loci will make your downstream processing of sequence capture data A LOT easier!

Command and options:

Usage: seqCapture intarget [options]

Basic options:

-t  FILE     Target sequence file in fasta format (.fasta or .fa)
-a  DIR      A folder with all final assemlies generated 
             by seqCapture assemble 
-m  FLOAT    How much should you cluster the targets and 
             assemblies at the get go [0.98]
-d  FLOAT    How much overlap is allowed between adjoining 
             assembled contigs mapping to the same target [0.3]
-p  INT      How similar does the assembled contig have to
             be to the target (note this is out of 100) [90]
-M  INT      Memory (in Mb) needed for cdhit [4096]
-T  INT      Number of threads used in blast and cdhit [10]
-E  FLOAT    Used in the initial BLAST step [1e-10]
-b  INT      Merging individual assemblies?
             1 = yes (for population genetic datasets)
             0 = no  (for phylogenetic datasets) [0]
-c  INT      Is the targeted loci from a mt genome (or most of it?)
             1 = yes  
             0 = no [0]    
-g  INT      For nuclear genes, retain flanking sequences or not
             1 = yes  
             0 = no [1]  
-L  INT      Min length cutoff in initial cdhit to keep 
             a assembled contig [200] 
-e  INT      Target sequences could be one of the following: 
             1=individual exons (one exon per gene); 
             2=cds (no UTR); 
             3=transcripts (including UTR); 
             4=random (such as UCEs, no need for exon identification) 
             [3]
-f  INT       +/- flanking bp you would like to add [100]
-s  FILE      Annotated transcripts generated by annotation [required by e=1, 2, and 3]