UCE phylogenetic projects - CGRL-QB3-UCBerkeley/seqCapture GitHub Wiki

Introduction:

UCE targeted loci file usually contain short probes. For some loci only one probe is used, and for others multiple probes are used. In some custom design, multiple, tiled, probes per locus could be quite common. In this case, multiple probes from the same locus must be merged first (using a light assembler such as CAP3). If probes from the same locus are not overlapped and therefore can not be merged, join them by Ns (suggest a long run of Ns). The ending product is that for each locus, only one probe is present. Also, when you run a self-blast for these UCE markers, each locus should only match to itself. If you find any probes matches other loci with a decent similarity (e.g. over 90%) then both loci should be eliminated from the targeted loci file.

Prepare input for the run:

A folder with raw assemblies for each sample.

For example in this case we have all raw assemblies present in "raw_assemblies_dir":

(seqCapture) $ ls raw_assemblies_dir/
 Sample1.fasta Sample2.fasta ....... SampleN.fasta

Targeted loci in multi-fasta format. See Introduction for more details

Usage examples:

Assuming random markers (UCE in this category -e 4 ); In each individual raw assembly, eliminating any contigs shorter then 150bp (-L 150); only considering a match if a contig and a targeted locus shared at 80% sequencing similarity (-p 80); 20% overlap is allowed between adjoining assembled contigs mapping to the same target and if more than that, one of them (usually the shorted one) is trimmed off (-d 20); retaining 500bp+/- flanking regions around each targeted locus in the resulting bed file (-f 500); using 20 threads in blast search and cd-hit-est clustering (-T 20).

(seqCapture) $ seqCapture intarget -t UCE_probes.fa -t /path/to/raw_assemblies_dir -L 150 -p 80 -d 20 -T 20 -e 4 -f 500

Output:

A diretory named "In_target" is created in "raw_assemblies_dir". And under "In_target" two subdirectories are created: "bed/" and "fasta/"

In diretory "fasta/", in-target assemblies in fasta format are available for each sample:

(seqCapture) $ ls fasta/
Sample1_targetedRegionAndFlanking.fasta
Sample2_targetedRegionAndFlanking.fasta
......
SampleN_targetedRegionAndFlanking.fasta

The contigs in XXX_targetedRegionAndFlanking.fasta are simply named as "Contig1", "Contig2"... "ContigN". If you want to link these names back to those in the original target (UCE_probes.fa), you can go to this file "UCE_probes.fa_rename_compared.txt" which can be found in the same diretory where "UCE_probes.fa" is located.

In diretory "bed/", a few different bed files are provided for each sample

(seqCapture) $ ls bed/
Sample1_allcontig.bed
Sample1_flanking_ONLY.bed
Sample1_sites_to_remove.txt
Sample1_targeted_region_and_flanking.bed
Sample1_targeted_region.bed
Sample1_targetedRegionforExonCapEval.bed
[...files for other samples...]

"Sample1_targeted_region_and_flanking.bed" defines targeted and flanking regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_flanking_ONLY.bed" defines flanking regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_targeted_region.bed" defines flanking regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_targetedRegionforExonCapEval.bed" also defines targeted and flanking regions like "Sample1_targeted_region_and_flanking.bed" does. However, it is not filtered. This file should only be used for evaluating enrichment efficiency (later in evaluation step);

The following two files can be ignored by phylogenetic projects:

"Sample1_sites_to_remove.txt" contains regions that are not unique in each intarget assemblies.
"Sample1_allcontig.bed" defines start (basically 0) and end (basically the total length) of each contig.