HyRAD - CGRL-QB3-UCBerkeley/seqCapture GitHub Wiki

Introduction:

For projects lacking a reference genome, the original RAD libraries that are used for in-house probe synthesis should be sequenced. Data from these libraries need to be analyzed using a RADseq data analytic pipeline. Mostly importantly, the clustered markers need to be rigorously filtered for redundancy (e.g. using self-blast). Only unique loci should be included in the targeted loci fasta file. For this technique, do not worry about filtering too much. Do worry about filtering not enough.

Prepare input for the run:

A folder with raw assemblies for each sample. For example in this case we have all raw assemblies present in "raw_assemblies_dir":

(seqCapture) $ ls raw_assemblies_dir/ Sample1.fasta Sample2.fasta ....... SampleN.fasta
Targeted loci in multi-fasta format (for example "HyRAD_markers.fa"). See Introduction for more details

Usage examples:

Assuming random markers (UCE in this category -e 4 ); In each individual raw assembly, eliminating any contigs shorter then 150bp (-L 150); only considering a match if a contig and a targeted locus shared at 80% sequencing similarity (-p 80); 20% overlap is allowed between adjoining assembled contigs mapping to the same target and if more than that, one of them (usually the shorted one) is trimmed off (-d 20); retaining 500bp+/- flanking regions around each targeted locus in the resulting bed file (-f 500); using 20 threads in blast search and cd-hit-est clustering (-T 20).

(seqCapture) $ seqCapture intarget -t HyRAD_markers.fa -t /path/to/raw_assemblies_dir -L 150 -p 80 -d 20 -T 20 -e 4 -f 500

Output:

A diretory named "In_target" is created in "raw_assemblies_dir". And under "In_target" two subdirectories are created: "bed/" and "fasta/"

In diretory "fasta/", in-target assemblies in fasta format are available for each sample:

(seqCapture) $ ls fasta/
Sample1_targetedRegionAndFlanking.fasta
Sample2_targetedRegionAndFlanking.fasta
......
SampleN_targetedRegionAndFlanking.fasta

The contigs in XXX_targetedRegionAndFlanking.fasta are simply named as "Contig1", "Contig2"... "ContigN". If you want to link these names back to those in the original target ("HyRAD_markers.fa"), you can go to this file "HyRAD_markers.fa_rename_compared.txt" which can be found in the same diretory where "HyRAD_markers.fa" is located. But, how the original targeted loci are named does not seem to be very important.

In diretory "bed/", a few different bed files are provided for each sample

(seqCapture) $ ls bed/
Sample1_allcontig.bed
Sample1_flanking_ONLY.bed
Sample1_sites_to_remove.txt
Sample1_targeted_region_and_flanking.bed
Sample1_targeted_region.bed
Sample1_targetedRegionforExonCapEval.bed
[...files for other samples...]

"Sample1_targeted_region_and_flanking.bed" defines targeted and flanking regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_flanking_ONLY.bed" defines flanking regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_targeted_region.bed" defines flanking regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_targetedRegionforExonCapEval.bed" also defines targeted and flanking regions like "Sample1_targeted_region_and_flanking.bed" does. However, it is not filtered. This file should only be used for evaluating enrichment efficiency (later in evaluation step);

The following two files can be ignored by phylogenetic projects:

"Sample1_sites_to_remove.txt" contains regions that are not unique in each intarget assemblies;
"Sample1_allcontig.bed" defines start (basically 0) and end (basically the total length) of each contig. Not very useful for downstream analysis.