Exon capture designed based on de novo assembled transcriptomes, popgen project - CGRL-QB3-UCBerkeley/seqCapture GitHub Wiki
Introduction:
This step takes representative assemblies produced by assemble
, compare each raw assembly against targeted loci, find contigs that best match the targeted loci and build one in-target assembly for each sample, and finally combine these in-target assemblies to choose the best representatives as the final (pseudo)reference genome for alignment.
Regarding targeted loci:
One important assumption by this module is that each targeted locus is unique. That is to say, if you blast all these loci against themselves, each locus should only match itself. If this assumption is violated, this module still works. You can either choose to masking such regions using an option in this module, or deal with them later during raw variant filtering stage. This module by default will produce a list of contigs and sites belonging to non-unique regions. When conducting variant filtering, you can choose to eliminate sites present in this list. These regions act like paralogues, and alignment against such regions are problematic in both phylogenetic and population genetic analyses.
Again: making a non-redundant list of target loci will make your downstream processing of sequence capture data A LOT easier!
Command and options:
Usage: seqCapture intarget [options]
Basic options:
-t FILE Target sequence file in fasta format (.fasta or .fa)
-a DIR A folder with all final assemlies generated
by seqCapture assemble
-m FLOAT How much should you cluster the targets and
assemblies at the get go [0.98]
-d FLOAT How much overlap is allowed between adjoining
assembled contigs mapping to the same target [0.3]
-p INT How similar does the assembled contig have to
be to the target (note this is out of 100) [90]
-M INT Memory (in Mb) needed for cdhit [4096]
-T INT Number of threads used in blast and cdhit [10]
-E FLOAT Used in the initial BLAST step [1e-10]
-b INT Merging individual assemblies?
1 = yes (for population genetic datasets)
0 = no (for phylogenetic datasets) [0]
-c INT Is the targeted loci from a mt genome (or most of it?)
1 = yes
0 = no [0]
-g INT For nuclear genes, retain flanking sequences or not
1 = yes
0 = no [1]
-L INT Min length cutoff in initial cdhit to keep
a assembled contig [200]
-e INT Target sequences could be one of the following:
1=individual exons (one exon per gene);
2=cds (no UTR);
3=transcripts (including UTR);
4=random (such as UCEs, no need for exon identification)
[3]
-f INT +/- flanking bp you would like to add [100]
-s FILE Annotated transcripts generated by annotation [required by e=1, 2, and 3]
Probe design:
Without any reference resources, exon captures can be designed using one annotated transcriptome assembly (Bi et al. 2012). The sequenced transcriptomes should be firstly assembled and annotated before used for probe design. Note that for population genetic projects, obtaining high quality cDNA libraries, assemblies and annotations are important especially for addressing questions related to detecting signals of natural selection. Typically I would recommend to target as many annotated genes as possible.
You may use individual exons or just use part of the transcript as markers for probe design. For most species, a closely related reference is lacking. In that case, identification of exon-exon boundaries within each transcript can be bioinformatically challenging. For simplicity, I suggest to use transcripts directly as markers in which exon-exon boundaries remain unidentified. Our pipelines will sort exon-exon boundaries out at intarget
step to identify targeted exonic regions and their flanking sequences, respectively.
In general the format of the targeted loci fasta file should be looking like the example below:
(seqCapture) $ less sp1_ref.fa
\>sp1_geneID1_gene1
ATGCATGC......
\>sp1_geneID2_gene2
GCATTCGC......
............
Note that the header of each marker should follow the naming scheme specified above as "CHAR1_CHAR2_CHAR3", where CHAR1 can be species name or contig name, CHAR2 must be a gene ID such as Ensembl Gene ID (e.g. ENSG00000010404), and CHAR3 can be gene name or others.
Prepare input for the run:
-
Targeted loci in multi-fasta format:
(seqCapture) $ less sp1_ref.fa
>sp1_geneID1_gene1 ATGCATGC......
>sp1_geneID2_gene2 GCATTCGC...... ............ ......
**Note: there must be only one copy of each exon/transcript in this reference. **
-
[OPTIONAL] If you are willing to define coding regions from each assembled contig, an annotated cds or a cdna or mrna fasta sequence file is needed (below using annotated cDNA file as an example): Annotated cDNA file contains one transcript per gene. However some modifications must be made before using it as input. First, you should only pick one longest transcript per gene. Second the header of each sequence needs to be modified as "geneID1 gsStartPos_geEndPos". "geneID" must match one of those in the targeted loci file (e.g. "sp1_geneID1_gene1"). "gsStartPos" means the gene begin at position StartPos. "geEndPos" means this cds ends by position at EndPos. For example:
(seqCapture) $ less sp1_cdna.fa >geneID5 gs51_ge1235 ATGGTGTGACCAGTCAGTAGC
This cdna has a gene name of geneID5 (so it can match "sp1_geneID5_gene1" in targeted loci file) and the coding sequence begins at position 51 and ends by position 1235 (this example cds has 1235-51+1 nt). Note: Because cDNA may contain 5' and 3' UTRs, therefore 1-51 could belong to 5' UTR and regions after 1235nt could belong to 3' UTR. But this is not crucial. It all depends on the quality of the assemblies and the annotation.
- A folder ("raw_assemblies_dir") with raw assemblies for selected samples to make a pseudo-reference (note I suggest to select 5-8 samples from a population. The chosen samples should have plenty of high quality data and must be modern specimens):
These raw assemblies are create by previous step assemble
:
(seqCapture) $ ls raw_assemblies_dir/
Sample1.fasta Sample2.fasta ....... SampleN.fasta
OPTION 1: we treat these markers as "random" markers and identification of coding and non-coding regions and sequences are not necessary:
Usage examples:
In this case these markers can just be treated as any random genomic markers. In the below example: Using sp1_ref.fa as a targeted loci reference file (-t sp1_ref.fa); assuming random markers (-e 4); In each individual raw assembly, eliminating any contigs shorter then 150bp (-L 150); only considering a match if a contig and a targeted locus shared at 75% sequencing similarity (-p 85); retaining 500bp+/- flanking regions around each targeted locus (-f 500); using 20 threads in blast search and cd-hit-est clustering (-T 20); assuming population genetic project: contigs derived from targets from each individual will be identified first and then combined to make a final reference for all samples (-b 1)
(seqCapture) $ seqCapture intarget -t sp1_ref.fa -t /path/to/raw_assemblies_dir -L 150 -p 85 -T 20 -e 4 -f 500 -b 1
Output:
A diretory named "In_target" is created in "raw_assemblies_dir". And under "In_target", another diretory "Final" is created: (seqCapture) $ ls Final/ combined_allcontig.bed combined_flanking_ONLY.bed combined_targeted_region_and_flanking.bed combined_targeted_region.bed combined_targetedRegionforExonCapEval.bed combined_targetedRegionAndFlanking.fasta
The contigs in combined_targetedRegionAndFlanking.fasta are simply named as "Contig1", "Contig2" ... "ContigN". If you want to link these names back to those in the original target ("sp1_ref.fa"), you can refer to this file "sp1_ref.fa_rename_compared.txt" which can be found in the same diretory where "sp1_ref.fa" is located. But, how the original targeted loci are named is not important.
"combined_targeted_region_and_flanking.bed" defines targeted and flanking regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"combined_flanking_ONLY.bed" defines flanking regions only (+-500bp around each target in the above usage example) in each assembled in-target contig;
"combined_targeted_region.bed" defines targeted regions only (+-500bp around each target in the above usage example) in each assembled in-target contig;
"combined_targetedRegionforExonCapEval.bed" also defines targeted and flanking regions like "combined_targeted_region_and_flanking.bed" does. However, it is not filtered. This file should only be used for evaluating enrichment efficiency (later in evaluation step);
"combined_sites_to_remove.txt" contains regions that are not unique in the intarget assemblies.
"combined_allcontig.bed" defines start (basically 0) and end (basically the total length) of each contig.
OPTION 2: we treat these markers as coding and identification of coding and non-coding is desired:
Usage examples:
In this case we assume that these markers are derived from genes and identification of exon/intron(flanking) is desired. In the below example: Using sp1_ref.fa as a targeted loci reference file (-t sp1_ref.fa); assuming cdna markers (-e 3); using cdna file (-s sp1_cdna.fa) to define cds. Explanation for other options see OPTION 1.
(seqCapture) $ seqCapture intarget -t sp1_ref.fa -t /path/to/raw_assemblies_dir -e 3 -s sp1_cdna.fa ... [other options] ...
Output:
A diretory named "In_target" is created in "raw_assemblies_dir". And under "In_target", another diretory "Final" is created: (seqCapture) $ ls Final/ combined_allcontig.bed combined_flanking_ONLY.bed combined_coding_and_flanking.bed combined_coding.bed combined_targetedRegionforExonCapEval.bed combined_targetedRegionAndFlanking.fasta
The contigs in combined_targetedRegionAndFlanking.fasta are simply named as "Contig1", "Contig2"... "ContigN". If you want to link these names back to those in the original target ("sp1_ref.fa"), you can go to this file "sp1_ref.fa_rename_compared.txt" which can be found in the same diretory where "sp1_ref.fa" is located. How the original targeted loci are named does not seem to be very important.
"combined_coding_and_flanking.bed" defines coding and flanking regions (+/-500bp around each target in the above usage example) in each assembled in-target contig;
"combined_flanking_ONLY.bed" defines flanking regions (+/-500bp around each target in the above usage example) in each assembled in-target contig;
"combined_coding.bed" defines coding regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"combined_targetedRegionforExonCapEval.bed" also defines coding and flanking regions like "combined_coding_and_flanking.bed" does. However, it is not filtered. This file should only be used for evaluating enrichment efficiency (later in evaluation step);
"combined_sites_to_remove.txt" contains regions that are not unique in the intarget assemblies.
"combined_allcontig.bed" defines start (basically 0) and end (basically the total length) of each contig.