Exon capture based on de novo transcriptome, no ref - CGRL-QB3-UCBerkeley/seqCapture GitHub Wiki

Introduction:

Probe design:

Without any reference resources, exon captures can be designed using annotated transcriptome assemblies (Bi et al. 2012). We suggest to use transcriptomes derived from species that are representative to the phylogenetics tree of interest (e.g. representative species from major clades of the tree) for probe design. For projects aiming at solving shallow phylogenetic breath (e.g. <2% in sequence divergence among major species), fewer or even one species may be sufficient to serve as a reference for probe design.

The sequenced transcriptomes should be firstly assembled and annotated. Note that for phylogenetic projects, obtaining the best quality cDNA libraries, assemblies and annotations are certainly desired, but are not very crucial. You probably don't need 10K independent markers for any projects. A few thousands to start may be sufficient for most phylogenetic questions. You also don't have to pursue full length transcripts and some fragmentation of the RNA/cDNA is not a problem. For the final annotated genes, a few hundred bp to 1kb per gene is enough for probe design.

You may use individual exons or just use part of the transcript as markers for probe design. For most species, a closely related reference is lacking. In that case, identification of exon-exon boundaries within each transcript can be bioinformatically challenging. For simplicity, I suggest to use transcripts directly as markers in which exon-exon boundaries remain unidentified. Our pipelines will sort exon-exon boundaries out at intarget step to identify targeted exonic regions and their flanking sequences, respectively.

For example if 3 species (sp1, sp2 and sp3) are used for probe design, then in combined targeted loci fasta file the sequences should be looking like the example below:

(seqCapture) $ less combined_ref.fa

\>sp1_geneID1_gene1
ATGCATGC......

\>sp2_geneID1_gene1
ATGCATCC......

\>sp3_geneID1_gene1
ATGCATTC......

\>sp1_geneID2_gene2
GCANNNGC......

\>sp2_geneID2_gene2
GCATAGGC......

\>sp3_geneID2_gene2
GCATTCGC......

............

Note that the header of each marker must follow the naming scheme specified above as "CHAR1_CHAR2_CHAR3", where CHAR1 can be species name or contig name, CHAR2 must be a gene ID such as Ensembl Gene ID (e.g. ENSG00000010404), and CHAR3 can be gene name or others. In the above example, sp1_geneID1_gene1, sp2_geneID1_gene1 and sp3_geneID1_gene1 are orthologous to each other; sp1_geneID2_gene2, sp2_geneID2_gene2 and sp3_geneID2_gene2 are orthologous to each other, and so forth. If markers are orthologous to each other, the middle gene ID (CHAR2) must be the same. CHAR1 and CHAR3 can be any labels. Furthermore, orthologous markers from different species (e.g. sp1_geneID1_gene1, sp2_geneID1_gene1 and sp3_geneID1_gene1) should have the exactly same length. If, their length differs by codon indels then Ns should be filled in.

From these combined targeted loci, a species-specific targeted loci fasta file can be created:

(seqCapture) $ less sp1_ref.fa

\>sp1_geneID1_gene1
ATGCATGC......

\>sp1_geneID2_gene2
GCATTTGC......

......

(seqCapture) $ less sp2_ref.fa

\>sp2_geneID1_gene1
ATGCATCC......

\>sp2_geneID2_gene2
GCATAGGC......

......

Prepare input for the run:

Targeted loci in multi-fasta format:

There are various ways of making a targeted loci reference file. Overall, I suggest use one of the species-specific targeted loci refernece (sp1_ref.fa, sp2_ref.fa OR sp3_ref.fa) for all samples to simplify this process. Alternatively you can combine markers (e.g. pick the orthologous copy of each target containing least amount of missing data) from these species-specific reference to make a mixed reference like below:

(seqCapture) $ less final_ref.fa

\>sp1_geneID1_gene1
ATGCATGC......

\>sp2_geneID2_gene2
GCATAGGC......

......

When phylogenetic distance across clades/groups/samples is very deep and the references sp1, sp2 and sp3 representing three distinctly distant clades, then you way wish to use different reference for different groups of samples. For example, sp1 is derived from clade1 and can be used as a refernece for the rest samples in clade1. You can run this step for each groups respectively and then combine output files afterwards.

**Note: no matter which reference to use, there must be only one copy of each exon in this reference. **

In the below demo I only show examples of using one of the species-specific reference, and in most cases, it is sufficient for searching targets from assembled contigs.

[OPTIONAL] If you are willing to define coding regions from each assembled contig, an annotated cds or a cdna or mrna fasta sequence file is needed (below using annotated cDNA file as an example):

Annotated cDNA file contains one transcript per gene. However some modifications must be made before using it as input. First, you should only pick one longest transcript per gene. Second the header of each sequence needs to be modified as "geneID1 gsStartPos_geEndPos". "geneID" must match one of those in the targeted loci file (e.g. "sp1_geneID1_gene1"). "gsStartPos" means the gene begin at position StartPos. "geEndPos" means this cds ends by position at EndPos. For example:

(seqCapture) $ less sp1_cdna.fa

\>geneID5 gs51_ge1235
ATGGTGTGACCAGTCAGTAGC

This cdna has a gene name of geneID5 (so it can match "sp1_geneID5_gene1" in targeted loci file) and the coding sequence begins at position 51 and ends by position 1235 (this example cds has 1235-51+1 nt). Note: Because cDNA may contain 5' and 3' UTRs, therefore 1-51 could belong to 5' UTR and regions after 1235bp could belong to 3' UTR. But this is not crucial. It all depends on the quality of the assemblies and the annotation.

A folder ("raw_assemblies_dir") with raw assemblies for each sample:

These raw assemblies are create by previous step assemble.

(seqCapture) $ ls raw_assemblies_dir/
Sample1.fasta Sample2.fasta ....... SampleN.fasta

OPTION 1: we treat these markers as "random" markers and identification of coding and non-coding regions and sequences are not necessary:

Usage examples:

In this case these markers can just be treated as any random genomic markers. In the below example: Using sp1_ref.fa as a targeted loci reference file (-t sp1_ref.fa); assuming random markers (-e 4); In each individual raw assembly, eliminating any contigs shorter then 150bp (-L 150); only considering a match if a contig and a targeted locus shared at 75% sequencing similarity (-p 75); retaining 500bp+/- flanking regions around each targeted locus (-f 500); using 20 threads in blast search and cd-hit-est clustering (-T 20).

(seqCapture) $ seqCapture intarget -t sp1_ref.fa -t /path/to/raw_assemblies_dir -L 150 -p 75 -T 20 -e 4 -f 500

Output:

A diretory named "In_target" is created in "raw_assemblies_dir". And under "In_target" two subdirectories are created: "bed/" and "fasta/"

In diretory "fasta/", in-target assemblies in fasta format are available for each sample:

(seqCapture) $ ls fasta/
Sample1_targetedRegionAndFlanking.fasta
Sample2_targetedRegionAndFlanking.fasta
......
SampleN_targetedRegionAndFlanking.fasta

The contigs in XXX_targetedRegionAndFlanking.fasta are simply named as "Contig1", "Contig2" ... "ContigN". If you want to link these names back to those in the original target ("sp1_ref.fa"), you can refer to this file "sp1_ref.fa_rename_compared.txt" which can be found in the same diretory where "sp1_ref.fa" is located. But, how the original targeted loci are named is not important.

In diretory "bed/", a few different bed files are provided for each sample

(seqCapture) $ ls bed/
Sample1_allcontig.bed
Sample1_flanking_ONLY.bed
Sample1_sites_to_remove.txt
Sample1_targeted_region_and_flanking.bed
Sample1_targeted_region.bed
Sample1_targetedRegionforExonCapEval.bed
[...files for other samples...]

"Sample1_targeted_region_and_flanking.bed" defines targeted and flanking regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_flanking_ONLY.bed" defines flanking regions only (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_targeted_region.bed" defines targeted regions only (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_targetedRegionforExonCapEval.bed" also defines targeted and flanking regions like "Sample1_targeted_region_and_flanking.bed" does. However, it is not filtered. This file should only be used for evaluating enrichment efficiency (later in evaluation step);

The two files below can be ignored by phylogenetic projects:

"Sample1_sites_to_remove.txt" contains regions that are not unique in each intarget assemblies.
"Sample1_allcontig.bed" defines start (basically 0) and end (basically the total length) of each contig.

OPTION 2: we treat these markers as coding and identification of coding and non-coding is desired:

Usage examples:

In this case we assume that these markers are derived from genes and identification of exon/intron(flanking) is desired. In the below example: Using sp3_ref.fa as a targeted loci reference file (-t sp3_ref.fa); assuming cdna markers (-e 3); using cdna file (-s sp3_cdna.fa) to define cds. Explanation for other options see OPTION 1.

(seqCapture) $ seqCapture intarget -t sp3_ref.fa -t /path/to/raw_assemblies_dir  -e 3 -s sp3_cdna.fa ... [other options] ...

Output:

A diretory named "In_target" is created in "raw_assemblies_dir". And under "In_target" two subdirectories are created: "bed/" and "fasta/"

In diretory "fasta/", in-target assemblies in fasta format are available for each sample:

(seqCapture) $ ls fasta/
Sample1_targetedRegionAndFlanking.fasta
Sample2_targetedRegionAndFlanking.fasta
......
SampleN_targetedRegionAndFlanking.fasta

The contigs in XXX_targetedRegionAndFlanking.fasta are simply named as "Contig1", "Contig2"... "ContigN". If you want to link these names back to those in the original target ("sp3_ref.fa"), you can go to this file "sp3_ref.fa_rename_compared.txt" which can be found in the same diretory where "sp3_ref.fa" is located. How the original targeted loci are named does not seem to be very important.

In diretory "bed/", a few different bed files are provided for each sample

(seqCapture) $ ls bed/
Sample1_allcontig.bed
Sample1_coding_and_flanking.bed
Sample1_coding.bed
Sample1_flanking_ONLY.bed
Sample1_sites_to_remove.txt
Sample1_targetedRegionforExonCapEval.bed

[...files for other samples...]

"Sample1_coding_and_flanking.bed" defines coding and flanking regions (+/-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_flanking_ONLY.bed" defines flanking regions (+/-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_coding.bed" defines coding regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_targetedRegionforExonCapEval.bed" also defines coding and flanking regions like "Sample1_coding_and_flanking.bed" does. However, it is not filtered. This file should only be used for evaluating enrichment efficiency (later in evaluation step);

The two files below can be ignored by phylogenetic projects:

"Sample1_sites_to_remove.txt" contains regions that are not unique in each intarget assemblies.
"Sample1_allcontig.bed" defines start (basically 0) and end (basically the total length) of each contig.