Exon capture with ref, phylogenetic projects - CGRL-QB3-UCBerkeley/seqCapture GitHub Wiki

Introduction:

Probe design:

Exon captures can be designed using a pre-existing reference genome. For phylogenetic projects, however, it is generally believed that probes should be designed based on ortholgous exons from multiple species reference. Species chosen should be representative to the phylogenetics tree of interest (e.g. representative species from major clades spanning a wide range of phylogenetic depth). For projects targeting on a shallow phylogenetic breath (e.g. <2% in sequence divergence), few or even one species could be sufficient as a reference for probe design. Because a reference genome is companioned with gff/gtf, selection of exons for each gene can relatively be an easy task (though filtering is never easy).

One orthologous exon should be chose per gene per species. If multiple exons are chosen per gene, they must be joined by Ns (e.g. 39Ns). However it is really unnecessary to pick more than one exon per gene. The key is to use as many independent markers as possible.

For example if 3 species (sp1, sp2 and sp3) are used for probe design, then in combined targeted loci fasta file the sequences should be looking like the example below:

(seqCapture) $ less combined_ref.fa

\>sp1_geneID1_gene1
ATGCATGC......

\>sp2_geneID1_gene1
ATGCATCC......

\>sp3_geneID1_gene1
ATGCATTC......

\>sp1_geneID2_gene2
GCANNNGC......

\>sp2_geneID2_gene2
GCATAGGC......

\>sp3_geneID2_gene2
GCATTCGC......

............

Note that the header of each marker must follow the naming scheme specified above as "CHAR1_CHAR2_CHAR3", where CHAR1 can be species name or contig name, CHAR2 must be a gene ID such as Ensembl Gene ID (e.g. ENSG00000010404), and CHAR3 can be gene names or others. In the above example, sp1_geneID1_gene1, sp2_geneID1_gene1 and sp3_geneID1_gene1 are orthologous to each other; sp1_geneID2_gene2, sp2_geneID2_gene2 and sp3_geneID2_gene2 are orthologous to each other, and so forth. If markers are orthologous to each other, the middle gene ID (CHAR2) must be the same. CHAR1 and CHAR3 can be any labels. Furthermore, orthologous exonic markers from different species (e.g. sp1_geneID1_gene1, sp2_geneID1_gene1 and sp3_geneID1_gene1) should have exactly the same length. If, their lengths differ by codon indels then Ns should be filled in.

From these combined targeted loci, a species-specific targeted loci fasta file can be created:

(seqCapture) $ less sp1_ref.fa

\>sp1_geneID1_gene1
 ATGCATGC......

\>sp1_geneID2_gene2
 GCATTTGC......

......


(seqCapture) $ less sp2_ref.fa

\>sp2_geneID1_gene1
ATGCATCC......

\>sp2_geneID2_gene2
GCATAGGC......

......

Prepare input for the run:

Targeted loci in multi-fasta format:

There are various ways of making a targeted loci reference file. Overall, I suggest use one of the species-specific targeted loci refernece (sp1_ref.fa, sp2_ref.fa OR sp3_ref.fa) for all samples to simplify this process. Alternatively you can combine markers (e.g. pick the orthologous copy of each marker containing least amount of missing data) from these species-specific reference to make a mixed reference like below:

(seqCapture) $ less final_ref.fa

\>sp1_geneID1_gene1
ATGCATGC......

\>sp2_geneID2_gene2
GCATAGGC......

......

When phylogenetic breath is very broad and the references sp1, sp2 and sp3 representing three distinctly distant clades, then you way wish to use different reference for different groups of samples. For example, sp1 is derived from clade1 and can be used as a refernece for the rest samples in clade1. You can run this step for each groups respectively and then combine output files afterwards.

**Note: no matter which reference to use, there must be only one copy of each orthologous locus in this reference. **

In the below demo I only show examples of using one of the species-specific reference, and in most cases, it is sufficient for searching targets from assembled contigs.

[OPTIONAL] If you are willing to define coding regions from each assembled contig, a cds or a cdna or mrna fasta sequence file is needed:

Cds file contains coding sequences of each gene and can be downloaded along with the reference genome. However some modifications must be made before using it as input. First, you should only pick one longest cds per gene. Second the header of each sequence needs to be modified as "geneID1_gs1_geEndPos". "geneID" must match one of those in the targeted loci file (e.g. "sp1_geneID1_gene1"). "gs1" means the gene begin at position 1. "geEndPos" means this cds ends by position at EndPos. Basically EndPos = length of the cds. For example:

(seqCapture) $ less sp1_cds.fa

\>geneID5 gs1_ge21
ATGGTGTGACCAGTCAGTAGC

This cds has a gene name of geneID5 (so it can matche "sp1_geneID5_gene1" in targeted loci file) and the coding sequence begins at position 1 and ends by position 21 (this example cds has 21 nt). This cds file is only needed when you want to separate coding regions and flanking regions for building contigs and alignment at next steps

Sequences in mRNA or cDNA fasta file may contains UTRs. The modifications that have to be made are similar to described above. The difference is that the "gs" and "ge" may no longer be "1" and "length of the sequence" for each transcript, respectively, becasue of the presence of 5' and 3' UTRs. Therefore you will have to define gs and ge for each mRNA or cDNA sequence and add them to their headers as "geneID1_gsStartPos_geEndPos".

A folder ("raw_assemblies_dir") with raw assemblies for each sample:

These raw assemblies are create by previous step assemble.

(seqCapture) $ ls raw_assemblies_dir/
Sample1.fasta Sample2.fasta ....... SampleN.fasta

OPTION 1: we treat these markers as "random" markers and identification of coding and non-coding is not necessary:

Usage examples:

In this case these markers can just be treated as any random genomic markers. In the below example: Using sp1_ref.fa as a targeted loci reference file (-t sp1_ref.fa); assuming random markers ( -e 4 ); In each individual raw assembly, eliminating any contigs shorter then 150bp (-L 150); only considering a match if a contig and a targeted locus shared at 80% sequencing similarity (-p 75); retaining 500bp+/- flanking regions around each targeted locus (-f 500); using 20 threads in blast search and cd-hit-est clustering (-T 20).

(seqCapture) $ seqCapture intarget -t sp1_ref.fa -t /path/to/raw_assemblies_dir -L 150 -p 75 -d 20  -e 4 -f 500

Output:

A diretory named "In_target" is created in "raw_assemblies_dir". And under "In_target" two subdirectories are created: "bed/" and "fasta/"

In diretory "fasta/", in-target assemblies in fasta format are available for each sample:

(seqCapture) $ ls fasta/
Sample1_targetedRegionAndFlanking.fasta
Sample2_targetedRegionAndFlanking.fasta
......
SampleN_targetedRegionAndFlanking.fasta

The contigs in XXX_targetedRegionAndFlanking.fasta are simply named as "Contig1", "Contig2"... "ContigN". If you want to link these names back to those in the original target ("sp1_ref.fa"), you can go to this file "sp1_ref.fa_rename_compared.txt" which can be found in the same diretory where "sp1_ref.fa" is located. But, how the original targeted loci are named does not seem to be very important.

In diretory "bed/", a few different bed files are provided for each sample

(seqCapture) $ ls bed/
Sample1_allcontig.bed
Sample1_flanking_ONLY.bed
Sample1_sites_to_remove.txt
Sample1_targeted_region_and_flanking.bed
Sample1_targeted_region.bed
Sample1_targetedRegionforExonCapEval.bed
[...files for other samples...]

"Sample1_targeted_region_and_flanking.bed" defines targeted and flanking regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_flanking_ONLY.bed" defines flanking regions only (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_targeted_region.bed" defines targeted regions only (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_targetedRegionforExonCapEval.bed" also defines targeted and flanking regions like "Sample1_targeted_region_and_flanking.bed" does. However, it is not filtered. This file should only be used for evaluating enrichment efficiency (later in evaluation step);

The two files below can be ignored by phylogenetic projects:

"Sample1_sites_to_remove.txt" contains regions that are not unique in each intarget assemblies.
"Sample1_allcontig.bed" defines start (basically 0) and end (basically the total length) of each contig.

OPTION 2: we treat these markers as coding and identification of coding and non-coding is desired:

Usage examples:

In this case we assume that these markers are derived from genes and identification of exon/intron is desired. In the below example: Using sp3_ref.fa as a targeted loci reference file (-t sp3_ref.fa); assuming cds markers ( -e 2 ); using cds file (-s sp1_cds.fa).. Explanation for other options see OPTION 1.

(seqCapture) $ seqCapture intarget -t sp3_ref.fa -t /path/to/raw_assemblies_dir  -e 2 -s sp1_cds.fa ... [other options] ...

Output:

A diretory named "In_target" is created in "raw_assemblies_dir". And under "In_target" two subdirectories are created: "bed/" and "fasta/"

In diretory "fasta/", in-target assemblies in fasta format are available for each sample:

(seqCapture) $ ls fasta/
Sample1_targetedRegionAndFlanking.fasta
Sample2_targetedRegionAndFlanking.fasta
......
SampleN_targetedRegionAndFlanking.fasta

The contigs in XXX_targetedRegionAndFlanking.fasta are simply named as "Contig1", "Contig2"... "ContigN". If you want to link these names back to those in the original target ("sp3_ref.fa"), you can go to this file "sp3_ref.fa_rename_compared.txt" which can be found in the same diretory where "sp3_ref.fa" is located. How the original targeted loci are named does not seem to be very important.

In diretory "bed/", a few different bed files are provided for each sample

(seqCapture) $ ls bed/
Sample1_allcontig.bed
Sample1_coding_and_flanking.bed
Sample1_coding.bed
Sample1_flanking_ONLY.bed
Sample1_sites_to_remove.txt
Sample1_targetedRegionforExonCapEval.bed

[...files for other samples...]

"Sample1_coding_and_flanking.bed" defines coding and flanking regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_flanking_ONLY.bed" defines flanking regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_coding.bed" defines coding regions (+-500bp around each target in the above usage example) in each assembled in-target contig;
"Sample1_targetedRegionforExonCapEval.bed" also defines coding and flanking regions like "Sample1_coding_and_flanking.bed" does. However, it is not filtered. This file should only be used for evaluating enrichment efficiency (later in evaluation step);

The two files below can be ignored by phylogenetic projects:

"Sample1_sites_to_remove.txt" contains regions that are not unique in each intarget assemblies.
"Sample1_allcontig.bed" defines start (basically 0) and end (basically the total length) of each contig.