4B. Day 3. Genome Trancriptome annotation method and stratgies - bioinfokushwaha/Livestock_Genomics GitHub Wiki
Genome
The following steps are involved in genome annotation and Enrichment
- Gene prediction
- Sequence annotation i,e, Similarity search (Blast Annotation) Domain search (Pfam, Interpro domain)
- Gene ontology
- Pathways
Transcriptome
The following steps are involved in Functional annotation and Enrichment:
- Translation of transcriptome
- Sequence annotation i,e, Similarity search (Blast Annotation) Domain search (Pfam, Interpro domain)
- Gene ontology
- Pathways
Spliting of Assembly
Split is sequences is recommended for the saving of computation
- By Unix command
- split-fasta script
mkdir Split_fasta && cd Split_fasta split_fasta --basename=Assem -numseqs=500 ../Trinity.fasta cd ..
TransDecoder (Find Coding Regions Within Transcripts)
TransDecoder identifies coding regions in de novo RNA-Seq transcript assemblies
Step 1: extract the long open reading frames
nice -n 1 TransDecoder.LongOrfs -t Split_fasta/Assem.000.fasta
Step 2: predict the likely coding regions
nice -n 1 TransDecoder.Predict -t Split_fasta/Assem.000.fasta
Output transcripts.fasta.transdecoder.pep : peptide sequences for the final candidate ORFs; all shorter candidates within longer ORFs were removed. transcripts.fasta.transdecoder.cds : nucleotide sequences for coding regions of the final candidate ORFs transcripts.fasta.transdecoder.gff3 : positions within the target transcripts of the final selected ORFs transcripts.fasta.transdecoder.bed : bed-formatted file describing ORF positions, best to view in GenomeView/IGV.
Use (transcripts.fasta.transdecoder.pep/transcripts.fasta.transdecoder.cds) of the file to run functional annotation
Online resource for Transcriptome Annoatation
TRAPID: Rapid Analysis of Transcriptome Data
KOBAS 3.0: web server for gene/protein functional annotation
Blast Annotation
#Make blast database for similarity search
nice -n 1 makeblastdb -in uniprot_sprot.fasta -input_type fasta -dbtype prot -out swisport -parse_seqids
#blastp similarity search
nice -n 1 blastp -query transdecoder_dir/longest_orfs.pep -db uniprot_sprot.fasta -max_target_seqs 1 \
-outfmt 6 -evalue 1e-5 -num_threads 1 > blastp.outfmt6
Pfam annotation
nice -n 1 hmmscan --cpu 6 --domtblout dom.tblout -o Ass.domtblout /bioinfo/DB/Pfam-A.hmm \
Assem.000.fasta.transdecoder.pep
Gene Ontology
1. Extract swiss-port ids from blastp result
less blastp.outfmt6 |cut -f 2 >Swiss_ids.txt
cat Swiss_ids.txt
2. Paste these ids in [Uniport] (https://www.uniprot.org/uploadlists/) batch id mapping webpage
a. Provide your identifiers
Q9NZ20
O00370
Q9NTG1
Q02878
Q96GP6
Q96GP6
b. Select options
i) From [Uniport AC/ID] to [UniportKB]
and Submit
3. Click on Columns (in Middle of page in row of filter)
Tick: Gene ontology (GO) and Gene ontology Ids
Tick Pathways
4. Select all rows and download as tab-separated text file
5. Open in XL or terminal
Hint: less and cut function
Gene Ontology Enrichment
Download Go term containing dataset and unzip
1. open [AgriGO](http://bioinfo.cau.edu.cn/agriGO/analysis.php)
a) Select analysis tool:Singular Enrichment Analysis (SEA)
b) Select the species: Customized annotation and paste DE GO term
c) Select reference:Customized annotated reference and paste all GO term
d) press submit
e) Go through enriched terms in analysis and Download GO enriched term
f) Click on REVIGO icon to export directly enriched term and press start
g) Explore different output of REVIGO
Wego
1. explore [WEGO](http://wego.genomics.org.cn/)
2. upload Wego_Go.txt in Native format
3. Extract information for Biological process, Molecular function and component
4. Divide Wego data set into two part and use Wego for comparative study
Pathway enrichment
1. Explore [KOBAS](http://kobas.cbi.pku.edu.cn/)
2. Upload Assem.000.fasta under Gene-list enrichment tab and click run
3. Download the result after completion of job
4. Extract the information for Term, ID, Input number, Background number, P-Value, Corrected P-Value as given in sample file in Day6 folder
5. Run R script to generate pathway enrichment figure
library (ggplot2)
x<- read.table("sample.txt", sep="\t",header=TRUE,na.strings = "NA")
png(filename="pathways_Enrichment",width = 6, height = 6, units = 'in', res = 600)
ggplot(x, aes(x = Score, y = reorder(Pathways, Genes), size = Genes, colour = qvalue)) + geom_point() + expand_limits(x=0)+ theme(text = element_text(size=12,colour="Black"), axis.text.y = element_text(colour="Black"))\
+ labs(x="Richness factor", y="Pathways", colour="P-adjust", size="Genes")
dev.off()