4B. Day 3. Genome Trancriptome annotation method and stratgies - bioinfokushwaha/Livestock_Genomics GitHub Wiki


Genome


The following steps are involved in genome annotation and Enrichment

  • Gene prediction
  • Sequence annotation i,e, Similarity search (Blast Annotation) Domain search (Pfam, Interpro domain)
  • Gene ontology
  • Pathways

Transcriptome


The following steps are involved in Functional annotation and Enrichment:

  • Translation of transcriptome
  • Sequence annotation i,e, Similarity search (Blast Annotation) Domain search (Pfam, Interpro domain)
  • Gene ontology
  • Pathways

Spliting of Assembly

Split is sequences is recommended for the saving of computation

  1. By Unix command
  2. split-fasta script

mkdir Split_fasta && cd Split_fasta split_fasta --basename=Assem -numseqs=500 ../Trinity.fasta cd ..

TransDecoder (Find Coding Regions Within Transcripts)

TransDecoder identifies coding regions in de novo RNA-Seq transcript assemblies

Step 1: extract the long open reading frames
nice -n 1 TransDecoder.LongOrfs -t Split_fasta/Assem.000.fasta

Step 2: predict the likely coding regions
nice -n 1 TransDecoder.Predict -t Split_fasta/Assem.000.fasta

Output transcripts.fasta.transdecoder.pep : peptide sequences for the final candidate ORFs; all shorter candidates within longer ORFs were removed. transcripts.fasta.transdecoder.cds : nucleotide sequences for coding regions of the final candidate ORFs transcripts.fasta.transdecoder.gff3 : positions within the target transcripts of the final selected ORFs transcripts.fasta.transdecoder.bed : bed-formatted file describing ORF positions, best to view in GenomeView/IGV.

Use (transcripts.fasta.transdecoder.pep/transcripts.fasta.transdecoder.cds) of the file to run functional annotation

Online resource for Transcriptome Annoatation

TRAPID: Rapid Analysis of Transcriptome Data

KOBAS 3.0: web server for gene/protein functional annotation

Blast Annotation

#Make blast database for similarity search
nice -n 1 makeblastdb -in uniprot_sprot.fasta -input_type fasta -dbtype prot -out swisport -parse_seqids

#blastp similarity search
nice -n 1 blastp -query transdecoder_dir/longest_orfs.pep -db uniprot_sprot.fasta  -max_target_seqs 1 \
-outfmt 6 -evalue 1e-5 -num_threads 1 > blastp.outfmt6

Pfam annotation

nice -n 1 hmmscan --cpu 6 --domtblout dom.tblout -o Ass.domtblout /bioinfo/DB/Pfam-A.hmm \
Assem.000.fasta.transdecoder.pep

Gene Ontology

1. Extract swiss-port ids from blastp result
less blastp.outfmt6 |cut -f 2 >Swiss_ids.txt
cat Swiss_ids.txt

2. Paste these ids in [Uniport] (https://www.uniprot.org/uploadlists/) batch id mapping webpage
a. Provide your identifiers
Q9NZ20
O00370
Q9NTG1
Q02878
Q96GP6
Q96GP6

b. Select options
i) From [Uniport AC/ID] to [UniportKB]
and Submit

3. Click on Columns (in Middle of page in row of filter)
Tick: Gene ontology (GO) and Gene ontology Ids
Tick Pathways

4. Select all rows and download as tab-separated text file

5. Open in XL or terminal
Hint: less and cut function

Gene Ontology Enrichment

Suggested Reading

Download Go term containing dataset and unzip

1. open [AgriGO](http://bioinfo.cau.edu.cn/agriGO/analysis.php)
a) Select analysis tool:Singular Enrichment Analysis (SEA)
b) Select the species: Customized annotation and paste DE GO term
c) Select reference:Customized annotated reference and paste all GO term
d) press submit
e) Go through enriched terms in analysis and Download GO enriched term 
f) Click on REVIGO icon to export directly enriched term  and press start
g) Explore different output of REVIGO

Wego

Suggested Reading

1. explore [WEGO](http://wego.genomics.org.cn/)
2. upload Wego_Go.txt in Native format 
3. Extract information for Biological process, Molecular function and component
4. Divide Wego data set into two part and use Wego for comparative study

Pathway enrichment

Suggested Reading

1. Explore [KOBAS](http://kobas.cbi.pku.edu.cn/)
2. Upload Assem.000.fasta under Gene-list enrichment tab and click run
3. Download the result after completion of job
4. Extract the information for Term, ID, Input number, Background number, P-Value, Corrected P-Value as given in sample file in Day6 folder
5. Run R script to generate pathway enrichment figure 

library (ggplot2)
x<- read.table("sample.txt", sep="\t",header=TRUE,na.strings = "NA") 

png(filename="pathways_Enrichment",width = 6, height = 6, units = 'in', res = 600)

ggplot(x, aes(x = Score, y = reorder(Pathways, Genes),  size = Genes, colour = qvalue)) +  geom_point() + expand_limits(x=0)+ theme(text = element_text(size=12,colour="Black"), axis.text.y = element_text(colour="Black"))\
 + labs(x="Richness factor", y="Pathways", colour="P-adjust", size="Genes")
dev.off()