4B. Day 3. Genome Trancriptome annotation method and stratgies - bioinfokushwaha/Livestock_Genomics GitHub Wiki

Genome

The following steps are involved in genome annotation and Enrichment

Gene prediction
Sequence annotation i,e, Similarity search (Blast Annotation) Domain search (Pfam, Interpro domain)
Gene ontology
Pathways

Transcriptome

The following steps are involved in Functional annotation and Enrichment:

Translation of transcriptome
Sequence annotation i,e, Similarity search (Blast Annotation) Domain search (Pfam, Interpro domain)
Gene ontology
Pathways

Spliting of Assembly

Split is sequences is recommended for the saving of computation

By Unix command
split-fasta script

mkdir Split_fasta && cd Split_fasta split_fasta --basename=Assem -numseqs=500 ../Trinity.fasta cd ..

TransDecoder (Find Coding Regions Within Transcripts)

TransDecoder identifies coding regions in de novo RNA-Seq transcript assemblies

Step 1: extract the long open reading frames
nice -n 1 TransDecoder.LongOrfs -t Split_fasta/Assem.000.fasta

Step 2: predict the likely coding regions
nice -n 1 TransDecoder.Predict -t Split_fasta/Assem.000.fasta

Output transcripts.fasta.transdecoder.pep : peptide sequences for the final candidate ORFs; all shorter candidates within longer ORFs were removed. transcripts.fasta.transdecoder.cds : nucleotide sequences for coding regions of the final candidate ORFs transcripts.fasta.transdecoder.gff3 : positions within the target transcripts of the final selected ORFs transcripts.fasta.transdecoder.bed : bed-formatted file describing ORF positions, best to view in GenomeView/IGV.

Use (transcripts.fasta.transdecoder.pep/transcripts.fasta.transdecoder.cds) of the file to run functional annotation

Online resource for Transcriptome Annoatation

TRAPID: Rapid Analysis of Transcriptome Data

KOBAS 3.0: web server for gene/protein functional annotation

Blast Annotation

#Make blast database for similarity search
nice -n 1 makeblastdb -in uniprot_sprot.fasta -input_type fasta -dbtype prot -out swisport -parse_seqids

#blastp similarity search
nice -n 1 blastp -query transdecoder_dir/longest_orfs.pep -db uniprot_sprot.fasta  -max_target_seqs 1 \
-outfmt 6 -evalue 1e-5 -num_threads 1 > blastp.outfmt6

Pfam annotation

nice -n 1 hmmscan --cpu 6 --domtblout dom.tblout -o Ass.domtblout /bioinfo/DB/Pfam-A.hmm \
Assem.000.fasta.transdecoder.pep

Gene Ontology

1. Extract swiss-port ids from blastp result
less blastp.outfmt6 |cut -f 2 >Swiss_ids.txt
cat Swiss_ids.txt

2. Paste these ids in [Uniport] (https://www.uniprot.org/uploadlists/) batch id mapping webpage
a. Provide your identifiers
Q9NZ20
O00370
Q9NTG1
Q02878
Q96GP6
Q96GP6

b. Select options
i) From [Uniport AC/ID] to [UniportKB]
and Submit

3. Click on Columns (in Middle of page in row of filter)
Tick: Gene ontology (GO) and Gene ontology Ids
Tick Pathways

4. Select all rows and download as tab-separated text file

5. Open in XL or terminal
Hint: less and cut function