ex801 - nibb-gitc/gitc2021mar-rnaseq GitHub Wiki
ex801 Functional annotation using similarity-based methods
æŒç¿801: ãã¢ããžãŒæ€çŽ¢ãçšããæ©èœã¢ãããŒã·ã§ã³
ããã§ã¯ãè¿çžçš®ã®ã¢ãã«çç©ã§ãè¯è³ªã®ã¢ãããŒã·ã§ã³ãã€ããã²ãã ããŒã¿ãå©çšå¯èœã§ããã±ãŒã¹ãæ³å®ããŠããã¢ããžãŒæ€çŽ¢çãçšããŠã¢ãããŒã·ã§ã³ãè¡ã£ãŠã¿ãã䜿çšããããŒã¿ã¯ãé µæ¯ Saccharomyces eubayanus ã®ãã©ã³ã¹ã¯ãªãããŒã ããŒã¿( GEO accesion: GSE133146 )ã§ããããã®é µæ¯ã¯ãã©ã¬ãŒããŒã«ã®çç£ã«äœ¿ãããé µæ¯ S. pastorianus ã®ç¥å çš®ã®äžã€ã§ãS. pastorianus ã¯ãã®é µæ¯ãšS. cerevisiaeã®ãã€ããªããã«ãã£ãŠæç«ãããšãããŠããããã®èµ·æºã«è¿«ãããšãå ã®ç ç©¶ã®ç®çãšãªã£ãŠããããã®ç ç©¶ã§ã¯ãã€ããªããããç¥å çš®ã«è¿ããšèããããS. eubayanusã®ããã©ã€æ ªã察象ãšããŠããã®ã²ãã ã·ãŒã±ã³ã¹ãè¡ã£ãŠããããããã§ã¯å ¬éãããŠãã奿 ªã®ã²ãã ãçšããŠãããã³ã°ãã䞻㫠S. cerevisiae ãšã®æ¯èŒã«åºã¥ããŠã¢ãããŒã·ã§ã³ãè¡ããšããæ³å®ã§è§£æãè¡ã£ãŠã¿ããããªããããã§ã¯ã¢ãããŒã·ã§ã³ãŸã§ãè¡ããããã«åºã¥ããšã³ãªããã¡ã³ãè§£æã¯æŒç¿803ã§è¡ãã
ããŒã¿
bias5ã§äœæ¥ããã ãã£ã¬ã¯ããª~/gitc/data/IU/yeastã«ã以äžã®ãã¡ã€ã«ãããã
file | contents | remarks |
---|---|---|
seub_genome.fa | S. eubayanus ã²ãã é å | å ¥åããŒã¿ |
stringtie_merged.gtf | stringtieã«ããã¢ã»ã³ãã«çµæ | å ¥åããŒã¿ |
topTags.gene.txt | EdgeRã®çµæïŒãã©ã³ã¹ã¯ãªããã¬ãã«ïŒ | å ¥åããŒã¿ïŒæŒç¿803ã§çšãã |
topTags.gene.txt | EdgeRã®çµæïŒéºäŒåã¬ãã«ïŒ | å ¥åããŒã¿ïŒæŒç¿803ã§çšãã |
seub_genes.pep | S. eubayanus éºäŒå翻蚳é å | äžéçµæ |
seub_genes6.pep | S. eubayanus éºäŒå翻蚳é åçž®å°ç | äžéçµæïŒäžéšïŒ |
scer_prot.fa | S. cerevisiae éºäŒå翻蚳é å | å ¬çããŒã¿ããååŸ |
blastout.tab | S. eub x S. cer BLASTçµæ | äžéçµæ |
seub_genes6.iprscan.tsv | seub_genes6ã«å¯Ÿãã InterProScançµæ | åºåçµæïŒäžéšïŒ |
seub_genes.emapper.annotations | seub_genesã«å¯ŸããEggNOG-mapperçµæ | åºåçµæïŒæŒç¿803ã§çšãã |
ç¶æ³ãšããŠã¯ãã²ãã é åã«å¯ŸããŠRNA-seqãªãŒãããããã³ã°ããŠã²ãã ããŒã¹ã§ã¢ã»ã³ãã«ãããã©ã³ã¹ã¯ãªããããšã«é »åºŠãã«ãŠã³ãããŠãEdgeRã§ææãªçºçŸå€åã瀺ããã©ã³ã¹ã¯ãªããïŒéºäŒåãæœåºãããšãããŸã§ãçµãã£ãŠããããšãæ³å®ããŠãããããªãã¡ãåæç¶æ ãšããŠæåã®ïŒã€ã®ãã¡ã€ã«ãååšããŠãããããããã¢ãããŒã·ã§ã³ããããªã£ãŠãããéäžã®éçšãã¹ãããã§ããããã«ãããã€ãäžéçµæã®ãã¡ã€ã«ã眮ããŠãããããããäžæžãããªãããã«ãå¥ã®ãã£ã¬ã¯ããªãäœæããŠãããã«ãã¡ã€ã«ãã³ããŒããŠäœæ¥ããããšãå§ããã
% cd ~/gitc/data/IU/
% mkdir ex1
% cp yeast/* ex1
% cd ex1
Step 1: ãã©ã³ã¹ã¯ãªããé åã®äœæ
stringtieã§ã¯ããã©ã³ã¹ã¯ãªããã®åº§æšãèšé²ããGTFãã¡ã€ã«ãäœæããããé åãã¡ã€ã«ã¯äœæããªãã®ã§ããŸããã®GTFãã¡ã€ã«ãšã²ãã é åãããã©ã³ã¹ã¯ãªããé åãäœæããå¿ èŠããããããã¯gffreadã³ãã³ãã§è¡ãã
% gffread stringtie_merged.gtf -g seub_genome.fa -w seub_transcripts.fa
Step 2: ã³ãŒãé å(CDS)ã®æšå®ãšç¿»èš³é åã®äœæã(è¬çŸ©ã§ã¯ã¹ãããããäºå®)
äœæãããã©ã³ã¹ã¯ãªããé åããCDSãæšå®ãããããã¯ãTransDecoderãçšããŠè¡ãããé·ãORFãæœåºããæ®µéãšãããããCDSãæšå®ããæ®µéã®ïŒæ®µéã§è¡ãã
% TransDecoder.LongOrfs -t seub_transcripts.fa
% TransDecoder.Predict -t seub_transcripts.fa --single_best_only --cpu 8
ããã©ã«ãã§ã¯ãäžã€ã®ãã©ã³ã¹ã¯ãªããäžã«éè€ããªãè€æ°ã®CDSãåå®ãããéã¯ãããããè€æ°ã®CDSãšããŠåºåããããåŸåŠçãããé¢åã«ãªãã®ã§ãããã§ã¯ãããã£ãå Žåã«ã¯äžã€ã®CDSã®ã¿ãåºåãããªãã·ã§ã³ãæå®ããŠããã
åºåçµæãšããŠãseub_transcripts.fa.trandecoder.pep ïŒã¢ããé žé åïŒã®ã»ãããã¡ã€ã«åã®ãµãã£ãã¯ã¹ã .cds ïŒå¡©åºé åïŒããã³ã.gff3ïŒãã©ã³ã¹ã¯ãªããäžã®CDSã®åº§æšïŒã®ïŒã€ã®ãã¡ã€ã«ãäœæããããã¢ããé žé åãã¡ã€ã«ããã®åŸã®è§£æã«çšããããéºäŒååããå ã®GTFãã¡ã€ã«äžã®transcript_idããå°ãå€ãã£ãŠããã
>DI49_1142.p1 GENE.DI49_1142~~DI49_1142.p1 ORF type:complete len:236 (-),score=50.10 DI49_1142:234-941(-)
MLPLIASRNRRPISLTIRKLFRTMSIVKGKPEEAKIVEARHVKDTSDCKWIGLQKIIYKD
PNGNEREWDSAVRTTRNSGGVDGIGILTILKYKDGKPDEILLQKQFRPPVEGVCIEMPAG
LIDAGEDVDTAALRELKEETGYKGKIISKSPTVFNDPGFTNTNLCLVTVEVDMSLPENQK
PVTQLEDNEFIECFSVELHKFPDEMVKLDQQGYKLDARVQNVAQGILMAKQYNIQ*
ããã¯åŸã åé¡ã«ãªãã®ã§ãå ã®transcript_idã«ååãæ»ããŠããŸããïŒå ã«--single_best_onlyãæå®ããŠäžå¯Ÿäžå¯Ÿå¿ãã€ãããã«ããŠããã®ã§ããããå¯èœã«ãªã£ãŠããïŒãããã¯ãseqkit ã®æåå眮æã³ãã³ã(replace)ãçšããŠä»¥äžã®ããã«ããŠè¡ããã
seqkit replace -p '\.p[0-9] ' -r ' ' seub_genes.fa.transdecoder.pep | sed 's/\*//' > seub_genes.pep
ãªããå ã®é åãã¡ã€ã«ã¯ã¢ããé žé åã®æ«å°Ÿã«ã¹ãããã³ãã³ã®ååšã瀺ã*ãå ¥ã£ãŠãããããœãããŠã§ã¢ã«ãã£ãŠã¯ãããåé¡ã«ãªãããšãããã®ã§ã䜵ã㊠sed ã³ãã³ããçšããŠãããé€å»ããŠãããå å·¥åŸã¯ã以äžã®ããã«ãªãã
>DI49_1142 Gene.6155::DI49_1142::g.6155 ORF type:complete len:236 (-) DI49_1142:234-941(-)
MLPLIASRNRRPISLTIRKLFRTMSIVKGKPEEAKIVEARHVKDTSDCKWIGLQKIIYKD
PNGNEREWDSAVRTTRNSGGVDGIGILTILKYKDGKPDEILLQKQFRPPVEGVCIEMPAG
LIDAGEDVDTAALRELKEETGYKGKIISKSPTVFNDPGFTNTNLCLVTVEVDMSLPENQK
PVTQLEDNEFIECFSVELHKFPDEMVKLDQQGYKLDARVQNVAQGILMAKQYNIQ
Step 3: BLASTãçšãããã¢ããžãŒæ€çŽ¢ã(è¬çŸ©ã§ã¯ã¹ãããããäºå®)
åã¹ãããã§äœæããã¢ããé žé å seub_genes.pep ãçšããŠãS. cerevisiaeã²ãã ã®ã¢ããé žé å scer_prot.fa ãšãBLASTã«ããç·åœããã®ãã¢ããžãŒæ€çŽ¢ãè¡ã(ãããã®é åãã¡ã€ã«ã¯ãäžèšããŒã¿ãã£ã¬ã¯ããªäžã«çœ®ããŠãã)ã
ãŸããscer_prot.fa ãããšã«ãBLASTæ€çŽ¢çšããŒã¿ããŒã¹ãäœæããã
% makeblastdb -in scer_prot.fa -dbtype prot -parse_seqids -out scer
çµæãšããŠãscerã§å§ãŸãè€æ°ã®ãã¡ã€ã«ãäœæãããããããçšããŠBLASTæ€çŽ¢ãå®è¡ããã
% blastp -query seub_genes.pep -db scer -evalue 0.001 -outfmt "6 std stitle" -max_target_seqs 10 -num_threads 8 > blastout.tab
Step 4: BLASTçµæãããã¹ããããé¢ä¿ã®æœåº
åã¹ãããã§äœæããBLASTæ€çŽ¢çµæãããåã¯ãšãªé åã«ã€ãã¹ã³ã¢ãæé«ã®ãããïŒãã¹ããããïŒã²ãšã€ã ããæœåºãããããã¯ãå ã®æ€çŽ¢çµæããäºãã¯ãšãªé åããšã«ã¹ã³ã¢é ã«äžŠãã§ããããšãåæãšããŠãsortã³ãã³ãã®stable option (-s)ãšunique option (-u) ãçšããããšã§ãå ã®é åºãç¶æãã€ã€ã¯ãšãªé åããšã«æåã®äžã€ã®ãããã®ã¿ãåºåããããšã§å®çŸããŠããã
% sort -s -k 1,1 -u blastout.tab > blast_top.tab
ãªãŒãœãã°åå®ã«ãããŠã¯ããã¢ã¯ã€ãºã®ã²ãã æ¯èŒã«ãããŠãäžæ¹åã®ãã¹ããããã ãã§ãªããåæ¹åã®ãã¹ããããã確èªããããšã§ããã®ç²ŸåºŠãé«ããããšãã§ããããããè¡ãããããŸãéæ¹åã®ãã¹ãããããæœåºãããããŒã¿ããŒã¹é å(S. cerevisiae)ããšã®ã¹ã³ã¢é ã«äžŠã¹æ¿ããåã³ãã³ããšåæ§ã«ããŠãŠããŒã¯ãªé åãæœåºããã
% sort -s -k 2,2 -k 11,11g -k 12,12nr blastout.tab | sort -s -k 2,2 -u > blast_top_rev.tab
ãã®åŸãäž¡æ¹åã®ãã¹ãããããåãããŠãœãŒãããéè€ããè¡ãåºåãã(uniq -d)ãããã«ãããåæ¹åã®ãã¹ãããããæœåºãããã
% sort blast_top.tab blast_top_rev.tab | uniq -d > blast_bbh.tab
Step 5: DIAMONDãçšãããã¢ããžãŒæ€çŽ¢ïŒãªãã·ã§ãã«ã ããè¬çŸ©ã§ã¯ãã¡ãã宿œããïŒ
DIAMONDã¯ãBLASTãšæ¯ã¹ãŠç²ŸåºŠã¯ããèœã¡ãããå§åçã«é«éã§ããããšãããå€§èŠæš¡ãªæ€çŽ¢ã«ãããŠåºãçšããããŠããããŒã«ã§ãããäœ¿ãæ¹ã¯BLASTãšãã䌌ãŠããããåã³ãã³ãã¯diamondã®ãµãã³ãã³ããšããŠåŒã³åºãããšãããã³ãªãã·ã§ã³ã®æå®ããã€ãã³ïŒã€ã«ãªããšãããç°ãªãã®ã§æ³šæãããïŒçªç®ã®ã³ãã³ãã¯é·ãã®ã§æšªã¹ã¯ããŒã«ããŠå šäœã確èªããããšããªããdiamondã§ã¯ããã©ã«ãã§evalueã®éŸå€ã0.001ãªã®ã§ããã®ãŸãŸã§ãããã°--evalue ãªãã·ã§ã³ã¯çç¥å¯èœã§ããã
% diamond makedb --in scer_prot.fa --db scer
% diamond blastp --query seub_genes.pep --db scer --max-target-seqs 10 --outfmt 6 qseqid sseqid pident evalue bitscore stitle --threads 4 --out diamondout.tab
ïŒè¿œå 課é¡ïŒäžèšãå®è¡ãããšãdiamondout.tabãšããçµæãã¡ã€ã«ãã§ããããã®çµæãçšããŠãåé ãšåæ§ã«ãã¹ãããããããã³åæ¹åãã¹ãããããæœåºãããBLASTæ€çŽ¢ã®æãšæ¯ã¹ãŠåºåã«ã©ã ãæžãããŠããã®ã§ãã«ã©ã äœçœ®ããããŠããç¹ã«æ³šæããã
Step 6: InterProScan ãçšããã¢ããŒãïŒãã¡ã€ã³æ€çŽ¢ïŒãªãã·ã§ãã«ãè¬çŸ©ã§ã¯ã¹ãããããäºå®ïŒ
ã¢ããŒãïŒãã¡ã€ã³æ€çŽ¢ã¯ãæ©èœæšå®ã®ããã®ããè©³çŽ°ãªæããããããææ®µãšããŠããã¢ããžãŒæ€çŽ¢ãšäœµããŠå®æœãããããšãå€ããInterProScanã¯ãäžã€ã®ã³ãã³ãã§å€å²ã«ãããããŒã¿ããŒã¹ãäžåºŠã«æ€çŽ¢ã§ããããŒã«ãšããŠåºãçšããããŠãããæ€çŽ¢ã«ã¯æéããããããã詊ããŠã¿ãå Žåã¯ãã¯ãšãªé åã®ããäžéšãæœåºããçž®å°çseub_genes6.pepãçšããããšããªããInterProScan ã¯bias5äžã§ã¯ã¢ãžã¥ãŒã«ã§ç®¡çãããŠããã®ã§ããã¹ãéã£ãŠããªãå Žåã¯module addã³ãã³ãã§ããŒãããã
% module add interproscan
% interproscan.sh -i seub_genes6.pep -b seub_genes6 -goterms -pa --cpu 4
çµæã¯seub.genes6.iprscanã§å§ãŸãããã€ãã®ãã¡ã€ã«ãšããŠåºåãããããã®ãã¡ãseub.genes6.iprscan.tsvã¯äžèšãã£ã¬ã¯ããªäžã«çœ®ããŠããã
Step 7: EggNOG mapperãçšãããªãŒãœãã°æ€çŽ¢
EggNOG mapperã¯ããããããäœæãããªãŒãœãã°ã°ã«ãŒããšç³»çµ±æš¹ã®æ å ±ãçšããŠã¯ãšãªé åã®ãªãŒãœãã°ãåå®ããããã«åºã¥ããŠã¢ãããŒã·ã§ã³ã¥ãããããŒã«ã§ãããæ€çŽ¢ãšã³ãžã³ã«ã¯DIAMONDãçšããŠããããããŒã¿ããŒã¹ã倧ããããã«æ€çŽ¢ã«ã¯æéããããã詊ããŠã¿ãå Žåã¯ãçž®å°çã®seub_genes6.pepãçšããããšã
% emapper.py -i seub_genes6.pep -o seub_genes6 -m diamond --cpu 6
é åå šäœã«å¯ŸããŠå®è¡ããçµæã¯ seub_genes.emapper.annotaionsãšããŠçœ®ããŠããã®ã§ããããçšããŠããã¹ãèšè¿°ããã³GOãªã¹ããæœåºããã
% cut -f1,8 seub_genes.emapper.annotations | grep -v ^# > seub_genes.emapper.tit
% cut -f1,13 seub_genes.emapper.annotations | grep -v ^# > seub_genes.emapper.go
ïŒçºå±èª²é¡ïŒã¯ã©ã¹ã¿ãŒèšç®æ©ãçšããå€§éæ€çŽ¢ã®é«éå®è¡
ã¯ãšãªãåå²ããŠäžŠåã«å®è¡ããããšã§æ€çŽ¢é床ãäžããããšãã§ãããå®éã«é åå šäœã«å¯ŸããŠå®è¡ããŠã¿ãã人ã¯ãé åãåå²ããŠBIASã®ã¯ã©ã¹ã¿ãŒã·ã¹ãã ãçšããŠå®è¡ããŠã¿ãããé åãåå²ããããã®ã³ãã³ãsplit_seq.plãçšæããŠããã®ã§ããããå®è¡ããã-BLOCK_SIZEãªãã·ã§ã³ã§ãåå²ã®åäœãšãªãé åé·ã®ç·åãæå®ããã
% split_seq.pl -BLOCK_SIZE=10000 seub_genes.pep
åå²ãããé åãã¡ã€ã«ã¯ãquery_seub_genesãšãããã£ã¬ã¯ããªã«æ ŒçŽããã(ãã®å Žåã300åçšåºŠã®ãã¡ã€ã«ã«åå²ããã)ã䜵ããŠãqsub_blast.sh ãšãããã¡ã€ã«ãäœæãããã®ã§ãæ«å°Ÿã«ããã³ãã³ã矀ã®ãã¡ãblastp ãã³ã¡ã³ãã¢ãŠãããemapperã®ã³ã¡ã³ããå€ããïŒåæ§ã«ãä»ã®ã³ãã³ããã³ã¡ã³ããå€ãããšã«ããå®è¡ã§ããããblast, diamondã«ã€ããŠã¯ããã®äžã§èšå®ãããDB倿°ã«é©åãªååïŒããã§ã¯scerïŒãå ¥ããå¿ èŠãããïŒ
#blastp -db $DB -query $INFILE -out $RESULT_OUT_DIR/blastp_$OUTFILE -num_threads $NCPUS -outfmt 6 -evalue 0.001
#diamond blastp --db $DB --query $INFILE --out $RESULT_OUT_DIR/diamond_$OUTFILE --threads $NCPUS
emapper.py -m diamond --no_annot --no_file_comments -i $INFILE -o $RESULT_OUT_DIR/emapper_$OUTFILE --cpu $NCPUS
#interproscan.sh -goterms -pa -i $INFILE -b $RESULT_OUT_DIR/iprscan_$OUTFILE -cpu $NCPUS
以äžã®ã³ãã³ãã§å®è¡ããã
% qsub qsub_blast.sh
å®è¡ç¶æ³ã¯ãqstat -u ãŠãŒã¶å
ã§ç¢ºèªã§ããïŒ-uãã€ããªããšå
šå¡åã衚瀺ãããïŒãäœã衚瀺ãããªããªããšãå®è¡ã¯çµãã£ãŠãããããŸãã«æ©ãçµäºããå Žåã¯ã倱æããŠããå¯èœæ§ãé«ãã®ã§ãã«ã¬ã³ããã£ã¬ã¯ããªã«äœæãããblastjob.e#### ãšãããã¡ã€ã«ã«å®è¡ããã³ãã³ãã®æšæºãšã©ãŒåºåãèšé²ãããŠããã®ã§ããããåç
§ããŠå¯ŸåŠããããšã
å®è¡ãç¡äºã«çµäºãããšãåå²ããã¯ãšãªé åããšã®æ€çŽ¢çµæãoutput_seub_genesãã£ã¬ã¯ããªã®äžã«æ ŒçŽãããããã®äžã®ãã¡ã€ã«ãcat ã³ãã³ãã§ãŸãšããŠã«ã¬ã³ããã£ã¬ã¯ããªã«ã³ããŒãããEggNOG mapperã®å Žåããã®æ®µéã§ã¯seed ortholog ã®æ€çŽ¢ã®ã¿ãçµãã£ãŠãã(ãªãã·ã§ã³ã«--no_annotãæå®ãããã)ã®ã§ããã®çµæãããšã«å床emapperãå®è¡ããŠãã¢ãããŒã·ã§ã³ã¥ããè¡ãã
% cat output_seub_genes/*.seed_orthologs > input.emapper.seed_orthologs
% emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o seub_genes --cpu 10
çµæã¯seub_genes.emapper.annotationsãšãããã¡ã€ã«ãšããŠäœæãããã