Ribo seq based ORF prediction - wdlingit/cop GitHub Wiki
This document describes steps for ORF prediction based on Ribo-seq evidence. This can be considered as a subsequent step following this and related with this.
Please raise an issue if there is any questions.
Suppose that root-ctr-0h-r1.bam
is a Ribo-seq BAM file sorted by names, and target.model
was made following this step. Java program rnaseq.GeneCoverageArray
can be used for collecting Ribo-seq footprints in genome coordinates for individual gene models.
ubuntu@test:~$ java -classpath riboFMC/jar/rackj.jar:riboFMC/jar/sam-1.89.jar rnaseq.GeneCoverageArray -GFF DB/target.model -M SAMun root-ctr-0h-r1.bam -O root-ctr-0h-r1 -multi true -direction true
program: GeneCoverageArray
canonical GFF filename (-GFF): DB/target.model
mapping method/filename (-M):
SAMun : root-ctr-0h-r1.bam
output prefix (-O): root-ctr-0h-r1
block join factor (-J): 2
identity cutoff (-ID): 0.95
use exon region (-exon): true
check by containing (-contain, FALSE for by intersecting): false
minimum overlap (-min): 8
check all alignment blocks (-ALL): true
include multi reads (-multi): true
care mapping direction (-direction): true
fill gaps between alignment blocks (-fill): false
#uniq reads: 1660545
#multi reads: 1450272
The coordinate file target.model
is storing genomic coordinates of gene models so some gene locus may have multiple gene models overlapping each other. In this example, option -multi
was set as true to accept multi reads because genomic uniq reads may overlap multiple gene models and be considered as multi in the computation. Accordingly, numbers of uniq reads and multi reads are about 3M, which is about the same as the number in the previous example.
The output file root-ctr-0h-r1.geneCoverage
is storing read depth of Ribo-seq reads following the gene model orientation (option -direction true
) for each mapped gene model. Note that the numeric arrays are following chromosome orientation.
#GeneID chr start readCnt CoverageArray format:.geneCoverage
AT1G01010.1 Chr1 3631 99 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
AT1G01020.1 Chr1 6788 16 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
AT1G01020.2 Chr1 6788 16 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
AT1G01020.3 Chr1 6788 16 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
AT1G01020.4 Chr1 6788 16 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
AT1G01020.5 Chr1 6788 16 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
NOTE: It would be a good idea to collect all Ribo-seq BAM files together for computing aggregated footprints. When the result is not satisfactory, a dummy .geneCoverage file should be considered.
ubuntu@test:~$ cat DB/target.model | perl -ne 'chomp; if($.==1){print "#GeneID\tchr\tstart\treadCnt\tCoverageArray\n"} if(/>/){ @t=split; $regionLen=$t[3]-$t[2]+1; @arr=(1) x $regionLen; print substr($t[0],1) . "\t$t[1]\t$t[2]\t1\t[" . join(", ",@arr) . "]\n" }' > target.dummy.geneCoverage
Arrays in the dummy file would be filled with all 1's.
#GeneID chr start readCnt CoverageArray
AT1G01010.1 Chr1 3631 1 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
AT1G01020.2 Chr1 6788 1 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
AT1G01020.6 Chr1 6788 1 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
AT1G01020.1 Chr1 6788 1 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
AT1G01020.3 Chr1 6788 1 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
AT1G01020.4 Chr1 6788 1 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
AT1G01020.5 Chr1 6788 1 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
With a .geneCoverage file and genome FASTA, the script RPFpeptidePredict.pl
in this repository can be used for ORF prediction. This script looks for ATGs from upstream 130bps (about 4x4x4x2) to downstream 130bps of regions evidenced by Ribo-seq (given by a .geneCoverage file), and reports predicted peptides from an ATG to its first stop codon.
ubuntu@test:~$ git clone https://github.com/wdlingit/cop.git
ubuntu@test:~$ ./cop/scripts/RPFpeptidePredict.pl
Usage: RPFpeptidePredict.pl <cdsCGFF> <genomeFasta> <RPFcoverageArray> <outFilename>
ubuntu@test:~$ ./cop/scripts/RPFpeptidePredict.pl DB/target.model DB/TAIR10_chr_all.fas target.dummy.geneCoverage predict.orf.tsv
ubuntu@test:~$ head -3 predict.orf.tsv
#geneID chr start stop RPF_start RPF_stop orfID AA_start AA_stop AA_seq
AT1G01010.1 Chr1 3631 5899 3631 5899 ORF_1 3760 5630 MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRDAMWYFFSRRENNKGNRQSRTTVSGKWKLTGESVEVKDQWGFCSEGFRGKIGHKRVLVFLDGRYPDKTKSDWVIHEFHYDLLPEHQRTYVICRLEYKGDDADILSAYAIDPTPAFVPNMTSSAGSVVNQSRQRNSGSYNTYSEYDSANHGQQFNENSNIMQQQPLQGSFNPLLEYDFANHGGQWLSDYIDLQQQVPYLAPYENESEMIWKHVIEENFEFLVDERTSMQQHYSDHRPKKPVSGVLPDDSSDTETGSMIFEDTSSSTDSVGSSDEPGHTRIDDIPSLNIIEPLHNYKAQEQPKQQSKEKVISSQKSECEWKMAEDSIKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFISVISWIILVG*
AT1G01010.1 Chr1 3631 5899 3631 5899 ORF_2 4020 4555 MLCGTSSLVEKTTKGIDRAGQRFLVNGSLPENLLRSRTSGDFVVRAFVVRLVIKGFWCSSMEDTLTKPNLIGLSTSSTTTSYQNIRGHMSSADLSTRVMMRTFYLLMQ*
Just for example, we may replace DB/target.model (genome coordinates of transcripts of coding mRNAs, lnc_RNAs, and ncRNAs) with a file of a specific subset, say, ncRNAs.
ubuntu@test:~$ java -classpath riboFMC/jar/rackj.jar misc.ModelCGFF -GFF3 DB/Araport11_GFF3_genes_transposons.201606.gff -O ncRNA -GRE gene ncRNA exon -IP true
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): DB/Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, ncRNA, exon]]
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): ncRNA
ubuntu@test:~$ cat ncRNA.model | perl -ne 'chomp; if($.==1){print "#GeneID\tchr\tstart\treadCnt\tCoverageArray\n"} if(/>/){ @t=split; $regionLen=$t[3]-$t[2]+1; @arr=(1) x $regionLen; print substr($t[0],1) . "\t$t[1]\t$t[2]\t1\t[" . join(", ",@arr) . "]\n" }' > ncRNA.dummy.geneCoverage
ubuntu@test:~$ ./cop/scripts/RPFpeptidePredict.pl ncRNA.model DB/TAIR10_chr_all.fas ncRNA.dummy.geneCoverage ncRNA.orf.tsv
ubuntu@test:~$ head -3 ncRNA.orf.tsv
#geneID chr start stop RPF_start RPF_stop orfID AA_start AA_stop AA_seq
AT1G04425.1 Chr1 1194187 1196889 1194187 1196889 ORF_1 1194188 1194193 M*
AT1G04425.1 Chr1 1194187 1196889 1194187 1196889 ORF_2 1194194 1195450 METLKTLVSSCPSLFLHQIRRQNAANHKEYC*
It is rather reasonable to filter lines according to length of predicted AA sequence (last column).