Ribo seq based ORF prediction - wdlingit/cop GitHub Wiki

This document describes steps for ORF prediction based on Ribo-seq evidence. This can be considered as a subsequent step following this and related with this.

Please raise an issue if there is any questions.

Collect Ribo-seq footprints

Suppose that root-ctr-0h-r1.bam is a Ribo-seq BAM file sorted by names, and target.model was made following this step. Java program rnaseq.GeneCoverageArray can be used for collecting Ribo-seq footprints in genome coordinates for individual gene models.

ubuntu@test:~$ java -classpath riboFMC/jar/rackj.jar:riboFMC/jar/sam-1.89.jar rnaseq.GeneCoverageArray -GFF DB/target.model -M SAMun root-ctr-0h-r1.bam -O root-ctr-0h-r1 -multi true -direction true
program: GeneCoverageArray
canonical GFF filename (-GFF): DB/target.model
mapping method/filename (-M):
  SAMun : root-ctr-0h-r1.bam
output prefix (-O): root-ctr-0h-r1
block join factor (-J): 2
identity cutoff (-ID): 0.95
use exon region (-exon): true
check by containing (-contain, FALSE for by intersecting): false
minimum overlap (-min): 8
check all alignment blocks (-ALL): true
include multi reads (-multi): true
care mapping direction (-direction): true
fill gaps between alignment blocks (-fill): false

#uniq reads: 1660545
#multi reads: 1450272

The coordinate file target.model is storing genomic coordinates of gene models so some gene locus may have multiple gene models overlapping each other. In this example, option -multi was set as true to accept multi reads because genomic uniq reads may overlap multiple gene models and be considered as multi in the computation. Accordingly, numbers of uniq reads and multi reads are about 3M, which is about the same as the number in the previous example.

The output file root-ctr-0h-r1.geneCoverage is storing read depth of Ribo-seq reads following the gene model orientation (option -direction true) for each mapped gene model. Note that the numeric arrays are following chromosome orientation.

#GeneID chr     start   readCnt CoverageArray   format:.geneCoverage
AT1G01010.1     Chr1    3631    99      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
AT1G01020.1     Chr1    6788    16      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
AT1G01020.2     Chr1    6788    16      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
AT1G01020.3     Chr1    6788    16      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
AT1G01020.4     Chr1    6788    16      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
AT1G01020.5     Chr1    6788    16      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

NOTE: It would be a good idea to collect all Ribo-seq BAM files together for computing aggregated footprints. When the result is not satisfactory, a dummy .geneCoverage file should be considered.

ubuntu@test:~$ cat DB/target.model | perl -ne 'chomp; if($.==1){print "#GeneID\tchr\tstart\treadCnt\tCoverageArray\n"} if(/>/){ @t=split; $regionLen=$t[3]-$t[2]+1; @arr=(1) x $regionLen; print substr($t[0],1) . "\t$t[1]\t$t[2]\t1\t[" . join(", ",@arr) . "]\n" }' > target.dummy.geneCoverage

Arrays in the dummy file would be filled with all 1's.

#GeneID chr     start   readCnt CoverageArray
AT1G01010.1     Chr1    3631    1       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
AT1G01020.2     Chr1    6788    1       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
AT1G01020.6     Chr1    6788    1       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
AT1G01020.1     Chr1    6788    1       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
AT1G01020.3     Chr1    6788    1       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
AT1G01020.4     Chr1    6788    1       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
AT1G01020.5     Chr1    6788    1       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

ORF prediction

With a .geneCoverage file and genome FASTA, the script RPFpeptidePredict.pl in this repository can be used for ORF prediction. This script looks for ATGs from upstream 130bps (about 4x4x4x2) to downstream 130bps of regions evidenced by Ribo-seq (given by a .geneCoverage file), and reports predicted peptides from an ATG to its first stop codon.

ubuntu@test:~$ git clone https://github.com/wdlingit/cop.git

ubuntu@test:~$ ./cop/scripts/RPFpeptidePredict.pl
Usage: RPFpeptidePredict.pl <cdsCGFF> <genomeFasta> <RPFcoverageArray> <outFilename>

ubuntu@test:~$ ./cop/scripts/RPFpeptidePredict.pl DB/target.model DB/TAIR10_chr_all.fas target.dummy.geneCoverage predict.orf.tsv

ubuntu@test:~$ head -3 predict.orf.tsv
#geneID chr     start   stop    RPF_start       RPF_stop        orfID   AA_start        AA_stop AA_seq
AT1G01010.1     Chr1    3631    5899    3631    5899    ORF_1   3760    5630    MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRDAMWYFFSRRENNKGNRQSRTTVSGKWKLTGESVEVKDQWGFCSEGFRGKIGHKRVLVFLDGRYPDKTKSDWVIHEFHYDLLPEHQRTYVICRLEYKGDDADILSAYAIDPTPAFVPNMTSSAGSVVNQSRQRNSGSYNTYSEYDSANHGQQFNENSNIMQQQPLQGSFNPLLEYDFANHGGQWLSDYIDLQQQVPYLAPYENESEMIWKHVIEENFEFLVDERTSMQQHYSDHRPKKPVSGVLPDDSSDTETGSMIFEDTSSSTDSVGSSDEPGHTRIDDIPSLNIIEPLHNYKAQEQPKQQSKEKVISSQKSECEWKMAEDSIKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFISVISWIILVG*
AT1G01010.1     Chr1    3631    5899    3631    5899    ORF_2   4020    4555    MLCGTSSLVEKTTKGIDRAGQRFLVNGSLPENLLRSRTSGDFVVRAFVVRLVIKGFWCSSMEDTLTKPNLIGLSTSSTTTSYQNIRGHMSSADLSTRVMMRTFYLLMQ*

Just for example, we may replace DB/target.model (genome coordinates of transcripts of coding mRNAs, lnc_RNAs, and ncRNAs) with a file of a specific subset, say, ncRNAs.

ubuntu@test:~$ java -classpath riboFMC/jar/rackj.jar misc.ModelCGFF -GFF3 DB/Araport11_GFF3_genes_transposons.201606.gff -O ncRNA -GRE gene ncRNA exon -IP true
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): DB/Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, ncRNA, exon]]
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): ncRNA

ubuntu@test:~$ cat ncRNA.model | perl -ne 'chomp; if($.==1){print "#GeneID\tchr\tstart\treadCnt\tCoverageArray\n"} if(/>/){ @t=split; $regionLen=$t[3]-$t[2]+1; @arr=(1) x $regionLen; print substr($t[0],1) . "\t$t[1]\t$t[2]\t1\t[" . join(", ",@arr) . "]\n" }' > ncRNA.dummy.geneCoverage

ubuntu@test:~$ ./cop/scripts/RPFpeptidePredict.pl ncRNA.model DB/TAIR10_chr_all.fas ncRNA.dummy.geneCoverage ncRNA.orf.tsv

ubuntu@test:~$ head -3 ncRNA.orf.tsv
#geneID chr     start   stop    RPF_start       RPF_stop        orfID   AA_start        AA_stop AA_seq
AT1G04425.1     Chr1    1194187 1196889 1194187 1196889 ORF_1   1194188 1194193 M*
AT1G04425.1     Chr1    1194187 1196889 1194187 1196889 ORF_2   1194194 1195450 METLKTLVSSCPSLFLHQIRRQNAANHKEYC*

It is rather reasonable to filter lines according to length of predicted AA sequence (last column).

⚠️ **GitHub.com Fallback** ⚠️