Riboseq multi stage mapping and filtering, part 4 - wdlingit/cop GitHub Wiki
Expression level computation
Following steps in Filtering and read demultiplexing, we have filtered and demultiplexed read alignment BAM files multiply1/*.name.bam
. To compute Riboseq expression levels of mRNA, lnc_RNA, and ncRNA, we observed the feature hierarchy inside Araport11 GFF3 and extract coordinates of mRNAs, lnc_RNAs, and ncRNAs. Coordinates were saved in file DB/base0.cgff
.
wdlin@comp01:somewhere/DB$ cat Araport11_GFF3_genes_transposons.Jun2016.gff.features
GffRoot
gene*
mRNA*
five_prime_UTR*
exon*
CDS*
three_prime_UTR*
protein*
lnc_RNA*
exon*
antisense_lncRNA*
exon*
miRNA_primary_transcript*
miRNA*
transcript_region*
exon*
tRNA*
exon*
antisense_RNA*
exon*
snoRNA*
exon*
ncRNA*
exon*
snRNA*
exon*
uORF*
rRNA*
exon*
transposable_element*
transposon_fragment*
pseudogene*
pseudogenic_transcript*
pseudogenic_exon*
pseudogenic_tRNA*
pseudogenic_exon*
transposable_element_gene*
mRNA*
exon*
wdlin@comp01:somewhere/DB$ java -classpath rackj.jar misc.ModelCGFF -GFF3 Araport11_GFF3_genes_transposons.Jun2016.gff \
-GRE gene mRNA:lnc_RNA:ncRNA five_prime_UTR:exon:three_prime_UTR:CDS \
-O base0 \
-IP true
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): /nas/Projects/DataSource/A.Thaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [gene, mRNA:lnc_RNA:ncRNA, five_prime_UTR:exon:three_prime_UTR:CDS](/wdlingit/cop/wiki/gene,-mRNA:lnc_RNA:ncRNA,-five_prime_UTR:exon:three_prime_UTR:CDS)
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): base0
wdlin@comp01:somewhere/DB$ head base0.cgff
>AT1G01010 Chr1 3631 5899 +
3631 3913
3996 4276
4486 4605
4706 5095
5174 5326
5439 5899
>AT1G01020 Chr1 6788 9130 -
6788 7069
7157 7232
wdlin@comp01:somewhere/DB$ cat base0.cgff | grep ">" | wc -l
30320
The following perl oneliner was applied to generate commands that computes Riboseq expression levels for genes stored in DB/base0.cgff
.
wdlin@comp01:somewhere$ mkdir multiply1/log
wdlin@comp01:somewhere$ ls multiply1/*.name.bam | perl -ne '
chomp;
/(.+?)\./;
push @{$hash{$1}},$_;
if(eof){
for $key (sort keys %hash){
$key=~/(.+)\/(.+)/;
$cmd="java -classpath rackj.jar:sam-1.89.jar rnaseq.RPKMComputer -GFF DB/base0.cgff -M SAM ".join(" -M SAM ",@{$hash{$key}})." -O $key.undir -ID 0.90 -direction false -min 2 -ALL true -geneOnly true >> $1/log/$2.undir.log";
print "\nCMD: $cmd\n";
#system $cmd
}
}
'
CMD: java -classpath rackj.jar:sam-1.89.jar rnaseq.RPKMComputer -GFF DB/base0.cgff -M SAM multiply1/root-ctr-0h-r1.name.bam -O multiply1/root-ctr-0h-r1.undir -ID 0.90 -direction false -min 2 -ALL true -geneOnly true >> multiply1/log/root-ctr-0h-r1.undir.log
(...)
Points to be noticed for java program rnaseq.RPKMComputer
(taking the first command for example):
-O multiply1/root-ctr-0h-r1.undir
assigned output prefix so the output RPKM file would bemultiply1/root-ctr-0h-r1.undir.geneRPKM
.-ID 0.90
required alignments to be with identity (#matches/read length) >=0.9. Recalled that we applied identity threshold 0.95 in all the mapping stages so this option should be less effective.-direction false
asked not to consider read orientation and the gene orientation. If this option was set to betrue
, only reads with the same orientation with a gene would be counted. In our this example, the difference between the two settings was very small.-min 2
asked the minimum overlapping between alignment blocks and exon blocks (of genes) to be 2bps. Recalled that we mapped Riboseq reads to Araport11 transcripts and then transformed the alignments to transcripts to be alignments to the genome. Considering intronic gaps, it is possible that a Riboseq read nicely mapped to a transcript has a small alignment block that intersecting exons with only 2bps.-ALL true
asked to check all alignment blocks.-geneOnly true
take only reads mapping to genes as the denominator when computing RPKM values, but not all reads mapping to the genome. This setting was to avoid any non-filtered reads from unknown rRNA/tRNA regions to affect the RPKM computation.
As log files were saved in multiply1/log/*.undir.log
, the following perl oneliner reported us numbers of reads that were uniquely mapped/multiply mapped/mapped to mRNA-, lnc_RNA-, and ncRNA-genes.
wdlin@comp01:somewhere$ ls multiply1/log/*.undir.log | perl -ne '
chomp;
/.+\/(.+?)\./;
print "FILE: $1\n";
system "tail $_"
' | perl -ne '
chomp;
if(/^FILE: (\S+)/){
$file=$1
}elsif(/^#(\S+).+?(\d+)/){
$hash{$file}{$1}=$2
}
if(eof){
for $f (sort keys %hash){
print "$f"; for $c ("uniq","multi","mapped"){
print "\t$hash{$f}{$c}"
}
print "\n"
}
}
'
root-ctr-0h-r1 3205509 243267 3448776
root-ctr-0h-r2 2099518 212015 2311533
root-ctr-0h-r3 832936 78579 911515
root-ctr-0h-r4 3896959 259302 4156261
shoot-ctr-0h-r1 3749716 320559 4070275
shoot-ctr-0h-r2 2637257 268837 2906094
shoot-ctr-0h-r3 4979891 379868 5359759
shoot-ctr-0h-r4 1338573 130361 1468934
The next command computed pearson correlations between every two samples based on log-RPKM values.
wdlin@comp01:somewhere/multiply1$ RPKMcorrelation.pl -log *.undir.geneRPKM > correlation.log.xls