Riboseq multi stage mapping and filtering, part 4 - wdlingit/cop GitHub Wiki

Expression level computation

Following steps in Filtering and read demultiplexing, we have filtered and demultiplexed read alignment BAM files multiply1/*.name.bam. To compute Riboseq expression levels of mRNA, lnc_RNA, and ncRNA, we observed the feature hierarchy inside Araport11 GFF3 and extract coordinates of mRNAs, lnc_RNAs, and ncRNAs. Coordinates were saved in file DB/base0.cgff.

wdlin@comp01:somewhere/DB$ cat Araport11_GFF3_genes_transposons.Jun2016.gff.features
GffRoot
  gene*
    mRNA*
      five_prime_UTR*
      exon*
      CDS*
      three_prime_UTR*
      protein*
    lnc_RNA*
      exon*
    antisense_lncRNA*
      exon*
    miRNA_primary_transcript*
      miRNA*
    transcript_region*
      exon*
    tRNA*
      exon*
    antisense_RNA*
      exon*
    snoRNA*
      exon*
    ncRNA*
      exon*
    snRNA*
      exon*
    uORF*
    rRNA*
      exon*
  transposable_element*
    transposon_fragment*
  pseudogene*
    pseudogenic_transcript*
      pseudogenic_exon*
    pseudogenic_tRNA*
      pseudogenic_exon*
  transposable_element_gene*
    mRNA*
      exon*

wdlin@comp01:somewhere/DB$ java -classpath rackj.jar misc.ModelCGFF -GFF3 Araport11_GFF3_genes_transposons.Jun2016.gff \
                           -GRE gene mRNA:lnc_RNA:ncRNA five_prime_UTR:exon:three_prime_UTR:CDS \
                           -O base0 \
                           -IP true
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): /nas/Projects/DataSource/A.Thaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [gene, mRNA:lnc_RNA:ncRNA, five_prime_UTR:exon:three_prime_UTR:CDS](/wdlingit/cop/wiki/gene,-mRNA:lnc_RNA:ncRNA,-five_prime_UTR:exon:three_prime_UTR:CDS)
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): base0

wdlin@comp01:somewhere/DB$ head base0.cgff
>AT1G01010      Chr1    3631    5899    +
3631    3913
3996    4276
4486    4605
4706    5095
5174    5326
5439    5899
>AT1G01020      Chr1    6788    9130    -
6788    7069
7157    7232

wdlin@comp01:somewhere/DB$ cat base0.cgff | grep ">" | wc -l
30320

The following perl oneliner was applied to generate commands that computes Riboseq expression levels for genes stored in DB/base0.cgff.

wdlin@comp01:somewhere$ mkdir multiply1/log

wdlin@comp01:somewhere$ ls multiply1/*.name.bam | perl -ne '
    chomp; 
    /(.+?)\./; 
    push @{$hash{$1}},$_; 
    if(eof){ 
        for $key (sort keys %hash){ 
            $key=~/(.+)\/(.+)/; 
            $cmd="java -classpath rackj.jar:sam-1.89.jar rnaseq.RPKMComputer -GFF DB/base0.cgff -M SAM ".join(" -M SAM ",@{$hash{$key}})." -O $key.undir -ID 0.90 -direction false -min 2 -ALL true -geneOnly true >> $1/log/$2.undir.log"; 
            print "\nCMD: $cmd\n"; 
            #system $cmd 
        } 
    }
'

CMD: java -classpath rackj.jar:sam-1.89.jar rnaseq.RPKMComputer -GFF DB/base0.cgff -M SAM multiply1/root-ctr-0h-r1.name.bam -O multiply1/root-ctr-0h-r1.undir -ID 0.90 -direction false -min 2 -ALL true -geneOnly true >> multiply1/log/root-ctr-0h-r1.undir.log
(...)

Points to be noticed for java program rnaseq.RPKMComputer (taking the first command for example):

  1. -O multiply1/root-ctr-0h-r1.undir assigned output prefix so the output RPKM file would be multiply1/root-ctr-0h-r1.undir.geneRPKM.
  2. -ID 0.90 required alignments to be with identity (#matches/read length) >=0.9. Recalled that we applied identity threshold 0.95 in all the mapping stages so this option should be less effective.
  3. -direction false asked not to consider read orientation and the gene orientation. If this option was set to be true, only reads with the same orientation with a gene would be counted. In our this example, the difference between the two settings was very small.
  4. -min 2 asked the minimum overlapping between alignment blocks and exon blocks (of genes) to be 2bps. Recalled that we mapped Riboseq reads to Araport11 transcripts and then transformed the alignments to transcripts to be alignments to the genome. Considering intronic gaps, it is possible that a Riboseq read nicely mapped to a transcript has a small alignment block that intersecting exons with only 2bps.
  5. -ALL true asked to check all alignment blocks.
  6. -geneOnly true take only reads mapping to genes as the denominator when computing RPKM values, but not all reads mapping to the genome. This setting was to avoid any non-filtered reads from unknown rRNA/tRNA regions to affect the RPKM computation.

As log files were saved in multiply1/log/*.undir.log, the following perl oneliner reported us numbers of reads that were uniquely mapped/multiply mapped/mapped to mRNA-, lnc_RNA-, and ncRNA-genes.

wdlin@comp01:somewhere$ ls multiply1/log/*.undir.log | perl -ne '
    chomp; 
    /.+\/(.+?)\./; 
    print "FILE: $1\n"; 
    system "tail $_"
' | perl -ne '
    chomp; 
    if(/^FILE: (\S+)/){ 
        $file=$1 
    }elsif(/^#(\S+).+?(\d+)/){ 
        $hash{$file}{$1}=$2 
    } 
    
    if(eof){ 
        for $f (sort keys %hash){ 
            print "$f"; for $c ("uniq","multi","mapped"){ 
                print "\t$hash{$f}{$c}" 
            } 
            print "\n" 
        } 
    }
'
root-ctr-0h-r1  3205509 243267  3448776
root-ctr-0h-r2  2099518 212015  2311533
root-ctr-0h-r3  832936  78579   911515
root-ctr-0h-r4  3896959 259302  4156261
shoot-ctr-0h-r1 3749716 320559  4070275
shoot-ctr-0h-r2 2637257 268837  2906094
shoot-ctr-0h-r3 4979891 379868  5359759
shoot-ctr-0h-r4 1338573 130361  1468934

The next command computed pearson correlations between every two samples based on log-RPKM values.

wdlin@comp01:somewhere/multiply1$ RPKMcorrelation.pl -log *.undir.geneRPKM > correlation.log.xls