Ribo seq read mapping, customization example - wdlingit/cop GitHub Wiki

This document describes a number of customization bioinformatics steps of Ribo-seq read mapping and expression level measurement, specifically for non-model organisms with non-enriched genome annotation, by taking Arabidopsis thaliana as an example. Refer the environment setting if needed.

Please raise an issue if there is any questions.

Assuming that a fastq.gz file of adapter removed reads of one single arabidopsis Ribo-seq sample to be processed. The aim is to remove tRNA, rRNA, snoRNA, snRNA, and miRNA reads first, and to take all other reads for counting.

Removing reads from tRNA, rRNA, snoRNA, snRNA, and miRNA

Java program misc.GffTree was adopted for parsing feature tree of the given GFF3 file. This usually helps me to decide how to extract sequences of specific genomic features.

ubuntu@ubuntu20:~/DB$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]

Parsing Araport11_GFF3_genes_transposons.201606.gff (from the TAIR database) gives a text file named Araport11_GFF3_genes_transposons.201606.gff.features. This file tells parent-child relationships between genomic features, with asterisked features been assigned ID in the GFF3 file. For example, there are exon region(s) for lnc_RNA. So by concatenating exon region(s) of one lnc_RNA, we may have exact sequence regions of the lnc_RNA object.

ubuntu@ubuntu20:~/DB$ cat Araport11_GFF3_genes_transposons.201606.gff.features
GffRoot
  gene*
    mRNA*
      five_prime_UTR*
      exon*
      CDS*
      three_prime_UTR*
      protein*
    lnc_RNA*
      exon*
    antisense_lncRNA*
      exon*
    miRNA_primary_transcript*
      miRNA*
    transcript_region*
      exon*
    tRNA*
      exon*
    antisense_RNA*
      exon*
    snoRNA*
      exon*
    ncRNA*
      exon*
    snRNA*
      exon*
    uORF*
    rRNA*
      exon*
  transposable_element*
    transposon_fragment*
  pseudogene*
    pseudogenic_transcript*
      pseudogenic_exon*
    pseudogenic_tRNA*
      pseudogenic_exon*
  transposable_element_gene*
    mRNA*
      exon*

Java program misc.CanonicalGFF was adopted for extracting coordinates of tRNA, rRNA, snoRNA, snRNA, and miRNA. Our aim is to remove reads mapping to these RNA categories so we are going to extracting their sequences and mapping reads to them. Recall that in the features file we see exon under tRNA, rRNA, snoRNA, and snRNA. Also miRNA under miRNA_primary_transcript, where it was found that miRNA records are like mature miRNAs and miRNA_primary_transcript records are like precursors. So extracting exon regions for tRNA, rRNA, snoRNA and snRNA AND extracting miRNA_primary_transcript (but not miRNA) regions for miRNA_primary_transcript should be desirable. The two parameters of the -GE option are telling the parent-child combinations to be extracted.

ubuntu@ubuntu20:~/DB$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -GE tRNA:snoRNA:snRNA:rRNA:miRNA_primary_transcript exon:miRNA_primary_transcript -O filter.cgff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): filter.cgff
gene-exon feature pairs (-GE): [tRNA:snoRNA:snRNA:rRNA:miRNA_primary_transcript, exon:miRNA_primary_transcript](/wdlingit/cop/wiki/tRNA:snoRNA:snRNA:rRNA:miRNA_primary_transcript,-exon:miRNA_primary_transcript)
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false

The following example should show how misc.CanonicalGff works: coordinates of exon records under tRNA AT1G11010.1 were collected under AT1G11010.1 in the .cgff file.

ubuntu@ubuntu20:~/DB$ grep AT1G11010.1 Araport11_GFF3_genes_transposons.201606.gff
Chr1    Araport11       tRNA    3676607 3676691 38      +       .       ID=AT1G11010.1;Parent=AT1G11010;Note=tRNA-Met;computational_description=pre-tRNA-tRNA-Met (anticodon: CAT);Dbxref=gene:1005715222;Name=AT1G11010.1
Chr1    Araport11       exon    3676607 3676645 38      +       .       ID=AT1G11010:exon:1;Parent=AT1G11010.1;Name=AT1G11010:exon:1
Chr1    Araport11       exon    3676656 3676691 38      +       .       ID=AT1G11010:exon:2;Parent=AT1G11010.1;Name=AT1G11010:exon:2

ubuntu@ubuntu20:~/DB$ head -1472 filter.cgff | tail -n 3
>AT1G11010.1    Chr1    3676607 3676691 +
3676607 3676645
3676656 3676691

Base on these extracted coordinates, perl script SeqGen.pl was adopted for extracting sequences. After that, we applied bowtie2-build to make them into a bowtie2 database so that we can map Ribo-seq reads to them for filtering.

ubuntu@ubuntu20:~/DB$ SeqGen.pl filter.fasta TAIR10_chr_all.fas filter.cgff

ubuntu@ubuntu20:~/DB$ bowtie2-build filter.fasta filterBase

The following command is for: extracting the Ribo-seq fastq.gz file into a text FASTQ, mapping these Ribo-seq reads using the Mapping.pl script, and finally remove the extracted text FASTQ. Parameter meaning of Mapping.pl:

x: no meaning but needed
root-ctr-0h-r1.fastq: the input FASTQ file
root-ctr-0h-r1.filter.bam: the desired output BAM filename
MappingBowtie.pl: ask our bowtie2 wrapper to do the mapping
-target DB/filterBase: by taking DB/filterBase as the reference for mapping
-p 7: bowtie2 option for using 7 threads (the VM running the example got 8 vCPUs)
-k 30 --sensitive: bowtie2 option for reporting at most 30 alignments under the sensitive criterion. It should be notice that bowtie2 tends to report exactly one alignment without -k 30
--gbar 151: bowtie2 option for not opening gap for first 151bps in reads (and our reads are all shorter than 151bps)
-ID 0.95: wrapper option for filtering alignments by identity>=0.95 (identity=#matching bases/read length)
-SamTmp root-ctr-0h-r1.filter.tmp.sam: save non-mapping reads and reads with identity<0.95 into the specified file

ubuntu@ubuntu20:~$ gzip -dc src/root-ctr-0h-r1.fastq.gz > root-ctr-0h-r1.fastq; Mapping.pl x root-ctr-0h-r1.fastq root-ctr-0h-r1.filter.bam MappingBowtie.pl -target DB/filterBase -p 7 -k 30 --gbar 151 --sensitive -ID 0.95 -SamTmp root-ctr-0h-r1.filter.tmp.sam; rm root-ctr-0h-r1.fastq
Total query reads: 58082938
([Bowtie]29946_Query1.fasta) SAM ID filter (0.95) reads: 58082938 -> 26290949
[bam_sort_core] merging from 15 files and 1 in-memory blocks...
Total SAM output reads: 26290949

In so doing, root-ctr-0h-r1.filter.tmp.sam is storing Ribo-seq reads not mapping to those undesired RNAs. sam2fas.pl was adopted for translating SAM records into FASTA. The perl one-liner perl -ne is to retain reads with lengths between 25bps to 50bps. root-ctr-0h-r1.pass1.fasta is the read file we are going to process in the next step.

ubuntu@ubuntu20:~$ sam2fas.pl root-ctr-0h-r1.filter.tmp.sam root-ctr-0h-r1.pass0.fasta
(SAM to FASTA) root-ctr-0h-r1.filter.tmp.sam -> root-ctr-0h-r1.pass0.fasta

ubuntu@ubuntu20:~$ cat root-ctr-0h-r1.pass0.fasta | perl -ne 'chomp; if(/^>/){ $tmp=$_ }elsif(25<=length($_) && length($_)<=50){ print "$tmp\n$_\n" }' > root-ctr-0h-r1.pass1.fasta

Mapping reads to Araport11 transcriptome

Again, misc.CanonicalGFF was adopted for collecting coordinates of Araport11 transcriptome, SeqGen.pl was for extracting sequences, and bowtie2-build was for making them into a bowtie2 reference.

ubuntu@ubuntu20:~/DB$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -O araport11.transcriptome.cgff -GE mRNA:lnc_RNA:antisense_lncRNA:transcript_region:tRNA:antisense_RNA:snoRNA:ncRNA:snRNA:rRNA five_prime_UTR:exon:CDS:three_prime_UTR -GE miRNA_primary_transcript miRNA_primary_transcript -GE pseudogenic_transcript:pseudogenic_tRNA pseudogenic_exon
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): araport11.transcriptome.cgff
gene-exon feature pairs (-GE): [mRNA:lnc_RNA:antisense_lncRNA:transcript_region:tRNA:antisense_RNA:snoRNA:ncRNA:snRNA:rRNA, five_prime_UTR:exon:CDS:three_prime_UTR], [miRNA_primary_transcript, miRNA_primary_transcript], [pseudogenic_transcript:pseudogenic_tRNA, pseudogenic_exon](/wdlingit/cop/wiki/mRNA:lnc_RNA:antisense_lncRNA:transcript_region:tRNA:antisense_RNA:snoRNA:ncRNA:snRNA:rRNA,-five_prime_UTR:exon:CDS:three_prime_UTR],-[miRNA_primary_transcript,-miRNA_primary_transcript],-[pseudogenic_transcript:pseudogenic_tRNA,-pseudogenic_exon)
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false

ubuntu@ubuntu20:~/DB$ SeqGen.pl araport11.transcriptome.fasta TAIR10_chr_all.fas araport11.transcriptome.cgff

ubuntu@ubuntu20:~/DB$ bowtie2-build araport11.transcriptome.fasta araport11.transcriptome

Again, we applied the Mapping.pl and the bowtie2 wrapper for read mapping. Similarily, only alignments with identity>=0.95 were accepted. Non-pass reads were saved in root-ctr-0h-r1.transcriptome.filtered.sam.

ubuntu@ubuntu20:~$ Mapping.pl x root-ctr-0h-r1.pass1.fasta root-ctr-0h-r1.transcriptome.bam MappingBowtie.pl -target DB/araport11.transcriptome -p 7 -k 30 --gbar 151 --sensitive -ID 0.95 -SamTmp root-ctr-0h-r1.transcriptome.filtered.sam
Total query reads: 31509683
([Bowtie]30237_Query1.fasta) SAM ID filter (0.95) reads: 31509683 -> 26711476
[bam_sort_core] merging from 13 files and 1 in-memory blocks...
Total SAM output reads: 26711476

Because root-ctr-0h-r1.transcriptome.bam was mad by mapping reads to the transcriptome, the mapping coordinates in the BAM file would be all corrsonding to transcripts but not the genome. The following command was applied to transferring transcriptome coordinates into genome coordinates.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAMun root-ctr-0h-r1.transcriptome.bam -O /dev/stdout -quiet true -filter translation DB/araport11.transcriptome.cgff | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo root-ctr-0h-r1.transcriptome.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin

Surely, reads without identity>=0.95 to the transcriptome were saved in root-ctr-0h-r1.transcriptome.filtered.sam and sam2fas.pl was applied to saving them into the FASTA file root-ctr-0h-r1.transcriptome.filtered.fasta.

ubuntu@ubuntu20:~$ sam2fas.pl root-ctr-0h-r1.transcriptome.filtered.sam root-ctr-0h-r1.transcriptome.filtered.fasta

Mapping reads to junction-spanning sequences

This is an optional step. Suggested if the genome got non-enriched transcriptome annotation.

If the genome to be studied got non-enriched genome annotation and there were sufficient RNAseq dataset for inferring non-database splicing junction, it would have been desirable to generate sequenes spanning junctions that are supported by RNAseq and mapping Ribo-seq reads to them. In so doing, it is possible to infer non-dabase junction with Ribo-seq evidence.

The following example is showing how to use java program special.JunctionCGFF to extract junction information supported by RNAseq data (in BAM files). It reads multiple BAM files of RNAseq datasets, excludes alignments with fragile blocks (shorter than 10bps), adopts splicing junctions supported by 10 or more reads, and avoiding defined splicing junction in araport11.transcriptome.cgff. For each splicing junction, flanking 49bps were included for sequence extraction.

ubuntu@ubuntu20:~/DB$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar special.JunctionCGFF -M SAM RNAseq_1.bam -M SAM RNAseq_2.bam -O RNAseqJunction -minB 10 -minDepth 10 -minSplicingPos 4 -flanking 49 -model araport11.transcriptome.cgff

The output is RNAseqJunction.cgff so we use it to extract sequences and make them into a bowtie2 reference.

ubuntu@ubuntu20:~/DB$ SeqGen.pl RNAseqJunction.fasta TAIR10_chr_all.fas RNAseqJunction.cgff

ubuntu@ubuntu20:~/DB$ bowtie2-build RNAseqJunction.fasta RNAseqJunction

Some commands are recommended for further checking. The following one is for making sure that every junction-spanning sequence is 98bp long. Avoiding junctions very close to chromosome ends.

ubuntu@ubuntu20:~/DB$ cat RNAseqJunction.fasta | perl -ne 'chomp; if(/^>(\S+)/){ $id=$1 }else{ $hash{$id}+=length($_) } if(eof){ for $key (keys %hash){ print "$hash{$key}\n" } }' | perl -ne 'chomp; $hash{$_}++; if(eof){ for $key (sort {$a<=>$b} keys %hash){ print "$key\t$hash{$key}\n" } }'
98      57991

The following perl one-liner is to generate coordinate file (RNAseqJunction.mid.cgff) that stores the center dimer for each junction-spanning sequence. In so doing, we are able to make sure mapping reads are passing the junctions in some later step.

ubuntu@ubuntu20:~/DB$ cat RNAseqJunction.fasta | perl -ne 'chomp; if(/^>(\S+)/){ $id=$1 }else{ $hash{$id}+=length($_) } if(eof){ for $key (sort {$a cmp $b} keys %hash){ print "$key\t$hash{$key}\n" } }' | perl -ne 'chomp; ($id,$len)=split; print ">$id\t$id\t".($len/2)."\t".($len/2+1)."\n".($len/2)."\t".($len/2+1)."\n"' > RNAseqJunction.mid.cgff

Similarily, we applied Mapping.pl and a bowtie2 wrapper for mapping reads from the last stage to these junction-spanning sequences. It was noticed that only 3270 reads were mapping to these junction-spanning sequences. This should be because of enriched annotation of arabidopsis.

ubuntu@ubuntu20:~$ Mapping.pl x root-ctr-0h-r1.transcriptome.filtered.fasta root-ctr-0h-r1.rnaseqJunction.bam MappingBowtie.pl -target DB/RNAseqJunction -p 7 -k 30 --gbar 151 --sensitive -ID 0.95 -SamTmp root-ctr-0h-r1.rnaseqJunction.filtered.sam
Total query reads: 4798207
([Bowtie]30434_Query1.fasta) SAM ID filter (0.95) reads: 4798207 -> 3270
Total SAM output reads: 3270

Again, alignment coordinates are corresponding to junction-spanning sequences. The following command is for making sure that these reads are spanning the junctions (covering center dimers of junction-spanning sequences) and transferrring read alignments into genome coordinates.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAMun root-ctr-0h-r1.rnaseqJunction.bam -O /dev/stdout -Un root-ctr-0h-r1.rnaseqJunction.filtered2.sam -quiet true -filter gene -GFF DB/RNAseqJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo root-ctr-0h-r1.rnaseqJunction.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin

Take a short look at the translated BAM file. It was found that alignments are all containing D operators in their CIGAR strings.

ubuntu@ubuntu20:~$ samtools view root-ctr-0h-r1.rnaseqJunction.translated.bam | head -3
HWI-ST615:762:H9PFHADXX:1:1101:2712:29480_1:N:0:TTAGGC  16      Chr1    28896241        0       26M192D1M       *       0       0       AGTATTTAACTGGTGGTGGTGGGGACT     *   NM:i:192 XM:i:0  MD:Z:26^TTGTAAACCGGAGGAGGCGAGTAGTGCTTAACCGGTGGTGGTGGGGACTTGTAAACCGGAGGAGGGGAGTAGTATTTAACTGGTGGTGGTGGGGACTTGTAAACCGGAGGAGGGGAGTAATGCTTAACCGGTGGTGGGGGAGACTTGTAAACCGGAGGAGGGGAGTAGTGCTTAACCGGTGGTGGGGGAGAC1
HWI-ST615:762:H9PFHADXX:1:1101:2712:29480_1:N:0:TTAGGC  16      Chr1    28896337        0       26M264D1M       *       0       0       AGTATTTAACTGGTGGTGGTGGGGACT     *   NM:i:264 XM:i:0  MD:Z:26^TTGTAAACCGGAGGAGGGGAGTAATGCTTAACCGGTGGTGGGGGAGACTTGTAAACCGGAGGAGGGGAGTAGTGCTTAACCGGTGGTGGGGGAGACTTGTAAACCGGAGGAGGGGAGTAGTGCTTAACCGGTGGTGGGGGAGACTTGTAAACCGGAGGAGGGGAGTAGTGCTTAACCGGTGGTGGGGGAGACTTGTAAACCGGTGGTGGGGGAGACTTGTAAACCGGAGGAGGGGAGTAGTACTTAACCGGTGGTGGGGGAGAC1
HWI-ST615:762:H9PFHADXX:1:1101:2712:29480_1:N:0:TTAGGC  16      Chr1    28896241        0       26M144D1M       *       0       0       AGTATTTAACTGGTGGTGGTGGGGACT     *   NM:i:144 XM:i:0  MD:Z:26^TTGTAAACCGGAGGAGGCGAGTAGTGCTTAACCGGTGGTGGTGGGGACTTGTAAACCGGAGGAGGGGAGTAGTATTTAACTGGTGGTGGTGGGGACTTGTAAACCGGAGGAGGGGAGTAATGCTTAACCGGTGGTGGGGGAGAC1

Becase we did filtering twice (when Mapping.pl and when transferring coordinate), we will have to combine the two filtered SAM files into a FASTA.

ubuntu@ubuntu20:~$ cat root-ctr-0h-r1.rnaseqJunction.filtered.sam root-ctr-0h-r1.rnaseqJunction.filtered2.sam | sam2fas.pl /dev/stdin root-ctr-0h-r1.rnaseqJunction.filtered.fasta

Mapping to genome

Use the filtered FASTA from the last mapping stage. The following Mapping.pl command applied a blat wrapper for read mapping.Parameter meaning of Mapping.pl:

-split 7: fork 7 processes for running blat (blat is not threaded)
x: no meaning but needed
root-ctr-0h-r1.rnaseqJunction.filtered.fasta: input read file
root-ctr-0h-r1.genome.bam: the desired output BAM filename
MappingBlat.pl: ask our blat wrapper to do the mapping
-unmap: blat will not include unmapped reads in its PSL output, this wrapper option is for including unmapped records in the output SAM
-md: this wrapper option is for include MD tags in output SAM
-target DB/TAIR10_chr_all.fas: taking DB/TAIR10_chr_all.fas as the reference for mapping
-q=rna -t=dna: blat options for mapping rna queries to dna sequences

ubuntu@ubuntu20:~$ Mapping.pl -split 7 x root-ctr-0h-r1.rnaseqJunction.filtered.fasta root-ctr-0h-r1.genome.bam MappingBlat.pl -unmap -md -target DB/TAIR10_chr_all.fas -q=rna -t=dna
Split processing...done
Total query reads: 4795394
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Searched 21172266 bases in 685057 sequences
Searched 21314798 bases in 685057 sequences
Searched 21352320 bases in 685057 sequences
Searched 21164632 bases in 685057 sequences
Searched 21495564 bases in 685052 sequences
Searched 21345796 bases in 685057 sequences
Searched 21347274 bases in 685057 sequences
[bam_sort_core] merging from 1 files and 1 in-memory blocks...
Total SAM output reads: 4795394

Because Ribo-seq reads are short and might be noisily mapped to the genome, a number of filters were applied to alignments made by blat:

-filter ID 0.95: accepts only with identity>=0.95
-filter J 2 -filter minB 8: by connecting indel shorter than 2bps, all alignment blocks should be no shorter than 8bps
-filter ExactEnds 4: alignment ends (4bps each) should be matching bases
-filter indel deletion -mode maximum -max 1000: intronic gap cannot be longer than 1000bps

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAMun root-ctr-0h-r1.genome.bam -O /dev/stdout -Un root-ctr-0h-r1.genome.filtered.sam -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter ExactEnds 4 -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo root-ctr-0h-r1.genome.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin

Finally, we merged the three BAM files together into root-ctr-0h-r1.merged.bam. Note that this file is sort-by-name.

ubuntu@ubuntu20:~$ samtools merge -fn root-ctr-0h-r1.merged.bam root-ctr-0h-r1.transcriptome.translated.bam root-ctr-0h-r1.rnaseqJunction.translated.bam root-ctr-0h-r1.genome.pass.bam
[W::bam_merge_core2] No @HD tag found.

The following command is for reporting accepted read counts at each stage. It should take time, especially when there are a number of samples. You may like to append | head -100 after samtools view $_ to see if the command is working.

ubuntu@ubuntu20:~$ ls *.pass.bam *.translated.bam | perl -ne 'chomp; /(.+?)\.(.+?)\./; print "FILE: $1 $2\n"; system "samtools view $_"' | perl -ne 'chomp; if(/^FILE:\s(\S+)\s(\S+)/){ $file=$1; $tag=$2; $files{$file}=1; $tags{$tag}=1; }else{ @t=split; $cnt{$file}{$tag}++ if $t[0] ne $last; $last=$t[0] } if(eof){ for $t (sort keys %tags){ print "\t$t" } print "\n"; for $f (sort keys %files){ print "$f"; for $t (sort keys %tags){ if(exists $cnt{$f}{$t}){ print "\t$cnt{$f}{$t}" }else{ print "\t0" } } print "\n" } }'
		genome  rnaseqJunction  transcriptome
root-ctr-0h-r1  293156  2813    26711476

Read counting and RPKM computation

In this corresponding study, we are interested in Ribo-seq expression levels in mRNA, lnc_RNA, and ncRNA. For this time, we don't apply the misc.CanonicalGFF program for coordinate extraction because mRNA records under transposable_element_gene also got exon children. To avoid mistakenly include those TE genes, we applied the misc.ModelCGFF program. The parameter -GRE gene mRNA:lnc_RNA:ncRNA five_prime_UTR:exon:three_prime_UTR:CDS is extracting records rooted under gene records so that TE genes would be avoided.

ubuntu@ubuntu20:~/DB$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 Araport11_GFF3_genes_transposons.201606.gff -O target -GRE gene mRNA:lnc_RNA:ncRNA five_prime_UTR:exon:three_prime_UTR:CDS -IP true
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [gene, mRNA:lnc_RNA:ncRNA, five_prime_UTR:exon:three_prime_UTR:CDS](/wdlingit/cop/wiki/gene,-mRNA:lnc_RNA:ncRNA,-five_prime_UTR:exon:three_prime_UTR:CDS)
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): target

Taking target.cgff (containing exon coordinates of mRNA, lnc_RNA, and ncRNA) and the merged BAM file (note: sort-by-name) together, we are able to do read counting using rnaseq.RPKMComputer for collected genes (i.e.: mRNA, lnc_RNA, and ncRNA). The option -direction true is asking the program to do counting only if a read and a gene is in the same orientation. The option -geneOnly true is asking the program to take only gene-mapping reads for library size when computing RPKMs. This should avoid affection from reads mapping to other gene categories.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar rnaseq.RPKMComputer -GFF DB/target.cgff -M SAM root-ctr-0h-r1.merged.bam -O root-ctr-0h-r1 -direction true -geneOnly true
program: RPKMComputer
canonical GFF filename (-GFF): DB/target.cgff
mapping method/filename (-M):
  SAM : root-ctr-0h-r1.merged.bam
output prefix (-O): root-ctr-0h-r1
block join factor (-J): 2
identity cutoff (-ID): 0.95
use exon region (-exon): true
check by containing (-contain, FALSE for by intersecting): false
minimum overlap (-min): 8
check all alignment blocks (-ALL): true
care mapping direction (-direction): true
RPKM by gene-mapping reads only (-geneOnly): true
model filename (-model): null

#uniq reads: 3024178
#multi reads: 158571
#mapped reads: 3182749

Output files were all prefixed by root-ctr-0h-r1 and the .geneRPKM file is the most important for our purpose. This is a tab-delimited file and columns are for:

gene ID. All RPKMComputer computations with the same -GFF input will output tables containing the same number of rows and the same order of genes.
gene length in Kbps
read counts of this genes, uniq-reads + multi-reads
RPKM value based on read counts in the last column
ratio of multi-reads in read counts. That is, #uniq-read = #read*(1-multiRatio)

ubuntu@ubuntu20:~$ head -8 root-ctr-0h-r1.geneRPKM
#GeneID Length(Kbps)    #Reads  RPKM    multi/ALL       format:.geneRPKM
AT1G01010       1.688   99.0    18.427242958689508      0.0
AT1G01020       1.329   24.0    5.673928652533076       0.0
AT1G03987       0.272   0.0     0.0     0.0
AT1G01030       1.836   0.0     0.0     0.0
AT1G01040       6.279   63.0    3.152446145157382       0.0
AT1G01050       1.184   303.2253906671282       80.46582558074503       0.023828449
AT1G01060       2.763   45.0    5.117162852345587       0.0