Ribo seq read mapping, customization example - wdlingit/cop GitHub Wiki
This document describes a number of customization bioinformatics steps of Ribo-seq read mapping and expression level measurement, specifically for non-model organisms with non-enriched genome annotation, by taking Arabidopsis thaliana as an example. Refer the environment setting if needed.
Please raise an issue if there is any questions.
Assuming that a fastq.gz file of adapter removed reads of one single arabidopsis Ribo-seq sample to be processed. The aim is to remove tRNA, rRNA, snoRNA, snRNA, and miRNA reads first, and to take all other reads for counting.
Removing reads from tRNA, rRNA, snoRNA, snRNA, and miRNA
Java program misc.GffTree
was adopted for parsing feature tree of the given GFF3 file. This usually helps me to decide how to extract sequences of specific genomic features.
ubuntu@ubuntu20:~/DB$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]
Parsing Araport11_GFF3_genes_transposons.201606.gff
(from the TAIR database) gives a text file named Araport11_GFF3_genes_transposons.201606.gff.features
. This file tells parent-child relationships between genomic features, with asterisked features been assigned ID
in the GFF3 file. For example, there are exon
region(s) for lnc_RNA
. So by concatenating exon
region(s) of one lnc_RNA
, we may have exact sequence regions of the lnc_RNA
object.
ubuntu@ubuntu20:~/DB$ cat Araport11_GFF3_genes_transposons.201606.gff.features
GffRoot
gene*
mRNA*
five_prime_UTR*
exon*
CDS*
three_prime_UTR*
protein*
lnc_RNA*
exon*
antisense_lncRNA*
exon*
miRNA_primary_transcript*
miRNA*
transcript_region*
exon*
tRNA*
exon*
antisense_RNA*
exon*
snoRNA*
exon*
ncRNA*
exon*
snRNA*
exon*
uORF*
rRNA*
exon*
transposable_element*
transposon_fragment*
pseudogene*
pseudogenic_transcript*
pseudogenic_exon*
pseudogenic_tRNA*
pseudogenic_exon*
transposable_element_gene*
mRNA*
exon*
Java program misc.CanonicalGFF
was adopted for extracting coordinates of tRNA, rRNA, snoRNA, snRNA, and miRNA. Our aim is to remove reads mapping to these RNA categories so we are going to extracting their sequences and mapping reads to them. Recall that in the features
file we see exon
under tRNA, rRNA, snoRNA, and snRNA. Also miRNA under miRNA_primary_transcript, where it was found that miRNA records are like mature miRNAs and miRNA_primary_transcript records are like precursors. So extracting exon regions for tRNA, rRNA, snoRNA and snRNA AND extracting miRNA_primary_transcript (but not miRNA) regions for miRNA_primary_transcript should be desirable. The two parameters of the -GE
option are telling the parent-child combinations to be extracted.
ubuntu@ubuntu20:~/DB$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -GE tRNA:snoRNA:snRNA:rRNA:miRNA_primary_transcript exon:miRNA_primary_transcript -O filter.cgff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): filter.cgff
gene-exon feature pairs (-GE): [tRNA:snoRNA:snRNA:rRNA:miRNA_primary_transcript, exon:miRNA_primary_transcript](/wdlingit/cop/wiki/tRNA:snoRNA:snRNA:rRNA:miRNA_primary_transcript,-exon:miRNA_primary_transcript)
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false
The following example should show how misc.CanonicalGff
works: coordinates of exon records under tRNA AT1G11010.1 were collected under AT1G11010.1 in the .cgff
file.
ubuntu@ubuntu20:~/DB$ grep AT1G11010.1 Araport11_GFF3_genes_transposons.201606.gff
Chr1 Araport11 tRNA 3676607 3676691 38 + . ID=AT1G11010.1;Parent=AT1G11010;Note=tRNA-Met;computational_description=pre-tRNA-tRNA-Met (anticodon: CAT);Dbxref=gene:1005715222;Name=AT1G11010.1
Chr1 Araport11 exon 3676607 3676645 38 + . ID=AT1G11010:exon:1;Parent=AT1G11010.1;Name=AT1G11010:exon:1
Chr1 Araport11 exon 3676656 3676691 38 + . ID=AT1G11010:exon:2;Parent=AT1G11010.1;Name=AT1G11010:exon:2
ubuntu@ubuntu20:~/DB$ head -1472 filter.cgff | tail -n 3
>AT1G11010.1 Chr1 3676607 3676691 +
3676607 3676645
3676656 3676691
Base on these extracted coordinates, perl script SeqGen.pl
was adopted for extracting sequences. After that, we applied bowtie2-build
to make them into a bowtie2 database so that we can map Ribo-seq reads to them for filtering.
ubuntu@ubuntu20:~/DB$ SeqGen.pl filter.fasta TAIR10_chr_all.fas filter.cgff
ubuntu@ubuntu20:~/DB$ bowtie2-build filter.fasta filterBase
The following command is for: extracting the Ribo-seq fastq.gz file into a text FASTQ, mapping these Ribo-seq reads using the Mapping.pl
script, and finally remove the extracted text FASTQ. Parameter meaning of Mapping.pl
:
x
: no meaning but neededroot-ctr-0h-r1.fastq
: the input FASTQ fileroot-ctr-0h-r1.filter.bam
: the desired output BAM filenameMappingBowtie.pl
: ask our bowtie2 wrapper to do the mapping-target DB/filterBase
: by takingDB/filterBase
as the reference for mapping-p 7
: bowtie2 option for using 7 threads (the VM running the example got 8 vCPUs)-k 30 --sensitive
: bowtie2 option for reporting at most 30 alignments under thesensitive
criterion. It should be notice that bowtie2 tends to report exactly one alignment without-k 30
--gbar 151
: bowtie2 option for not opening gap for first 151bps in reads (and our reads are all shorter than 151bps)-ID 0.95
: wrapper option for filtering alignments by identity>=0.95 (identity=#matching bases/read length)-SamTmp root-ctr-0h-r1.filter.tmp.sam
: save non-mapping reads and reads with identity<0.95 into the specified file
.
ubuntu@ubuntu20:~$ gzip -dc src/root-ctr-0h-r1.fastq.gz > root-ctr-0h-r1.fastq; Mapping.pl x root-ctr-0h-r1.fastq root-ctr-0h-r1.filter.bam MappingBowtie.pl -target DB/filterBase -p 7 -k 30 --gbar 151 --sensitive -ID 0.95 -SamTmp root-ctr-0h-r1.filter.tmp.sam; rm root-ctr-0h-r1.fastq
Total query reads: 58082938
([Bowtie]29946_Query1.fasta) SAM ID filter (0.95) reads: 58082938 -> 26290949
[bam_sort_core] merging from 15 files and 1 in-memory blocks...
Total SAM output reads: 26290949
In so doing, root-ctr-0h-r1.filter.tmp.sam
is storing Ribo-seq reads not mapping to those undesired RNAs. sam2fas.pl
was adopted for translating SAM records into FASTA. The perl one-liner perl -ne
is to retain reads with lengths between 25bps to 50bps. root-ctr-0h-r1.pass1.fasta
is the read file we are going to process in the next step.
ubuntu@ubuntu20:~$ sam2fas.pl root-ctr-0h-r1.filter.tmp.sam root-ctr-0h-r1.pass0.fasta
(SAM to FASTA) root-ctr-0h-r1.filter.tmp.sam -> root-ctr-0h-r1.pass0.fasta
ubuntu@ubuntu20:~$ cat root-ctr-0h-r1.pass0.fasta | perl -ne 'chomp; if(/^>/){ $tmp=$_ }elsif(25<=length($_) && length($_)<=50){ print "$tmp\n$_\n" }' > root-ctr-0h-r1.pass1.fasta
Mapping reads to Araport11 transcriptome
Again, misc.CanonicalGFF
was adopted for collecting coordinates of Araport11 transcriptome, SeqGen.pl
was for extracting sequences, and bowtie2-build
was for making them into a bowtie2 reference.
ubuntu@ubuntu20:~/DB$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -O araport11.transcriptome.cgff -GE mRNA:lnc_RNA:antisense_lncRNA:transcript_region:tRNA:antisense_RNA:snoRNA:ncRNA:snRNA:rRNA five_prime_UTR:exon:CDS:three_prime_UTR -GE miRNA_primary_transcript miRNA_primary_transcript -GE pseudogenic_transcript:pseudogenic_tRNA pseudogenic_exon
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): araport11.transcriptome.cgff
gene-exon feature pairs (-GE): [mRNA:lnc_RNA:antisense_lncRNA:transcript_region:tRNA:antisense_RNA:snoRNA:ncRNA:snRNA:rRNA, five_prime_UTR:exon:CDS:three_prime_UTR], [miRNA_primary_transcript, miRNA_primary_transcript], [pseudogenic_transcript:pseudogenic_tRNA, pseudogenic_exon](/wdlingit/cop/wiki/mRNA:lnc_RNA:antisense_lncRNA:transcript_region:tRNA:antisense_RNA:snoRNA:ncRNA:snRNA:rRNA,-five_prime_UTR:exon:CDS:three_prime_UTR],-[miRNA_primary_transcript,-miRNA_primary_transcript],-[pseudogenic_transcript:pseudogenic_tRNA,-pseudogenic_exon)
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false
ubuntu@ubuntu20:~/DB$ SeqGen.pl araport11.transcriptome.fasta TAIR10_chr_all.fas araport11.transcriptome.cgff
ubuntu@ubuntu20:~/DB$ bowtie2-build araport11.transcriptome.fasta araport11.transcriptome
Again, we applied the Mapping.pl
and the bowtie2 wrapper for read mapping. Similarily, only alignments with identity>=0.95 were accepted. Non-pass reads were saved in root-ctr-0h-r1.transcriptome.filtered.sam
.
ubuntu@ubuntu20:~$ Mapping.pl x root-ctr-0h-r1.pass1.fasta root-ctr-0h-r1.transcriptome.bam MappingBowtie.pl -target DB/araport11.transcriptome -p 7 -k 30 --gbar 151 --sensitive -ID 0.95 -SamTmp root-ctr-0h-r1.transcriptome.filtered.sam
Total query reads: 31509683
([Bowtie]30237_Query1.fasta) SAM ID filter (0.95) reads: 31509683 -> 26711476
[bam_sort_core] merging from 13 files and 1 in-memory blocks...
Total SAM output reads: 26711476
Because root-ctr-0h-r1.transcriptome.bam
was mad by mapping reads to the transcriptome, the mapping coordinates in the BAM file would be all corrsonding to transcripts but not the genome. The following command was applied to transferring transcriptome coordinates into genome coordinates.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAMun root-ctr-0h-r1.transcriptome.bam -O /dev/stdout -quiet true -filter translation DB/araport11.transcriptome.cgff | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo root-ctr-0h-r1.transcriptome.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin
Surely, reads without identity>=0.95 to the transcriptome were saved in root-ctr-0h-r1.transcriptome.filtered.sam
and sam2fas.pl
was applied to saving them into the FASTA file root-ctr-0h-r1.transcriptome.filtered.fasta
.
ubuntu@ubuntu20:~$ sam2fas.pl root-ctr-0h-r1.transcriptome.filtered.sam root-ctr-0h-r1.transcriptome.filtered.fasta
Mapping reads to junction-spanning sequences
This is an optional step. Suggested if the genome got non-enriched transcriptome annotation.
If the genome to be studied got non-enriched genome annotation and there were sufficient RNAseq dataset for inferring non-database splicing junction, it would have been desirable to generate sequenes spanning junctions that are supported by RNAseq and mapping Ribo-seq reads to them. In so doing, it is possible to infer non-dabase junction with Ribo-seq evidence.
The following example is showing how to use java program special.JunctionCGFF
to extract junction information supported by RNAseq data (in BAM files). It reads multiple BAM files of RNAseq datasets, excludes alignments with fragile blocks (shorter than 10bps), adopts splicing junctions supported by 10 or more reads, and avoiding defined splicing junction in araport11.transcriptome.cgff
. For each splicing junction, flanking 49bps were included for sequence extraction.
ubuntu@ubuntu20:~/DB$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar special.JunctionCGFF -M SAM RNAseq_1.bam -M SAM RNAseq_2.bam -O RNAseqJunction -minB 10 -minDepth 10 -minSplicingPos 4 -flanking 49 -model araport11.transcriptome.cgff
The output is RNAseqJunction.cgff
so we use it to extract sequences and make them into a bowtie2 reference.
ubuntu@ubuntu20:~/DB$ SeqGen.pl RNAseqJunction.fasta TAIR10_chr_all.fas RNAseqJunction.cgff
ubuntu@ubuntu20:~/DB$ bowtie2-build RNAseqJunction.fasta RNAseqJunction
Some commands are recommended for further checking. The following one is for making sure that every junction-spanning sequence is 98bp long. Avoiding junctions very close to chromosome ends.
ubuntu@ubuntu20:~/DB$ cat RNAseqJunction.fasta | perl -ne 'chomp; if(/^>(\S+)/){ $id=$1 }else{ $hash{$id}+=length($_) } if(eof){ for $key (keys %hash){ print "$hash{$key}\n" } }' | perl -ne 'chomp; $hash{$_}++; if(eof){ for $key (sort {$a<=>$b} keys %hash){ print "$key\t$hash{$key}\n" } }'
98 57991
The following perl one-liner is to generate coordinate file (RNAseqJunction.mid.cgff
) that stores the center dimer for each junction-spanning sequence. In so doing, we are able to make sure mapping reads are passing the junctions in some later step.
ubuntu@ubuntu20:~/DB$ cat RNAseqJunction.fasta | perl -ne 'chomp; if(/^>(\S+)/){ $id=$1 }else{ $hash{$id}+=length($_) } if(eof){ for $key (sort {$a cmp $b} keys %hash){ print "$key\t$hash{$key}\n" } }' | perl -ne 'chomp; ($id,$len)=split; print ">$id\t$id\t".($len/2)."\t".($len/2+1)."\n".($len/2)."\t".($len/2+1)."\n"' > RNAseqJunction.mid.cgff
Similarily, we applied Mapping.pl
and a bowtie2 wrapper for mapping reads from the last stage to these junction-spanning sequences. It was noticed that only 3270 reads were mapping to these junction-spanning sequences. This should be because of enriched annotation of arabidopsis.
ubuntu@ubuntu20:~$ Mapping.pl x root-ctr-0h-r1.transcriptome.filtered.fasta root-ctr-0h-r1.rnaseqJunction.bam MappingBowtie.pl -target DB/RNAseqJunction -p 7 -k 30 --gbar 151 --sensitive -ID 0.95 -SamTmp root-ctr-0h-r1.rnaseqJunction.filtered.sam
Total query reads: 4798207
([Bowtie]30434_Query1.fasta) SAM ID filter (0.95) reads: 4798207 -> 3270
Total SAM output reads: 3270
Again, alignment coordinates are corresponding to junction-spanning sequences. The following command is for making sure that these reads are spanning the junctions (covering center dimers of junction-spanning sequences) and transferrring read alignments into genome coordinates.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAMun root-ctr-0h-r1.rnaseqJunction.bam -O /dev/stdout -Un root-ctr-0h-r1.rnaseqJunction.filtered2.sam -quiet true -filter gene -GFF DB/RNAseqJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo root-ctr-0h-r1.rnaseqJunction.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin
Take a short look at the translated BAM file. It was found that alignments are all containing D operators in their CIGAR strings.
ubuntu@ubuntu20:~$ samtools view root-ctr-0h-r1.rnaseqJunction.translated.bam | head -3
HWI-ST615:762:H9PFHADXX:1:1101:2712:29480_1:N:0:TTAGGC 16 Chr1 28896241 0 26M192D1M * 0 0 AGTATTTAACTGGTGGTGGTGGGGACT * NM:i:192 XM:i:0 MD:Z:26^TTGTAAACCGGAGGAGGCGAGTAGTGCTTAACCGGTGGTGGTGGGGACTTGTAAACCGGAGGAGGGGAGTAGTATTTAACTGGTGGTGGTGGGGACTTGTAAACCGGAGGAGGGGAGTAATGCTTAACCGGTGGTGGGGGAGACTTGTAAACCGGAGGAGGGGAGTAGTGCTTAACCGGTGGTGGGGGAGAC1
HWI-ST615:762:H9PFHADXX:1:1101:2712:29480_1:N:0:TTAGGC 16 Chr1 28896337 0 26M264D1M * 0 0 AGTATTTAACTGGTGGTGGTGGGGACT * NM:i:264 XM:i:0 MD:Z:26^TTGTAAACCGGAGGAGGGGAGTAATGCTTAACCGGTGGTGGGGGAGACTTGTAAACCGGAGGAGGGGAGTAGTGCTTAACCGGTGGTGGGGGAGACTTGTAAACCGGAGGAGGGGAGTAGTGCTTAACCGGTGGTGGGGGAGACTTGTAAACCGGAGGAGGGGAGTAGTGCTTAACCGGTGGTGGGGGAGACTTGTAAACCGGTGGTGGGGGAGACTTGTAAACCGGAGGAGGGGAGTAGTACTTAACCGGTGGTGGGGGAGAC1
HWI-ST615:762:H9PFHADXX:1:1101:2712:29480_1:N:0:TTAGGC 16 Chr1 28896241 0 26M144D1M * 0 0 AGTATTTAACTGGTGGTGGTGGGGACT * NM:i:144 XM:i:0 MD:Z:26^TTGTAAACCGGAGGAGGCGAGTAGTGCTTAACCGGTGGTGGTGGGGACTTGTAAACCGGAGGAGGGGAGTAGTATTTAACTGGTGGTGGTGGGGACTTGTAAACCGGAGGAGGGGAGTAATGCTTAACCGGTGGTGGGGGAGAC1
Becase we did filtering twice (when Mapping.pl and when transferring coordinate), we will have to combine the two filtered SAM files into a FASTA.
ubuntu@ubuntu20:~$ cat root-ctr-0h-r1.rnaseqJunction.filtered.sam root-ctr-0h-r1.rnaseqJunction.filtered2.sam | sam2fas.pl /dev/stdin root-ctr-0h-r1.rnaseqJunction.filtered.fasta
Mapping to genome
Use the filtered FASTA from the last mapping stage. The following Mapping.pl
command applied a blat wrapper for read mapping.Parameter meaning of Mapping.pl
:
-split 7
: fork 7 processes for running blat (blat is not threaded)x
: no meaning but neededroot-ctr-0h-r1.rnaseqJunction.filtered.fasta
: input read fileroot-ctr-0h-r1.genome.bam
: the desired output BAM filenameMappingBlat.pl
: ask our blat wrapper to do the mapping-unmap
: blat will not include unmapped reads in its PSL output, this wrapper option is for including unmapped records in the output SAM-md
: this wrapper option is for include MD tags in output SAM-target DB/TAIR10_chr_all.fas
: takingDB/TAIR10_chr_all.fas
as the reference for mapping-q=rna -t=dna
: blat options for mapping rna queries to dna sequences
.
ubuntu@ubuntu20:~$ Mapping.pl -split 7 x root-ctr-0h-r1.rnaseqJunction.filtered.fasta root-ctr-0h-r1.genome.bam MappingBlat.pl -unmap -md -target DB/TAIR10_chr_all.fas -q=rna -t=dna
Split processing...done
Total query reads: 4795394
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Searched 21172266 bases in 685057 sequences
Searched 21314798 bases in 685057 sequences
Searched 21352320 bases in 685057 sequences
Searched 21164632 bases in 685057 sequences
Searched 21495564 bases in 685052 sequences
Searched 21345796 bases in 685057 sequences
Searched 21347274 bases in 685057 sequences
[bam_sort_core] merging from 1 files and 1 in-memory blocks...
Total SAM output reads: 4795394
Because Ribo-seq reads are short and might be noisily mapped to the genome, a number of filters were applied to alignments made by blat:
-filter ID 0.95
: accepts only with identity>=0.95-filter J 2 -filter minB 8
: by connecting indel shorter than 2bps, all alignment blocks should be no shorter than 8bps-filter ExactEnds 4
: alignment ends (4bps each) should be matching bases-filter indel deletion -mode maximum -max 1000
: intronic gap cannot be longer than 1000bps
.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar misc.AlignmentFilter2 -M SAMun root-ctr-0h-r1.genome.bam -O /dev/stdout -Un root-ctr-0h-r1.genome.filtered.sam -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter ExactEnds 4 -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo root-ctr-0h-r1.genome.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin
Finally, we merged the three BAM files together into root-ctr-0h-r1.merged.bam. Note that this file is sort-by-name.
ubuntu@ubuntu20:~$ samtools merge -fn root-ctr-0h-r1.merged.bam root-ctr-0h-r1.transcriptome.translated.bam root-ctr-0h-r1.rnaseqJunction.translated.bam root-ctr-0h-r1.genome.pass.bam
[W::bam_merge_core2] No @HD tag found.
The following command is for reporting accepted read counts at each stage. It should take time, especially when there are a number of samples. You may like to append | head -100
after samtools view $_
to see if the command is working.
ubuntu@ubuntu20:~$ ls *.pass.bam *.translated.bam | perl -ne 'chomp; /(.+?)\.(.+?)\./; print "FILE: $1 $2\n"; system "samtools view $_"' | perl -ne 'chomp; if(/^FILE:\s(\S+)\s(\S+)/){ $file=$1; $tag=$2; $files{$file}=1; $tags{$tag}=1; }else{ @t=split; $cnt{$file}{$tag}++ if $t[0] ne $last; $last=$t[0] } if(eof){ for $t (sort keys %tags){ print "\t$t" } print "\n"; for $f (sort keys %files){ print "$f"; for $t (sort keys %tags){ if(exists $cnt{$f}{$t}){ print "\t$cnt{$f}{$t}" }else{ print "\t0" } } print "\n" } }'
genome rnaseqJunction transcriptome
root-ctr-0h-r1 293156 2813 26711476
Read counting and RPKM computation
In this corresponding study, we are interested in Ribo-seq expression levels in mRNA, lnc_RNA, and ncRNA. For this time, we don't apply the misc.CanonicalGFF
program for coordinate extraction because mRNA records under transposable_element_gene also got exon children. To avoid mistakenly include those TE genes, we applied the misc.ModelCGFF
program. The parameter -GRE gene mRNA:lnc_RNA:ncRNA five_prime_UTR:exon:three_prime_UTR:CDS
is extracting records rooted under gene records so that TE genes would be avoided.
ubuntu@ubuntu20:~/DB$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 Araport11_GFF3_genes_transposons.201606.gff -O target -GRE gene mRNA:lnc_RNA:ncRNA five_prime_UTR:exon:three_prime_UTR:CDS -IP true
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [gene, mRNA:lnc_RNA:ncRNA, five_prime_UTR:exon:three_prime_UTR:CDS](/wdlingit/cop/wiki/gene,-mRNA:lnc_RNA:ncRNA,-five_prime_UTR:exon:three_prime_UTR:CDS)
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): target
Taking target.cgff
(containing exon coordinates of mRNA, lnc_RNA, and ncRNA) and the merged BAM file (note: sort-by-name) together, we are able to do read counting using rnaseq.RPKMComputer
for collected genes (i.e.: mRNA, lnc_RNA, and ncRNA). The option -direction true
is asking the program to do counting only if a read and a gene is in the same orientation. The option -geneOnly true
is asking the program to take only gene-mapping reads for library size when computing RPKMs. This should avoid affection from reads mapping to other gene categories.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar rnaseq.RPKMComputer -GFF DB/target.cgff -M SAM root-ctr-0h-r1.merged.bam -O root-ctr-0h-r1 -direction true -geneOnly true
program: RPKMComputer
canonical GFF filename (-GFF): DB/target.cgff
mapping method/filename (-M):
SAM : root-ctr-0h-r1.merged.bam
output prefix (-O): root-ctr-0h-r1
block join factor (-J): 2
identity cutoff (-ID): 0.95
use exon region (-exon): true
check by containing (-contain, FALSE for by intersecting): false
minimum overlap (-min): 8
check all alignment blocks (-ALL): true
care mapping direction (-direction): true
RPKM by gene-mapping reads only (-geneOnly): true
model filename (-model): null
#uniq reads: 3024178
#multi reads: 158571
#mapped reads: 3182749
Output files were all prefixed by root-ctr-0h-r1
and the .geneRPKM
file is the most important for our purpose. This is a tab-delimited file and columns are for:
- gene ID. All RPKMComputer computations with the same
-GFF
input will output tables containing the same number of rows and the same order of genes. - gene length in Kbps
- read counts of this genes, uniq-reads + multi-reads
- RPKM value based on read counts in the last column
- ratio of multi-reads in read counts. That is,
#uniq-read = #read*(1-multiRatio)
.
ubuntu@ubuntu20:~$ head -8 root-ctr-0h-r1.geneRPKM
#GeneID Length(Kbps) #Reads RPKM multi/ALL format:.geneRPKM
AT1G01010 1.688 99.0 18.427242958689508 0.0
AT1G01020 1.329 24.0 5.673928652533076 0.0
AT1G03987 0.272 0.0 0.0 0.0
AT1G01030 1.836 0.0 0.0 0.0
AT1G01040 6.279 63.0 3.152446145157382 0.0
AT1G01050 1.184 303.2253906671282 80.46582558074503 0.023828449
AT1G01060 2.763 45.0 5.117162852345587 0.0