ribo seq read mapping - wdlingit/cop GitHub Wiki
Methods and wrap scripts described here are specifically for Arabidopsis theliana, which has relatively enriched genome annotation. To deal with non-model organisms that might have a certain portion of splicing junctions unreveal in its genome annotation file, please refer here for a possible approach.
Please raise an issue if there is any questions.
NOTE: by date 20240617, a minor modification on downloading TAIR10 genome assembly was done that the download source was changed to Ensembl.
- wrap scripts, java programs, and prebuild bowtie2 index files for Arabidopsis theliana (TAIR10 genome + 2016ver Araport11 annotation): riboFMC.tar.gz
- setup.pl: evnironment checking script, would download TAIR10 assembly from TAIR FTP under user agreement
- filter.pl: wrap script for filtering, defaultly against riboFMC/db/filterBase (Araport11 tRNAs, miRNAs, snoRNAs, snRNAs, and rRNAs)
- mapping.pl: wrap script for mapping, defaultly against db/araport11.transcriptome (Araport11 transcriptome)
- counting.pl: wrap script for counting, deafultly against db/target.cgff (Araport11 mRNA(with CDS), lnc_RNA, and ncRNA)
- packaged files in riboFMC.tar.gz
- riboFMC/jar/rackj.jar: JAR file from https://sourceforge.net/projects/rackj/
- riboFMC/jar/sam-1.89.jar: JAR file from https://sourceforge.net/projects/picard/
- riboFMC/jar/*.txt: licenses and license related files
- directory riboFMC/db: prebuild bowtie2 index files and other genome coordinate files
- directory riboFMC/scripts: necessary scripts extracted from https://sourceforge.net/projects/rackj/
- Excel template for Z-test comoparison: CompareRPKMs.xlsx
Here is an example of running the wrap scripts using a fresh new ubuntu 20.40 virtual machine, with 4 vCPUs and 8GB ram. The first step is to download the package and extract it.
ubuntu@test:~$ wget https://maccu.project.sinica.edu.tw/20230330/riboFMC.tar.gz
--2023-03-31 02:28:05-- https://maccu.project.sinica.edu.tw/20230330/riboFMC.tar.gz
Resolving maccu.project.sinica.edu.tw (maccu.project.sinica.edu.tw)... 140.109.4.241
Connecting to maccu.project.sinica.edu.tw (maccu.project.sinica.edu.tw)|140.109.4.241|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 130601012 (125M) [application/octet-stream]
Saving to: 'riboFMC.tar.gz'
riboFMC.tar.gz 100%[========================================================================================>] 124.55M 299MB/s in 0.4s
2023-03-31 02:28:05 (299 MB/s) - 'riboFMC.tar.gz' saved [130601012/130601012]
ubuntu@test:~$ tar -zxvf riboFMC.tar.gz
(...)
Running the setup script riboFMC/setup.pl. Since java, bowtie2, samtools and perl module Bio::DB::Fasta are required, we installed them by installing corresponding ubuntu packages (openjdk-11-jre-headless bowtie2 samtools bioperl).
ubuntu@test:~$ riboFMC/setup.pl
Base directory: riboFMC
checking java...
It seems that no java installed
checking bowtie2...
It seems that no bowtie2 installed
checking samtools...
It seems that no samtools installed
checking perl module Bio::DB::Fasta...
It seems that perl module Bio::DB::Fasta is not installed
You may install it via intalling bioperl
riboFMC/db/TAIR10_chr_all.fas not existing or incorrect
Enter "yes" to download from Ensembl or Ctrl-C to exit
^C
ubuntu@test:~$ sudo apt install openjdk-11-jre-headless bowtie2 samtools bioperl
Running the setup script again. This time no messages saying something not installed. Enter "yes" to download TAIR10 genome sequences from Ensembl FTP site. Note that the downloaded version is "TAIR10" but not "TAIR10.1".
ubuntu@test:~$ tsocks riboFMC/setup.pl
Base directory: riboFMC
checking java...
checking bowtie2...
checking samtools...
checking perl module Bio::DB::Fasta...
riboFMC/db/TAIR10_chr_all.fas not existing or incorrect
Enter "yes" to download from Ensembl or Ctrl-C to exit
yes
Fixing sequence headers...done
build FASTA index...done
Assuming that file src/root-ctr-0h-r1.fastq.gz is of adapter-clean Ribo-seq FASTQ. Our first step is to remove reads matching to Araport11 tRNAs, miRNAs, snoRNAs, snRNAs, and rRNAs. The default bowtie2 index riboFMC/db/filterBase was built exactly using Araport11 tRNAs, miRNAs, snoRNAs, snRNAs, and rRNAs as input. The output FASTA gzip file root-ctr-0h-r1.fasta.gz should store reads that have no qualified alignments to Araport11 tRNAs, miRNAs, snoRNAs, snRNAs, and rRNAs.
ubuntu@test:~$ riboFMC/filter.pl
Usage: filter.pl [options] <inFASTQ> <outFASTA.gz>
-p <number of bowtie2 threads> (default 3)
-k <number of bowtie2 alignments per read> (default 30)
-x <bowtie2 index path> (default riboFMC/db/filterBase)
-minLen <minimum read length to keep> (default 25)
NOTE: <inFASTQ> can be a .gz file.
ubuntu@test:~$ riboFMC/filter.pl src/root-ctr-0h-r1.fastq.gz root-ctr-0h-r1.fasta.gz
Base directory: riboFMC
Executing command: bowtie2 -p 3 -k 30 --gbar 151 --sensitive -x riboFMC/db/filterBase -U src/root-ctr-0h-r1.fastq.gz | riboFMC/scripts/sam_filter.pl -ID 0.95 /dev/null /dev/stdout | riboFMC/scripts/sam2fas.pl -silent /dev/stdin /dev/stdout | perl -ne 'chomp; if(/^>/){ $tmp=$_ }elsif(25<=length($_) && length($_)<=50){ print "$tmp\n$_\n" }' | gzip -c > root-ctr-0h-r1.fasta.gz
Bowtie2 is running, and the following message is made by bowtie2:
58082938 reads; of these:
58082938 (100.00%) were unpaired; of these:
31150236 (53.63%) aligned 0 times
1382959 (2.38%) aligned exactly 1 time
25549743 (43.99%) aligned >1 times
46.37% overall alignment rate
Taking reads not mapping to the filter base, we mapped them to Araport11 transcriptome. The default bowtie2 index riboFMC/db/araport11.transcriptome was built exactly based on all Araport11 transcript sequences (no matter coding or non-coding), where the sequence coordinates are saved in riboFMC/db/araport11.transcriptome.cgff. In short, the mapping procedure is to map reads to Araport11 transcriptome and translate alignment coordinates from transcript coordinates to genome coordinates using infomation stored in riboFMC/db/araport11.transcriptome.cgff. Note that the output BAM file is non-sorted and read IDs inside should be in the same order as in the source FASTA file.
ubuntu@test:~$ riboFMC/mapping.pl
Usage: mapping.pl [options] <inFASTA> <outBAM>
-p <number of bowtie2 threads> (default 3)
-k <number of bowtie2 alignments per read> (default 30)
-x <bowtie2 index path> (default riboFMC/db/araport11.transcriptome)
-trans <translation coordinate> (default riboFMC/db/araport11.transcriptome.cgff)
-genome <genome FASTA> (default riboFMC/db/TAIR10_chr_all.fas)
NOTE: <inFASTA> can be a .gz file.
ubuntu@test:~$ riboFMC/mapping.pl root-ctr-0h-r1.fasta.gz root-ctr-0h-r1.bam
Executing command: bowtie2 -p 3 -k 30 --gbar 151 --sensitive -x riboFMC/db/araport11.transcriptome -f -U root-ctr-0h-r1.fasta.gz | java -classpath riboFMC/jar/rackj.jar:riboFMC/jar/sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -quiet true -filter ID 0.95 -filter translation riboFMC/db/araport11.transcriptome.cgff | riboFMC/scripts/psl2sam.pl riboFMC/db/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo root-ctr-0h-r1.bam -T riboFMC/db/TAIR10_chr_all.fas /dev/stdin
Bowtie2 is running, and the following message is made by bowtie2:
31509683 reads; of these:
31509683 (100.00%) were unpaired; of these:
3499134 (11.10%) aligned 0 times
14583820 (46.28%) aligned exactly 1 time
13426729 (42.61%) aligned >1 times
88.90% overall alignment rate
Finally, per-gene read counting was done based on the BAM file from the the last step against genome coordinate of selected genes. The default coordinate file riboFMC/db/target.cgff stores coordinates of Araport11 coding genes, lnc_RNAs, and ncRNAs.
ubuntu@test:~$ riboFMC/counting.pl
Usage: counting.pl [options] <inBAM> <outPrefix>
-count <counting coordinate> (default riboFMC/db/target.cgff)
NOTE: <outPrefix>.geneRPKM will store read counts and RPKM values.
ubuntu@test:~$ riboFMC/counting.pl root-ctr-0h-r1.bam root-ctr-0h-r1
Base directory: riboFMC
Executing command: java -classpath riboFMC/jar/rackj.jar:riboFMC/jar/sam-1.89.jar rnaseq.RPKMComputer -GFF riboFMC/db/target.cgff -M SAMun root-ctr-0h-r1.bam -O root-ctr-0h-r1 -direction true -geneOnly true -min 2
program: RPKMComputer
canonical GFF filename (-GFF): riboFMC/db/target.cgff
mapping method/filename (-M):
SAMun : root-ctr-0h-r1.bam
output prefix (-O): root-ctr-0h-r1
block join factor (-J): 2
identity cutoff (-ID): 0.95
use exon region (-exon): true
check by containing (-contain, FALSE for by intersecting): false
minimum overlap (-min): 2
check all alignment blocks (-ALL): true
care mapping direction (-direction): true
RPKM by gene-mapping reads only (-geneOnly): true
model filename (-model): null
#uniq reads: 3182831
#multi reads: 157133
#mapped reads: 3339964
The underlying java program generated a number of tab-delimited files prefixed with "root-ctr-0h-r1". For the purpose of measuring Ribo-seq expression level, the file to be used is "root-ctr-0h-r1.geneRPKM". This file contains five columns:
- gene ID
- sum of exon lengths (in Kbps)
- sum of uniq-reads and multi-reads
- RPKM
- ratio of multi-reads
Note that read counting of uniq-reads and multi-reads and computation of RPKMs are following the description as in PMID:18516045. Also note that .geneRPKM file will list genes in the same order if using the same counting coordinate file.
ubuntu@test:~$ head root-ctr-0h-r1.geneRPKM
#GeneID Length(Kbps) #Reads RPKM multi/ALL format:.geneRPKM
AT1G01010 1.688 105.0 18.62409040013435 0.0
AT1G01020 1.329 25.0 5.632137410051736 0.0
AT1G03987 0.272 0.0 0.0 0.0
AT1G01030 1.836 0.0 0.0 0.0
AT1G01040 6.279 67.0 3.1947915999569148 0.0
AT1G01050 1.184 351.41863837391304 88.86511423787786 0.021110544
AT1G01060 2.763 50.0 5.418103957986795 0.0
AT1G01070 1.692 62.261062367080214 11.017279882014764 0.06843864
AT1G04003 0.209 0.0 0.0 0.0
The following steps are suggest to use the template Excel file:
- Adjust numbers of control columns and treatment columns by inserting or deleting columns. Make sure Excel formulas in columns Pt and Pc are covering correct treatment and control columns respectively.
- All the .geneRPKM files from the counting step have the same order of genes. Copy and fill columns with gene information and RPKM values.
- Use Excel Autofill function to duplicate formulas in row 2 under columns Pt, Pc, ..., logFC to all rows.
The column Z-TEST is for Z-test p-values (PMID:10359602), and the column logFC is for log2-fold-change.