Riboseq multi stage mapping and filtering, part 2 - wdlingit/cop GitHub Wiki
The multi stage Riboseq read mapping procedure described in this page is based on one assumption: the existing genome annotation and corresponding RNAseq datasets provide a reliable transcriptome reference. Accordingly, mapping Riboseq reads to this transcriptome reference would be more precise than mapping Riboseq reads directly to the genome assembly. In the multi stage mapping procedure, read were firstly mapped to the transcriptome reference for precision and then mapped to the genome assembly for sensitivity.
We used to mark mapping stages with numbers. In this document, mapping stages are:
-
1-1
: exact matching to Araport11 transcripts -
1-2
: exact matching to splicing junctions inferred from RNAseq data -
2-1
: 95% identity matches to Ararpot11 transcripts, both ends (4bps) of the alignments must be exact matches -
2-2
: 95% identity matches to splicing junctions inferred from RNAseq data, both ends (4bps) of the alignments must be exact matches -
3-1
&3-2
: direct mapping to Arabidopsis genome with default and fine setting, both ends (4bps) of the alignments must be exact matches -
4-1
: 95% identity matches to Ararpot11 transcripts -
4-2
: 95% identity matches to splicing junctions inferred from RNAseq data -
5-1
&5-2
: direct mapping to Arabidopsis genome with default and fine setting - the final merge step
We also provide some real numbers when applying this multi stage mapping procedure on our Riboseq datasets.
Commands in this document are not only for multiplexed Riboseq reads. They can be applied to separate FASTQ files. Most perl oneliners will be attached with generated commands (lines started with "CMD:") from our real practices.
Recall that we have a Bowtie2 index Araport11.strand.model
in folder DB
. The following perl oneliner was to generate commands that execute our wrapper script Mapping.pl
for mapping.
wdlin@comp03:somewhere$ ls merged.fasta | perl -ne '
chomp;
/(.+?)\./;
$cmd="Mapping.pl x $_ $1.1-1.bam MappingBowtie.pl -target DB/Araport11.strand.model -p 20 -k 30 --gbar 151 --sensitive -ID 0.999 -SamTmp $1.1-1.unmapped.sam";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: Mapping.pl x merged.fasta merged.1-1.bam MappingBowtie.pl -target DB/Araport11.strand.model -p 20 -k 30 --gbar 151 --sensitive -ID 0.999 -SamTmp merged.1-1.unmapped.sam
Points to be noticed:
- The generated command (the line started with
CMD:
) was takingmerged.fasta
as input andmerged.1-1.bam
as output. -
MappingBowtie.pl -target DB/Araport11.strand.model
: call child wrapper scriptMappingBowtie.pl
and takeDB/Araport11.strand.model
as the Bowtie2 index -
-p 20 -k 30 --gbar 151 --sensitive
: Bowtie2 parameters.-p 20
for using 20 threads.--sensitive
for Bowtie2's preset end-to-end alignment criteria.-k 30
for reporting at most 30 qualified alignments (defined by--sensitive
). Note that Bowtie2 tends to report only one alignment if this parameter was not set.--gbar 151
for asking Bowtie2 not to open any alignment gaps in the first 151bps of reads. Taking all Bowtie2 parameters together, the setting here was to ask Bowtie2 not to open any gap and report multiple qualified alignments. -
-ID 0.999 -SamTmp merged.1-1.unmapped.sam
: filtering function provided by the wrapper script. Save only alignments with identity (#matche/read length) no less than0.999
to the output BAM file, and save those unqualified alignments tomerged.1-1.unmapped.sam
. Note that our reads were all shorter than 51bps so-ID 0.999
was equal to exact matches. Also note that Bowtie2 reports also alignments of read that cannot be mapped. Somerged.1-1.bam
andmerged.1-1.unmapped.sam
should contain all reads inmerged.fasta
.
The alignments in merged.1-1.bam
were mapped to Araport11 transcripts. We applied the following perl oneliner to generate piped commands for transforming alignments to genomic coordinates.
wdlin@comp03:somewhere$ ls *.1-1.bam | perl -ne '
chomp; /(.+?)\./;
$cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAM $_ -O /dev/stdout -Un $1.1-1.filtered.sam -quiet true -filter translation DB/Araport11.strand.model | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo $1.1-1.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAM merged.1-1.bam -O /dev/stdout -Un merged.1-1.filtered.sam -quiet true -filter translation DB/Araport11.strand.model | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo merged.1-1.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin
Description:
-
-M SAM merged.1-1.bam
was to ask the java programmisc.AlignmentFilter2
to readmerged.1-1.bam
as a SAM file. -
-O /dev/stdout
was to write alignment records (after coordinate transformation) to standard output so that the next part of the pipe can handle them. -
-Un merged.1-1.filtered.sam
: save any alignments that found no annotation for coordinate transformation to filemerged.1-1.filtered.sam
. Theoretically, there should be no reads cannot be transformed. -
-quiet true
was to ask the java programmisc.AlignmentFilter2
not to output program messages other than alignment records. -
-filter translation DB/Araport11.strand.model
: use thetranslate
function to translate alignments based on coordinates save inDB/Araport11.strand.model
. Recall thatDB/Araport11.strand.model
saves exon coordinates of all Araport11 transcripts. -
psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin
: Read PSL format input and translate them into SAM records with SAM header information. Note thatDB/TAIR10_chr_all.fas
is the genome assembly that we downloaded from EnsemblPlants. -
samtools view -Sbo merged.1-1.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin
: Read SAM format input and save them in a BAM file namedmerged.1-1.translated.bam
.
There are a number of reasons for us to perform this coordinate transforming step:
- We compute read counts for genes by computing intersections between alignments and exon intervals of genes in genomic coordinates.
- BAM files are usually much smaller than SAM files.
- SAMtools operations like
merge
andsort
accept only BAM files.
The next step was to collect those unqualified reads for the next stage.
wdlin@comp03:somewhere$ ls *.1-1.*.sam | perl -ne '
chomp;
if(/(.+?)_R\d\./){
$key=$1
}else{
/(.+?)\./;
$key=$1
}
push @{$hash{$key}},$_;
if(eof){
for $key (sort keys %hash){
$cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.1-1.tmp.fasta";
print "\nCMD: $cmd\n";
#system $cmd
}
}
'
CMD: cat merged.1-1.filtered.sam merged.1-1.unmapped.sam | sam2fas.pl /dev/stdin merged.1-1.tmp.fasta
The generated command was to take unqualified SAM records in merged.1-1.filtered.sam
and merged.1-1.unmapped.sam
and save these reads into merged.1-1.tmp.fasta
.
In addition to transcripts in the database, RNAseq datasets (especially those made under the same experiment conditions with Riboseq dataset) can provide a good reference for Riboseq read mapping. In this stage, we generated junction-spanning sequences where the junctions were not covered by the Araport11 transcriptome database. In so doing, mapping Riboseq reads to these sequences can help us to obtain alignments cannot be inferred by Araport11 transcriptome.
wdlin@comp03:somewhere$ ls RNAseq_folder/rnaseq_*.merged.bam | perl -ne '
chomp;
push @arr,$_;
if(eof){
$cmd="java -classpath rackj.jar:sam-1.89.jar special.JunctionCGFF -M SAM ".join(" -M SAM ",@arr)." -O DB/RNAseqJunction -minB 10 -minDepth 10 -minSplicingPos 4 -flanking 49 -model DB/Araport11.strand.model";
print "\nCMD: $cmd\n";
#system $cmd
}
'
wdlin@comp03:somewhere$ head DB/RNAseqJunction.cgff
>splice.1 Chr1 4228 4554
4228 4276
4506 4554
>splice.2 Chr1 7399 7612
7399 7447
7564 7612
>splice.3 Chr1 7780 7990
7780 7828
7942 7990
>splice.4 Chr1 7783 7990
In above block, the perl oneliner command was to list all RNAseq BAM files (sorted by name) and generate the command to compute all junctions in those RNAseq BAM file. Reported junction coordinates would be represented by two 49bp intervals (-flanking 49
, also see first few lines of DB/RNAseqJunction.cgff
), where each of them were supported by at least 10 reads (-minDepth 10
). Junctions available in the Araport11 database were removed (-model DB/Araport11.strand.model
).
Since these junctions unavailable in the Araport11 database were saved as coordinates of two intervals, we applied SeqGen.pl
to generate those 98bp junction-spanning sequences and built a Bowtie2 index, where the Bowtie2 index name is araport11.rnaJunction
.
wdlin@comp03:somewhere/DB$ SeqGen.pl araport11.rnaJunction.fasta TAIR10_chr_all.fas RNAseqJunction.cgff
wdlin@comp03:somewhere/DB$ bowtie2-build araport11.rnaJunction.fasta araport11.rnaJunction
The following perl oneliner was to make sure sequences were all 98bp long.
(check length distribution of junction sequences)
wdlin@comp03:somewhere$ cat DB/araport11.rnaJunction.fasta | perl -ne '
chomp;
if(/^>(\S+)/){
$id=$1
}else{
$hash{$id}+=length($_)
}
if(eof){
for $key (keys %hash){
print "$hash{$key}\n"
}
}
' | perl -ne '
chomp;
$hash{$_}++;
if(eof){
for $key (sort {$a<=>$b} keys %hash){
print "$key\t$hash{$key}\n"
}
}
'
98 57991
Generating these junction-spanning sequences was aimed to align junction-spanning Riboseq reads correctly. In order to unqualify those Riboseq reads mapping to junction-spanning sequences but not crossing the junctions, the following perl oneliner was applied to generate coordinates of mid points of junction-spanning seqeunces.
wdlin@comp03:somewhere$ cat DB/araport11.rnaJunction.fasta | perl -ne '
chomp;
if(/^>(\S+)/){
$id=$1
}else{
$hash{$id}+=length($_)
}
if(eof){
for $key (sort {$a cmp $b} keys %hash){
print "$key\t$hash{$key}\n"
}
}
' | perl -ne '
chomp;
($id,$len)=split;
print ">$id\t$id\t".($len/2)."\t".($len/2+1)."\n".($len/2)."\t".($len/2+1)."\n"
' > DB/araport11.rnaJunction.mid.cgff
wdlin@comp03:somewhere$ head DB/araport11.rnaJunction.mid.cgff
>splice.1 splice.1 49 50
49 50
>splice.10 splice.10 49 50
49 50
>splice.100 splice.100 49 50
49 50
>splice.1000 splice.1000 49 50
49 50
>splice.10000 splice.10000 49 50
49 50
The next three steps were very similar with what we had done in stage 1-1
.
wdlin@comp03:somewhere$ ls *.1-1.tmp.fasta | perl -ne '
chomp;
/(.+?)\./;
$cmd="Mapping.pl x $_ $1.1-2.bam MappingBowtie.pl -target DB/araport11.rnaJunction -p 20 -k 30 --gbar 151 --sensitive -ID 0.999 -SamTmp $1.1-2.tmp.sam";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: Mapping.pl x merged.1-1.tmp.fasta merged.1-2.bam MappingBowtie.pl -target DB/araport11.rnaJunction -p 20 -k 30 --gbar 151 --sensitive -ID 0.999 -SamTmp merged.1-2.tmp.sam
The generated command was to take merged.1-1.tmp.fasta
(containing unqualified reads in stage 1-1
) as the input, mapping them to the junction-spanning sequences using Bowtie2 (MappingBowtie.pl -target DB/araport11.rnaJunction
). Again, we asked Bowtie2 not to open gaps and report multiple qualified alignments under its --sensitive
criteria. The wrapper options -ID 0.999 -SamTmp merged.1-2.tmp.sam
ensured only perfect match been saved in merged.1-2.bam
and unqualified records saved in merged.1-2.tmp.sam
.
wdlin@comp03:somewhere$ ls *.1-2.bam | perl -ne '
chomp;
/(.+?)\./;
$cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAM $_ -O /dev/stdout -Un $1.1-2.filtered.sam -quiet true -filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo $1.1-2.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAM merged.1-2.bam -O /dev/stdout -Un merged.1-2.filtered.sam -quiet true -filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo merged.1-2.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin
The generated command was to take merged.1-2.bam
as the input, accept only alignments crossing mid points of junction-spanning sequences (-filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2
), and transform accepted alignments to genomic coordinates (-filter translation DB/RNAseqJunction.cgff
). The outputted PSL format records were read by the psl2sam.pl
script and convered into SAM records. The last part of samtools view
was to convert SAM records into BAM. Here, unqualified reads would be saved in merged.1-2.filtered.sam
.
wdlin@comp03:somewhere$ ls *.1-2.*.sam | perl -ne '
chomp;
if(/(.+?)_R\d\./){
$key=$1
}else{
/(.+?)\./;
$key=$1
}
push @{$hash{$key}},$_;
if(eof){
for $key (sort keys %hash){
$cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.1-2.tmp.fasta";
print "\nCMD: $cmd\n";
#system $cmd
}
}
'
CMD: cat merged.1-2.filtered.sam merged.1-2.tmp.sam | sam2fas.pl /dev/stdin merged.1-2.tmp.fasta
The generated command was to take unqualified SAM records in merged.1-2.filtered.sam
and merged.1-2.tmp.sam
and save those reads in merged.1-2.tmp.fasta
.
2-1
: 95% identity matches to Ararpot11 transcripts, both ends (4bps) of the alignments must be exact matches
wdlin@comp03:somewhere$ ls *.1-2.tmp.fasta | perl -ne '
chomp;
/(.+?)\./;
$cmd="Mapping.pl x $_ $1.2-1.bam MappingBowtie.pl -target DB/Araport11.strand.model -p 20 -k 30 --gbar 151 --sensitive";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: Mapping.pl x merged.1-2.tmp.fasta merged.2-1.bam MappingBowtie.pl -target DB/Araport11.strand.model -p 20 -k 30 --gbar 151 --sensitive
The generated command was to take merged.1-2.tmp.fasta
(containing unqualified reads in stage 1-2
) as the input and map them to Araport11 transcripts using Bowtie2 (MappingBowtie.pl -target DB/Araport11.strand.model
). Again, we asked Bowtie2 not to open gaps and report multiple qualified alignments under its --sensitive
criteria. This time, the wrapper options -ID
and -SamTmp
were not applied because some sophisticated alignment filtering were needed when we did not require perfect matches using Bowtie2. Note that all reads with and without alignment records were saved in the output merged.2-1.bam
file.
wdlin@comp03:somewhere$ ls *.2-1.bam | perl -ne '
chomp;
/(.+?)\./;
$cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -Un $1.2-1.unmapped.sam -quiet true -filter translation DB/Araport11.strand.model | psl2samG.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un $1.2-1.filtered.sam -quiet true -filter ID 0.95 -filter ExactEnds 4 -filter ExactIntron 2 | samtools view -Sbo $1.2-1.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.2-1.bam -O /dev/stdout -Un merged.2-1.unmapped.sam -quiet true -filter translation DB/Araport11.strand.model | psl2samG.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un merged.2-1.filtered.sam -quiet true -filter ID 0.95 -filter ExactEnds 4 -filter ExactIntron 2 | samtools view -Sbo merged.2-1.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin
In the generated command, we piped four commands:
- The first
misc.AlignmentFilter2
: readmerged.2-1.bam
and transform alignment coordinates to the genome. Note that reads reported without alignment by Bowtie2 were saved tomerged.2-1.unmapped.sam
. Note that the coordinate transformation function always generate records in PSL format. -
psl2samG.pl
read PSL format records made by the firstmisc.AlignmentFilter2
and converted them into SAM format. - The second
misc.AlignmentFilter2
: qualified alignments in genomic coordinates with three criteria: (i) identity>=0.95 (#match/read length), (ii) the four base pairs at the two ends must be matches (-filter ExactEnds 4
), and (iii) for every intronic gap created by the coordinate transformation (recalled that gaps were not allowed in the Bowtie2 step), the two base pairs surrounding the gap must be matches (-filter ExactIntron 2
). Unqualified records were saved inmerged.2-1.filtered.sam
. -
samtools view
converted qualified SAM records into BAM filemerged.2-1.pass.bam
Filtering criteria in point 3 were made based on some observations when applying Bowtie2:
- In Bowtie2's end-to-end mode (for example,
--sensitive
), end bases might be mismatches. - Mismatches in bases surrounding intron junctions may mean some other isoforms not included in the reference.
And our last step for this stage was to save reads in merged.2-1.filtered.sam
and merged.2-1.unmapped.sam
into merged.2-1.tmp.fasta
for the next stage.
wdlin@comp03:somewhere$ ls *.2-1.*.sam | perl -ne '
chomp;
if(/(.+?)_R\d\./){
$key=$1
}else{
/(.+?)\./;
$key=$1
}
push @{$hash{$key}},$_;
if(eof){
for $key (sort keys %hash){
$cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.2-1.tmp.fasta";
print "\nCMD: $cmd\n";
#system $cmd
}
}
'
CMD: cat merged.2-1.filtered.sam merged.2-1.unmapped.sam | sam2fas.pl /dev/stdin merged.2-1.tmp.fasta
2-2
: 95% identity matches to splicing junctions inferred from RNAseq data, both ends (4bps) of the alignments must be exact matches
Commands for this stage are similar to those in 2-1
. The first one was to map reads in merged.2-1.tmp.fasta
to junction-spanning sequences.
wdlin@comp03:somewhere$ ls *.2-1.tmp.fasta | perl -ne '
chomp;
/(.+?)\./;
$cmd="Mapping.pl x $_ $1.2-2.bam MappingBowtie.pl -target DB/araport11.rnaJunction -p 20 -k 30 --gbar 151 --sensitive";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: Mapping.pl x merged.2-1.tmp.fasta merged.2-2.bam MappingBowtie.pl -target DB/araport11.rnaJunction -p 20 -k 30 --gbar 151 --sensitive
Qualified alignments by checking if reads were crossing mid points of junction-spanning sequences. Also, required alignments containing exact matches at the two ends, as well as exact matches surrounding intronic gaps.
wdlin@comp03:somewhere$ ls *.2-2.bam | perl -ne '
chomp;
/(.+?)\./;
$cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -Un $1.2-2.unmapped.sam -quiet true -filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2samG.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un $1.2-2.filtered.sam -quiet true -filter ID 0.95 -filter ExactEnds 4 -filter ExactIntron 2 | samtools view -Sbo $1.2-2.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.2-2.bam -O /dev/stdout -Un merged.2-2.unmapped.sam -quiet true -filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2samG.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un merged.2-2.filtered.sam -quiet true -filter ID 0.95 -filter ExactEnds 4 -filter ExactIntron 2 | samtools view -Sbo merged.2-2.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin
Saved reads of unqualified alignments in merged.2-2.filtered.sam
and merged.2-2.unmapped.sam
into merged.2-2.tmp.fasta
.
wdlin@comp03:somewhere$ ls *.2-2.*.sam | perl -ne '
chomp;
if(/(.+?)_R\d\./){
$key=$1
}else{
/(.+?)\./;
$key=$1
}
push @{$hash{$key}},$_;
if(eof){
for $key (sort keys %hash){
$cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.2-2.tmp.fasta";
print "\nCMD: $cmd\n";
#system $cmd
}
}
'
CMD: cat merged.2-2.filtered.sam merged.2-2.unmapped.sam | sam2fas.pl /dev/stdin merged.2-2.tmp.fasta
3-1
& 3-2
: direct mapping to Arabidopsis genome with default and fine setting, both ends (4bps) of the alignments must be exact matches
In stages 1-*
and 2-*
, reads were mapped to the transcriptome reference and than alignments were transformed into genomic coordinates. In stages 3-*
, reads were mapped to the genome assembly directly using BLAT.
The first perl oneliner was to generate a command that used the wrapper script Mapping.pl
for mapping.
wdlin@comp03:somewhere$ ls *.2-2.tmp.fasta | perl -ne '
chomp;
/(.+?)\./;
$cmd="Mapping.pl -split 20 x $_ $1.3-1.bam MappingBlat.pl -unmap -md -target DB/TAIR10_chr_all.fas -q=rna -t=dna";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: Mapping.pl -split 20 x merged.2-2.tmp.fasta merged.3-1.bam MappingBlat.pl -unmap -md -target DB/TAIR10_chr_all.fas -q=rna -t=dna
Points to be noticed:
- input file
merged.2-2.tmp.fasta
contained all unqualified reads from the last stage. - wrapper option
-split 20
would split the input file into 20 shares and fork 20 processes (not threads) for running BLAT tasks. - child wrapper
MappingBlat.pl
was applied for calling BLAT. Its option-unmap
was to include unmapped reads in the output SAM/BAM file, where BLAT was not designed to include unmapped reads. The option-md
was to addMD
attributes into the SAM records.-target
specified the database file for BLAT mapping. -
-q=rna -t=dna
are BLAT options for mapping transcriptomic sequences to genome sequences.
Accordingly, merged.3-1.bam
saved all records (mapped and unmapped) for merged.2-2.tmp.fasta
. The next perl oneliner was to generate the filtering command to deal with merged.3-1.bam
.
wdlin@comp03:somewhere$ ls *.3-1.bam | perl -ne '
chomp;
/(.+?)\./;
$cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -Un $1.3-1.filtered.sam -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter ExactEnds 4 -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo $1.3-1.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.3-1.bam -O /dev/stdout -Un merged.3-1.filtered.sam -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter ExactEnds 4 -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo merged.3-1.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin
Key points to be noticed:
- Input file here was
merged.3-1.bam
, which contained mapped and unmapped records.merged.3-1.filtered.sam
saved all unqualified records. - The filtering options were to require (i) alignments with identity>=0.95, (ii) alignment blocks (continuous mapping intervals in both query side and target side of the alignment) no shorter than 8bps, (iii) 4bp exact matches at the two ends of the alignments, and (iv) intronic gaps no greater than 1000bps (
-filter indel deletion -mode maximum -max 1000
).
Reasons for the filter criteria (above points 2(ii) and 2(iv)):
- BLAT tends to randomly find matches of size down to 6bps when the two ends of a read have no good alignment to the genome.
- Limiting the intronic gap to 1000bps should further ensure that only reliable alignments were accepted. 1000bps seems a reasonable upper bound of intron sizes for arabidopsis. This also reduced the possibility to have randomly mapped 8bp blocks at the two ends.
The last command of this stage was to collect reads without qualified alignments into merged.3-1.tmp.fasta
for the next stage.
wdlin@comp03:somewhere$ ls *.3-1.*.sam | perl -ne '
chomp;
if(/(.+?)_R\d\./){
$key=$1
}else{
/(.+?)\./;
$key=$1
}
push @{$hash{$key}},$_;
if(eof){
for $key (sort keys %hash){
$cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.3-1.tmp.fasta";
print "\nCMD: $cmd\n";
#system $cmd
}
}
'
CMD: cat merged.3-1.filtered.sam | sam2fas.pl /dev/stdin merged.3-1.tmp.fasta
The next three commands for 3-2
were very similar with those for 3-1
, except that the additional BLAT options -minMatch=1 -minScore=14 -tileSize=9
ensured a fine scanning of BLAT. Accordingly, this made the mapping step very time consuming. You may consider to skip this stage as it used to produce no more than 2% of effective alignments.
wdlin@comp03:somewhere$ ls *.3-1.tmp.fasta | perl -ne '
chomp;
/(.+?)\./;
$cmd="Mapping.pl -split 20 x $_ $1.3-2.bam MappingBlat.pl -unmap -md -target DB/TAIR10_chr_all.fas -q=rna -t=dna -minMatch=1 -minScore=14 -tileSize=9";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: Mapping.pl -split 20 x merged.3-1.tmp.fasta merged.3-2.bam MappingBlat.pl -unmap -md -target DB/TAIR10_chr_all.fas -q=rna -t=dna -minMatch=1 -minScore=14 -tileSize=9
wdlin@comp03:somewhere$ ls *.3-2.bam | perl -ne '
chomp;
/(.+?)\./;
$cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -Un $1.3-2.filtered.sam -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter ExactEnds 4 -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo $1.3-2.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.3-2.bam -O /dev/stdout -Un merged.3-2.filtered.sam -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter ExactEnds 4 -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo merged.3-2.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin
wdlin@comp02:somewhere$ ls *.3-2.*.sam | perl -ne '
chomp;
if(/(.+?)_R\d\./){
$key=$1
}else{
/(.+?)\./;
$key=$1
}
push @{$hash{$key}},$_;
if(eof){
for $key (sort keys %hash){
$cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.3-2.tmp.fasta";
print "\nCMD: $cmd\n";
#system $cmd
}
}
'
CMD: cat merged.3-2.filtered.sam | sam2fas.pl /dev/stdin merged.3-2.tmp.fasta
The three commands for this stage were very similar with those for stage 2-1
, except that
- The Bowtie2 option
--sensitive
was modified to--sensitive-local
.--sensitive-local
is a Bowtie2 preset option for local alignment. According to our observation,--sensitive-local
deals the two alignment ends better than--sensitive
when it is impossible to have exact matches at the two ends. - The filtering option
-filter ExactEnds 4
were remove because alignments with exact matches at the two ends were collected in previous stages. Also, Bowtie2 with option--sensitive-local
would tend to report alignments without exact matches at the two ends.
wdlin@comp02:somewhere$ ls *.3-2.tmp.fasta | perl -ne '
chomp;
/(.+?)\./;
$cmd="Mapping.pl x $_ $1.4-1.bam MappingBowtie.pl -target DB/Araport11.strand.model -p 20 -k 30 --gbar 151 --sensitive-local";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: Mapping.pl x merged.3-2.tmp.fasta merged.4-1.bam MappingBowtie.pl -target DB/Araport11.strand.model -p 20 -k 30 --gbar 151 --sensitive-local
wdlin@comp02:somewhere$ ls *.4-1.bam | perl -ne '
chomp;
/(.+?)\./;
$cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -Un $1.4-1.unmapped.sam -quiet true -filter ID 0.95 -filter translation DB/Araport11.strand.model | psl2samG.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un $1.4-1.filtered.sam -quiet true -filter ExactIntron 2 | samtools view -Sbo $1.4-1.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin";
print "\nCMD: $cmd\n";
#system $cmd
'
CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.4-1.bam -O /dev/stdout -Un merged.4-1.unmapped.sam -quiet true -filter ID 0.95 -filter translation DB/Araport11.strand.model | psl2samG.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un merged.4-1.filtered.sam -quiet true -filter ExactIntron 2 | samtools view -Sbo merged.4-1.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin
wdlin@comp02:somewhere$ ls *.4-1.*.sam | perl -ne '
chomp;
if(/(.+?)_R\d\./){
$key=$1
}else{
/(.+?)\./;
$key=$1
}
push @{$hash{$key}},$_;
if(eof){
for $key (sort keys %hash){
$cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.4-1.tmp.fasta";
print "\nCMD: $cmd\n";
#system $cmd
}
}
'
CMD: cat merged.4-1.filtered.sam merged.4-1.unmapped.sam | sam2fas.pl /dev/stdin merged.4-1.tmp.fasta
The three commands for this stage were very similar with those for stage 2-2
. The two exceptions were --sensitive-local
for Bowtie2 and no -filter ExactEnds 4
for filtering, by similar reasons as described for stage 4-1
.
wdlin@comp02:somewhere$ ls *.4-1.tmp.fasta | perl -ne '
chomp;
/(.+?)\./;
$cmd="Mapping.pl x $_ $1.4-2.bam MappingBowtie.pl -target DB/araport11.rnaJunction -p 20 -k 30 --gbar 151 --sensitive-local";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: Mapping.pl x merged.4-1.tmp.fasta merged.4-2.bam MappingBowtie.pl -target DB/araport11.rnaJunction -p 20 -k 30 --gbar 151 --sensitive-local
wdlin@comp02:somewhere$ ls *.4-2.bam | perl -ne '
chomp;
/(.+?)\./;
$cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -Un $1.4-2.unmapped.sam -quiet true -filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un $1.4-2.filtered.sam -quiet true -filter ID 0.95 -filter ExactIntron 2 | samtools view -Sbo $1.4-2.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin";
print "\nCMD: $cmd\n";
#system $cmd
'
CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.4-2.bam -O /dev/stdout -Un merged.4-2.unmapped.sam -quiet true -filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un merged.4-2.filtered.sam -quiet true -filter ID 0.95 -filter ExactIntron 2 | samtools view -Sbo merged.4-2.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin
wdlin@comp02:somewhere$ ls *.4-2.*.sam | perl -ne '
chomp;
if(/(.+?)_R\d\./){
$key=$1
}else{
/(.+?)\./;
$key=$1
}
push @{$hash{$key}},$_;
if(eof){
for $key (sort keys %hash){
$cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.4-2.tmp.fasta";
print "\nCMD: $cmd\n";
#system $cmd
}
}
'
CMD: cat merged.4-2.filtered.sam merged.4-2.unmapped.sam | sam2fas.pl /dev/stdin merged.4-2.tmp.fasta
Steps of these two stages were similar with those of 3-1
and 3-2
. The only exception is that the filtering option -filter ExactEnds 4
were removed, i.e., not requiring exact matches at the two ends. However, we didn't redo the mapping steps because the mapping processes using BLAT were time consuming. Considering that all alignments had already been done in steps 3-1
and 3-2
, all we need to do was to extract them from merged.3-1.tmp.bam
and merged.3-2.tmp.bam
.
wdlin@comp02:somewhere$ ls *.3-?.bam | perl -ne '
chomp;
/(.+?)\.(.+?)\./;
$cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo $1.$2.tmp.bam -T DB/TAIR10_chr_all.fas /dev/stdin";
print "\nCMD: $cmd\n";
#system $cmd
'
CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.3-1.bam -O /dev/stdout -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo merged.3-1.tmp.bam -T DB/TAIR10_chr_all.fas /dev/stdin
CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.3-2.bam -O /dev/stdout -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo merged.3-2.tmp.bam -T DB/TAIR10_chr_all.fas /dev/stdin
The two generated commands were doing almost the same filtering but not requiring exact matches at the two ends. That is, the two output BAM files, merged.3-1.tmp.bam
and merged.3-2.tmp.bam
were supersets of merged.3-1.pass.bam
and merged.3-2.pass.bam
, respectively.
Also, because the filtering criteria was relaxed, merged.3-1.tmp.bam
and merged.3-2.tmp.bam
might contain reads in common. To exclude reads have qualified alignments in merged.3-1.tmp.bam
from merged.3-2.tmp.bam
, we applied SplitSelector.pl
. To make this script work with large BAM files, be sure to follow these steps for installing some perl modules. In short, SplitSelector.pl
reads two lists and makes extractions from the second list based on the first list and options. In this example, the SplitSelector.pl
was to extract BAM records from merged.3-2.tmp0.bam
(renamed from merged.3-2.tmp.bam
) by excluding reads in BAM file merged.3-1.tmp.bam
(--reverse 1
for reversed selection). The output file was named merged.3-2.tmp.bam
. In so doing, we have updated merged.3-2.tmp.bam
contained no reads in common with ``merged.3-1.tmp.bam`.
wdlin@comp02:somewhere$ mv merged.3-2.tmp.bam merged.3-2.tmp0.bam
wdlin@comp02:somewhere$ SplitSelector.pl --split 10 --reverse 1 BAM merged.3-1.tmp.bam BAM merged.3-2.tmp0.bam merged.3-2.tmp.bam
wdlin@comp02:somewhere$ SplitSelector.pl
Usage: SplitSelector.pl [options] <listFile format {LIST|FASTA|FASTQ|SAM|BAM}> <listFile> <seqFile format {FASTA|FASTQ|SAM|BAM}> <seqFile> <outputFile>
Options:
--idBy list="<ID|SEQ>"
--idBy seq="<ID|SEQ|PROG \"...\">"
--space list="<w>" replace space characters in read IDs with <w> on listFile. (default: no replacement)
--space seq="<w>" replace space characters in read IDs with <w> on seqFile. (default: no replacement)
--slash list="<x>" replace slash characters (/) in read IDs with <x> on listFile. (default: no replacement)
--slash seq="<x>" replace slash characters (/) in read IDs with <x> on seqFile. (default: no replacement)
--idTrim <y> trim last <y> characters of sequence IDs for hash keys. (default: 0)
--noPair remove a hash key if two or more IDs generate this same key. (default: 0)
--mateOnly select only mate reads. (default: 0)
--reverse reverse selection. (default: 0)
--report <RAW|FASTA|ID|SEQ>
--split <z> split seqFile into <z> files for parallel selection (defalut: 1).
--noSplitSameID not to split same sequence ID (default: 1)
--tmpdir <dir> assign a directory to generate the temporary directory, for <z> greater than 1. (default: current working dir)
--dump dump lookup IDs only (default: no dump)
Recall that merged.4-2.tmp.fasta
was containing all reads not qualified in all above stages (from 1-*
to 4-*
), and we have merged.3-1.tmp.bam
and merged.3-2.tmp.bam
containing alignments made by BLAT without filtering on alignment ends. Applying SplitSelector.pl
again should let us obtain alignments for stages 5-1
and 5-2
without rerunning the time consuming BLAT processes.
wdlin@comp02:somewhere$ SplitSelector.pl --split 10 FASTA merged.4-2.tmp.fasta BAM merged.3-1.tmp.bam merged.5-1.pass.bam
wdlin@comp02:somewhere$ SplitSelector.pl --split 10 FASTA merged.4-2.tmp.fasta BAM merged.3-2.tmp.bam merged.5-2.pass.bam
Some stages might produce no alignments when the total number of Riboseq reads was low. The following perl oneliner was applied to see if any BAM file contains no alignments. In our this case, there is no such BAM files.
wdlin@comp02:somewhere$ ls *.pass.bam *.translated.bam | perl -ne '
chomp;
$cmd="samtools view $_ | head -1000 | sam_filter.pl -ID 1.1 -msg $_ /dev/null /dev/null"; system $cmd
' | perl -ne '
chomp;
/\((.+?)\).+reads: (\d+)/;
print "$1\t$2\n"
' | perl -ne '
chomp;
@t=split;
$cmd="rm $t[0]";
if($t[1]==0){
print "\nCMD: $cmd\n";
#system $cmd
}
'
The following perl oneliner was used to generate a command for combining BAM files of pass alignments. Note that all above steps produced BAM files sorted by name so that the -n
option was applied for samtools merge
.
wdlin@comp02:somewhere$ ls *.pass.bam *.translated.bam | perl -ne '
chomp;
/(.+?)\./;
push @{$hash{$1}},$_;
if(eof){
for $key (sort keys %hash){
$cmd="samtools merge -fn $key.merged.bam @{$hash{$key}}";
print "\nCMD: $cmd\n";
#system $cmd
}
}
'
CMD: samtools merge -fn merged.merged.bam merged.1-1.translated.bam merged.1-2.translated.bam merged.2-1.pass.bam merged.2-2.pass.bam merged.3-1.pass.bam merged.3-2.pass.bam merged.4-1.pass.bam merged.4-2.pass.bam merged.5-1.pass.bam merged.5-2.pass.bam
Here are Riboseq read counts in the multi stage mapping procedure from one of our study. The direct araport11 row provides numbers of mapped reads by mapping reads only to Araport11 transcriptome using the mapping approach in this page without applying its filtering step.
sample | root-ctr-0h-r1 | root-ctr-0h-r2 | root-ctr-0h-r3 | root-ctr-0h-r4 | shoot-ctr-0h-r1 | shoot-ctr-0h-r2 | shoot-ctr-0h-r3 | shoot-ctr-0h-r4 |
---|---|---|---|---|---|---|---|---|
#read (25~50bps) | 57,485,656 | 67,069,781 | 53,221,915 | 55,725,679 | 59,291,669 | 52,932,987 | 61,515,314 | 64,121,338 |
1-1 | 37,303,514 | 43,484,682 | 33,466,131 | 32,998,508 | 37,387,833 | 30,684,389 | 37,374,588 | 41,243,699 |
1-2 | 5,874 | 3,339 | 1,580 | 7,508 | 5,243 | 3,897 | 7,512 | 2,294 |
2-1 | 1,459,963 | 1,428,306 | 1,288,558 | 810,410 | 2,903,377 | 1,237,592 | 2,450,520 | 1,308,064 |
2-2 | 1,047 | 504 | 235 | 1,160 | 583 | 241 | 763 | 172 |
3-1 | 443,840 | 891,183 | 361,960 | 927,978 | 469,497 | 446,371 | 348,467 | 597,956 |
3-2 | 470,724 | 407,718 | 288,529 | 376,342 | 374,567 | 299,812 | 306,810 | 248,776 |
4-1 | 12,571,122 | 17,145,748 | 13,243,792 | 16,137,106 | 14,090,490 | 16,855,643 | 14,662,137 | 16,906,860 |
4-2 | 1,058 | 509 | 238 | 965 | 722 | 602 | 985 | 349 |
5-1 | 1,088,251 | 1,090,234 | 1,111,191 | 1,101,176 | 697,123 | 781,785 | 577,874 | 1,057,908 |
5-2 | 498,313 | 206,906 | 357,688 | 187,871 | 198,793 | 180,734 | 148,247 | 130,469 |
merged | 53,843,706 | 64,659,129 | 50,119,902 | 52,549,024 | 56,128,228 | 50,491,066 | 55,877,903 | 61,496,547 |
% | 94% | 96% | 94% | 94% | 95% | 95% | 91% | 96% |
direct araport11 | 53,233,842 | 63,164,986 | 49,378,730 | 50,914,378 | 55,501,811 | 49,702,987 | 55,370,073 | 60,523,342 |
% | 93% | 94% | 93% | 91% | 94% | 94% | 90% | 94% |