Riboseq multi stage mapping and filtering, part 2 - wdlingit/cop GitHub Wiki

Multi stage mapping

The multi stage Riboseq read mapping procedure described in this page is based on one assumption: the existing genome annotation and corresponding RNAseq datasets provide a reliable transcriptome reference. Accordingly, mapping Riboseq reads to this transcriptome reference would be more precise than mapping Riboseq reads directly to the genome assembly. In the multi stage mapping procedure, read were firstly mapped to the transcriptome reference for precision and then mapped to the genome assembly for sensitivity.

We used to mark mapping stages with numbers. In this document, mapping stages are:

  • 1-1: exact matching to Araport11 transcripts
  • 1-2: exact matching to splicing junctions inferred from RNAseq data
  • 2-1: 95% identity matches to Ararpot11 transcripts, both ends (4bps) of the alignments must be exact matches
  • 2-2: 95% identity matches to splicing junctions inferred from RNAseq data, both ends (4bps) of the alignments must be exact matches
  • 3-1 & 3-2: direct mapping to Arabidopsis genome with default and fine setting, both ends (4bps) of the alignments must be exact matches
  • 4-1: 95% identity matches to Ararpot11 transcripts
  • 4-2: 95% identity matches to splicing junctions inferred from RNAseq data
  • 5-1 & 5-2: direct mapping to Arabidopsis genome with default and fine setting
  • the final merge step

We also provide some real numbers when applying this multi stage mapping procedure on our Riboseq datasets.

Commands in this document are not only for multiplexed Riboseq reads. They can be applied to separate FASTQ files. Most perl oneliners will be attached with generated commands (lines started with "CMD:") from our real practices.

1-1: exact matching to Araport11 transcripts

Recall that we have a Bowtie2 index Araport11.strand.model in folder DB. The following perl oneliner was to generate commands that execute our wrapper script Mapping.pl for mapping.

wdlin@comp03:somewhere$ ls merged.fasta | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="Mapping.pl x $_ $1.1-1.bam MappingBowtie.pl -target DB/Araport11.strand.model -p 20 -k 30 --gbar 151 --sensitive -ID 0.999 -SamTmp $1.1-1.unmapped.sam"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

CMD: Mapping.pl x merged.fasta merged.1-1.bam MappingBowtie.pl -target DB/Araport11.strand.model -p 20 -k 30 --gbar 151 --sensitive -ID 0.999 -SamTmp merged.1-1.unmapped.sam

Points to be noticed:

  1. The generated command (the line started with CMD:) was taking merged.fasta as input and merged.1-1.bam as output.
  2. MappingBowtie.pl -target DB/Araport11.strand.model: call child wrapper script MappingBowtie.pl and take DB/Araport11.strand.model as the Bowtie2 index
  3. -p 20 -k 30 --gbar 151 --sensitive: Bowtie2 parameters. -p 20 for using 20 threads. --sensitive for Bowtie2's preset end-to-end alignment criteria. -k 30 for reporting at most 30 qualified alignments (defined by --sensitive). Note that Bowtie2 tends to report only one alignment if this parameter was not set. --gbar 151 for asking Bowtie2 not to open any alignment gaps in the first 151bps of reads. Taking all Bowtie2 parameters together, the setting here was to ask Bowtie2 not to open any gap and report multiple qualified alignments.
  4. -ID 0.999 -SamTmp merged.1-1.unmapped.sam: filtering function provided by the wrapper script. Save only alignments with identity (#matche/read length) no less than 0.999 to the output BAM file, and save those unqualified alignments to merged.1-1.unmapped.sam. Note that our reads were all shorter than 51bps so -ID 0.999 was equal to exact matches. Also note that Bowtie2 reports also alignments of read that cannot be mapped. So merged.1-1.bam and merged.1-1.unmapped.sam should contain all reads in merged.fasta.

The alignments in merged.1-1.bam were mapped to Araport11 transcripts. We applied the following perl oneliner to generate piped commands for transforming alignments to genomic coordinates.

wdlin@comp03:somewhere$ ls *.1-1.bam | perl -ne '
    chomp; /(.+?)\./; 
    $cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAM $_ -O /dev/stdout -Un $1.1-1.filtered.sam -quiet true -filter translation DB/Araport11.strand.model | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo $1.1-1.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAM merged.1-1.bam -O /dev/stdout -Un merged.1-1.filtered.sam -quiet true -filter translation DB/Araport11.strand.model | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo merged.1-1.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin

Description:

  1. -M SAM merged.1-1.bam was to ask the java program misc.AlignmentFilter2 to read merged.1-1.bam as a SAM file.
  2. -O /dev/stdout was to write alignment records (after coordinate transformation) to standard output so that the next part of the pipe can handle them.
  3. -Un merged.1-1.filtered.sam: save any alignments that found no annotation for coordinate transformation to file merged.1-1.filtered.sam. Theoretically, there should be no reads cannot be transformed.
  4. -quiet true was to ask the java program misc.AlignmentFilter2 not to output program messages other than alignment records.
  5. -filter translation DB/Araport11.strand.model: use the translate function to translate alignments based on coordinates save in DB/Araport11.strand.model. Recall that DB/Araport11.strand.model saves exon coordinates of all Araport11 transcripts.
  6. psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin: Read PSL format input and translate them into SAM records with SAM header information. Note that DB/TAIR10_chr_all.fas is the genome assembly that we downloaded from EnsemblPlants.
  7. samtools view -Sbo merged.1-1.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin: Read SAM format input and save them in a BAM file named merged.1-1.translated.bam.

There are a number of reasons for us to perform this coordinate transforming step:

  1. We compute read counts for genes by computing intersections between alignments and exon intervals of genes in genomic coordinates.
  2. BAM files are usually much smaller than SAM files.
  3. SAMtools operations like merge and sort accept only BAM files.

The next step was to collect those unqualified reads for the next stage.

wdlin@comp03:somewhere$ ls *.1-1.*.sam | perl -ne '
    chomp; 
    if(/(.+?)_R\d\./){ 
        $key=$1 
    }else{ 
        /(.+?)\./; 
        $key=$1 
    } 
    push @{$hash{$key}},$_; 

    if(eof){ 
        for $key (sort keys %hash){ 
            $cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.1-1.tmp.fasta"; 
            print "\nCMD: $cmd\n"; 
            #system $cmd 
        } 
    }
'

CMD: cat merged.1-1.filtered.sam merged.1-1.unmapped.sam | sam2fas.pl /dev/stdin merged.1-1.tmp.fasta

The generated command was to take unqualified SAM records in merged.1-1.filtered.sam and merged.1-1.unmapped.sam and save these reads into merged.1-1.tmp.fasta.

1-2: exact matching to splicing junctions inferred from RNAseq data

In addition to transcripts in the database, RNAseq datasets (especially those made under the same experiment conditions with Riboseq dataset) can provide a good reference for Riboseq read mapping. In this stage, we generated junction-spanning sequences where the junctions were not covered by the Araport11 transcriptome database. In so doing, mapping Riboseq reads to these sequences can help us to obtain alignments cannot be inferred by Araport11 transcriptome.

wdlin@comp03:somewhere$ ls RNAseq_folder/rnaseq_*.merged.bam | perl -ne '
    chomp; 
    push @arr,$_; 
    
    if(eof){ 
        $cmd="java -classpath rackj.jar:sam-1.89.jar special.JunctionCGFF -M SAM ".join(" -M SAM ",@arr)." -O DB/RNAseqJunction -minB 10 -minDepth 10 -minSplicingPos 4 -flanking 49 -model DB/Araport11.strand.model"; 
        print "\nCMD: $cmd\n"; 
        #system $cmd 
    }
'

wdlin@comp03:somewhere$ head DB/RNAseqJunction.cgff
>splice.1       Chr1    4228    4554
4228    4276
4506    4554
>splice.2       Chr1    7399    7612
7399    7447
7564    7612
>splice.3       Chr1    7780    7990
7780    7828
7942    7990
>splice.4       Chr1    7783    7990

In above block, the perl oneliner command was to list all RNAseq BAM files (sorted by name) and generate the command to compute all junctions in those RNAseq BAM file. Reported junction coordinates would be represented by two 49bp intervals (-flanking 49, also see first few lines of DB/RNAseqJunction.cgff), where each of them were supported by at least 10 reads (-minDepth 10). Junctions available in the Araport11 database were removed (-model DB/Araport11.strand.model).

Since these junctions unavailable in the Araport11 database were saved as coordinates of two intervals, we applied SeqGen.pl to generate those 98bp junction-spanning sequences and built a Bowtie2 index, where the Bowtie2 index name is araport11.rnaJunction.

wdlin@comp03:somewhere/DB$ SeqGen.pl araport11.rnaJunction.fasta TAIR10_chr_all.fas RNAseqJunction.cgff

wdlin@comp03:somewhere/DB$ bowtie2-build araport11.rnaJunction.fasta araport11.rnaJunction

The following perl oneliner was to make sure sequences were all 98bp long.

(check length distribution of junction sequences)
wdlin@comp03:somewhere$ cat DB/araport11.rnaJunction.fasta | perl -ne '
    chomp; 
    if(/^>(\S+)/){ 
        $id=$1 
    }else{ 
        $hash{$id}+=length($_) 
    } 
    if(eof){ 
        for $key (keys %hash){ 
            print "$hash{$key}\n" 
        } 
    }
' | perl -ne '
    chomp; 
    $hash{$_}++; 
    if(eof){ 
        for $key (sort {$a<=>$b} keys %hash){ 
            print "$key\t$hash{$key}\n" 
        }
    }
'
98      57991

Generating these junction-spanning sequences was aimed to align junction-spanning Riboseq reads correctly. In order to unqualify those Riboseq reads mapping to junction-spanning sequences but not crossing the junctions, the following perl oneliner was applied to generate coordinates of mid points of junction-spanning seqeunces.

wdlin@comp03:somewhere$ cat DB/araport11.rnaJunction.fasta | perl -ne '
    chomp; 
    if(/^>(\S+)/){ 
        $id=$1 
    }else{ 
        $hash{$id}+=length($_) 
    } 
    if(eof){ 
        for $key (sort {$a cmp $b} keys %hash){ 
            print "$key\t$hash{$key}\n" 
        } 
    }
' | perl -ne '
    chomp; 
    ($id,$len)=split; 
    print ">$id\t$id\t".($len/2)."\t".($len/2+1)."\n".($len/2)."\t".($len/2+1)."\n"
' > DB/araport11.rnaJunction.mid.cgff

wdlin@comp03:somewhere$ head DB/araport11.rnaJunction.mid.cgff
>splice.1       splice.1        49      50
49      50
>splice.10      splice.10       49      50
49      50
>splice.100     splice.100      49      50
49      50
>splice.1000    splice.1000     49      50
49      50
>splice.10000   splice.10000    49      50
49      50

The next three steps were very similar with what we had done in stage 1-1.

wdlin@comp03:somewhere$ ls *.1-1.tmp.fasta | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="Mapping.pl x $_ $1.1-2.bam MappingBowtie.pl -target DB/araport11.rnaJunction -p 20 -k 30 --gbar 151 --sensitive -ID 0.999 -SamTmp $1.1-2.tmp.sam"; 
    print "\nCMD: $cmd\n";
    #system $cmd;
'

CMD: Mapping.pl x merged.1-1.tmp.fasta merged.1-2.bam MappingBowtie.pl -target DB/araport11.rnaJunction -p 20 -k 30 --gbar 151 --sensitive -ID 0.999 -SamTmp merged.1-2.tmp.sam

The generated command was to take merged.1-1.tmp.fasta (containing unqualified reads in stage 1-1) as the input, mapping them to the junction-spanning sequences using Bowtie2 (MappingBowtie.pl -target DB/araport11.rnaJunction). Again, we asked Bowtie2 not to open gaps and report multiple qualified alignments under its --sensitive criteria. The wrapper options -ID 0.999 -SamTmp merged.1-2.tmp.sam ensured only perfect match been saved in merged.1-2.bam and unqualified records saved in merged.1-2.tmp.sam.

wdlin@comp03:somewhere$ ls *.1-2.bam | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAM $_ -O /dev/stdout -Un $1.1-2.filtered.sam -quiet true -filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo $1.1-2.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAM merged.1-2.bam -O /dev/stdout -Un merged.1-2.filtered.sam -quiet true -filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | samtools view -Sbo merged.1-2.translated.bam -T DB/TAIR10_chr_all.fas /dev/stdin

The generated command was to take merged.1-2.bam as the input, accept only alignments crossing mid points of junction-spanning sequences (-filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2), and transform accepted alignments to genomic coordinates (-filter translation DB/RNAseqJunction.cgff). The outputted PSL format records were read by the psl2sam.pl script and convered into SAM records. The last part of samtools view was to convert SAM records into BAM. Here, unqualified reads would be saved in merged.1-2.filtered.sam.

wdlin@comp03:somewhere$ ls *.1-2.*.sam | perl -ne '
    chomp; 
    if(/(.+?)_R\d\./){ 
        $key=$1 
    }else{ 
        /(.+?)\./; 
        $key=$1 
    } 
    push @{$hash{$key}},$_; 
    
    if(eof){ 
        for $key (sort keys %hash){ 
            $cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.1-2.tmp.fasta"; 
            print "\nCMD: $cmd\n"; 
            #system $cmd 
        } 
    }
'

CMD: cat merged.1-2.filtered.sam merged.1-2.tmp.sam | sam2fas.pl /dev/stdin merged.1-2.tmp.fasta

The generated command was to take unqualified SAM records in merged.1-2.filtered.sam and merged.1-2.tmp.sam and save those reads in merged.1-2.tmp.fasta.

2-1: 95% identity matches to Ararpot11 transcripts, both ends (4bps) of the alignments must be exact matches

wdlin@comp03:somewhere$ ls *.1-2.tmp.fasta | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="Mapping.pl x $_ $1.2-1.bam MappingBowtie.pl -target DB/Araport11.strand.model -p 20 -k 30 --gbar 151 --sensitive"; 
    print "\nCMD: $cmd\n";
    #system $cmd;
'

CMD: Mapping.pl x merged.1-2.tmp.fasta merged.2-1.bam MappingBowtie.pl -target DB/Araport11.strand.model -p 20 -k 30 --gbar 151 --sensitive

The generated command was to take merged.1-2.tmp.fasta (containing unqualified reads in stage 1-2) as the input and map them to Araport11 transcripts using Bowtie2 (MappingBowtie.pl -target DB/Araport11.strand.model). Again, we asked Bowtie2 not to open gaps and report multiple qualified alignments under its --sensitive criteria. This time, the wrapper options -ID and -SamTmp were not applied because some sophisticated alignment filtering were needed when we did not require perfect matches using Bowtie2. Note that all reads with and without alignment records were saved in the output merged.2-1.bam file.

wdlin@comp03:somewhere$ ls *.2-1.bam | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -Un $1.2-1.unmapped.sam -quiet true -filter translation DB/Araport11.strand.model | psl2samG.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un $1.2-1.filtered.sam -quiet true -filter ID 0.95 -filter ExactEnds 4 -filter ExactIntron 2 | samtools view -Sbo $1.2-1.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.2-1.bam -O /dev/stdout -Un merged.2-1.unmapped.sam -quiet true -filter translation DB/Araport11.strand.model | psl2samG.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un merged.2-1.filtered.sam -quiet true -filter ID 0.95 -filter ExactEnds 4 -filter ExactIntron 2 | samtools view -Sbo merged.2-1.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin

In the generated command, we piped four commands:

  1. The first misc.AlignmentFilter2: read merged.2-1.bam and transform alignment coordinates to the genome. Note that reads reported without alignment by Bowtie2 were saved to merged.2-1.unmapped.sam. Note that the coordinate transformation function always generate records in PSL format.
  2. psl2samG.pl read PSL format records made by the first misc.AlignmentFilter2 and converted them into SAM format.
  3. The second misc.AlignmentFilter2: qualified alignments in genomic coordinates with three criteria: (i) identity>=0.95 (#match/read length), (ii) the four base pairs at the two ends must be matches (-filter ExactEnds 4), and (iii) for every intronic gap created by the coordinate transformation (recalled that gaps were not allowed in the Bowtie2 step), the two base pairs surrounding the gap must be matches (-filter ExactIntron 2). Unqualified records were saved in merged.2-1.filtered.sam.
  4. samtools view converted qualified SAM records into BAM file merged.2-1.pass.bam

Filtering criteria in point 3 were made based on some observations when applying Bowtie2:

  1. In Bowtie2's end-to-end mode (for example, --sensitive), end bases might be mismatches.
  2. Mismatches in bases surrounding intron junctions may mean some other isoforms not included in the reference.

And our last step for this stage was to save reads in merged.2-1.filtered.sam and merged.2-1.unmapped.sam into merged.2-1.tmp.fasta for the next stage.

wdlin@comp03:somewhere$ ls *.2-1.*.sam | perl -ne '
    chomp; 
    if(/(.+?)_R\d\./){ 
        $key=$1 
    }else{ 
        /(.+?)\./; 
        $key=$1 
    } 
    push @{$hash{$key}},$_; 
    
    if(eof){ 
        for $key (sort keys %hash){ 
            $cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.2-1.tmp.fasta"; 
            print "\nCMD: $cmd\n"; 
            #system $cmd 
        } 
    }
'

CMD: cat merged.2-1.filtered.sam merged.2-1.unmapped.sam | sam2fas.pl /dev/stdin merged.2-1.tmp.fasta

2-2: 95% identity matches to splicing junctions inferred from RNAseq data, both ends (4bps) of the alignments must be exact matches

Commands for this stage are similar to those in 2-1. The first one was to map reads in merged.2-1.tmp.fasta to junction-spanning sequences.

wdlin@comp03:somewhere$ ls *.2-1.tmp.fasta | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="Mapping.pl x $_ $1.2-2.bam MappingBowtie.pl -target DB/araport11.rnaJunction -p 20 -k 30 --gbar 151 --sensitive"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

CMD: Mapping.pl x merged.2-1.tmp.fasta merged.2-2.bam MappingBowtie.pl -target DB/araport11.rnaJunction -p 20 -k 30 --gbar 151 --sensitive

Qualified alignments by checking if reads were crossing mid points of junction-spanning sequences. Also, required alignments containing exact matches at the two ends, as well as exact matches surrounding intronic gaps.

wdlin@comp03:somewhere$ ls *.2-2.bam | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -Un $1.2-2.unmapped.sam -quiet true -filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2samG.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un $1.2-2.filtered.sam -quiet true -filter ID 0.95 -filter ExactEnds 4 -filter ExactIntron 2 | samtools view -Sbo $1.2-2.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.2-2.bam -O /dev/stdout -Un merged.2-2.unmapped.sam -quiet true -filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2samG.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un merged.2-2.filtered.sam -quiet true -filter ID 0.95 -filter ExactEnds 4 -filter ExactIntron 2 | samtools view -Sbo merged.2-2.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin

Saved reads of unqualified alignments in merged.2-2.filtered.sam and merged.2-2.unmapped.sam into merged.2-2.tmp.fasta.

wdlin@comp03:somewhere$ ls *.2-2.*.sam | perl -ne '
    chomp; 
    if(/(.+?)_R\d\./){ 
        $key=$1 
    }else{ 
        /(.+?)\./; 
        $key=$1 
    } 
    push @{$hash{$key}},$_; 
    
    if(eof){ 
        for $key (sort keys %hash){ 
            $cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.2-2.tmp.fasta"; 
            print "\nCMD: $cmd\n"; 
            #system $cmd 
        } 
    }
'

CMD: cat merged.2-2.filtered.sam merged.2-2.unmapped.sam | sam2fas.pl /dev/stdin merged.2-2.tmp.fasta

3-1 & 3-2: direct mapping to Arabidopsis genome with default and fine setting, both ends (4bps) of the alignments must be exact matches

In stages 1-* and 2-*, reads were mapped to the transcriptome reference and than alignments were transformed into genomic coordinates. In stages 3-*, reads were mapped to the genome assembly directly using BLAT.

The first perl oneliner was to generate a command that used the wrapper script Mapping.pl for mapping.

wdlin@comp03:somewhere$ ls *.2-2.tmp.fasta | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="Mapping.pl -split 20 x $_ $1.3-1.bam MappingBlat.pl -unmap -md -target DB/TAIR10_chr_all.fas -q=rna -t=dna"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

CMD: Mapping.pl -split 20 x merged.2-2.tmp.fasta merged.3-1.bam MappingBlat.pl -unmap -md -target DB/TAIR10_chr_all.fas -q=rna -t=dna

Points to be noticed:

  1. input file merged.2-2.tmp.fasta contained all unqualified reads from the last stage.
  2. wrapper option -split 20 would split the input file into 20 shares and fork 20 processes (not threads) for running BLAT tasks.
  3. child wrapper MappingBlat.pl was applied for calling BLAT. Its option -unmap was to include unmapped reads in the output SAM/BAM file, where BLAT was not designed to include unmapped reads. The option -md was to add MD attributes into the SAM records. -target specified the database file for BLAT mapping.
  4. -q=rna -t=dna are BLAT options for mapping transcriptomic sequences to genome sequences.

Accordingly, merged.3-1.bam saved all records (mapped and unmapped) for merged.2-2.tmp.fasta. The next perl oneliner was to generate the filtering command to deal with merged.3-1.bam.

wdlin@comp03:somewhere$ ls *.3-1.bam | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -Un $1.3-1.filtered.sam -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter ExactEnds 4 -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo $1.3-1.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.3-1.bam -O /dev/stdout -Un merged.3-1.filtered.sam -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter ExactEnds 4 -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo merged.3-1.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin

Key points to be noticed:

  1. Input file here was merged.3-1.bam, which contained mapped and unmapped records. merged.3-1.filtered.sam saved all unqualified records.
  2. The filtering options were to require (i) alignments with identity>=0.95, (ii) alignment blocks (continuous mapping intervals in both query side and target side of the alignment) no shorter than 8bps, (iii) 4bp exact matches at the two ends of the alignments, and (iv) intronic gaps no greater than 1000bps (-filter indel deletion -mode maximum -max 1000).

Reasons for the filter criteria (above points 2(ii) and 2(iv)):

  1. BLAT tends to randomly find matches of size down to 6bps when the two ends of a read have no good alignment to the genome.
  2. Limiting the intronic gap to 1000bps should further ensure that only reliable alignments were accepted. 1000bps seems a reasonable upper bound of intron sizes for arabidopsis. This also reduced the possibility to have randomly mapped 8bp blocks at the two ends.

The last command of this stage was to collect reads without qualified alignments into merged.3-1.tmp.fasta for the next stage.

wdlin@comp03:somewhere$ ls *.3-1.*.sam | perl -ne '
    chomp; 
    if(/(.+?)_R\d\./){ 
        $key=$1 
    }else{ 
        /(.+?)\./; 
        $key=$1 
    } 
    push @{$hash{$key}},$_; 
    
    if(eof){ 
        for $key (sort keys %hash){ 
            $cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.3-1.tmp.fasta"; 
            print "\nCMD: $cmd\n"; 
            #system $cmd 
        } 
    }
'

CMD: cat merged.3-1.filtered.sam | sam2fas.pl /dev/stdin merged.3-1.tmp.fasta

The next three commands for 3-2 were very similar with those for 3-1, except that the additional BLAT options -minMatch=1 -minScore=14 -tileSize=9 ensured a fine scanning of BLAT. Accordingly, this made the mapping step very time consuming. You may consider to skip this stage as it used to produce no more than 2% of effective alignments.

wdlin@comp03:somewhere$ ls *.3-1.tmp.fasta | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="Mapping.pl -split 20 x $_ $1.3-2.bam MappingBlat.pl -unmap -md -target DB/TAIR10_chr_all.fas -q=rna -t=dna -minMatch=1 -minScore=14 -tileSize=9"; 
    print "\nCMD: $cmd\n";
    #system $cmd;
'

CMD: Mapping.pl -split 20 x merged.3-1.tmp.fasta merged.3-2.bam MappingBlat.pl -unmap -md -target DB/TAIR10_chr_all.fas -q=rna -t=dna -minMatch=1 -minScore=14 -tileSize=9
wdlin@comp03:somewhere$ ls *.3-2.bam | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -Un $1.3-2.filtered.sam -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter ExactEnds 4 -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo $1.3-2.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.3-2.bam -O /dev/stdout -Un merged.3-2.filtered.sam -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter ExactEnds 4 -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo merged.3-2.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin
wdlin@comp02:somewhere$ ls *.3-2.*.sam | perl -ne '
    chomp; 
    if(/(.+?)_R\d\./){ 
        $key=$1 
    }else{ 
        /(.+?)\./; 
        $key=$1 
    } 
    push @{$hash{$key}},$_; 
    
    if(eof){ 
        for $key (sort keys %hash){ 
            $cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.3-2.tmp.fasta"; 
            print "\nCMD: $cmd\n"; 
            #system $cmd 
        } 
    }
'

CMD: cat merged.3-2.filtered.sam | sam2fas.pl /dev/stdin merged.3-2.tmp.fasta

4-1: 95% identity matches to Ararpot11 transcripts

The three commands for this stage were very similar with those for stage 2-1, except that

  1. The Bowtie2 option --sensitive was modified to --sensitive-local. --sensitive-local is a Bowtie2 preset option for local alignment. According to our observation, --sensitive-local deals the two alignment ends better than --sensitive when it is impossible to have exact matches at the two ends.
  2. The filtering option -filter ExactEnds 4 were remove because alignments with exact matches at the two ends were collected in previous stages. Also, Bowtie2 with option --sensitive-local would tend to report alignments without exact matches at the two ends.
wdlin@comp02:somewhere$ ls *.3-2.tmp.fasta | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="Mapping.pl x $_ $1.4-1.bam MappingBowtie.pl -target DB/Araport11.strand.model -p 20 -k 30 --gbar 151 --sensitive-local"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

CMD: Mapping.pl x merged.3-2.tmp.fasta merged.4-1.bam MappingBowtie.pl -target DB/Araport11.strand.model -p 20 -k 30 --gbar 151 --sensitive-local
wdlin@comp02:somewhere$ ls *.4-1.bam | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -Un $1.4-1.unmapped.sam -quiet true -filter ID 0.95 -filter translation DB/Araport11.strand.model | psl2samG.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un $1.4-1.filtered.sam -quiet true -filter ExactIntron 2 | samtools view -Sbo $1.4-1.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd
'

CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.4-1.bam -O /dev/stdout -Un merged.4-1.unmapped.sam -quiet true -filter ID 0.95 -filter translation DB/Araport11.strand.model | psl2samG.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un merged.4-1.filtered.sam -quiet true -filter ExactIntron 2 | samtools view -Sbo merged.4-1.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin
wdlin@comp02:somewhere$ ls *.4-1.*.sam | perl -ne '
    chomp; 
    if(/(.+?)_R\d\./){ 
        $key=$1 
    }else{ 
        /(.+?)\./; 
        $key=$1 
    } 
    push @{$hash{$key}},$_; 
    
    if(eof){ 
        for $key (sort keys %hash){ 
            $cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.4-1.tmp.fasta"; 
            print "\nCMD: $cmd\n"; 
            #system $cmd 
        } 
    }
'

CMD: cat merged.4-1.filtered.sam merged.4-1.unmapped.sam | sam2fas.pl /dev/stdin merged.4-1.tmp.fasta

4-2`: 95% identity matches to splicing junctions inferred from RNAseq data

The three commands for this stage were very similar with those for stage 2-2. The two exceptions were --sensitive-local for Bowtie2 and no -filter ExactEnds 4 for filtering, by similar reasons as described for stage 4-1.

wdlin@comp02:somewhere$ ls *.4-1.tmp.fasta | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="Mapping.pl x $_ $1.4-2.bam MappingBowtie.pl -target DB/araport11.rnaJunction -p 20 -k 30 --gbar 151 --sensitive-local"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

CMD: Mapping.pl x merged.4-1.tmp.fasta merged.4-2.bam MappingBowtie.pl -target DB/araport11.rnaJunction -p 20 -k 30 --gbar 151 --sensitive-local
wdlin@comp02:somewhere$ ls *.4-2.bam | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -Un $1.4-2.unmapped.sam -quiet true -filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un $1.4-2.filtered.sam -quiet true -filter ID 0.95 -filter ExactIntron 2 | samtools view -Sbo $1.4-2.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd
'

CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.4-2.bam -O /dev/stdout -Un merged.4-2.unmapped.sam -quiet true -filter gene -GFF DB/araport11.rnaJunction.mid.cgff -min 2 -filter translation DB/RNAseqJunction.cgff | psl2sam.pl -md DB/TAIR10_chr_all.fas /dev/stdin | java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun /dev/stdin -O /dev/stdout -Un merged.4-2.filtered.sam -quiet true -filter ID 0.95 -filter ExactIntron 2 | samtools view -Sbo merged.4-2.pass.bam -T DB/TAIR10_chr_all.fas /dev/stdin
wdlin@comp02:somewhere$ ls *.4-2.*.sam | perl -ne '
    chomp; 
    if(/(.+?)_R\d\./){ 
        $key=$1 
    }else{ 
        /(.+?)\./; 
        $key=$1 
    } 
    push @{$hash{$key}},$_; 
    
    if(eof){ 
        for $key (sort keys %hash){ 
            $cmd="cat @{$hash{$key}} | sam2fas.pl /dev/stdin $key.4-2.tmp.fasta"; 
            print "\nCMD: $cmd\n"; 
            #system $cmd 
        } 
    }
'

CMD: cat merged.4-2.filtered.sam merged.4-2.unmapped.sam | sam2fas.pl /dev/stdin merged.4-2.tmp.fasta

5-1 & 5-2: direct mapping to Arabidopsis genome with default and fine setting

Steps of these two stages were similar with those of 3-1 and 3-2. The only exception is that the filtering option -filter ExactEnds 4 were removed, i.e., not requiring exact matches at the two ends. However, we didn't redo the mapping steps because the mapping processes using BLAT were time consuming. Considering that all alignments had already been done in steps 3-1 and 3-2, all we need to do was to extract them from merged.3-1.tmp.bam and merged.3-2.tmp.bam.

wdlin@comp02:somewhere$ ls *.3-?.bam | perl -ne '
    chomp; 
    /(.+?)\.(.+?)\./; 
    $cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun $_ -O /dev/stdout -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo $1.$2.tmp.bam -T DB/TAIR10_chr_all.fas /dev/stdin"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd
'

CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.3-1.bam -O /dev/stdout -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo merged.3-1.tmp.bam -T DB/TAIR10_chr_all.fas /dev/stdin

CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAMun merged.3-2.bam -O /dev/stdout -quiet true -filter ID 0.95 -filter J 2 -filter minB 8 -filter J reset -filter indel deletion -mode maximum -max 1000 | samtools view -Sbo merged.3-2.tmp.bam -T DB/TAIR10_chr_all.fas /dev/stdin

The two generated commands were doing almost the same filtering but not requiring exact matches at the two ends. That is, the two output BAM files, merged.3-1.tmp.bam and merged.3-2.tmp.bam were supersets of merged.3-1.pass.bam and merged.3-2.pass.bam, respectively.

Also, because the filtering criteria was relaxed, merged.3-1.tmp.bam and merged.3-2.tmp.bam might contain reads in common. To exclude reads have qualified alignments in merged.3-1.tmp.bam from merged.3-2.tmp.bam, we applied SplitSelector.pl. To make this script work with large BAM files, be sure to follow these steps for installing some perl modules. In short, SplitSelector.pl reads two lists and makes extractions from the second list based on the first list and options. In this example, the SplitSelector.pl was to extract BAM records from merged.3-2.tmp0.bam (renamed from merged.3-2.tmp.bam) by excluding reads in BAM file merged.3-1.tmp.bam (--reverse 1 for reversed selection). The output file was named merged.3-2.tmp.bam. In so doing, we have updated merged.3-2.tmp.bam contained no reads in common with ``merged.3-1.tmp.bam`.

wdlin@comp02:somewhere$ mv merged.3-2.tmp.bam merged.3-2.tmp0.bam

wdlin@comp02:somewhere$ SplitSelector.pl --split 10 --reverse 1 BAM merged.3-1.tmp.bam BAM merged.3-2.tmp0.bam merged.3-2.tmp.bam

wdlin@comp02:somewhere$ SplitSelector.pl
Usage:  SplitSelector.pl [options] <listFile format {LIST|FASTA|FASTQ|SAM|BAM}> <listFile> <seqFile format {FASTA|FASTQ|SAM|BAM}> <seqFile> <outputFile>
Options:
   --idBy list="<ID|SEQ>"
   --idBy seq="<ID|SEQ|PROG \"...\">"
   --space list="<w>"           replace space characters in read IDs with <w> on listFile. (default: no replacement)
   --space seq="<w>"            replace space characters in read IDs with <w> on seqFile. (default: no replacement)
   --slash list="<x>"           replace slash characters (/) in read IDs with <x> on listFile. (default: no replacement)
   --slash seq="<x>"            replace slash characters (/) in read IDs with <x> on seqFile. (default: no replacement)
   --idTrim <y>                 trim last <y> characters of sequence IDs for hash keys. (default: 0)
   --noPair                     remove a hash key if two or more IDs generate this same key. (default: 0)
   --mateOnly                   select only mate reads. (default: 0)
   --reverse                    reverse selection. (default: 0)
   --report <RAW|FASTA|ID|SEQ>
   --split <z>                  split seqFile into <z> files for parallel selection (defalut: 1).
   --noSplitSameID              not to split same sequence ID (default: 1)
   --tmpdir <dir>               assign a directory to generate the temporary directory, for <z> greater than 1. (default: current working dir)
   --dump                       dump lookup IDs only (default: no dump)

Recall that merged.4-2.tmp.fasta was containing all reads not qualified in all above stages (from 1-* to 4-*), and we have merged.3-1.tmp.bam and merged.3-2.tmp.bam containing alignments made by BLAT without filtering on alignment ends. Applying SplitSelector.pl again should let us obtain alignments for stages 5-1 and 5-2 without rerunning the time consuming BLAT processes.

wdlin@comp02:somewhere$ SplitSelector.pl --split 10 FASTA merged.4-2.tmp.fasta BAM merged.3-1.tmp.bam merged.5-1.pass.bam

wdlin@comp02:somewhere$ SplitSelector.pl --split 10 FASTA merged.4-2.tmp.fasta BAM merged.3-2.tmp.bam merged.5-2.pass.bam

Merge BAM files of pass alignments from all stages

Some stages might produce no alignments when the total number of Riboseq reads was low. The following perl oneliner was applied to see if any BAM file contains no alignments. In our this case, there is no such BAM files.

wdlin@comp02:somewhere$ ls *.pass.bam *.translated.bam | perl -ne '
    chomp; 
    $cmd="samtools view $_ | head -1000 | sam_filter.pl -ID 1.1 -msg $_ /dev/null /dev/null"; system $cmd
' | perl -ne '
    chomp; 
    /\((.+?)\).+reads: (\d+)/; 
    print "$1\t$2\n"
' | perl -ne '
    chomp; 
    @t=split; 
    $cmd="rm $t[0]"; 
    if($t[1]==0){ 
        print "\nCMD: $cmd\n"; 
        #system $cmd 
    }
'

The following perl oneliner was used to generate a command for combining BAM files of pass alignments. Note that all above steps produced BAM files sorted by name so that the -n option was applied for samtools merge.

wdlin@comp02:somewhere$ ls *.pass.bam *.translated.bam | perl -ne '
    chomp; 
    /(.+?)\./; 
    push @{$hash{$1}},$_; 
    if(eof){ 
        for $key (sort keys %hash){ 
            $cmd="samtools merge -fn $key.merged.bam @{$hash{$key}}"; 
            print "\nCMD: $cmd\n"; 
            #system $cmd 
        } 
    }
'

CMD: samtools merge -fn merged.merged.bam merged.1-1.translated.bam merged.1-2.translated.bam merged.2-1.pass.bam merged.2-2.pass.bam merged.3-1.pass.bam merged.3-2.pass.bam merged.4-1.pass.bam merged.4-2.pass.bam merged.5-1.pass.bam merged.5-2.pass.bam

Real numbers

Here are Riboseq read counts in the multi stage mapping procedure from one of our study. The direct araport11 row provides numbers of mapped reads by mapping reads only to Araport11 transcriptome using the mapping approach in this page without applying its filtering step.

sample root-ctr-0h-r1 root-ctr-0h-r2 root-ctr-0h-r3 root-ctr-0h-r4 shoot-ctr-0h-r1 shoot-ctr-0h-r2 shoot-ctr-0h-r3 shoot-ctr-0h-r4
#read (25~50bps) 57,485,656 67,069,781 53,221,915 55,725,679 59,291,669 52,932,987 61,515,314 64,121,338
1-1 37,303,514 43,484,682 33,466,131 32,998,508 37,387,833 30,684,389 37,374,588 41,243,699
1-2 5,874 3,339 1,580 7,508 5,243 3,897 7,512 2,294
2-1 1,459,963 1,428,306 1,288,558 810,410 2,903,377 1,237,592 2,450,520 1,308,064
2-2 1,047 504 235 1,160 583 241 763 172
3-1 443,840 891,183 361,960 927,978 469,497 446,371 348,467 597,956
3-2 470,724 407,718 288,529 376,342 374,567 299,812 306,810 248,776
4-1 12,571,122 17,145,748 13,243,792 16,137,106 14,090,490 16,855,643 14,662,137 16,906,860
4-2 1,058 509 238 965 722 602 985 349
5-1 1,088,251 1,090,234 1,111,191 1,101,176 697,123 781,785 577,874 1,057,908
5-2 498,313 206,906 357,688 187,871 198,793 180,734 148,247 130,469
merged 53,843,706 64,659,129 50,119,902 52,549,024 56,128,228 50,491,066 55,877,903 61,496,547
% 94% 96% 94% 94% 95% 95% 91% 96%
direct araport11 53,233,842 63,164,986 49,378,730 50,914,378 55,501,811 49,702,987 55,370,073 60,523,342
% 93% 94% 93% 91% 94% 94% 90% 94%
⚠️ **GitHub.com Fallback** ⚠️