Riboseq multi stage mapping and filtering, part 3 - wdlingit/cop GitHub Wiki

Filtering and read demultiplexing

Read demultiplexing

Following the Riboseq multiplexing steps and the multi stage mapping procedure, there are the following three files:

  1. merged.fasta: a FASTA file storing only nonredundant Riboseq reads,
  2. merged.freq: a frequency file storing read frequencies in all samples, and
  3. merged.merged.bam: the BAM file storing qualified alignments of merged.fasta

The following perl oneliner was applied to demultiplex alignments in merged.merged.bam and save SAM files int folder multiply0. In short, the perl oneliner read frequency information from merged.freq, duplicated alignments by different times according to the read frequencies in samples, and saved duplicated alignments into SAM files.

wdlin@comp02:somewhere$ mkdir multiply0

wdlin@comp02:somewhere$ samtools view merged.merged.bam | perl -ne '
    if($.==1){ 
        open(FILE,"<merged.freq"); 
        $line=<FILE>; 
        @samples=split(/\s+/,$line); 
        shift @samples; 
        for $sample (@samples){ 
            open($fh{$sample},">multiply0/$sample.sam"); 
        } 
        while($line=<FILE>){ 
            @s=split(/\s+/,$line); 
            $seq=shift @s; 
            for($idx=0;$idx<@s;$idx++){ 
                $freq{$seq}{$samples[$idx]}=$s[$idx] 
            } 
        } 
        close FILE 
    } 
    chomp; 
    @t=split; 
    $seq=$t[9]; 
    if($t[1]&16){ 
        $seq=reverse $seq; 
        $seq=~tr/ACGT/TGCA/ 
    } 
    $id=$t[0]; 
    for $sample (@samples){ 
        for $x (1..$freq{$seq}{$sample}){ 
            $t[0]="$id\_$x"; 
            $rec=join("\t",@t); 
            print {$fh{$sample}} "$rec\n" 
        } 
    }
'

Since SAM files were space consuming, we applied the following perl oneliner to generate commands to sort SAM records by read names and saved them into BAM files. After that, SAM files were removed.

wdlin@comp02:somewhere$ ls multiply0/*.sam | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="samtools view -Sbo /dev/stdout -T DB/TAIR10_chr_all.fas $_ | samtools sort -n -m 7G -@ 7 -o $1.name.bam /dev/stdin"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

wdlin@comp02:somewhere$ rm multiply0/*.sam

Here are some observations on the demultiplexing. The first read in multiply0/root-ctr-0h-r1.name.bam is seq2, which was multiplied by 10 times. Its sequence is AAAACGGGGGGGGCAAGTGTTCTTCGG by looking up merged.fasta. In the frequency file merged.freq, the frequency of sequence AAAACGGGGGGGGCAAGTGTTCTTCGG in root-ctr-0h-r1 is exactly 10.

wdlin@comp02:somewhere$ samtools view multiply0/root-ctr-0h-r1.name.bam | head -10
seq2_Freq=20_1  16      ChrM    362730  0       27M     *       0       0       CCGAAGAACACTTGCCCCCCCCGTTTT     *  NM:i:0   XM:i:0  MD:Z:27
seq2_Freq=20_2  16      ChrM    362730  0       27M     *       0       0       CCGAAGAACACTTGCCCCCCCCGTTTT     *  NM:i:0   XM:i:0  MD:Z:27
seq2_Freq=20_3  16      ChrM    362730  0       27M     *       0       0       CCGAAGAACACTTGCCCCCCCCGTTTT     *  NM:i:0   XM:i:0  MD:Z:27
seq2_Freq=20_4  16      ChrM    362730  0       27M     *       0       0       CCGAAGAACACTTGCCCCCCCCGTTTT     *  NM:i:0   XM:i:0  MD:Z:27
seq2_Freq=20_5  16      ChrM    362730  0       27M     *       0       0       CCGAAGAACACTTGCCCCCCCCGTTTT     *  NM:i:0   XM:i:0  MD:Z:27
seq2_Freq=20_6  16      ChrM    362730  0       27M     *       0       0       CCGAAGAACACTTGCCCCCCCCGTTTT     *  NM:i:0   XM:i:0  MD:Z:27
seq2_Freq=20_7  16      ChrM    362730  0       27M     *       0       0       CCGAAGAACACTTGCCCCCCCCGTTTT     *  NM:i:0   XM:i:0  MD:Z:27
seq2_Freq=20_8  16      ChrM    362730  0       27M     *       0       0       CCGAAGAACACTTGCCCCCCCCGTTTT     *  NM:i:0   XM:i:0  MD:Z:27
seq2_Freq=20_9  16      ChrM    362730  0       27M     *       0       0       CCGAAGAACACTTGCCCCCCCCGTTTT     *  NM:i:0   XM:i:0  MD:Z:27
seq2_Freq=20_10 16      ChrM    362730  0       27M     *       0       0       CCGAAGAACACTTGCCCCCCCCGTTTT     *  NM:i:0   XM:i:0  MD:Z:27

wdlin@comp02:somewhere$ cat merged.fasta | grep -A 1 seq2_Freq
>seq2_Freq=20
AAAACGGGGGGGGCAAGTGTTCTTCGG

wdlin@comp02:somewhere$ cat merged.freq | perl -ne '@t=split; print $_ if $.==1 || $t[0] eq "AAAACGGGGGGGGCAAGTGTTCTTCGG"'
SEQ     root-ctr-0h-r1  root-ctr-0h-r2  root-ctr-0h-r3  root-ctr-0h-r4  shoot-ctr-0h-r1 shoot-ctr-0h-r2 shoot-ctr-0h-r3     shoot-ctr-0h-r4
AAAACGGGGGGGGCAAGTGTTCTTCGG     10      5       3       2       0       0       0       0

Identify regions extensively covered by multi reads

For removing potential rRNA reads from Riboseq datasets, there are a number of options. For example, we can use a scanning tool like sortmerna, or removing reads matching to rRNA/tRNA regions. In addition to aforementioned database method, we found that genomic regions extensively mapped by multi reads (reads mapping to multiple locations in the genome) could be rRNA-related. In this section, we present the procedure to identify these regions.

The following perl oneliner was used to generate commands that select alignments of multi reads from all *.name.bam made in our last step. The selection was controlled by filter options -filter top true -filter multi -min 2 that retain reads if they had 2 or more the-same-best alignments. Selected alignments were saved in *.multi.bam files.

wdlin@comp02:somewhere$ ls multiply0/*.name.bam | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAM $_ -O /dev/stdout -quiet true -filter top true -filter multi -min 2 | samtools view -Sbo $1.multi.bam -T DB/TAIR10_chr_all.fas /dev/stdin"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAM multiply0/root-ctr-0h-r1.name.bam -O /dev/stdout -quiet true -filter top true -filter multi -min 2 | samtools view -Sbo multiply0/root-ctr-0h-r1.multi.bam -T DB/TAIR10_chr_all.fas /dev/stdin
(...)

The next generated command was used to collect mapped regions of alignments in *.multi.bam. Output files were prefixed by MULTI.

wdlin@comp02:somewhere/multiply0$ ls *.multi.bam | perl -ne '
    chomp; 
    push @list,"-M SAM $_"; 
    if(eof){ 
        $cmd="java -classpath rackj.jar:sam-1.89.jar rnaseq.MappingInfoRecover ".join(" ",@list)." -O MULTI -multi true"; 
        print "\nCMD: $cmd\n"; 
        #system $cmd 
    }
'
CMD: java -classpath rackj.jar:sam-1.89.jar rnaseq.MappingInfoRecover -M SAM root-ctr-0h-r1.multi.bam -M SAM root-ctr-0h-r2.multi.bam -M SAM root-ctr-0h-r3.multi.bam -M SAM root-ctr-0h-r4.multi.bam -M SAM shoot-ctr-0h-r1.multi.bam -M SAM shoot-ctr-0h-r2.multi.bam -M SAM shoot-ctr-0h-r3.multi.bam -M SAM shoot-ctr-0h-r4.multi.bam -O MULTI -multi true

The next command was to combine collected mapped region with existing genome annotation. In this example, we combined regions mapped by multi reads with coordinates of gene loci generated generated in part 0 (in file ../DB/Araport11.strand.cgff). The output file MULTI.extdCGFF stored updated coordinates of gene loci. For example, by comparing MULTI.extdCGFF with the original file, it was found that AT1G02480 was extended by 3bps. Unannotated regions mapped by Riboseq multi reads were marked NOVEL_*.

wdlin@comp02:somewhere/multiply0$ java -classpath rackj.jar:sam-1.89.jar rnaseq.TranscriptomeRecover -GFF ../DB/Araport11.strand.cgff -model /dev/null -exon MULTI.exonInfo -intron MULTI.intronInfo -O MULTI
program: TranscriptomeRecover
canonical GFF filename (-GFF): ../DB/Araport11.strand.cgff
exon filename (-exon): MULTI.exonInfo
intron filename (-intron): MULTI.intronInfo
mated pair filename (-matePair): null
depth cutoff (-depth): 2
outprefix (-O): MULTI
model filename (-model): /dev/null
junction cutoff (-byReadSplice): 3 2 10

Decomposition process
Qualification process
Adjustment process
Intron boundry adjustment process
Reporting

wdlin@comp02:somewhere/multiply0$ head MULTI.extdCGFF
>AT1G01010      Chr1    3631    5899    +
3631    3913
3996    4276
4486    4605
4706    5095
5174    5326
5439    5899
>AT1G01020      Chr1    6788    9130    -
6788    7069
7157    7232

wdlin@comp02:somewhere/multiply0$ diff MULTI.extdCGFF ../DB/Araport11.strand.cgff | head
1174,1175c1174,1175
< >AT1G02480    Chr1    515491  515566  +
< 515491        515566
---
> >AT1G02480    Chr1    515494  515566  +
> 515494        515566
1362,1363c1362,1363
< >AT1G02760    Chr1    604397  604475  +
< 604397        604475
---

wdlin@comp02:somewhere/multiply0$  cat MULTI.extdCGFF | grep -A 3 NOVEL | head
>NOVEL_1        Chr1    1100205 1100241 +
1100205 1100241
>NOVEL_2        Chr1    1525258 1525309 +
1525258 1525309
>NOVEL_3        Chr1    6274207 6274235 +
6274207 6274235
>NOVEL_4        Chr1    8782580 8782636 +
8782580 8782636
>NOVEL_5        Chr1    8782641 8782674 +
8782641 8782674

The following perl oneliner was applied to generate commands that counting multi reads for all gene loci, based on the updated coordinates in MULTI.extdCGFF. Read count files we adopted were named *.geneRKM.

wdlin@comp02:somewhere/multiply0$ ls *.multi.bam | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="java -classpath rackj.jar:sam-1.89.jar rnaseq.RPKMComputer -GFF MULTI.extdCGFF -M SAM $_ -O $1.multi"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd
'

CMD: java -classpath rackj.jar:sam-1.89.jar rnaseq.RPKMComputer -GFF MULTI.extdCGFF -M SAM root-ctr-0h-r1.multi.bam -O root-ctr-0h-r1.multi
(...)

Accordingly, we applied the following perl oneliner to compute top loci that were mapped by most multi reads. It was found that the first five loci covered more than 95% of multi reads. NOTE: Example output in this block was from what we got when working with TAIR10 genome annotation. With Araport11, these NOVEL regions were annotated with some genes.

wdlin@comp01:somewhere/multiply0$ cat *.multi.geneRPKM | perl -ne '
    next if /^#/; 
    chomp; 
    @token=split; 
    $hash{$token[0]}+=$token[2]; 
    $sum+=$token[2]; 
    if(eof){ 
        for $key (sort {$hash{$b}<=>$hash{$a}} keys %hash){ 
            $add+=$hash{$key}/$sum; 
            print "$key\t$hash{$key}\t$add\n" 
        } 
    }
' | head -20
NOVEL_812       231953466.326005        0.348210588410695
NOVEL_116       231483020.326006        0.695714939139023
NOVEL_802       59070349.3364712        0.784391868291566
AT2G01021       58918537.1666666        0.872840895676514
NOVEL_810       58918537.1666666        0.961289923061462
NOVEL_1016      613508.211098237        0.962210926996282
NOVEL_1020      408452.940652156        0.962824100151464
NOVEL_1044      401538.747050331        0.96342689365826
NOVEL_1079      379128.702668979        0.963996045008877
NOVEL_1031      374496.258292516        0.964558242093159

So we applied the same perl oneliner again but just for picking the first five items. Finally, coordinates of regions extensively mapped by multi reads were saved in selected.cgff. For convenience, these regions were called multi-high regions.

wdlin@comp01:somewhere/multiply0$ cat *.multi.geneRPKM | perl -ne '
    next if /^#/; 
    chomp; @token=split; 
    $hash{$token[0]}+=$token[2]; 
    $sum+=$token[2]; 
    if(eof){ 
        for $key (sort {$hash{$b}<=>$hash{$a}} keys %hash){ 
            $add+=$hash{$key}/$sum; 
            print "$key\t$hash{$key}\t$add\n" 
        } 
    }
' | head -5 > multihigh.list

wdlin@comp01:somewhere/multiply0$ cat MULTI.extdCGFF | perl -ne '
    if($.==1){ 
        open(FILE,"<multihigh.list"); 
        while($line=<FILE>){ 
            chomp $line; 
            @s=split(/\s+/,$line); 
            $hash{$s[0]}=1 
        } 
        close FILE 
    } 
    if(/^>(\S+)/){ 
        $flag=0; 
        $flag=1 if exists $hash{$1} 
    }
    print if $flag
' > selected.cgff

wdlin@comp01:somewhere/multiply0$ head selected.cgff
>AT2G01021      Chr2    6116    8146    +
6116    8146
>NOVEL_116      Chr2    8214    9529    +
8214    9529
>NOVEL_802      Chr3    14192869        14193091        +
14192869        14193091
>NOVEL_810      Chr3    14200087        14202117        +
14200087        14202117
>NOVEL_812      Chr3    14202185        14203500        +
14202185        14203500

Read filtering

In our practice, we used to remove reads mapping to rRNA and tRNA, as well as reads mapping to multi-high regions. Our approach was to identify those regions, identify reads mapping to those regions, and remove those reads from merged.merged.bam. As multi-high regions were identified in the last session, we observed the feature hierarchy inside Araport11 GFF3 file and extract coordinates of rRNAs and tRNAs from Araport11 genome annotation.

wdlin@comp01:somewhere/DB$ cat Araport11_GFF3_genes_transposons.Jun2016.gff.features
GffRoot
  gene*
    mRNA*
      five_prime_UTR*
      exon*
      CDS*
      three_prime_UTR*
      protein*
    lnc_RNA*
      exon*
    antisense_lncRNA*
      exon*
    miRNA_primary_transcript*
      miRNA*
    transcript_region*
      exon*
    tRNA*
      exon*
    antisense_RNA*
      exon*
    snoRNA*
      exon*
    ncRNA*
      exon*
    snRNA*
      exon*
    uORF*
    rRNA*
      exon*
  transposable_element*
    transposon_fragment*
  pseudogene*
    pseudogenic_transcript*
      pseudogenic_exon*
    pseudogenic_tRNA*
      pseudogenic_exon*
  transposable_element_gene*
    mRNA*
      exon*

wdlin@comp01:somewhere/DB$ java -classpath rackj.jar:sam-1.89.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -O rRNAtRNA.cgff -GE tRNA:rRNA exon
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): rRNAtRNA.cgff
gene-exon feature pairs (-GE): [[tRNA:rRNA, exon]]
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false

wdlin@comp01:somewhere/DB$ head rRNAtRNA.cgff
>AT2G01010.1    Chr2    3706    5513    +
3706    5513
>AT2G01020.1    Chr2    5782    5945    +
5782    5945
>AT3G41768.1    Chr3    14197677        14199484        +
14197677        14199484
>AT3G41979.1    Chr3    14199753        14199916        +
14199753        14199916
>ATCG00920.1    ChrC    101012  102502  +
101012  102502

Note that tRNAs and rRNAs were both composed by exons in the Araport11 GFF3, so we applied option -GE tRNA:rRNA exon for extracting exon regions belonging to tRNAs and rRNAs and organized them under individual tRNAs and rRNAs. Coordinates were saved in file rRNAtRNA.cgff.

Next two commands were to collect reads mapping to tRNA/rRNA regions and multi-high regions, respectively. Collected alignments (reads) were saved in files merged.rRNAtRNA.bam and merged.multihigh.bam.

wdlin@comp01:somewhere$ java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 \
                        -M SAM merged.merged.bam \
                        -O /dev/stdout -quiet true \
                        -filter gene -GFF DB/rRNAtRNA.cgff | \
                        samtools view -Sbo merged.rRNAtRNA.test.bam -T DB/TAIR10_chr_all.fas /dev/stdin

wdlin@comp01:somewhere$ java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 \
                        -M SAM merged.merged.bam \
                        -O /dev/stdout -quiet true \
                        -filter gene -GFF multiply0/selected.cgff | \
                        samtools view -Sbo merged.multihigh.bam -T DB/TAIR10_chr_all.fas /dev/stdin

Collected alignments (reads) in the two files were merged into file merged.filterBase1.bam. Note that all the BAM file were sorted by name. Finally, the filtering on merged.merged.bam was done by applying SplitSelector.pl. The SplitSelector.pl command here was to take merged.filterBase1.bam as a list, iterate alignments in merged.merged.bam, and output alignments (reads) not in merged.filterBase1.bam (option --reverse 1). Accordingly, merged.filter1.bam was storing multiplex reads of all samples.

(necessary merge for filter bases)
wdlin@comp01:somewhere$ samtools merge -fn merged.filterBase1.bam merged.rRNAtRNA.bam merged.multihigh.bam

(filtering)
wdlin@comp01:somewhere$ SplitSelector.pl --split 10 --reverse 1 BAM merged.filterBase1.bam BAM merged.merged.bam merged.filter1.bam

Applying the same demultiplexing perl oneliner as the one at beginning of this document, and the same samtools command, demultiplexed alignment files were saved as multiply1/*.name.bam.

(extract alignments for each samples)
wdlin@comp01:somewhere$ samtools view merged.filter1.bam | perl -ne '
    if($.==1){ 
        open(FILE,"<merged.freq"); 
        $line=<FILE>; 
        @samples=split(/\s+/,$line); 
        shift @samples; 
        for $sample (@samples){ 
            open($fh{$sample},">multiply1/$sample.sam"); 
        } 
        while($line=<FILE>){ 
            @s=split(/\s+/,$line); 
            $seq=shift @s; 
            for($idx=0;$idx<@s;$idx++){ 
                $freq{$seq}{$samples[$idx]}=$s[$idx] 
            } 
        } 
        close FILE 
    } 
    chomp; 
    @t=split; 
    $seq=$t[9]; 
    if($t[1]&16){ 
        $seq=reverse $seq; 
        $seq=~tr/ACGT/TGCA/ 
    } 
    $id=$t[0]; 
    for $sample (@samples){ 
        for $x (1..$freq{$seq}{$sample}){ 
            $t[0]="$id\_$x"; 
            $rec=join("\t",@t); 
            print {$fh{$sample}} "$rec\n" 
        } 
    }
'

wdlin@comp01:somewhere$ ls multiply1/*.sam | perl -ne '
    chomp; 
    /(.+?)\./; 
    $cmd="samtools view -Sbo /dev/stdout -T DB/TAIR10_chr_all.fas $_ | samtools sort -n -m 7G -@ 7 -o $1.name.bam /dev/stdin"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

wdlin@comp01:somewhere$ rm multiply1/*.sam
⚠️ **GitHub.com Fallback** ⚠️