Riboseq multi stage mapping and filtering, part 3 - wdlingit/cop GitHub Wiki
Following the Riboseq multiplexing steps and the multi stage mapping procedure, there are the following three files:
-
merged.fasta
: a FASTA file storing only nonredundant Riboseq reads, -
merged.freq
: a frequency file storing read frequencies in all samples, and -
merged.merged.bam
: the BAM file storing qualified alignments ofmerged.fasta
The following perl oneliner was applied to demultiplex alignments in merged.merged.bam
and save SAM files int folder multiply0
. In short, the perl oneliner read frequency information from merged.freq
, duplicated alignments by different times according to the read frequencies in samples, and saved duplicated alignments into SAM files.
wdlin@comp02:somewhere$ mkdir multiply0
wdlin@comp02:somewhere$ samtools view merged.merged.bam | perl -ne '
if($.==1){
open(FILE,"<merged.freq");
$line=<FILE>;
@samples=split(/\s+/,$line);
shift @samples;
for $sample (@samples){
open($fh{$sample},">multiply0/$sample.sam");
}
while($line=<FILE>){
@s=split(/\s+/,$line);
$seq=shift @s;
for($idx=0;$idx<@s;$idx++){
$freq{$seq}{$samples[$idx]}=$s[$idx]
}
}
close FILE
}
chomp;
@t=split;
$seq=$t[9];
if($t[1]&16){
$seq=reverse $seq;
$seq=~tr/ACGT/TGCA/
}
$id=$t[0];
for $sample (@samples){
for $x (1..$freq{$seq}{$sample}){
$t[0]="$id\_$x";
$rec=join("\t",@t);
print {$fh{$sample}} "$rec\n"
}
}
'
Since SAM files were space consuming, we applied the following perl oneliner to generate commands to sort SAM records by read names and saved them into BAM files. After that, SAM files were removed.
wdlin@comp02:somewhere$ ls multiply0/*.sam | perl -ne '
chomp;
/(.+?)\./;
$cmd="samtools view -Sbo /dev/stdout -T DB/TAIR10_chr_all.fas $_ | samtools sort -n -m 7G -@ 7 -o $1.name.bam /dev/stdin";
print "\nCMD: $cmd\n";
#system $cmd;
'
wdlin@comp02:somewhere$ rm multiply0/*.sam
Here are some observations on the demultiplexing. The first read in multiply0/root-ctr-0h-r1.name.bam
is seq2
, which was multiplied by 10 times. Its sequence is AAAACGGGGGGGGCAAGTGTTCTTCGG
by looking up merged.fasta
. In the frequency file merged.freq
, the frequency of sequence AAAACGGGGGGGGCAAGTGTTCTTCGG
in root-ctr-0h-r1
is exactly 10.
wdlin@comp02:somewhere$ samtools view multiply0/root-ctr-0h-r1.name.bam | head -10
seq2_Freq=20_1 16 ChrM 362730 0 27M * 0 0 CCGAAGAACACTTGCCCCCCCCGTTTT * NM:i:0 XM:i:0 MD:Z:27
seq2_Freq=20_2 16 ChrM 362730 0 27M * 0 0 CCGAAGAACACTTGCCCCCCCCGTTTT * NM:i:0 XM:i:0 MD:Z:27
seq2_Freq=20_3 16 ChrM 362730 0 27M * 0 0 CCGAAGAACACTTGCCCCCCCCGTTTT * NM:i:0 XM:i:0 MD:Z:27
seq2_Freq=20_4 16 ChrM 362730 0 27M * 0 0 CCGAAGAACACTTGCCCCCCCCGTTTT * NM:i:0 XM:i:0 MD:Z:27
seq2_Freq=20_5 16 ChrM 362730 0 27M * 0 0 CCGAAGAACACTTGCCCCCCCCGTTTT * NM:i:0 XM:i:0 MD:Z:27
seq2_Freq=20_6 16 ChrM 362730 0 27M * 0 0 CCGAAGAACACTTGCCCCCCCCGTTTT * NM:i:0 XM:i:0 MD:Z:27
seq2_Freq=20_7 16 ChrM 362730 0 27M * 0 0 CCGAAGAACACTTGCCCCCCCCGTTTT * NM:i:0 XM:i:0 MD:Z:27
seq2_Freq=20_8 16 ChrM 362730 0 27M * 0 0 CCGAAGAACACTTGCCCCCCCCGTTTT * NM:i:0 XM:i:0 MD:Z:27
seq2_Freq=20_9 16 ChrM 362730 0 27M * 0 0 CCGAAGAACACTTGCCCCCCCCGTTTT * NM:i:0 XM:i:0 MD:Z:27
seq2_Freq=20_10 16 ChrM 362730 0 27M * 0 0 CCGAAGAACACTTGCCCCCCCCGTTTT * NM:i:0 XM:i:0 MD:Z:27
wdlin@comp02:somewhere$ cat merged.fasta | grep -A 1 seq2_Freq
>seq2_Freq=20
AAAACGGGGGGGGCAAGTGTTCTTCGG
wdlin@comp02:somewhere$ cat merged.freq | perl -ne '@t=split; print $_ if $.==1 || $t[0] eq "AAAACGGGGGGGGCAAGTGTTCTTCGG"'
SEQ root-ctr-0h-r1 root-ctr-0h-r2 root-ctr-0h-r3 root-ctr-0h-r4 shoot-ctr-0h-r1 shoot-ctr-0h-r2 shoot-ctr-0h-r3 shoot-ctr-0h-r4
AAAACGGGGGGGGCAAGTGTTCTTCGG 10 5 3 2 0 0 0 0
For removing potential rRNA reads from Riboseq datasets, there are a number of options. For example, we can use a scanning tool like sortmerna, or removing reads matching to rRNA/tRNA regions. In addition to aforementioned database method, we found that genomic regions extensively mapped by multi reads (reads mapping to multiple locations in the genome) could be rRNA-related. In this section, we present the procedure to identify these regions.
The following perl oneliner was used to generate commands that select alignments of multi reads from all *.name.bam
made in our last step. The selection was controlled by filter options -filter top true -filter multi -min 2
that retain reads if they had 2 or more the-same-best alignments. Selected alignments were saved in *.multi.bam
files.
wdlin@comp02:somewhere$ ls multiply0/*.name.bam | perl -ne '
chomp;
/(.+?)\./;
$cmd="java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAM $_ -O /dev/stdout -quiet true -filter top true -filter multi -min 2 | samtools view -Sbo $1.multi.bam -T DB/TAIR10_chr_all.fas /dev/stdin";
print "\nCMD: $cmd\n";
#system $cmd;
'
CMD: java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 -M SAM multiply0/root-ctr-0h-r1.name.bam -O /dev/stdout -quiet true -filter top true -filter multi -min 2 | samtools view -Sbo multiply0/root-ctr-0h-r1.multi.bam -T DB/TAIR10_chr_all.fas /dev/stdin
(...)
The next generated command was used to collect mapped regions of alignments in *.multi.bam
. Output files were prefixed by MULTI
.
wdlin@comp02:somewhere/multiply0$ ls *.multi.bam | perl -ne '
chomp;
push @list,"-M SAM $_";
if(eof){
$cmd="java -classpath rackj.jar:sam-1.89.jar rnaseq.MappingInfoRecover ".join(" ",@list)." -O MULTI -multi true";
print "\nCMD: $cmd\n";
#system $cmd
}
'
CMD: java -classpath rackj.jar:sam-1.89.jar rnaseq.MappingInfoRecover -M SAM root-ctr-0h-r1.multi.bam -M SAM root-ctr-0h-r2.multi.bam -M SAM root-ctr-0h-r3.multi.bam -M SAM root-ctr-0h-r4.multi.bam -M SAM shoot-ctr-0h-r1.multi.bam -M SAM shoot-ctr-0h-r2.multi.bam -M SAM shoot-ctr-0h-r3.multi.bam -M SAM shoot-ctr-0h-r4.multi.bam -O MULTI -multi true
The next command was to combine collected mapped region with existing genome annotation. In this example, we combined regions mapped by multi reads with coordinates of gene loci generated generated in part 0 (in file ../DB/Araport11.strand.cgff
). The output file MULTI.extdCGFF
stored updated coordinates of gene loci. For example, by comparing MULTI.extdCGFF
with the original file, it was found that AT1G02480
was extended by 3bps. Unannotated regions mapped by Riboseq multi reads were marked NOVEL_*
.
wdlin@comp02:somewhere/multiply0$ java -classpath rackj.jar:sam-1.89.jar rnaseq.TranscriptomeRecover -GFF ../DB/Araport11.strand.cgff -model /dev/null -exon MULTI.exonInfo -intron MULTI.intronInfo -O MULTI
program: TranscriptomeRecover
canonical GFF filename (-GFF): ../DB/Araport11.strand.cgff
exon filename (-exon): MULTI.exonInfo
intron filename (-intron): MULTI.intronInfo
mated pair filename (-matePair): null
depth cutoff (-depth): 2
outprefix (-O): MULTI
model filename (-model): /dev/null
junction cutoff (-byReadSplice): 3 2 10
Decomposition process
Qualification process
Adjustment process
Intron boundry adjustment process
Reporting
wdlin@comp02:somewhere/multiply0$ head MULTI.extdCGFF
>AT1G01010 Chr1 3631 5899 +
3631 3913
3996 4276
4486 4605
4706 5095
5174 5326
5439 5899
>AT1G01020 Chr1 6788 9130 -
6788 7069
7157 7232
wdlin@comp02:somewhere/multiply0$ diff MULTI.extdCGFF ../DB/Araport11.strand.cgff | head
1174,1175c1174,1175
< >AT1G02480 Chr1 515491 515566 +
< 515491 515566
---
> >AT1G02480 Chr1 515494 515566 +
> 515494 515566
1362,1363c1362,1363
< >AT1G02760 Chr1 604397 604475 +
< 604397 604475
---
wdlin@comp02:somewhere/multiply0$ cat MULTI.extdCGFF | grep -A 3 NOVEL | head
>NOVEL_1 Chr1 1100205 1100241 +
1100205 1100241
>NOVEL_2 Chr1 1525258 1525309 +
1525258 1525309
>NOVEL_3 Chr1 6274207 6274235 +
6274207 6274235
>NOVEL_4 Chr1 8782580 8782636 +
8782580 8782636
>NOVEL_5 Chr1 8782641 8782674 +
8782641 8782674
The following perl oneliner was applied to generate commands that counting multi reads for all gene loci, based on the updated coordinates in MULTI.extdCGFF
. Read count files we adopted were named *.geneRKM
.
wdlin@comp02:somewhere/multiply0$ ls *.multi.bam | perl -ne '
chomp;
/(.+?)\./;
$cmd="java -classpath rackj.jar:sam-1.89.jar rnaseq.RPKMComputer -GFF MULTI.extdCGFF -M SAM $_ -O $1.multi";
print "\nCMD: $cmd\n";
#system $cmd
'
CMD: java -classpath rackj.jar:sam-1.89.jar rnaseq.RPKMComputer -GFF MULTI.extdCGFF -M SAM root-ctr-0h-r1.multi.bam -O root-ctr-0h-r1.multi
(...)
Accordingly, we applied the following perl oneliner to compute top loci that were mapped by most multi reads. It was found that the first five loci covered more than 95% of multi reads. NOTE: Example output in this block was from what we got when working with TAIR10 genome annotation. With Araport11, these NOVEL
regions were annotated with some genes.
wdlin@comp01:somewhere/multiply0$ cat *.multi.geneRPKM | perl -ne '
next if /^#/;
chomp;
@token=split;
$hash{$token[0]}+=$token[2];
$sum+=$token[2];
if(eof){
for $key (sort {$hash{$b}<=>$hash{$a}} keys %hash){
$add+=$hash{$key}/$sum;
print "$key\t$hash{$key}\t$add\n"
}
}
' | head -20
NOVEL_812 231953466.326005 0.348210588410695
NOVEL_116 231483020.326006 0.695714939139023
NOVEL_802 59070349.3364712 0.784391868291566
AT2G01021 58918537.1666666 0.872840895676514
NOVEL_810 58918537.1666666 0.961289923061462
NOVEL_1016 613508.211098237 0.962210926996282
NOVEL_1020 408452.940652156 0.962824100151464
NOVEL_1044 401538.747050331 0.96342689365826
NOVEL_1079 379128.702668979 0.963996045008877
NOVEL_1031 374496.258292516 0.964558242093159
So we applied the same perl oneliner again but just for picking the first five items. Finally, coordinates of regions extensively mapped by multi reads were saved in selected.cgff
. For convenience, these regions were called multi-high regions.
wdlin@comp01:somewhere/multiply0$ cat *.multi.geneRPKM | perl -ne '
next if /^#/;
chomp; @token=split;
$hash{$token[0]}+=$token[2];
$sum+=$token[2];
if(eof){
for $key (sort {$hash{$b}<=>$hash{$a}} keys %hash){
$add+=$hash{$key}/$sum;
print "$key\t$hash{$key}\t$add\n"
}
}
' | head -5 > multihigh.list
wdlin@comp01:somewhere/multiply0$ cat MULTI.extdCGFF | perl -ne '
if($.==1){
open(FILE,"<multihigh.list");
while($line=<FILE>){
chomp $line;
@s=split(/\s+/,$line);
$hash{$s[0]}=1
}
close FILE
}
if(/^>(\S+)/){
$flag=0;
$flag=1 if exists $hash{$1}
}
print if $flag
' > selected.cgff
wdlin@comp01:somewhere/multiply0$ head selected.cgff
>AT2G01021 Chr2 6116 8146 +
6116 8146
>NOVEL_116 Chr2 8214 9529 +
8214 9529
>NOVEL_802 Chr3 14192869 14193091 +
14192869 14193091
>NOVEL_810 Chr3 14200087 14202117 +
14200087 14202117
>NOVEL_812 Chr3 14202185 14203500 +
14202185 14203500
In our practice, we used to remove reads mapping to rRNA and tRNA, as well as reads mapping to multi-high regions. Our approach was to identify those regions, identify reads mapping to those regions, and remove those reads from merged.merged.bam
. As multi-high regions were identified in the last session, we observed the feature hierarchy inside Araport11 GFF3 file and extract coordinates of rRNAs and tRNAs from Araport11 genome annotation.
wdlin@comp01:somewhere/DB$ cat Araport11_GFF3_genes_transposons.Jun2016.gff.features
GffRoot
gene*
mRNA*
five_prime_UTR*
exon*
CDS*
three_prime_UTR*
protein*
lnc_RNA*
exon*
antisense_lncRNA*
exon*
miRNA_primary_transcript*
miRNA*
transcript_region*
exon*
tRNA*
exon*
antisense_RNA*
exon*
snoRNA*
exon*
ncRNA*
exon*
snRNA*
exon*
uORF*
rRNA*
exon*
transposable_element*
transposon_fragment*
pseudogene*
pseudogenic_transcript*
pseudogenic_exon*
pseudogenic_tRNA*
pseudogenic_exon*
transposable_element_gene*
mRNA*
exon*
wdlin@comp01:somewhere/DB$ java -classpath rackj.jar:sam-1.89.jar misc.CanonicalGFF -I Araport11_GFF3_genes_transposons.201606.gff -O rRNAtRNA.cgff -GE tRNA:rRNA exon
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
output filename (-O): rRNAtRNA.cgff
gene-exon feature pairs (-GE): [[tRNA:rRNA, exon]]
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
intron preserving (-IP)): false
wdlin@comp01:somewhere/DB$ head rRNAtRNA.cgff
>AT2G01010.1 Chr2 3706 5513 +
3706 5513
>AT2G01020.1 Chr2 5782 5945 +
5782 5945
>AT3G41768.1 Chr3 14197677 14199484 +
14197677 14199484
>AT3G41979.1 Chr3 14199753 14199916 +
14199753 14199916
>ATCG00920.1 ChrC 101012 102502 +
101012 102502
Note that tRNAs and rRNAs were both composed by exons in the Araport11 GFF3, so we applied option -GE tRNA:rRNA exon
for extracting exon regions belonging to tRNAs and rRNAs and organized them under individual tRNAs and rRNAs. Coordinates were saved in file rRNAtRNA.cgff
.
Next two commands were to collect reads mapping to tRNA/rRNA regions and multi-high regions, respectively. Collected alignments (reads) were saved in files merged.rRNAtRNA.bam
and merged.multihigh.bam
.
wdlin@comp01:somewhere$ java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 \
-M SAM merged.merged.bam \
-O /dev/stdout -quiet true \
-filter gene -GFF DB/rRNAtRNA.cgff | \
samtools view -Sbo merged.rRNAtRNA.test.bam -T DB/TAIR10_chr_all.fas /dev/stdin
wdlin@comp01:somewhere$ java -classpath rackj.jar:sam-1.89.jar misc.AlignmentFilter2 \
-M SAM merged.merged.bam \
-O /dev/stdout -quiet true \
-filter gene -GFF multiply0/selected.cgff | \
samtools view -Sbo merged.multihigh.bam -T DB/TAIR10_chr_all.fas /dev/stdin
Collected alignments (reads) in the two files were merged into file merged.filterBase1.bam
. Note that all the BAM file were sorted by name. Finally, the filtering on merged.merged.bam
was done by applying SplitSelector.pl
. The SplitSelector.pl
command here was to take merged.filterBase1.bam
as a list, iterate alignments in merged.merged.bam
, and output alignments (reads) not in merged.filterBase1.bam
(option --reverse 1
). Accordingly, merged.filter1.bam
was storing multiplex reads of all samples.
(necessary merge for filter bases)
wdlin@comp01:somewhere$ samtools merge -fn merged.filterBase1.bam merged.rRNAtRNA.bam merged.multihigh.bam
(filtering)
wdlin@comp01:somewhere$ SplitSelector.pl --split 10 --reverse 1 BAM merged.filterBase1.bam BAM merged.merged.bam merged.filter1.bam
Applying the same demultiplexing perl oneliner as the one at beginning of this document, and the same samtools command, demultiplexed alignment files were saved as multiply1/*.name.bam
.
(extract alignments for each samples)
wdlin@comp01:somewhere$ samtools view merged.filter1.bam | perl -ne '
if($.==1){
open(FILE,"<merged.freq");
$line=<FILE>;
@samples=split(/\s+/,$line);
shift @samples;
for $sample (@samples){
open($fh{$sample},">multiply1/$sample.sam");
}
while($line=<FILE>){
@s=split(/\s+/,$line);
$seq=shift @s;
for($idx=0;$idx<@s;$idx++){
$freq{$seq}{$samples[$idx]}=$s[$idx]
}
}
close FILE
}
chomp;
@t=split;
$seq=$t[9];
if($t[1]&16){
$seq=reverse $seq;
$seq=~tr/ACGT/TGCA/
}
$id=$t[0];
for $sample (@samples){
for $x (1..$freq{$seq}{$sample}){
$t[0]="$id\_$x";
$rec=join("\t",@t);
print {$fh{$sample}} "$rec\n"
}
}
'
wdlin@comp01:somewhere$ ls multiply1/*.sam | perl -ne '
chomp;
/(.+?)\./;
$cmd="samtools view -Sbo /dev/stdout -T DB/TAIR10_chr_all.fas $_ | samtools sort -n -m 7G -@ 7 -o $1.name.bam /dev/stdin";
print "\nCMD: $cmd\n";
#system $cmd;
'
wdlin@comp01:somewhere$ rm multiply1/*.sam