Riboseq multi stage mapping and filtering, part 1 - wdlingit/cop GitHub Wiki
Adapter removal and read multiplexing
Adapter removal
In early years, we used to receive Riboseq read datasets that were cleaned by adapter removal done by the sequencing facility. Recently, we applied Trimmomatic for adapter removal. We believe that it would be better to discuss with the sequencing facility for how to perform adapter removal for Riboseq datasets. Here we present one of our recent examples of adapter removal.
wdlin@comp03:somewhere$ cat linker.txt
>linker
CTGTAGGCACCATCAAT
wdlin@comp03:somewhere$ mkdir trimmo log
wdlin@comp03:somewhere$ ls src/*_R1.fastq.gz | perl -ne '
chomp;
/.+\/(.+?)\./;
$cmd="java -jar trimmomatic-0.39.jar SE -threads 20 -phred33 $_ trimmo/$1.fastq.gz ILLUMINACLIP:linker.txt:2:9:9 MINLEN:15 >> log/$1\_trimmolog 2>&1";
print "\nCMD: $cmd\n";
#system $cmd;
'
In this example, the linker sequence was suggested by the sequencing facility. Although the Riboseq dataset was made pair-ended, we picked only R1 files for trimming (ls src/*_R1.fastq.gz
) because R1 reads were following the original orientation. This was also suggested by the sequencing facility. In above block, the last command is a perl oneliner. This perl oneliner was to generate one Trimmomatic command per one _R1.fastq.gz
file in the src
folder. In the generated commands, Trimmomatic output files would be saved in trimmo
folder. The matching scoring part 2:9:9
was specifically set for the short linker sequence. We intended to put a sharp mark (#
) in front of system
. Remove it if you want those generated commands been executed.
wdlin@comp03:somewhere$ ls src/*_R1.fastq.gz | perl -ne 'chomp; /.+\/(.+?)\./; $cmd="java -jar trimmomatic-0.39.jar SE -threads 20 -phred33 $_ trimmo/$1.fastq.gz ILLUMINACLIP:linker.txt:2:9:9 MINLEN:15 >> log/$1\_trimmolog 2>&1"; print "\nCMD: $cmd\n"; #system $cmd;'
CMD: java -jar trimmomatic-0.39.jar SE -threads 20 -phred33 src/riboseq_root_ES_r4_R1.fastq.gz trimmo/riboseq_root_ES_r4_R1.fastq.gz ILLUMINACLIP:linker.txt:2:9:9 MINLEN:15 >> log/riboseq_root_ES_r4_R1_trimmolog 2>&1
(...)
Riboseq read multiplexing
According to our experiences, Riboseq reads used to be duplicated. It is worth doing read multiplexing/demultiplexing when the total number of reads in a project is large, especially because some of our mapping steps are time consuming. The read multiplexing technique is to save a frequency file recording which sequence is appearing how many times in which sample. The following perl oneliner is to read all .fastq.gz
files in trimmo
folder and save merged.freq
as the frequency file.
wdlin@comp03:somewhere$ ls -l trimmo/
total 9757672
-rwxrwxr-x 1 wdlin wdlin 1243733789 Jul 15 22:46 root-ctr-0h-r1.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1382459851 Jul 15 22:47 root-ctr-0h-r2.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1136664805 Jul 15 22:48 root-ctr-0h-r3.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1037868642 Jul 15 22:49 root-ctr-0h-r4.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1333517997 Jul 15 22:49 shoot-ctr-0h-r1.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1091000714 Jul 15 22:50 shoot-ctr-0h-r2.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1401192669 Jul 15 22:51 shoot-ctr-0h-r3.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1365370672 Jul 15 22:52 shoot-ctr-0h-r4.fastq.gz
wdlin@comp03:somewhere$ ls trimmo/*.fastq.gz | perl -ne '
chomp; /.+\/(.+?)\./;
print "FILE: $1\n"; $
cmd="gzip -dc $_";
system $cmd
' | perl -ne '
if(/^FILE: (\S+)/){
$file=$1;
$files{$file}=1;
$cnt=0
}else{
$cnt++;
if(($cnt%4)==2){
chomp;
$hash{$_}{$file}++
}
}
if(eof){
print "SEQ";
for $file (sort keys %files){
print "\t$file"
}
print "\n";
for $seq (keys %hash){
print "$seq";
for $file (sort keys %files){
if(exists $hash{$seq}{$file}){
print "\t$hash{$seq}{$file}"
}else{
print "\t0"
}
}
print "\n"
}
}
' > merged.freq
In this case, the perl oneliner took quite a while and used no more than 10GB memory. Here are first few lines of the frequency file.
wdlin@comp03:somewhere$ head -5 merged.freq
SEQ root-ctr-0h-r1 root-ctr-0h-r2 root-ctr-0h-r3 root-ctr-0h-r4 shoot-ctr-0h-r1 shoot-ctr-0h-r2 shoot-ctr-0h-r3 shoot-ctr-0h-r4
AGAATGTCTCTAAGGTTGCATAGAAGGA 0 0 0 0 0 0 1 0
AAAACGGGGGGGGCAAGTGTTCTTCGG 10 5 3 2 0 0 0 0
GTTACACTACTGATGCCCGCGTCGCGATA 0 0 1 0 0 0 0 0
GGGAACGCAACAATGTTGGGAAACGAT 2 2 0 2 0 0 0 0
The way to obtain read counts for each sample from the frequency file.
wdlin@comp03:somewhere$ cat merged.freq | perl -ne '
chomp;
@t=split;
if($.==1){
@header=@t;
shift @header;
next
}
shift @t;
for($i=0;$i<@t;$i++){
$hash{$i}+=$t[$i]
}
if(eof){
for $i (sort keys %hash){
print "$header[$i]\t$hash{$i}\n"
}
}
'
root-ctr-0h-r1 58082938
root-ctr-0h-r2 67623514
root-ctr-0h-r3 53329668
root-ctr-0h-r4 56012462
shoot-ctr-0h-r1 59738467
shoot-ctr-0h-r2 53278422
shoot-ctr-0h-r3 61826359
shoot-ctr-0h-r4 64248867
The way to get read length distribution for each sample from the frequency file.
wdlin@comp03:somewhere$ cat merged.freq | perl -ne '
chomp;
@t=split;
if($.==1){
@header=@t;
shift @header;
next
}
$len=length(shift @t);
for($i=0;$i<@t;$i++){
$hash{$len}{$header[$i]}+=$t[$i]
}
if(eof){
for $x (@header){
print "\t$x";
}
print "\n";
for $len (sort {$a<=>$b} keys %hash){
print "$len";
for $x (@header){
print "\t$hash{$len}{$x}";
}
print "\n";
}
}
' > summary.lenDist
wdlin@comp03:somewhere$ head summary.lenDist
root-ctr-0h-r1 root-ctr-0h-r2 root-ctr-0h-r3 root-ctr-0h-r4 shoot-ctr-0h-r1 shoot-ctr-0h-r2 shoot-ctr-0h-r3 shoot-ctr-0h-r4
15 0 1509 0 142 0 802 0 185
16 0 3511 0 224 0 2092 0 230
17 0 4815 0 441 0 4725 0 391
18 0 5887 0 609 0 7212 0 717
19 0 6926 0 1346 0 7511 0 962
20 23789 13131 2647 4812 18681 9551 14880 1884
21 57060 38401 3678 18906 42113 24438 32116 6832
22 81405 75339 8395 40002 64279 37983 46518 14290
23 112457 131009 22930 69868 113117 123383 72553 41770
Finally, generate a FASTA file of reads of length 25~50.
wdlin@comp03:somewhere$ cat merged.freq | perl -ne '
if($.==1){
print
}else{
@t=split;
print if length($t[0])>=25 && length($t[0])<=50
}
' > merged.freq.1
wdlin@comp03:somewhere$ cat merged.freq.1 | perl -ne '
next if $.==1;
$idx++;
chomp;
@t=split;
$seq=shift @t;
$sum=0;
for $x (@t){
$sum+=$x
}
print ">seq$idx\_Freq=$sum\n$seq\n"
' > merged.fasta
wdlin@comp03:somewhere$ head merged.fasta
>seq1_Freq=1
AGAATGTCTCTAAGGTTGCATAGAAGGA
>seq2_Freq=20
AAAACGGGGGGGGCAAGTGTTCTTCGG
>seq3_Freq=1
GTTACACTACTGATGCCCGCGTCGCGATA
>seq4_Freq=6
GGGAACGCAACAATGTTGGGAAACGAT
>seq5_Freq=1
GGTTAGTCGATCCTAAGAGTCATGATAACTCGACGGATCGCA
Our next part took only this merged.fasta
for Riboseq read mapping.