Riboseq multi stage mapping and filtering, part 1 - wdlingit/cop GitHub Wiki

Adapter removal and read multiplexing

Adapter removal

In early years, we used to receive Riboseq read datasets that were cleaned by adapter removal done by the sequencing facility. Recently, we applied Trimmomatic for adapter removal. We believe that it would be better to discuss with the sequencing facility for how to perform adapter removal for Riboseq datasets. Here we present one of our recent examples of adapter removal.

wdlin@comp03:somewhere$ cat linker.txt
>linker
CTGTAGGCACCATCAAT

wdlin@comp03:somewhere$ mkdir trimmo log

wdlin@comp03:somewhere$ ls src/*_R1.fastq.gz | perl -ne '
    chomp; 
    /.+\/(.+?)\./; 
    $cmd="java -jar trimmomatic-0.39.jar SE -threads 20 -phred33 $_ trimmo/$1.fastq.gz ILLUMINACLIP:linker.txt:2:9:9 MINLEN:15 >> log/$1\_trimmolog 2>&1"; 
    print "\nCMD: $cmd\n"; 
    #system $cmd;
'

In this example, the linker sequence was suggested by the sequencing facility. Although the Riboseq dataset was made pair-ended, we picked only R1 files for trimming (ls src/*_R1.fastq.gz) because R1 reads were following the original orientation. This was also suggested by the sequencing facility. In above block, the last command is a perl oneliner. This perl oneliner was to generate one Trimmomatic command per one _R1.fastq.gz file in the src folder. In the generated commands, Trimmomatic output files would be saved in trimmo folder. The matching scoring part 2:9:9 was specifically set for the short linker sequence. We intended to put a sharp mark (#) in front of system. Remove it if you want those generated commands been executed.

wdlin@comp03:somewhere$ ls src/*_R1.fastq.gz | perl -ne 'chomp; /.+\/(.+?)\./; $cmd="java -jar trimmomatic-0.39.jar SE -threads 20 -phred33 $_ trimmo/$1.fastq.gz ILLUMINACLIP:linker.txt:2:9:9 MINLEN:15 >> log/$1\_trimmolog 2>&1"; print "\nCMD: $cmd\n"; #system $cmd;'

CMD: java -jar trimmomatic-0.39.jar SE -threads 20 -phred33 src/riboseq_root_ES_r4_R1.fastq.gz trimmo/riboseq_root_ES_r4_R1.fastq.gz ILLUMINACLIP:linker.txt:2:9:9 MINLEN:15 >> log/riboseq_root_ES_r4_R1_trimmolog 2>&1
(...)

Riboseq read multiplexing

According to our experiences, Riboseq reads used to be duplicated. It is worth doing read multiplexing/demultiplexing when the total number of reads in a project is large, especially because some of our mapping steps are time consuming. The read multiplexing technique is to save a frequency file recording which sequence is appearing how many times in which sample. The following perl oneliner is to read all .fastq.gz files in trimmo folder and save merged.freq as the frequency file.

wdlin@comp03:somewhere$ ls -l trimmo/
total 9757672
-rwxrwxr-x 1 wdlin wdlin 1243733789 Jul 15 22:46 root-ctr-0h-r1.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1382459851 Jul 15 22:47 root-ctr-0h-r2.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1136664805 Jul 15 22:48 root-ctr-0h-r3.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1037868642 Jul 15 22:49 root-ctr-0h-r4.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1333517997 Jul 15 22:49 shoot-ctr-0h-r1.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1091000714 Jul 15 22:50 shoot-ctr-0h-r2.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1401192669 Jul 15 22:51 shoot-ctr-0h-r3.fastq.gz
-rwxrwxr-x 1 wdlin wdlin 1365370672 Jul 15 22:52 shoot-ctr-0h-r4.fastq.gz

wdlin@comp03:somewhere$ ls trimmo/*.fastq.gz | perl -ne '
    chomp; /.+\/(.+?)\./; 
    print "FILE: $1\n"; $
    cmd="gzip -dc $_"; 
    system $cmd
' | perl -ne '
    if(/^FILE: (\S+)/){ 
        $file=$1; 
        $files{$file}=1; 
        $cnt=0 
    }else{ 
        $cnt++; 
        if(($cnt%4)==2){ 
            chomp; 
            $hash{$_}{$file}++ 
        } 
    } 
    
    if(eof){ 
        print "SEQ"; 
        for $file (sort keys %files){ 
            print "\t$file" 
        } 
        print "\n"; 
        for $seq (keys %hash){ 
            print "$seq"; 
            for $file (sort keys %files){ 
                if(exists $hash{$seq}{$file}){ 
                    print "\t$hash{$seq}{$file}" 
                }else{ 
                    print "\t0" 
                } 
            } 
            print "\n" 
        } 
    }
' > merged.freq

In this case, the perl oneliner took quite a while and used no more than 10GB memory. Here are first few lines of the frequency file.

wdlin@comp03:somewhere$ head -5 merged.freq
SEQ     root-ctr-0h-r1  root-ctr-0h-r2  root-ctr-0h-r3  root-ctr-0h-r4  shoot-ctr-0h-r1 shoot-ctr-0h-r2 shoot-ctr-0h-r3     shoot-ctr-0h-r4
AGAATGTCTCTAAGGTTGCATAGAAGGA    0       0       0       0       0       0       1       0
AAAACGGGGGGGGCAAGTGTTCTTCGG     10      5       3       2       0       0       0       0
GTTACACTACTGATGCCCGCGTCGCGATA   0       0       1       0       0       0       0       0
GGGAACGCAACAATGTTGGGAAACGAT     2       2       0       2       0       0       0       0

The way to obtain read counts for each sample from the frequency file.

wdlin@comp03:somewhere$ cat merged.freq | perl -ne '
    chomp; 
    @t=split; 
    if($.==1){ 
        @header=@t; 
        shift @header; 
        next 
    } 
    shift @t; 
    for($i=0;$i<@t;$i++){ 
        $hash{$i}+=$t[$i] 
    } 
    
    if(eof){ 
        for $i (sort keys %hash){ 
            print "$header[$i]\t$hash{$i}\n" 
        } 
    }
'
root-ctr-0h-r1  58082938
root-ctr-0h-r2  67623514
root-ctr-0h-r3  53329668
root-ctr-0h-r4  56012462
shoot-ctr-0h-r1 59738467
shoot-ctr-0h-r2 53278422
shoot-ctr-0h-r3 61826359
shoot-ctr-0h-r4 64248867

The way to get read length distribution for each sample from the frequency file.

wdlin@comp03:somewhere$ cat merged.freq | perl -ne '
    chomp; 
    @t=split; 
    if($.==1){ 
        @header=@t; 
        shift @header; 
        next 
    } 
    $len=length(shift @t); 
    for($i=0;$i<@t;$i++){ 
        $hash{$len}{$header[$i]}+=$t[$i] 
    } 
    
    if(eof){ 
        for $x (@header){ 
            print "\t$x"; 
        } 
        print "\n"; 
        for $len (sort {$a<=>$b} keys %hash){ 
            print "$len"; 
            for $x (@header){ 
                print "\t$hash{$len}{$x}"; 
            } 
            print "\n"; 
        } 
    }
' > summary.lenDist

wdlin@comp03:somewhere$ head summary.lenDist
        root-ctr-0h-r1  root-ctr-0h-r2  root-ctr-0h-r3  root-ctr-0h-r4  shoot-ctr-0h-r1 shoot-ctr-0h-r2 shoot-ctr-0h-r3     shoot-ctr-0h-r4
15      0       1509    0       142     0       802     0       185
16      0       3511    0       224     0       2092    0       230
17      0       4815    0       441     0       4725    0       391
18      0       5887    0       609     0       7212    0       717
19      0       6926    0       1346    0       7511    0       962
20      23789   13131   2647    4812    18681   9551    14880   1884
21      57060   38401   3678    18906   42113   24438   32116   6832
22      81405   75339   8395    40002   64279   37983   46518   14290
23      112457  131009  22930   69868   113117  123383  72553   41770

Finally, generate a FASTA file of reads of length 25~50.

wdlin@comp03:somewhere$ cat merged.freq | perl -ne '
    if($.==1){ 
        print 
    }else{ 
        @t=split; 
        print if length($t[0])>=25 && length($t[0])<=50 
    }
' > merged.freq.1

wdlin@comp03:somewhere$ cat merged.freq.1 | perl -ne '
    next if $.==1; 
    $idx++; 
    chomp; 
    @t=split; 
    $seq=shift @t; 
    $sum=0; 
    for $x (@t){ 
        $sum+=$x 
    } 
    print ">seq$idx\_Freq=$sum\n$seq\n"
' > merged.fasta

wdlin@comp03:somewhere$ head merged.fasta
>seq1_Freq=1
AGAATGTCTCTAAGGTTGCATAGAAGGA
>seq2_Freq=20
AAAACGGGGGGGGCAAGTGTTCTTCGG
>seq3_Freq=1
GTTACACTACTGATGCCCGCGTCGCGATA
>seq4_Freq=6
GGGAACGCAACAATGTTGGGAAACGAT
>seq5_Freq=1
GGTTAGTCGATCCTAAGAGTCATGATAACTCGACGGATCGCA

Our next part took only this merged.fasta for Riboseq read mapping.