Preprocessing - Bioinformatics-Institute/transcriptomics_WBC GitHub Wiki

RNA-seq Flowchart - Module 3

1-v. Preprocessing

Use Flexbar to trim sequence adapter from the read FASTQ files. The output of this step will be adaptor-trimmed FASTQ files for each data set.

Refer to the Flexbar project and manual for a more detailed explanation:

Flexbar basic usage:

#flexbar -r reads [-t target] [-b barcodes] [-a adapters] [options]

Extra options specified below:

'--adapter-min-overlap 7' requires a minimum of 7 bases to match the adapter
'--adapter-trim-end RIGHT' uses a trimming strategy to remove the adapter from the 3' end or RIGHT end of the read
'--max-uncalled 300' allows as many as 300 uncalled or N bases (MiSeq read lengths can be 300bp)
'--min-read-length' the minimum read length allowed after trimming is 25bp. TopHat has a default segment length of 25bp. The smallest read size should be at least one segment length.
'--threads 8 ' use 8 threads
'--zip-output GZ' the input FASTQ files are gzipped so we will output gzipped FASTQ to save space
'--adapters ' define the path to the adapter FASTA file to trim
'--reads ' define the path to the read 1 FASTQ file of reads
'--reads2 ' define the path to the read 2 FASTQ file of reads
'--target ' a base path for the output files. The value will _1.fastq.gz and _2.fastq.gz for read 1 and read 2 respectively
'--pre-trim-left' trim a fixed number of bases at left read end. For example, to trim 5 bases at the left side of reads: --pre-trim-left 5
'--pre-trim-right' trim a fixed number of bases at right read end. For example, to trim 5 bases at the right side of reads: --pre-trim-right 5
'--pre-trim-phred' trim based on phred quality value to deal with higher error rates towards the end of reads. For example, to trim the 3' end until quality offset value 30 or higher is reached, specify: --pre-trim-phred 30

Flexbar adaptor trimming

We are simply going to remove adaptors and not use the trimming function (which will be done later). First, set up some directories for output

    cd $RNAWORKING/
    mkdir adaptor_trimmed

We can run flexbar on single files or two paired files simultaneously. To run on two paired files, we would run:

# flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters $RNARAW/illumina_multiplex.fa --threads 2 --zip-output GZ --reads $RNARAW/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 $RNARAW/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz --target adaptor_trimmed/UHR_R1

This would produce two files in the adaptor_trimmed directory, prefixed with UHR_R1 and with _1.fastq.gz or _2.fastq.gz at the end, depending on whether it was the left or right read. However, we are going to use a loop, to run the command on all 12 files simultaneously. The files also end up with much shorter and simpler names.

    for rep in 1 2 3
    do 
        flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters $RNARAW/illumina_multiplex.fa --threads 2 --zip-output GZ --reads $RNARAW/HBR_Rep${rep}_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 $RNARAW/HBR_Rep${rep}_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz --target adaptor_trimmed/HBR_R${rep}
    done

    for rep in 1 2 3
    do 
        flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters $RNARAW/illumina_multiplex.fa --threads 2 --zip-output GZ --reads $RNARAW/UHR_Rep${rep}_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 $RNARAW/UHR_Rep${rep}_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz --target adaptor_trimmed/UHR_R${rep}
    done

SolexaQA dynamictrim and lengthsort

We can use SolexaQA++ to trim reads by quality, remove reads below a minimum length, and discard reads without a mate. We will run two subprograms of SolexaQA++: dynamictrim and lengthsort.

This will run SolexaQA++ dynamictrim on the files we previously adaptor trimmed, and put the outputs in $RNAWORKING/trim.

    cd $RNAWORKING
    mkdir trim
    SolexaQA++ dynamictrim adaptor_trimmed/*.fastq.gz -d trim
    ls trim

Now we will run SolexaQA++ lengthsort on the trimmed output files. Lengthsort (by default) removes reads that are below 25 bp, and removes unpaired reads (to a separate file, so you don't lose them). SolexaQA++ lengthsort requires both the left and right reads, but we can run the program on all the files using a nested loop.

    mkdir $RNAWORKING/sort
    for sample in HBR UHR 
    do
        for rep in 1 2 3
        do 
         SolexaQA++ lengthsort trim/${sample}_R${rep}_1.fastq.trimmed.gz trim/${sample}_R${rep}_2.fastq.trimmed.gz -d sort
        done
    done
    ls sort

The above command takes a bit of explaining. There are two sets of six sample files, prefixed with "UHR" and "HBR" (because they are from different tissue types). Within this set of 6 there are 3 reps, with the text "R1/2/3" in the filename. Within each rep there are left and right reads. So we can refer to the file using variables to represent the prefix and the rep number. Instead of the file:

    #HBR_R1_1.fastq.trimmed.gz

We can call 6 files by using:

    #${sample}_R${rep}_1.fastq.trimmed.gz

..in a loop.

These files have now had adaptors removed, have been trimmed by quality, had reads <25bp and unpaired reads removed. These files will be used for downstream analysis. However you've probably noticed the names becoming long and unwieldy again. Long names can be useful; for example, the name tells us which ERCC mix was used. We're also leaving a trail of intermediate files and folders behind us (which can be useful in some circumstances). Let's get our final files and put them in a separate directory, and clean up behind us:

    mkdir final
    for sample in HBR UHR
    do
        for rep in 1 2 3
        do 
        mv sort/${sample}_R${rep}_1.fastq.trimmed.paired.gz final/${sample}_R${rep}_1.fastq.gz
        mv sort/${sample}_R${rep}_2.fastq.trimmed.paired.gz final/${sample}_R${rep}_2.fastq.gz
        done
    done
    rm -r $RNAWORKING/adaptor_trimmed
    rm -r $RNAWORKING/trim
    rm -r $RNAWORKING/sort

Previous Section	This Section	Next Section
Data QC	Preprocessing	Alignment