Preprocessing - Bioinformatics-Institute/transcriptomics_WBC GitHub Wiki
1-v. Preprocessing
Use Flexbar to trim sequence adapter from the read FASTQ files. The output of this step will be adaptor-trimmed FASTQ files for each data set.
Refer to the Flexbar project and manual for a more detailed explanation:
Flexbar basic usage:
#flexbar -r reads [-t target] [-b barcodes] [-a adapters] [options]
Extra options specified below:
- '--adapter-min-overlap 7' requires a minimum of 7 bases to match the adapter
- '--adapter-trim-end RIGHT' uses a trimming strategy to remove the adapter from the 3' end or RIGHT end of the read
- '--max-uncalled 300' allows as many as 300 uncalled or N bases (MiSeq read lengths can be 300bp)
- '--min-read-length' the minimum read length allowed after trimming is 25bp. TopHat has a default segment length of 25bp. The smallest read size should be at least one segment length.
- '--threads 8 ' use 8 threads
- '--zip-output GZ' the input FASTQ files are gzipped so we will output gzipped FASTQ to save space
- '--adapters ' define the path to the adapter FASTA file to trim
- '--reads ' define the path to the read 1 FASTQ file of reads
- '--reads2 ' define the path to the read 2 FASTQ file of reads
- '--target ' a base path for the output files. The value will _1.fastq.gz and _2.fastq.gz for read 1 and read 2 respectively
- '--pre-trim-left' trim a fixed number of bases at left read end. For example, to trim 5 bases at the left side of reads: --pre-trim-left 5
- '--pre-trim-right' trim a fixed number of bases at right read end. For example, to trim 5 bases at the right side of reads: --pre-trim-right 5
- '--pre-trim-phred' trim based on phred quality value to deal with higher error rates towards the end of reads. For example, to trim the 3' end until quality offset value 30 or higher is reached, specify: --pre-trim-phred 30
Flexbar adaptor trimming
We are simply going to remove adaptors and not use the trimming function (which will be done later). First, set up some directories for output
cd $RNAWORKING/
mkdir adaptor_trimmed
We can run flexbar on single files or two paired files simultaneously. To run on two paired files, we would run:
# flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters $RNARAW/illumina_multiplex.fa --threads 2 --zip-output GZ --reads $RNARAW/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 $RNARAW/UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz --target adaptor_trimmed/UHR_R1
This would produce two files in the adaptor_trimmed directory, prefixed with UHR_R1 and with _1.fastq.gz or _2.fastq.gz at the end, depending on whether it was the left or right read. However, we are going to use a loop, to run the command on all 12 files simultaneously. The files also end up with much shorter and simpler names.
for rep in 1 2 3
do
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters $RNARAW/illumina_multiplex.fa --threads 2 --zip-output GZ --reads $RNARAW/HBR_Rep${rep}_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 $RNARAW/HBR_Rep${rep}_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz --target adaptor_trimmed/HBR_R${rep}
done
for rep in 1 2 3
do
flexbar --adapter-min-overlap 7 --adapter-trim-end RIGHT --adapters $RNARAW/illumina_multiplex.fa --threads 2 --zip-output GZ --reads $RNARAW/UHR_Rep${rep}_ERCC-Mix1_Build37-ErccTranscripts-chr22.read1.fastq.gz --reads2 $RNARAW/UHR_Rep${rep}_ERCC-Mix1_Build37-ErccTranscripts-chr22.read2.fastq.gz --target adaptor_trimmed/UHR_R${rep}
done
SolexaQA dynamictrim and lengthsort
We can use SolexaQA++ to trim reads by quality, remove reads below a minimum length, and discard reads without a mate. We will run two subprograms of SolexaQA++: dynamictrim and lengthsort.
This will run SolexaQA++ dynamictrim on the files we previously adaptor trimmed, and put the outputs in $RNAWORKING/trim.
cd $RNAWORKING
mkdir trim
SolexaQA++ dynamictrim adaptor_trimmed/*.fastq.gz -d trim
ls trim
Now we will run SolexaQA++ lengthsort on the trimmed output files. Lengthsort (by default) removes reads that are below 25 bp, and removes unpaired reads (to a separate file, so you don't lose them). SolexaQA++ lengthsort requires both the left and right reads, but we can run the program on all the files using a nested loop.
mkdir $RNAWORKING/sort
for sample in HBR UHR
do
for rep in 1 2 3
do
SolexaQA++ lengthsort trim/${sample}_R${rep}_1.fastq.trimmed.gz trim/${sample}_R${rep}_2.fastq.trimmed.gz -d sort
done
done
ls sort
The above command takes a bit of explaining. There are two sets of six sample files, prefixed with "UHR" and "HBR" (because they are from different tissue types). Within this set of 6 there are 3 reps, with the text "R1/2/3" in the filename. Within each rep there are left and right reads. So we can refer to the file using variables to represent the prefix and the rep number. Instead of the file:
#HBR_R1_1.fastq.trimmed.gz
We can call 6 files by using:
#${sample}_R${rep}_1.fastq.trimmed.gz
..in a loop.
These files have now had adaptors removed, have been trimmed by quality, had reads <25bp and unpaired reads removed. These files will be used for downstream analysis. However you've probably noticed the names becoming long and unwieldy again. Long names can be useful; for example, the name tells us which ERCC mix was used. We're also leaving a trail of intermediate files and folders behind us (which can be useful in some circumstances). Let's get our final files and put them in a separate directory, and clean up behind us:
mkdir final
for sample in HBR UHR
do
for rep in 1 2 3
do
mv sort/${sample}_R${rep}_1.fastq.trimmed.paired.gz final/${sample}_R${rep}_1.fastq.gz
mv sort/${sample}_R${rep}_2.fastq.trimmed.paired.gz final/${sample}_R${rep}_2.fastq.gz
done
done
rm -r $RNAWORKING/adaptor_trimmed
rm -r $RNAWORKING/trim
rm -r $RNAWORKING/sort
Previous Section | This Section | Next Section |
---|---|---|
Data QC | Preprocessing | Alignment |