Lab: Trimming reads - statonlab/EPP_575_RNASeq_Workshop GitHub Wiki
Trimming Reads
Today we'll use the program Skewer to trim portions of sequences that have low Q scores (per-base, Phred-like quality scores). Skewer also has an effective algorithm for removing contaminating adapter sequences.
Helpful links:
Note:
To illustrate the process of adapter removal, this lab is going to use a secondary dataset for purely illustrative purposes.
The first paired-end reads sample (ERR3310285) from the following project will be used for this lab's trimming exercises:
https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-7933/samples/
Samples were subsetted to include only the first one million sequences in each file. These samples are for today's lab only, if you're interested in applying trimming to the actual course data, the relevant methodology will be provided at the end of this page in the Appendix section.
An overview of how we'll manage the Arabidopsis data alongside the example data in our directories is shown below:
raw_data/
SRR17062759_1.fastq
SRR17062759_2.fastq
trimming_example/
example_1.fastq
example_2.fastq
truseq_adapters.fa
analysis/
1_fastqcRaw/
SRR17062759_1.fastQC/
SRR17062759_1.fastQC.out
EPP575_raw_multiqc_report.html
trimming_example/
example_1.fastQC.out
example_2.fastQC.out
2a_trimming/
trimming_example/
example.trim_output
example-trimmed-pair1_fastqc.html
example-trimmed-pair2_fastqc.html
2b_fastqcTrimmed/
SRR17062759-trimmed-pair1.fastQC/
SRR17062759-trimmed-pair2.fastQC/
SRR17062759_trimmed_1.fastQC.out
SRR17062759_trimmed_2.fastQC.out
trimming_example/
example_summary_trimmed/
example_trimmed_1.fastQC.out
example_trimmed_2.fastQC.out
Steps:
1. Consider the metadata for your sequences and their library prep
Original publication for today's example dataset:
https://onlinelibrary.wiley.com/doi/full/10.1111/tpj.14776
Illumina provides all adapter sequences: https://support-docs.illumina.com/SHARE/AdapterSeq/illumina-adapter-sequences.pdf
Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
2. Prepare your project directory
Go to your personal project directory.
Change to the raw_data directory in your analysis directory, then create a trimming_example subdirectory:
# go to your personal analysis directory
cd raw_data
mkdir trimming_example
cd trimming_example
Create symbolic links to the example datasets:
ln -s /pickett_shared/teaching/EPP575_Jan2022/raw_data/trimming_example/example_* ./
cp /pickett_shared/teaching/EPP575_Jan2022/raw_data/trimming_example/truseq_adapters.fa ./
3. Explore the files manually for adapters and quality
First, let's manually look to see if Illumina adapter sequences are present. Output the first 10,000 sequences of the example file using the head command and pipe this into the lightweight file viewer less
head -n 40000 example_1.fastq | less
Why are we looking at the first 40,000 lines of the file?
The screen should now display the reads as in a text viewer. Type the backslash key '/' to enter less's search mode and copy-paste the "Read 1" sequence from Illumina (in step 1 above)...
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
...then press Enter to search the file for this adapter. Press 'n' to progress forward through the file and 'Shift + n' to progress backwards.
Congratulations! You've found adapter contamination! Also note that most Q scores are either "I" or "F", corresponding with Phred scores 41 or 38, respectively.
Are there any trends in where the "I"s and "F"s are positioned throughout the sequences?
Press 'q' to escape the text viewer. Let's allow actual bioinformatic tools to perform the work for us.
4. Check QC before trimming
Go to your analysis directory from the last lab, change to the 1_fastqcRaw subdirectory, then make a subdirectory for this example:
# go to your personal analysis directory
cd 1_fastqcRaw
mkdir trimming_example
cd trimming_example
If it isn't already loaded from the previous lab section, make fastqc available:
spack load [email protected]%[email protected]
Run the following:
mkdir example_summary
fastqc -o example_summary ../../../raw_data/trimming_example/example_1.fastq >& example_1.fastQC.out
fastqc -o example_summary ../../../raw_data/trimming_example/example_2.fastq >& example_2.fastQC.out
Note the use of "../" to give the relative path to our input data.
5. Run Skewer
Now that we've seen the quality metrics for our data, we'll clean it up using Skewer. Change back to your analysis directory and make a new directory for trimming along with a subdirectory for this example.
# go to your personal analysis directory
mkdir 2a_trimming
mkdir 2a_trimming/trimming_example
cd 2a_trimming/trimming_example
Skewer is saved locally, so we won't need spack this time.
To run Skewer on the example files:
/sphinx_local/software/skewer/skewer \
-x ../../../raw_data/trimming_example/truseq_adapters.fa \
-l 30 \
../../../raw_data/trimming_example/example_1.fastq \
../../../raw_data/trimming_example/example_2.fastq \
-o example \
>& example.trim_output
To run Skewer on SRR17062759:
cd ..
/sphinx_local/software/skewer/skewer \
-x /sphinx_local/software/Trimmomatic-0.39/adapters/all.fa \
-l 30 \
../../raw_data/SRR17062759_1.fastq \
../../raw_data/SRR17062759_2.fastq \
-o SRR17062759 \
>& SRR17062759.trim_output
6. Check QC after trimming
Change back to the analysis directory and create the following directories:
# go to analysis directory
mkdir 2b_fastqcTrimmed
mkdir 2b_fastqcTrimmed/trimming_example
cd 2b_fastqcTrimmed/trimming_example
mkdir example_summary_trimmed
fastqc -o example_summary_trimmed ../../2a_trimming/trimming_example/example-trimmed-pair1.fastq >& example_trimmed_1.fastQC.out
fastqc -o example_summary_trimmed ../../2a_trimming/trimming_example/example-trimmed-pair2.fastq >& example_trimmed_2.fastQC.out
Now run fastqc again on the "real" dataset:
cd ..
mkdir SRR17062759-trimmed-pair1.fastQC
mkdir SRR17062759-trimmed-pair2.fastQC
fastqc -o SRR17062759-trimmed-pair1.fastQC ../../2a_trimming/SRR17062759-trimmed-pair1.fastq >& SRR17062759_trimmed_1.fastQC.out
fastqc -o SRR17062759-trimmed-pair2.fastQC ../../2a_trimming/SRR17062759-trimmed-pair2.fastq >& SRR17062759_trimmed_2.fastQC.out
7. Prepare raw dataset for tomorrow's lab:
For the sake of illustration and computation, we used a small dataset for this lab. Continuing with the Arabidopsis data, we'll use pre-trimmed files for all downstream steps. Create symbolic links in your 2a_trimming folder by running the following command:
cd 2a_trimming
ln -s /pickett_shared/teaching/EPP575_Jan2022/final_outputs/2a_trimming/*.fastq ./
If you'd like to follow the exact steps used in preparing these files, see the Appendix below.
Appendix
The following settings were used to trim the Arabidopsis data using Skewer. For now, we already have these completed files in our analysis directory, but the data can be processed automatically using a for-loop (we'll discuss this on Thursday).
for R1 in /pickett_shared/teaching/EPP575_Jan2022/raw_data/*_1.fastq
do
R2=`sed 's/_1/_2/' <(echo $R1)`
BASE=$( basename $R1 | sed 's/_1.fastq//g')
echo "R1 $R1"
echo "R2 $R2"
echo "BASE $BASE"
/sphinx_local/software/skewer/skewer \
-x /sphinx_local/software/Trimmomatic-0.39/adapters/all.fa \
-l 30 \
$R1 \
$R2 \
-o $BASE \
>& $BASE.trim_output
done