Lab: Trimming reads - statonlab/EPP_575_RNASeq_Workshop Wiki

Trimming Reads

Today we'll use the program Skewer to trim portions of sequences that have low Q scores (per-base, Phred-like quality scores). Skewer also has an effective algorithm for removing contaminating adapter sequences.

Helpful links:

Note:

To illustrate the process of adapter removal, this lab is going to use a secondary dataset for purely illustrative purposes.

The first paired-end reads sample (ERR3310285) from the following project will be used for this lab's trimming exercises:
https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-7933/samples/

Samples were subsetted to include only the first one million sequences in each file. These samples are for today's lab only, if you're interested in applying trimming to the actual course data, the relevant methodology will be provided at the end of this page in the Appendix section.

An overview of how we'll manage the Arabidopsis data alongside the example data in our directories is shown below:


raw_data/
    SRR17062759_1.fastq
    SRR17062759_2.fastq
    trimming_example/
        example_1.fastq
        example_2.fastq
        truseq_adapters.fa

analysis/
    1_fastqcRaw/
        SRR17062759_1.fastQC/
            SRR17062759_1.fastQC.out
            EPP575_raw_multiqc_report.html
        trimming_example/
            example_1.fastQC.out
            example_2.fastQC.out

    2a_trimming/
        trimming_example/
            example.trim_output
            example-trimmed-pair1_fastqc.html
            example-trimmed-pair2_fastqc.html

    2b_fastqcTrimmed/
        SRR17062759-trimmed-pair1.fastQC/
        SRR17062759-trimmed-pair2.fastQC/
        SRR17062759_trimmed_1.fastQC.out
        SRR17062759_trimmed_2.fastQC.out
        trimming_example/
            example_summary_trimmed/
            example_trimmed_1.fastQC.out
            example_trimmed_2.fastQC.out

Steps:

1. Consider the metadata for your sequences and their library prep

Original publication for today's example dataset:
https://onlinelibrary.wiley.com/doi/full/10.1111/tpj.14776

Illumina provides all adapter sequences: https://support-docs.illumina.com/SHARE/AdapterSeq/illumina-adapter-sequences.pdf

Read 1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

Read 2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

2. Prepare your project directory

Go to your personal project directory.

Change to the raw_data directory in your analysis directory, then create a trimming_example subdirectory:

# go to your personal analysis directory

cd raw_data
mkdir trimming_example
cd trimming_example

Create symbolic links to the example datasets:

ln -s /pickett_shared/teaching/EPP575_Jan2022/raw_data/trimming_example/example_* ./
cp /pickett_shared/teaching/EPP575_Jan2022/raw_data/trimming_example/truseq_adapters.fa ./

3. Explore the files manually for adapters and quality

First, let's manually look to see if Illumina adapter sequences are present. Output the first 10,000 sequences of the example file using the head command and pipe this into the lightweight file viewer less

head -n 40000 example_1.fastq | less

Why are we looking at the first 40,000 lines of the file?

The screen should now display the reads as in a text viewer. Type the backslash key '/' to enter less's search mode and copy-paste the "Read 1" sequence from Illumina (in step 1 above)...

AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

...then press Enter to search the file for this adapter. Press 'n' to progress forward through the file and 'Shift + n' to progress backwards.

Congratulations! You've found adapter contamination! Also note that most Q scores are either "I" or "F", corresponding with Phred scores 41 or 38, respectively.

Are there any trends in where the "I"s and "F"s are positioned throughout the sequences?

Press 'q' to escape the text viewer. Let's allow actual bioinformatic tools to perform the work for us.

4. Check QC before trimming

Go to your analysis directory from the last lab, change to the 1_fastqcRaw subdirectory, then make a subdirectory for this example:

# go to your personal analysis directory

cd 1_fastqcRaw
mkdir trimming_example
cd trimming_example

If it isn't already loaded from the previous lab section, make fastqc available:

spack load [email protected]%[email protected]

Run the following:

mkdir example_summary
fastqc -o example_summary ../../../raw_data/trimming_example/example_1.fastq >& example_1.fastQC.out
fastqc -o example_summary ../../../raw_data/trimming_example/example_2.fastq >& example_2.fastQC.out

Note the use of "../" to give the relative path to our input data.

5. Run Skewer

Now that we've seen the quality metrics for our data, we'll clean it up using Skewer. Change back to your analysis directory and make a new directory for trimming along with a subdirectory for this example.

# go to your personal analysis directory

mkdir 2a_trimming
mkdir 2a_trimming/trimming_example
cd 2a_trimming/trimming_example

Skewer is saved locally, so we won't need spack this time.

To run Skewer on the example files:

/sphinx_local/software/skewer/skewer \
  -x ../../../raw_data/trimming_example/truseq_adapters.fa \
  -l 30 \
  ../../../raw_data/trimming_example/example_1.fastq \
  ../../../raw_data/trimming_example/example_2.fastq \
  -o example \
  >& example.trim_output

To run Skewer on SRR17062759:

cd ..

/sphinx_local/software/skewer/skewer \
  -x /sphinx_local/software/Trimmomatic-0.39/adapters/all.fa \
  -l 30 \
  ../../raw_data/SRR17062759_1.fastq \
  ../../raw_data/SRR17062759_2.fastq \
  -o SRR17062759 \
  >& SRR17062759.trim_output

6. Check QC after trimming

Change back to the analysis directory and create the following directories:

# go to analysis directory

mkdir 2b_fastqcTrimmed
mkdir 2b_fastqcTrimmed/trimming_example
cd 2b_fastqcTrimmed/trimming_example

mkdir example_summary_trimmed
fastqc -o example_summary_trimmed ../../2a_trimming/trimming_example/example-trimmed-pair1.fastq >& example_trimmed_1.fastQC.out
fastqc -o example_summary_trimmed ../../2a_trimming/trimming_example/example-trimmed-pair2.fastq >& example_trimmed_2.fastQC.out

Now run fastqc again on the "real" dataset:

cd ..

mkdir SRR17062759-trimmed-pair1.fastQC
mkdir SRR17062759-trimmed-pair2.fastQC
fastqc -o SRR17062759-trimmed-pair1.fastQC ../../2a_trimming/SRR17062759-trimmed-pair1.fastq >& SRR17062759_trimmed_1.fastQC.out
fastqc -o SRR17062759-trimmed-pair2.fastQC ../../2a_trimming/SRR17062759-trimmed-pair2.fastq >& SRR17062759_trimmed_2.fastQC.out

7. Prepare raw dataset for tomorrow's lab:

For the sake of illustration and computation, we used a small dataset for this lab. Continuing with the Arabidopsis data, we'll use pre-trimmed files for all downstream steps. Create symbolic links in your 2a_trimming folder by running the following command:

cd 2a_trimming
ln -s /pickett_shared/teaching/EPP575_Jan2022/final_outputs/2a_trimming/*.fastq ./

If you'd like to follow the exact steps used in preparing these files, see the Appendix below.

Appendix

The following settings were used to trim the Arabidopsis data using Skewer. For now, we already have these completed files in our analysis directory, but the data can be processed automatically using a for-loop (we'll discuss this on Thursday).

for R1 in /pickett_shared/teaching/EPP575_Jan2022/raw_data/*_1.fastq
do
	R2=`sed 's/_1/_2/' <(echo $R1)`
	BASE=$( basename $R1 | sed 's/_1.fastq//g')
	echo "R1 $R1"
	echo "R2 $R2"
	echo "BASE $BASE"

	/sphinx_local/software/skewer/skewer \
		-x /sphinx_local/software/Trimmomatic-0.39/adapters/all.fa \
		-l 30 \
		$R1 \
		$R2 \
		-o $BASE \
		>& $BASE.trim_output
done