Lab: Short read variation I ‐ Read QC and trimming - mestato/EPP622_2024 GitHub Wiki

1. Finding and assigning fastq data files

The Solenopsis invicta RAD data we are using for the labs comes from this project. We have already downloaded all the reads, which can be found here -

/pickett_sphinx/teaching/EPP622_2024/raw_data/solenopsis_invicta

Please view the below table, as every student has been assigned a read for the duration of the short read variation I unit.

Student Read Accession Total # Reads Location
Sanskriti Acharya SRR6922148 1281077 Oglethorpe Co, GA
Aditi SRR6922294 1405770 Oglethorpe Co, GA
Maria Caballero Aragon SRR6922306 1541579 Oglethorpe Co, GA
Charles Dawe SRR6922308 2987905 Oglethorpe Co, GA
Mengling He SRR6922451 1731620 Oglethorpe Co, GA
Rebecca Kraus SRR6922454 2897067 Oglethorpe Co, GA
Stefanie Menezes De Moura SRR6922449 3649330 Oglethorpe Co, GA
Marissa Nufer SRR6922311 258861 Oglethorpe Co, GA
Alina Pokhrel SRR6922354 1344896 Pascagoula, MS
Andrew Reed SRR6922399 1091567 Pascagoula, MS
Patrick Sisler SRR6922194 1100979 Alejandra, Argentina
Hannah Teddleton SRR6922233 1216598 Alejandra, Argentina
James Ulmer SRR6922241 1148981 Alejandra, Argentina
Erin Van Berkel SRR6922315 1017635 Alejandra, Argentina
Makhali Voss SRR6922318 1592618 Alejandra, Argentina
Katie Wood SRR6922319 2199106 Alejandra, Argentina
Meg Staton SRR6922321 1696637 Alejandra, Argentina
Alysson Dekovich SRR6922446 991054 Alejandra, Argentina
Beant Kapoor SRR6922447 1957684 Alejandra, Argentina

NOTE - Please let us know if you are having any trouble with your assigned fastq files. We have downloaded some extra fastq files just in case.

NOTE - Wherever you see something like <your subset> in the code block, please do not copy paste as it might not work. In the coding world, anything within <> means you have to specify it yourself.

2. Setting up a personal directory

Go to the analyses directory within the EPP 622 course directory:

/pickett_sphinx/teaching/EPP622_2024/analyses

...and make a personal analysis folder. For example:

mkdir <your user id goes here>
cd <your user id goes here>

3. Running fastqc

Now, let's make a directory where we will run fastqc:

mkdir 1_fastqc
cd 1_fastqc

We can create a soft link (symbolic link) to the raw data

ln -s ../../../raw_data/solenopsis_invicta/<your subset>.fastq .

Let's load fastqc:

spack load fastqc

Let's run the program now. Since, we all are sharing the same computing resource, we will run fastqc on just one forward read fastq file -

fastqc <your subset>

This program outputs results in .zip and .html formats. We can't inspect them on Sphinx, so we'll need to copy them to our own devices.

scp <your_username>@sphinx.ag.utk.edu:/pickett_sphinx/teaching/EPP622_2024/analyses/<your_username>/1_fastqc/\*html .

Note

What if you want to quality check a bunch of fastq files and generate a neat report to share it with your collaborators? That's where Multiqc comes into play. It takes the .zip files created by fastqc as input and generates a single interactive report for all the samples. I have generated a report which is present here -

/pickett_sphinx/teaching/EPP622_2024/analyses/multiqc/output/s_invicta_GBS_multiqc.html

Please scp this file to your local computer and open it.

4. Running Skewer

Skewer is a fast and accurate adapter trimmer for next-generation sequencing paired-end reads. It has several features such as detecting and removing adapter sequences, trimming sequences based on phred quality scores etc. Now, go to your personal analyses directory and make a new directory -

cd /pickett_sphinx/teaching/EPP622_2024/analyses/<your name>
mkdir 2_skewer
cd 2_skewer

Soft link the raw data files here, too (the space is free!)

ln -s ../../../raw_data/solenopsis_invicta/<your subset>.fastq .

Skewer is installed locally on sphinx therefore, we won't have to use Spack to load it this time.

/sphinx_local/software/skewer/skewer \
    -t 2 \
    -l 95 \
    -x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG \
    -Q 30 <your subset> \
    -o <outfile name>

-t stands for number of threads used by this command
-l stands for minimum length of sequence we want to keep in our analyses
-Q is the minimum mean quality score (Phred score) of the sequence (across the entire read length)

Note: Here we use Q 30 as an illustrative example because the data is already very high quality. In some instances, Q 30 may be considered on the more strict end of trimming thresholds.

Say you wanted to trim all the files using a for loop. Here is an example of how to do that:

for f in *fastq
do
	BASE=$( basename $f | sed 's/.fastq//g')
	echo $BASE

	/sphinx_local/software/skewer/skewer \
	-t 2 -l 95 -Q 30 \
	-x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG \
	$f -o $BASE 
done

5. Run fastqc on trimmed files - DIY

Now that we have trimmed our sequence file, let's check it's quality using fastqc. Since, we already have the fastqc loaded using Spack we don't have to do that again.

fastqc <your subset>-trimmed.fastq
⚠️ **GitHub.com Fallback** ⚠️