Lab: Short read variation I ‐ Read QC and trimming

1. Finding and assigning fastq data files

The Solenopsis invicta RAD data we are using for the labs comes from this project. We have already downloaded all the reads, which can be found here -

/pickett_sphinx/teaching/EPP622_2024/raw_data/solenopsis_invicta

Please view the below table, as every student has been assigned a read for the duration of the short read variation I unit.

Student	Read Accession	Total # Reads	Location
Sanskriti Acharya	SRR6922148	1281077	Oglethorpe Co, GA
Aditi	SRR6922294	1405770	Oglethorpe Co, GA
Maria Caballero Aragon	SRR6922306	1541579	Oglethorpe Co, GA
Charles Dawe	SRR6922308	2987905	Oglethorpe Co, GA
Mengling He	SRR6922451	1731620	Oglethorpe Co, GA
Rebecca Kraus	SRR6922454	2897067	Oglethorpe Co, GA
Stefanie Menezes De Moura	SRR6922449	3649330	Oglethorpe Co, GA
Marissa Nufer	SRR6922311	258861	Oglethorpe Co, GA
Alina Pokhrel	SRR6922354	1344896	Pascagoula, MS
Andrew Reed	SRR6922399	1091567	Pascagoula, MS
Patrick Sisler	SRR6922194	1100979	Alejandra, Argentina
Hannah Teddleton	SRR6922233	1216598	Alejandra, Argentina
James Ulmer	SRR6922241	1148981	Alejandra, Argentina
Erin Van Berkel	SRR6922315	1017635	Alejandra, Argentina
Makhali Voss	SRR6922318	1592618	Alejandra, Argentina
Katie Wood	SRR6922319	2199106	Alejandra, Argentina
Meg Staton	SRR6922321	1696637	Alejandra, Argentina
Alysson Dekovich	SRR6922446	991054	Alejandra, Argentina
Beant Kapoor	SRR6922447	1957684	Alejandra, Argentina

NOTE - Please let us know if you are having any trouble with your assigned fastq files. We have downloaded some extra fastq files just in case.

NOTE - Wherever you see something like <your subset> in the code block, please do not copy paste as it might not work. In the coding world, anything within <> means you have to specify it yourself.

2. Setting up a personal directory

Go to the analyses directory within the EPP 622 course directory:

/pickett_sphinx/teaching/EPP622_2024/analyses

...and make a personal analysis folder. For example:

mkdir <your user id goes here>
cd <your user id goes here>

3. Running fastqc

Now, let's make a directory where we will run fastqc:

mkdir 1_fastqc
cd 1_fastqc

We can create a soft link (symbolic link) to the raw data

ln -s ../../../raw_data/solenopsis_invicta/<your subset>.fastq .

Let's load fastqc:

spack load fastqc

Let's run the program now. Since, we all are sharing the same computing resource, we will run fastqc on just one forward read fastq file -

fastqc <your subset>

This program outputs results in .zip and .html formats. We can't inspect them on Sphinx, so we'll need to copy them to our own devices.

scp <your_username>@sphinx.ag.utk.edu:/pickett_sphinx/teaching/EPP622_2024/analyses/<your_username>/1_fastqc/\*html .

Note

What if you want to quality check a bunch of fastq files and generate a neat report to share it with your collaborators? That's where Multiqc comes into play. It takes the .zip files created by fastqc as input and generates a single interactive report for all the samples. I have generated a report which is present here -

/pickett_sphinx/teaching/EPP622_2024/analyses/multiqc/output/s_invicta_GBS_multiqc.html

Please scp this file to your local computer and open it.

4. Running Skewer

Skewer is a fast and accurate adapter trimmer for next-generation sequencing paired-end reads. It has several features such as detecting and removing adapter sequences, trimming sequences based on phred quality scores etc. Now, go to your personal analyses directory and make a new directory -

cd /pickett_sphinx/teaching/EPP622_2024/analyses/<your name>
mkdir 2_skewer
cd 2_skewer

Soft link the raw data files here, too (the space is free!)

ln -s ../../../raw_data/solenopsis_invicta/<your subset>.fastq .

Skewer is installed locally on sphinx therefore, we won't have to use Spack to load it this time.

/sphinx_local/software/skewer/skewer \
    -t 2 \
    -l 95 \
    -x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG \
    -Q 30 <your subset> \
    -o <outfile name>

-t stands for number of threads used by this command
-l stands for minimum length of sequence we want to keep in our analyses
-Q is the minimum mean quality score (Phred score) of the sequence (across the entire read length)

Note: Here we use Q 30 as an illustrative example because the data is already very high quality. In some instances, Q 30 may be considered on the more strict end of trimming thresholds.

Say you wanted to trim all the files using a for loop. Here is an example of how to do that:

for f in *fastq
do
	BASE=$( basename $f | sed 's/.fastq//g')
	echo $BASE

	/sphinx_local/software/skewer/skewer \
	-t 2 -l 95 -Q 30 \
	-x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG \
	$f -o $BASE 
done

5. Run fastqc on trimmed files - DIY

Now that we have trimmed our sequence file, let's check it's quality using fastqc. Since, we already have the fastqc loaded using Spack we don't have to do that again.

fastqc <your subset>-trimmed.fastq

Lab: Short read variation I ‐ Read QC and trimming - mestato/EPP622_2024 GitHub Wiki

1. Finding and assigning fastq data files

2. Setting up a personal directory

3. Running fastqc

4. Running Skewer

5. Run fastqc on trimmed files - DIY

⚠️ GitHub.com Fallback ⚠️

Lab: Short read variation I ‐ Read QC and trimming - mestato/EPP622_2024 GitHub Wiki

1. Finding and assigning fastq data files

2. Setting up a personal directory

3. Running fastqc

4. Running Skewer

5. Run fastqc on trimmed files - DIY

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️