Lab: Short read variation I ‐ Read QC and trimming - mestato/EPP622_2024 GitHub Wiki
The Solenopsis invicta RAD data we are using for the labs comes from this project. We have already downloaded all the reads, which can be found here -
/pickett_sphinx/teaching/EPP622_2024/raw_data/solenopsis_invicta
Please view the below table, as every student has been assigned a read for the duration of the short read variation I unit.
Student | Read Accession | Total # Reads | Location |
---|---|---|---|
Sanskriti Acharya | SRR6922148 | 1281077 | Oglethorpe Co, GA |
Aditi | SRR6922294 | 1405770 | Oglethorpe Co, GA |
Maria Caballero Aragon | SRR6922306 | 1541579 | Oglethorpe Co, GA |
Charles Dawe | SRR6922308 | 2987905 | Oglethorpe Co, GA |
Mengling He | SRR6922451 | 1731620 | Oglethorpe Co, GA |
Rebecca Kraus | SRR6922454 | 2897067 | Oglethorpe Co, GA |
Stefanie Menezes De Moura | SRR6922449 | 3649330 | Oglethorpe Co, GA |
Marissa Nufer | SRR6922311 | 258861 | Oglethorpe Co, GA |
Alina Pokhrel | SRR6922354 | 1344896 | Pascagoula, MS |
Andrew Reed | SRR6922399 | 1091567 | Pascagoula, MS |
Patrick Sisler | SRR6922194 | 1100979 | Alejandra, Argentina |
Hannah Teddleton | SRR6922233 | 1216598 | Alejandra, Argentina |
James Ulmer | SRR6922241 | 1148981 | Alejandra, Argentina |
Erin Van Berkel | SRR6922315 | 1017635 | Alejandra, Argentina |
Makhali Voss | SRR6922318 | 1592618 | Alejandra, Argentina |
Katie Wood | SRR6922319 | 2199106 | Alejandra, Argentina |
Meg Staton | SRR6922321 | 1696637 | Alejandra, Argentina |
Alysson Dekovich | SRR6922446 | 991054 | Alejandra, Argentina |
Beant Kapoor | SRR6922447 | 1957684 | Alejandra, Argentina |
NOTE - Please let us know if you are having any trouble with your assigned fastq files. We have downloaded some extra fastq files just in case.
NOTE - Wherever you see something like <your subset>
in the code block, please do not copy paste as it might not work. In the coding world, anything within <>
means you have to specify it yourself.
Go to the analyses directory within the EPP 622 course directory:
/pickett_sphinx/teaching/EPP622_2024/analyses
...and make a personal analysis folder. For example:
mkdir <your user id goes here>
cd <your user id goes here>
Now, let's make a directory where we will run fastqc
:
mkdir 1_fastqc
cd 1_fastqc
We can create a soft link (symbolic link) to the raw data
ln -s ../../../raw_data/solenopsis_invicta/<your subset>.fastq .
Let's load fastqc:
spack load fastqc
Let's run the program now. Since, we all are sharing the same computing resource, we will run fastqc
on just one forward read fastq file -
fastqc <your subset>
This program outputs results in .zip
and .html
formats. We can't inspect them on Sphinx, so we'll need to copy them to our own devices.
scp <your_username>@sphinx.ag.utk.edu:/pickett_sphinx/teaching/EPP622_2024/analyses/<your_username>/1_fastqc/\*html .
Note
What if you want to quality check a bunch of fastq files and generate a neat report to share it with your collaborators? That's where Multiqc comes into play. It takes the .zip
files created by fastqc as input and generates a single interactive report for all the samples. I have generated a report which is present here -
/pickett_sphinx/teaching/EPP622_2024/analyses/multiqc/output/s_invicta_GBS_multiqc.html
Please scp
this file to your local computer and open it.
Skewer is a fast and accurate adapter trimmer for next-generation sequencing paired-end reads. It has several features such as detecting and removing adapter sequences, trimming sequences based on phred quality scores etc. Now, go to your personal analyses directory and make a new directory -
cd /pickett_sphinx/teaching/EPP622_2024/analyses/<your name>
mkdir 2_skewer
cd 2_skewer
Soft link the raw data files here, too (the space is free!)
ln -s ../../../raw_data/solenopsis_invicta/<your subset>.fastq .
Skewer is installed locally on sphinx therefore, we won't have to use Spack to load it this time.
/sphinx_local/software/skewer/skewer \
-t 2 \
-l 95 \
-x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG \
-Q 30 <your subset> \
-o <outfile name>
-t
stands for number of threads used by this command
-l
stands for minimum length of sequence we want to keep in our analyses
-Q
is the minimum mean quality score (Phred score) of the sequence (across the entire read length)
Note: Here we use Q 30 as an illustrative example because the data is already very high quality. In some instances, Q 30 may be considered on the more strict end of trimming thresholds.
Say you wanted to trim all the files using a for loop. Here is an example of how to do that:
for f in *fastq
do
BASE=$( basename $f | sed 's/.fastq//g')
echo $BASE
/sphinx_local/software/skewer/skewer \
-t 2 -l 95 -Q 30 \
-x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG \
$f -o $BASE
done
Now that we have trimmed our sequence file, let's check it's quality using fastqc. Since, we already have the fastqc
loaded using Spack we don't have to do that again.
fastqc <your subset>-trimmed.fastq