05. Let's Talk About Reads and Read Quality - davidaray/Genomes-and-Genome-Evolution GitHub Wiki
NOTE: This tutorial assumes you've completed everything in the previous tutorials. In particular, you should know the how to navigate around a linux operating system and what a singularity container is and how to bind host directories.
GETTING SOME DATA
We are going to assemble the data from three bacteria, Helicobacter pylori, Streptococcus pyogenes, and Neisseria gonorrhoeae. I’ve identified and downloaded appropriate raw sequence data for you to use. You will need to copy that data to your own working directory after creating that directory. For this exercise, it’s best to use the larger storage available in your ‘scratch’ directory. Follow these instructions as closely as possible so that all our efforts are comparable, and I can easily find the files you create if troubleshooting is needed.
Navigate to your scratch directory using cd
and type:
cd gge2024/data
The following commands will copy the data you need to an appropriate directory. NOTE: the period here is important! It describes where to put the copied files (your present working directory).
cp -r /lustre/scratch/daray/gge2024/data/helicobacterpylori .
cp -r /lustre/scratch/daray/gge2024/data/neisseriagonorrhoeae .
cp -r /lustre/scratch/daray/gge2024/data/streptococcuspyogenes .
Look at the folders that are in your new data directory.
ls -lhrt
You'll see there are three directories, one for each bacterial species.
You can also look in the subdirectory for each species. For example,
ls -lhrt helicobacterpylori
In that directory, there are four files. Each is a compressed fastq file,'fastq.gz'. The '.gz' just means that the sequence data are compressed using a utility call gzip
. We will see what these reads actually look like in another exercise.
COUNTING AND VIEWING YOUR READS
From here down is your assignment, answer any questions and upload to Assignment 5 - Read quality
Fastq is a file format for storing sequence data and the quality of that data. The lines below were blatantly stolen from Wikipedia:
-
Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
-
Line 2 is the raw sequence letters.
-
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
-
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Each quality value corresponds with one of the nucleotide values in the second line.
So, with that information you can deduce that there are four lines of information for each sequence read. That means you can find out how many sequence reads are in each of the files.
cd /lustre/scratch/[eraider]/gge2024/data/helicobacterpylori/
zcat is a version of the cat command that works on gzipped files.
zcat short_reads_1.fastq.gz | wc -l
zcat short_reads_2.fastq.gz | wc -l
zcat long_reads_low_depth.fastq.gz | wc -l
zcat long_reads_high_depth.fastq.gz | wc -l
-
How many reads are in each file? Remember, there are four lines per read.
-
The number of reads in the two short read files should be identical. Why?
-
Compare the number of reads in all four files to one another.
-
Look at the filenames of the long reads. What do the descriptors "low_depth" and "high_depth" mean?
A FASTQ file containing a single sequence might look like this:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+SEQ_ID
!''((((+))%%%++)(%%%%).1*-+*''))**55CCF>>>>>>CCCCCCC65
The two SEQ_ID lines have obvious meanings. The second line is the DNA sequence of the read as determined by the sequencing software. The fourth line shows the quality score of each sequenced base.
The byte representing quality runs from 0x21 (lowest quality; '!' in ASCII) to 0x7e (highest quality; '~' in ASCII). You can find a chart of the probabilities or the correct base at this site. There are actually two tables there. I've copied the one relevant to us here:
In general, the data above aren't of the highest quality as demonstrated by the fact that most of the characters in the quality line are on the left side of that spectrum.
You can view the sequence data using a simple command that allows you to temporarily unzip a file and look at the first few lines of it by combining the zcat
and head
commands. zcat
will decompress the file temporarily and you know what the head
command does from an earlier exercise.
Execute the following two commands.
zcat short_reads_1.fastq.gz | head
zcat short_reads_2.fastq.gz | head
- Are your data of pretty good quality compared to the example above? Why or why not?
Now execute these two commands.
zcat long_reads_low_depth.fastq.gz | head
zcat long_reads_high_depth.fastq.gz | head
- You should have noticed a substantial difference in the output to the screen. Explain the differences between what you see when executing these commands for the short and long read files. You may need to scroll up and down a bit.
QUALITY ASSESSMENT
You did a quick visual assessment of the quality of your reads above but there are better, more thorough ways. Indeed, after getting your raw data for any project in the real world, assessing quality would be your first step. And you would NOT want to do this by taking the time to actually look at the fourth line of every sequencing read.
FastQC is a popular tool from Babraham Institute Bioinformatics Group used for quality assessment of sequencing data. Most Bioinformatics pipelines will use FastQC, or similar tools in the first stage of the analysis. The documentation for FastQC will help you to interpret the plots and stats produced by the tool. A traffic light system is used to alert the user’s attention to possible issues. However, it is worth bearing in mind that the tool is blind to the particular type of sequencing you are performing (i.e. whole-genome, ChIP-seq, RNA-seq), so some warnings might be expected due to the nature of your experiment.
For this, we will use the fastqc conda environment I've provided for you through the Singularity container. If you want to try and install fastqc on your own, please do so. It's a good exercise for you if you plan to continue using bioinformatics. However, it's not necessary for the exercise.
FASTQC
Let's analyze your reads using an interactive session with 10 processors and then navigate to your gge2024 folder, and the folder containing the data for H. pylori.
interactive -p nocona -c 10
cd /lustre/scratch/[eraider]/gge2024/data/helicobacterpylori
Now, let's set up the variables that you will need to tell Singularity where to find all of your data files and the container.
WORKDIR=/lustre/scratch/[eraider]/gge2024/data/helicobacterpylori
GGE_CONTAINER=/lustre/scratch/[eraider]/gge2024/container
You can now run the quality assesment with fastqc.
singularity exec -B $WORKDIR:$WORKDIR $GGE_CONTAINER/gge_container_v4.sif fastqc -t 10 *.fastq.gz
Remember, this command does two things. 1) It executes thye Singularity container and 2) binds the directory so any files in that host directory can now be accessed by the container operating system and the container operating system can write output to the host gge directory. Any time you see $WORKDIR:$WORKDIR
it means that the container is reading the directories as source:output.
If everything worked correctly you should see something like this on your screen.
application/gzip
application/gzip
Started analysis of long_reads_high_depth.fastq.gz
application/gzip
application/gzip
Started analysis of long_reads_low_depth.fastq.gz
Analysis complete for long_reads_low_depth.fastq.gz
Approx 10% complete for long_reads_high_depth.fastq.gz
Approx 20% complete for long_reads_high_depth.fastq.gz
Approx 35% complete for long_reads_high_depth.fastq.gz
Approx 45% complete for long_reads_high_depth.fastq.gz
Approx 55% complete for long_reads_high_depth.fastq.gz
Approx 65% complete for long_reads_high_depth.fastq.gz
Approx 80% complete for long_reads_high_depth.fastq.gz
Approx 90% complete for long_reads_high_depth.fastq.gz
Started analysis of short_reads_1.fastq.gz
Analysis complete for long_reads_high_depth.fastq.gz
Approx 5% complete for short_reads_1.fastq.gz
Started analysis of short_reads_2.fastq.gz
Approx 10% complete for short_reads_1.fastq.gz
Approx 5% complete for short_reads_2.fastq.gz
Approx 15% complete for short_reads_1.fastq.gz
and so on....
Now, let's break down what you actually did just now.
Type singularity exec $GGE_CONTAINER fastqc -h
and you'll get a list of the commands available to you from the program fastqc. Using that information, we can parse out what the command fastqc -t 10 *.fastq.gz
actually does.
-
fastqc
simply tells the computer to use that program. -
-t 10
tells the program that there are 10 processors available to do the job that's been asked of it. -
*.fastq.gz
uses the '*' wildcard to tell the program to analyze anything that ends in '.fastq.gz'. Because you didn't specify any other path, the program assumes it should look for the files in whatever directory you're in at the moment. In our case, we're in/mnt/data/helicobacterpylori/
aka the host directory,/lustre/scratch/[eraider]/gge2024/data/helicobacterpylori/
.
Use a Bitvise sftp window to navigate to your output, which will be located in the same directory. Notice that in addition to the original .gz files, there are now .html and .zip files. The .html files use data in the .zip files to generate viewable browser windows with information on your sequence reads. On a PC, they're easy to see, usually by just double clicking on the .html filename.
Details on interpreting the output .html files can be found at this site. At the same site, you can find examples of good Illumina data, bad Illumina data, PacBio data, and other types of reads.
Assuming we've talked about the various types of sequencing technologies and their pros and cons by this point in the class, you should be able to explain the relative quality assessments when comparing your paired-end (Illumina) reads and the Nanopore data.
-
Write out that explanation.
-
What are the longest and shortest reads from each of the four read sets? How many total reads are in each of the four read sets? Did you get the same answer as you did by calculating it with
zcat
andwc
?
In the real world you may end up dealing with many, many data files. For example, for one RADSeq project we had hundreds of raw read files to examine. Evaluating each one individually would be a real pain. Fortunately, there is a better way, multiqc.
If you go back to your helicobacterpylori directory and run the following
singularity exec -B $WORKDIR:$WORKDIR $GGE_CONTAINER/gge_container_v4.sif multiqc . --dirs -o qc/
It will search for any and all quality reports that exist and produce a single .html that will allow multiple comparisons at once. That file will be located in a directory called QC.
-
Using the report, tell me about how many reads from each of the four read sets are duplicates. Provide your answer in percentages.
-
Would you consider these reads to be GC-rich or GC-poor?
An excellent tour of the output is available on youtube and the documentation is at this site.
Fortunately, there are no major problems with this data.
** These data were obtained from links at https://github.com/rrwick/Unicycler/blob/main/sample_data/README.md