Deprecated 05. Let's Talk About Reads and Quality Control - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

NOTE: This tutorial assumes you've completed everything in the previous tutorials. In particular, you should know the how to navigate around a linux operating system and what a singularity container is and how to bind host directories.

GETTING SOME DATA

We are going to assemble the data from three bacteria, Helicobacter pylori, Streptococcus pyogenes, and Neisseria gonorrhoeae. I’ve identified and downloaded appropriate raw sequence data for you to use. You will need to copy that data to your own working directory after creating that directory. For this exercise, it’s best to use the larger storage available in your ‘scratch’ directory. Follow these instructions as closely as possible so that all our efforts are comparable, and I can easily find the files you create if troubleshooting is needed.

Navigate to your scratch directory using cd and type:

cd gge2024/data

The following commands will copy the data you need to an appropriate directory. NOTE: the period here is important! It describes where to put the copied files (your present working directory).

cp -r /lustre/scratch/daray/gge2024/data/helicobacterpylori .

cp -r /lustre/scratch/daray/gge2024/data/neisseriagonorrhoeae .

cp -r /lustre/scratch/daray/gge2024/data/streptococcuspyogenes .

Look at the folders that are in your new data directory.

ls -lhrt

You'll see there are three directories, one for each bacterial species.

You can also look in the subdirectory for each species. For example,

ls -lhrt helicobacterpylori

In that directory, there are four files. Each is a compressed fastq file,'fastq.gz'. The '.gz' just means that the sequence data are compressed using a utility call gzip. We will see what these reads actually look like in another exercise.

COUNTING AND VIEWING YOUR READS

From here down is your assignment, answer any questions and upload to Assignment 4 - Read quality

Fastq is a file format for storing sequence data and the quality of that data. The lines below were blatantly stolen from Wikipedia:

Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Each quality value corresponds with one of the nucleotide values in the second line.

So, with that information you can deduce that there are four lines of information for each sequence read. That means you can find out how many sequence reads are in each of the files.

cd /lustre/scratch/[eraider]/gge2023/data/helicobacterpylori/

zcat is a version of the cat command that works on gzipped files.

zcat short_reads_1.fastq.gz | wc -l

zcat short_reads_2.fastq.gz | wc -l

zcat long_reads_low_depth.fastq.gz | wc -l

zcat long_reads_high_depth.fastq.gz | wc -l

How many reads are in each file? Remember, there are four lines per read.
The number of reads in the two short read files should be identical. Why?
Compare the number of reads in all four files to one another.
Look at the filenames of the long reads. What do the descriptors "low_depth" and "high_depth" mean?

A FASTQ file containing a single sequence might look like this:

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+SEQ_ID

!''((((+))%%%++)(%%%%).1*-+*''))**55CCF>>>>>>CCCCCCC65

The two SEQ_ID lines have obvious meanings. The second line is the DNA sequence of the read as determined by the sequencing software. The fourth line shows the quality score of each sequenced base.

The byte representing quality runs from 0x21 (lowest quality; '!' in ASCII) to 0x7e (highest quality; '~' in ASCII). You can find a chart of the probabilities or the correct base at this site. There are actually two tables there. I've copied the one relevant to us here:

In general, the data above aren't of the highest quality as demonstrated by the fact that most of the characters in the quality line are on the left side of that spectrum.

You can view the sequence data using a simple command that allows you to temporarily unzip a file and look at the first few lines of it by combining the zcat and head commands. zcat will decompress the file temporarily and you know what the head command does from an earlier exercise.

Execute the following two commands.

zcat short_reads_1.fastq.gz | head

zcat short_reads_2.fastq.gz | head

Are your data of pretty good quality compared to the example above? Why or why not?

Now execute these two commands.

zcat long_reads_low_depth.fastq.gz | head

zcat long_reads_high_depth.fastq.gz | head

You should have noticed a substantial difference in the output to the screen. Explain the differences between what you see when executing these commands for the short and long read files. You may need to scroll up and down a bit.

QUALITY ASSESSMENT

You did a quick visual assessment of the quality of your reads above but there are better, more thorough ways. Indeed, after getting your raw data for any project in the real world, assessing quality would be your first step. And you would NOT want to do this by taking the time to actually look at the fourth line of every sequencing read.

FastQC is a popular tool from Babraham Institute Bioinformatics Group used for quality assessment of sequencing data. Most Bioinformatics pipelines will use FastQC, or similar tools in the first stage of the analysis. The documentation for FastQC will help you to interpret the plots and stats produced by the tool. A traffic light system is used to alert the user’s attention to possible issues. However, it is worth bearing in mind that the tool is blind to the particular type of sequencing you are performing (i.e. whole-genome, ChIP-seq, RNA-seq), so some warnings might be expected due to the nature of your experiment.

For this, we will use the fastqc conda environment I've provided for you through the Singularity container. If you want to try and install fastqc on your own, please do so. It's a good exercise for you if you plan to continue using bioinformatics. However, it's not necessary for the exercise.

Optional installation of fastqc and multiqc if you're interested.

interactive -p nocona -c 1

. ~/conda/etc/profile.d/conda.sh

conda activate

conda create --name fastqc

conda activate fastqc

conda install -c bioconda fastqc

You will also use another package called multiqc. Let's get that installed and use it.

conda install multiqc

Make sure that both fastqc and multiqc have been properly installed by typing conda list. There you will see a list of all the packages that were installed in your fastqc conda environment.

End optional installation instructions

Let's analyze your reads using an interactive session with 10 processors and then navigate to your gge2024 folder.

interactive -p nocona -c 10
WORKDIR=/lustre/scratch/[eraider]/gge2024
cd $WORKDIR

Now, let's tell Singularity where to find all of your data files by binding the appropriate directory to the container.

`singularity shell -B $WORKDIR:$WORKDIR container/gge_container_v4.sif`

Remember, this command does two things. 1) It starts a Singularity shell so that you can interact with the operating system that it represents and 2) binds your gge host directory to an imaginary location within the container with the same name as your gge2024 directory. Any files in that host directory can now be accessed by the container operating system and the container operating system can write output to the host gge directory. Any time you see $WORKDIR and are using the container, it means the same thing as /lustre/scratch/[eraider]/gge2024.

The prompt should now look like this:

Singularity>

Next, let's navigate to the folder where we have stored the sequencing reads for Helicobacter pylori.

cd $WORKDIR/data/helicobacterpylori/

If you list the files using ls, you should see the four fastq.gz files discussed in the text above for this species.

Now, we will finally use the program fastqc to actually analyze the sequence reads in each file.

fastqc -t 10 *.fastq.gz

There should be something like this on your screen.

Singularity> fastqc -t 10 *.fastq.gz
Started analysis of long_reads_high_depth.fastq.gz
Started analysis of long_reads_low_depth.fastq.gz
Started analysis of short_reads_1.fastq.gz
Analysis complete for long_reads_low_depth.fastq.gz
Started analysis of short_reads_2.fastq.gz
Approx 10% complete for long_reads_high_depth.fastq.gz
Approx 5% complete for short_reads_1.fastq.gz
Approx 5% complete for short_reads_2.fastq.gz
Approx 20% complete for long_reads_high_depth.fastq.gz
Approx 10% complete for short_reads_1.fastq.gz
Approx 35% complete for long_reads_high_depth.fastq.gz
and so on....

Now, let's break down what you actually did just now.

Type fastqc -h and you'll get a list of the commands available to you from the program fastqc. Using that information, we can parse out what the command fastqc -t 10 *.fastq.gz actually does.

fastqc simply tells the computer to use that program.

-t 10 tells the program that there are 10 processors available to do the job that's been asked of it.

and

*.fastq.gz uses the '*' wildcard to tell the program to analyze anything that ends in '.fastq.gz'. Because you didn't specify any other path, the program assumes it should look for the files in whatever directory you're in at the moment. In our case, we're in $WORKDIR/data/helicobacterpylori/ aka the host directory, $WORKDIR/data/helicobacterpylori/.

Use a Bitvise sftp window to navigate to your output, which will be located in the same directory. Notice that in addition to the original .gz files, there are now .html and .zip files. The .html files use data in the .zip files to generate viewable browser windows with information on your sequence reads. On a PC, they're easy to see, usually by just double clicking on the .html filename.

Details on interpreting the output .html files can be found at this site. At the same site you can find examples of good Illumina data, bad Illumina data, PacBio data, and other types of reads.

Assuming we've talked about the various types of sequencing technologies and their pros and cons by this point in the class, you should be able to explain the relative quality assessments when comparing your paired-end (Illumina) reads and the Nanopore data.

Write out that explanation.
What are the longest and shortest reads from each of the four read sets? How many total reads are in each of the four read sets? Did you get the same answer as you did by calculating it with zcat and wc?

In the real world you may end up dealing with many, many data files. For example, for one RADSeq project we had hundreds of raw read files to examine. Evaluating each one individually would be a real pain. Fortunately, there is a better way, multiqc.

If you go back to your helicobacterpylori directory and run the following

multiqc . --dirs -o qc/

It will search for any and all quality reports that exist and produce a single .html that will allow multiple comparisons at once. That file will be located in a directory called QC.

Using the report, tell me about how many reads from each of the four read sets are duplicates. Provide your answer in percentages.
Would you consider these reads to be GC-rich or GC-poor?

An excellent tour of the output is available on youtube and the documentation is at this site.

Fortunately, there are no major problems with this data.

** These data were obtained from links at https://github.com/rrwick/Unicycler/blob/main/sample_data/README.md