Deprecated 07. Basic Assembly Statistics and Comparing Assemblies - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

GENERATING BASIC ASSEMBLY STATS

To explore the assembly that was generated in the previous exercise, we will use a utility from another assembly called Abyss. Install it in your unicycler conda environment after getting an interactive session.

Every previous time I've taught this class, the abyss installation that was created within unicycler has worked. For some reason, it isn't this year.

To get around that problem, we'll use an older version of abyss. To install it, activate your unicycler environment and then install the software using

conda install -c "bioconda/label/cf201901" abyss

[ Tangent alert! There is actually another option but we haven't talked about how to deal with module load commands. If you think you will need to use Abyss in your actual research and truly want to use the most recent version, you can skip using conda and add this line to your scripts.

module load gcc/9.2.0 openmpi/4.0.4 abyss/2.3.1 bwa/0.7.17 samtools/1.11

For this class, the conda option should work. End tangent]

Information on Abyss is available at https://github.com/bcgsc/abyss. We’ll be using abyss-fac, which will generate assembly contiguity statistics.

After the installation is complete, cd into the output directory of the assembly you ran in the previous exercise.

That should just be

cd /lustre/scratch/[eraider]/gge2023/unicycler/hp_pe_only.

Now run abyss-fac to generate an output table.

abyss-fac assembly.fasta |tee hp_pe_only-stats.tab

FYI, tee allows you to have the output from a program projected to the display and also to an output file.

The table will be output to the screen and to a table (.tab file) in that directory. What do the column headers in the table tell us?

Category	Description
n	Total number of sequences
n:500	Number of sequences at least 500 bp
L50	Number of sequences at least the N50 size
min	The size of the smallest sequence
N75	At least 75% of the assembly is in sequences of the N75 size or larger
N50	At least half the assembly is in sequences of the N50 size or larger
N20	At least 25% of the assembly is in sequences of the N25 size or larger
E-size	The sum of the square of the sequence sizes divided by the assembly size
max	The size of the largest sequence
sum	The sum of the sequence sizes
name	The file name of the assembly

Upload answers to the following to Assignment 6 - Assembly stats on Blackboard. Google is your friend.

What is the N50 for this assembly and what does that statistic mean?
How big is the longest contig in the assembly? How does that compare to the total size of the assembly? What proportion of the assembly is represented by that one contig?
How many contigs make up this assembly? Knowing what you know about most bacterial genomes, is this the best possible assembly for a genome like this one? How could you find out?

GENERATING ADDITIONAL ASSEMBLIES

Now that you know how to generate a table for an assembly, let's generate more assemblies for comparison.

During Exercise 05 you ran your assemblies interactively, which made it necessary to keep your session open and running until it finished. Annoying. That's fine for short jobs, but for long jobs, its not realistic. For this exercise you are going to use submission scripts to submit your jobs to the queue so that they will run without you needing to watch it constantly. In fact, you already used one in exercise 03 when you were learning to use the queue, so we will use that script as a starting point this week.

Lets keep things organized and make a directory to keep all our scripts in one place and move into that directory.

mkdir /lustre/scratch/[eraider]/gge2023/bin

cd /lustre/scratch/[eraider]/gge2023/bin

Copy the script you ran in exercise 03 into your scripts directory. It should be in a folder called gge in your home directory. If not, you can copy it from /home/daray/counter.sh.

cp ~/gge/counter.sh .

I'll guide you through setting up the scripts for the first two and then you're on your own to follow the instructions in generating the others. Feel free to work with partners or on your own.

In the last exercise, you generated an assembly for H. pylori using only short (Illumina) paired-end reads. Your first task in this exercise is to complete a new assembly for H. pylori using the paired-end reads again but adding the low coverage long reads. In abyss, you will use the -l option as described below to include the long reads. Before you get started, note that the output folder is called ‘hp_long_low’.

Lets get the submission script set up. Open counter.sh in a text editor and make some changes (using nano as described in Exercise 01 is easiest). If you need a refresher on what the first few lines of the submission script do, review Exercise 03.

Change the job-name from "counter" to "hp_long_low". Also change the number of processors to be used to 12.

For this script to run correctly, the first thing you need to do is get the script to activate your unicycler conda environment. So, add the lines that will do that.

. ~/conda/etc/profile.d/conda.sh
conda activate unicycler

This is a good time to talk about documentation. If you start writing code on a regular basis, you'll find that you often go back to your code a month afterward and you'll wonder, "What was that line for?" Now is the time to get into the habit of annotating your code with comments '#'. For example, at this point, I'd make this script read like so:

# These lines activate conda and my unicycler environment
. ~/conda/etc/profile.d/conda.sh
conda activate unicycler

Delete the lines that contain the old instructions (these are the lines at the bottom that do not start with a #).

Write the new instructions. First, instruct the computer to change to your working directory for this script (where you want your output files to go) by adding the following line to the script.

# Navigate to your working directory
cd /lustre/scratch/[eraider]/gge2023/unicycler

Next, add the unicycler command below for your assembly.

# Run unicycler on H. pylori using the paired short reads and the low depth, long reads
unicycler \
-1 ../data/helicobacterpylori/short_reads_1.fastq.gz \
-2 ../data/helicobacterpylori/short_reads_2.fastq.gz \
-l ../data/helicobacterpylori/long_reads_low_depth.fastq.gz \
-o hp_long_low \
--keep 3 \
-t 10

A few notes here. First, that end of line \ allows you to split single commands into separate lines. This makes things easier to read if your terminal is not wide enough to see the entire line. But, you can't have ANYTHING (no spaces, no tabs, no characters) after that \ or it will not work. Second, note the difference between "-1" and "-l". They look the same but aren't.

The final script should look something like this (with appropriate changes specific to you):

#!/bin/bash
#SBATCH --job-name=hp_long_low
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=12

# These lines activate conda and my unicycler environment
. ~/conda/etc/profile.d/conda.sh
conda activate unicycler

# Navigate to your working directory
cd /lustre/scratch/[eraider]/gge2023/unicycler

# Run unicycler on H. pylori using the paired short reads and the low depth, long reads
unicycler \
-1 ../data/helicobacterpylori/short_reads_1.fastq.gz \
-2 ../data/helicobacterpylori/short_reads_2.fastq.gz \
-l ../data/helicobacterpylori/long_reads_low_depth.fastq.gz \
-o hp_long_low \
--keep 3 \
-t 10

Save this edited script as "hp_long_low.sh".

Create a submission script for a third assembly for H. pylori using the paired end reads and the high coverage long reads instead of the low coverage long reads. Call the output folder ‘hp_long_high’ and name your submission script "hp_long_high.sh".

# Run unicycler on H. pylori using the paired short reads and the high depth, long reads
unicycler \
-1 ../data/helicobacterpylori/short_reads_1.fastq.gz \
-2 ../data/helicobacterpylori/short_reads_2.fastq.gz \
-l ../data/helicobacterpylori/long_reads_high_depth.fastq.gz \
-o hp_long_high \
--keep 3 \
-t 10

Remember to change the --job-name line to "hp_long_high".

Submit both of your scripts to the queue.

sbatch hp_long_high.sh sbatch hp_long_low.sh

To check on your job, use squeue.

squeue -u [eraider]

Now, this is where you are going to have to think a little. You've now either completed or submitted jobs for three different assemblies of H. pylori.

with only short reads
with short reads and long reads but at low sequencing depth
with short reads and long reads at high sequencing depth

Create submission scripts to generate the same three types of assemblies for N. gonorrhoeae and S. pyogenes.

Some hints:

Make sure you're using the correct paths to the sequence data for each organism.
Use the following directory names for your output files (ng = N. gonorrhoeae & sg = S. pyogenes: ‘ng_pe_only’, ‘sp_pe_only’, ‘ng_long_low’, ‘ng_long_high’, ‘sp_long_low’, ‘sp_long_high’.
Make sure you move into the unicycler directory at the beginning of each script. Otherwise, you will end up with files and folders scattered willy-nilly around HPCC.
Do NOT run any of these on the head node.
Use as many submission scripts as you need to get the jobs done. There's a limit but you will not reach it for this exercise.
Some of these will take a substantial amount of computation. Much more than the simple assembly that we started with. Budget your time accordingly and get your jobs submitted quickly because your job may wait in the queue a while before starting. There is a deadline for you to complete this exercise and there's still a lot to do.

Once all your assemblies are done, generate the contiguity statistics for each one as described early on in this exercise. Using that information, complete the following tasks and answer the questions associated with them.

Compile a summary table that includes the statistics for all three assemblies for each genome. In other words, generate a single table for H pylori, another table for N. gonorrhoeae, and one final table for S. pyogenes and answer the following questions. It would probably be best to use your skills in Excel for this and to graph the results when necessary. If you're a person who knows R or some other statistical package, go for it but don't make this more complicated than it needs to be. In all cases, the independent variables (e.g. hp_pe_only, hp_long_low, and hp_long_high) belong on the X-axis.

Generally, how does adding long reads to the assembly impact N50? The number of fragments into which the assembly is divided?
What does it mean that the final N. gonorrhoeae ng_long_high assembly consists of two scaffolds rather than just one?
There is a spot in the .log file where you can get a good idea of the insert sizes for the short-read libraries. Find it and tell me the likely maximum and minimum insert sizes for the libraries. What do you estimate to be the probable target insert size? Describe how you think the insert size range was determined.

Upload your responses to Assignment 6 - Assembly stats.

Acknowledgements

Shaun Jackman (https://sjackman.ca/) was helpful in generating this tutorial.