09a. WGS Assembly Using ABYSS - davidaray/Genomes-and-Genome-Evolution GitHub Wiki
Unicycler is designed to be used for bacterial genomes. A second assembler commonly used for prokayotic and eukaryotic genomes is ABYSS (https://github.com/bcgsc/abyss). It uses paired-end and long reads much like Unicycler does, but it requires a little more massaging. However, it’s useful because it can be more finely tuned and can be used for non-bacterial genomes. For now, though, we’re going to use the bacterial data from the Unicycler exercise.
Don't wait until the last minute to start this exercise. This exercise involves multiple genome assembly processes, some of which take quite a while.
But first, we need to get a working script.
Log in to HPCC as usual. Once you have a terminal, complete the following commands:
mkdir -p /lustre/scratch/[eraider]/gge2024/abyss/hpylori
cp /lustre/scratch/daray/gge2024/bin/abyss_sbatch_hp_9a.sh /lustre/scratch/[eraider]/gge2024/abyss/hpylori
cd /lustre/scratch/[eraider]/gge2024/abyss/hpylori
Reminder: The script you just copied has the text <eraider>
in it at several places. You need to replace that text with your actual eraider id.
Open the file in a text editor using nano abyss_sbatch_hp_9a.sh
.
You should see something like the text below. Replace "[eraider]" replaced with your eraider id:
#!/bin/bash
#SBATCH --job-name=abyss_hp
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=36
#Define you working directory and your data directory
WORKDIR=/lustre/scratch/[eraider]/gge2024/abyss/hpylori_9a
DATADIR=/lustre/scratch/[eraider]/gge2024/data/helicobacterpylori
#Module load abyss and related components
module load gcc/9.2.0 openmpi/4.0.4 abyss/2.3.1 bwa/0.7.17 samtools/1.11
#Create a list of k-mers to test assemblies
LIST="26 36 46 56 66 76 86 96"
#For every k-mer in that list, build an assembly in a dedicated directory
## For every value of k in the LIST
## make a directory in the gge2024/abyss directory tagged with the value of k-mer
## go to that directory
## run abyss pe using the singularity container and with all short reads and the high depth long reads using 36 processors and the value of k. Call the assembly "hp_hybrid_k"$k.
for k in $LIST
do mkdir -p $WORKDIR/"hp_hybrid_k"$k
cd $WORKDIR/"hp_hybrid_k"$k
abyss-pe np=36 k=$k name="hp_hybrid_k"$k \
in="$DATADIR/short_reads_1.fastq.gz $DATADIR/short_reads_2.fastq.gz" \
longa="$DATADIR/long_reads_high_depth.fastq.gz" \
long="longa"
done
Note that this time we're not using Singularity. There's something off with the abyss installation in our container that I need to fix. Instead, we're using the abyss installation already on HPCC.
So, what will this script do? Essentially, it will create 8 assemblies using the same data as you used with Unicycler for the 'hp_long_high' assembly. Each assembly will use a different value of k. In other words, a different set of k-mers. One of the things I didn’t like about Unicycler is that it didn’t keep the various assemblies that were generated when using all of the tested k-mers. We’re going to try to see what those assemblies would have looked like if they were available.
To run this script, simply type
sbatch abyss_sbatch_hp_9a.sh
The final assembly to be used in downstream analyses is "hp_hybrid_k[whatever the k is]-long-scaffs.fa".
When the run is finished (use squeue -u [eraider]
to check if your job is still running), complete the following tasks.
- Generate and gather the various .tab files that depict the basic stats for each run and generate a table containing all of the data. Include the table as your answer to this question.
Your table should have the following structure:
n | n:500 | L50 | min | N75 | N50 | N25 | E-size | max | sum | assembly name |
---|---|---|---|---|---|---|---|---|---|---|
hp_hybrid_26-long-scaffs.fa | ||||||||||
hp_hybrid_36-long-scaffs.fa | ||||||||||
hp_hybrid_46-long-scaffs.fa | ||||||||||
hp_hybrid_56-long-scaffs.fa | ||||||||||
hp_hybrid_66-long-scaffs.fa | ||||||||||
hp_hybrid_76-long-scaffs.fa | ||||||||||
hp_hybrid_86-long-scaffs.fa | ||||||||||
hp_hybrid_96-long-scaffs.fa |
-
Using whatever tools you have at your disposal (Excel, R, etc.), plot N50 vs k-mer (the dependent variable N50 on the Y, and the independent variable k-mer on the X) for the scaffolds. Same as we did for the assemblies in a previous exercise.
-
Do the same for N50 vs number of contigs greater than 500 bp and N50 vs total assembly size.
-
Describe any trends that are obvious from the three plots.
-
Which assembly would you choose as the ‘best’ one for H. pylori? Explain why you selected that assembly.
-
Do the same set of assemblies for N. gonorrhoeae and S. pyogenes. Then perform the same post-assembly analyses. As you write your scripts make sure to use the appropriate folder names and abbreviations when needed. For example, any instance of 'helicobacterpylori' will need to be replaced with 'neisseriagonorrhoeae' or 'streptococcuspyogenes', depending on which species you're dealing with. Also watch our for instances of 'hp' or 'hpylori' and replace accordingly.
a. While you wait, predict what you should see for general trends in the N50, n:500, and total size plots. Will they follow the same patterns?
b. Did they?
-
Plot kmer size vs total assembly size (sum) for all three and compare them. What do you find?
Upload all your information to Assignment 9 - Abyss as a single Word document.