09. WGS Assembly Using ABYSS - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

NOTE for 2025 - Use HPCC installed version of Abyss. The conda installation won't work.

Unicycler is designed to be used for bacterial genomes. A second assembler commonly used for prokayotic and eukaryotic genomes is ABYSS (https://github.com/bcgsc/abyss). It uses paired-end and long reads much like Unicycler does, but it requires a little more massaging. However, it’s useful because it can be more finely tuned and can be used for non-bacterial genomes. For now, though, we’re going to use the bacterial data from the Unicycler exercise.

Don't wait until the last minute to start this exercise. This exercise involves multiple genome assembly processes, some of which take quite a while.

USING ABYSS

But first, we need to get a working script.

Log in to HPCC as usual. Once you have a terminal, complete the following commands:

mkdir -p /lustre/scratch/[eraider]/gge2024/abyss/hpylori

cp /lustre/scratch/daray/gge2024/bin/abyss_sbatch.sh /lustre/scratch/[eraider]/gge2024/abyss/hpylori

cd /lustre/scratch/[eraider]/gge2024/abyss/hpylori

Reminder: The script you just copied has the text <eraider> in it at several places. You need to replace that text with your actual eraider id. To do that type the following. Don't replace <eraider> with your username in this one instance. Do replace YOUR ACTUAL ERAIDER ID with your actual eraider id.

To do this we will use a utility called sed, stream editor. While sed has many, many functions, I mostly use it to do replacements in files without opening them. To do that the basic syntax is as follows: 'sed "s|text to replace|replacement text|g" target file >output file'.

sed "s|<eraider>|YOUR ACTUAL ERAIDER|g" abyss_sbatch.sh >abyss_sbatch_hp.sh

For example, for me the command would be: sed 's|<eraider>|daray|g' abyss_sbatch.sh >abyss_sbatch_hp.sh Note the addition of hp to indicate H. pylori.

Open the file in a text editor or using less abyss_sbatch_hp.sh.

You should see something like this but with '' replaced with your eraider id:

#!/bin/bash
#SBATCH --job-name=abyss
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=36

#Migrate to your working directory
cd /lustre/scratch/<eraider>/gge2024/abyss/hpylori

#If Abyss not installed using conda uncomment the next line:
#Module load abyss and related components
#module load gcc/9.2.0 openmpi/4.0.4 abyss/2.3.1 bwa/0.7.17 samtools/1.11

#Set up your path variables
WORKDIR=/lustre/scratch/<eraider>/gge2024
GGE_CONTAINER=/lustre/scratch/<eraider>/gge2024/container

#Create a list of k-mers to test assemblies
LIST="26 36 46 56 66 76 86 96"

#For every k-mer in that list, build an assembly in a dedicated directory
## For every value of k in the LIST
## make a directory in the gge2024/abyss directory tagged with the value of k-mer
## go to that directory
## run abyss pe using the singularity container and with all short reads and the high depth long reads using 36 processors and the value of k. Call the assembly "hp_hybrid_k"$k.
for k in $LIST
	do mkdir -p $WORKDIR/abyss/"hp_hybrid_k"$k
	cd $WORKDIR/abyss/"hp_hybrid_k"$k
	singularity exec -B $WORKDIR:$WORKDIR $GGE_CONTAINER/gge_container_v4.sif \
	abyss-pe np=36 k=$k name="hp_hybrid_k"$k \
	in='/lustre/scratch/<eraider>/gge2024/data/helicobacterpylori/short_reads_1.fastq.gz /lustre/scratch/<eraider>/gge2024/data/helicobacterpylori/short_reads_2.fastq.gz' \
	longa='/lustre/scratch/<eraider>/gge2024/data/helicobacterpylori/long_reads_high_depth.fastq.gz' \
	long='longa' 
done

So, what will this script do? Essentially, it will create 8 assemblies using the same data as you used with Unicycler for the 'hp_long_high' assembly. Each assembly will use a different value of k. In other words, a different set of k-mers. One of the things I didn’t like about Unicycler is that it didn’t keep the various assemblies that were generated when using all of the tested k-mers. We’re going to try to see what those assemblies would have looked like if they were available.

To run this script, simply type

sbatch abyss_sbatch_hp.sh

The final assembly to be used in downstream analyses is "hp_hybrid_k[whatever the k is]-scaffolds.fa".

FOR YOU TO DO

When the run is finished (use squeue -u [eraider] to check if your job is still running), complete the following tasks.

  1. Generate and gather the various .tab files that depict the basic stats for each run and generate a table containing all of the data. Include the table as your answer to this question.

Your table should have the following structure:

n n:500 L50 min N75 N50 N25 E-size max sum assembly name
hp_hybrid_26
hp_hybrid_36
hp_hybrid_46
hp_hybrid_56
hp_hybrid_66
hp_hybrid_76
hp_hybrid_86
hp_hybrid_96
  1. Using whatever tools you have at your disposal (Excel, R, etc.), plot N50 vs k-mer (the dependent variable N50 on the Y, and the independent variable k-mer on the X) for the scaffolds. Same as we did for the assemblies in a previous exercise.

  2. Do the same for N50 vs number of contigs greater than 500 bp and total assembly size.

  3. Describe any trends that are obvious from the three plots.

  4. Which assembly would you choose as the ‘best’ one for H. pylori? Explain why you selected that assembly.

  5. Do the same set of assemblies for N. gonorrhoeae and S. pyogenes. Then perform the same post-assembly analyses. As you write your scripts make sure to use the appropriate folder names and abbreviations when needed.

    a. While you wait, predict what you should see for general trends in the N50, n:500, and total size plots. Will they follow the same patterns?

    b. Did they?

Upload all your information to Assignment 9 - Abyss as a single Word document.

⚠️ **GitHub.com Fallback** ⚠️