09. WGS Assembly Using ABYSS - davidaray/Genomes-and-Genome-Evolution GitHub Wiki
Unicycler is designed to be used for bacterial genomes. A second assembler commonly used for prokayotic and eukaryotic genomes is ABYSS (https://github.com/bcgsc/abyss). It uses paired-end and long reads much like Unicycler does, but it requires a little more massaging. However, it’s useful because it can be more finely tuned and can be used for non-bacterial genomes. For now, though, we’re going to use the bacterial data from the Unicycler exercise.
Don't wait until the last minute to start this exercise. This exercise involves multiple genome assembly processes, some of which take quite a while.
But first, we need to get a working script.
Log in to HPCC as usual. Once you have a terminal, complete the following commands:
mkdir -p /lustre/scratch/[eraider]/gge2024/abyss/hpylori
cp /lustre/scratch/daray/gge2024/bin/abyss_sbatch.sh /lustre/scratch/[eraider]/gge2024/abyss/hpylori
cd /lustre/scratch/[eraider]/gge2024/abyss/hpylori
Reminder: The script you just copied has the text <eraider>
in it at several places. You need to replace that text with your actual eraider id. To do that type the following. Don't replace <eraider>
with your username in this one instance. Do replace YOUR ACTUAL ERAIDER ID with your actual eraider id.
To do this we will use a utility called sed
, stream editor. While sed has many, many functions, I mostly use it to do replacements in files without opening them. To do that the basic syntax is as follows: 'sed "s|text to replace|replacement text|g" target file >output file'.
sed "s|<eraider>|YOUR ACTUAL ERAIDER|g" abyss_sbatch.sh >abyss_sbatch_hp.sh
For example, for me the command would be: sed 's|<eraider>|daray|g' abyss_sbatch.sh >abyss_sbatch_hp.sh
Note the addition of hp to indicate H. pylori.
Open the file in a text editor or using less abyss_sbatch_hp.sh
.
You should see something like this but with '' replaced with your eraider id:
#!/bin/bash
#SBATCH --job-name=abyss
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=36
#Migrate to your working directory
cd /lustre/scratch/<eraider>/gge2024/abyss/hpylori
#If Abyss not installed using conda uncomment the next line:
#Module load abyss and related components
#module load gcc/9.2.0 openmpi/4.0.4 abyss/2.3.1 bwa/0.7.17 samtools/1.11
#Set up your path variables
WORKDIR=/lustre/scratch/<eraider>/gge2024
GGE_CONTAINER=/lustre/scratch/<eraider>/gge2024/container
#Create a list of k-mers to test assemblies
LIST="26 36 46 56 66 76 86 96"
#For every k-mer in that list, build an assembly in a dedicated directory
## For every value of k in the LIST
## make a directory in the gge2024/abyss directory tagged with the value of k-mer
## go to that directory
## run abyss pe using the singularity container and with all short reads and the high depth long reads using 36 processors and the value of k. Call the assembly "hp_hybrid_k"$k.
for k in $LIST
do mkdir -p $WORKDIR/abyss/"hp_hybrid_k"$k
cd $WORKDIR/abyss/"hp_hybrid_k"$k
singularity exec -B $WORKDIR:$WORKDIR $GGE_CONTAINER/gge_container_v4.sif \
abyss-pe np=36 k=$k name="hp_hybrid_k"$k \
in='/lustre/scratch/<eraider>/gge2024/data/helicobacterpylori/short_reads_1.fastq.gz /lustre/scratch/<eraider>/gge2024/data/helicobacterpylori/short_reads_2.fastq.gz' \
longa='/lustre/scratch/<eraider>/gge2024/data/helicobacterpylori/long_reads_high_depth.fastq.gz' \
long='longa'
done
So, what will this script do? Essentially, it will create 8 assemblies using the same data as you used with Unicycler for the 'hp_long_high' assembly. Each assembly will use a different value of k. In other words, a different set of k-mers. One of the things I didn’t like about Unicycler is that it didn’t keep the various assemblies that were generated when using all of the tested k-mers. We’re going to try to see what those assemblies would have looked like if they were available.
To run this script, simply type
sbatch abyss_sbatch_hp.sh
The final assembly to be used in downstream analyses is "hp_hybrid_k[whatever the k is]-scaffolds.fa".
When the run is finished (use squeue -u [eraider]
to check if your job is still running), complete the following tasks.
- Generate and gather the various .tab files that depict the basic stats for each run and generate a table containing all of the data. Include the table as your answer to this question.
Your table should have the following structure:
n | n:500 | L50 | min | N75 | N50 | N25 | E-size | max | sum | assembly name |
---|---|---|---|---|---|---|---|---|---|---|
hp_hybrid_26 | ||||||||||
hp_hybrid_36 | ||||||||||
hp_hybrid_46 | ||||||||||
hp_hybrid_56 | ||||||||||
hp_hybrid_66 | ||||||||||
hp_hybrid_76 | ||||||||||
hp_hybrid_86 | ||||||||||
hp_hybrid_96 |
-
Using whatever tools you have at your disposal (Excel, R, etc.), plot N50 vs k-mer (the dependent variable N50 on the Y, and the independent variable k-mer on the X) for the scaffolds. Same as we did for the assemblies in a previous exercise.
-
Do the same for N50 vs number of contigs greater than 500 bp and total assembly size.
-
Describe any trends that are obvious from the three plots.
-
Which assembly would you choose as the ‘best’ one for H. pylori? Explain why you selected that assembly.
-
Do the same set of assemblies for N. gonorrhoeae and S. pyogenes. Then perform the same post-assembly analyses. As you write your scripts make sure to use the appropriate folder names and abbreviations when needed.
a. While you wait, predict what you should see for general trends in the N50, n:500, and total size plots. Will they follow the same patterns?
b. Did they?
Upload all your information to Assignment 9 - Abyss as a single Word document.