Deprecated 09. WGS Assembly Using ABYSS - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

Unicycler is designed to be used for bacterial genomes. A second assembler commonly used for prokayotic and eukaryotic genomes is ABYSS (https://github.com/bcgsc/abyss). It uses paired-end and long reads much like Unicycler does, but it requires a little more massaging. However, it’s useful because it can be more finely tuned and can be used for non-bacterial genomes. For now, though, we’re going to use the bacterial data from the Unicycler exercise.

Don't wait until the last minute to start this exercise. This exercise involves multiple genome assembly processes, some of which take quite a while.

USING ABYSS

But first, we need to get you some data and a working script.

mkdir -p /lustre/scratch/[eraider]/gge2023/abyss/hpylori

cp /lustre/scratch/daray/gge2023/bin/abyss_sbatch.sh /lustre/scratch/[eraider]/gge2023/abyss/hpylori

cd /lustre/scratch/[eraider]/gge2023/abyss/hpylori

Reminder: The script you just copied has the text [eraider] in it at several places. You need to replace that text with your actual eraider id. To do that type the following. Don't replace [eraider] with your username in this one instance. Do replace YOUR ACTUAL ERAIDER ID with your actual eraider id.

To do this we will use a utility called sed, stream editor. While sed has many, many functions, I mostly use it to do replacements in files without opening them. To do that the basic syntax is as follows: 'sed "s/text to replace/replacement text/g" target file >output file'.

sed "s|\[eraider\]|YOUR ACTUAL ERAIDER|g" abyss_sbatch.sh >abyss_sbatch_hp.sh

For example, for me the command would be: sed 's|\[eraider\]|daray|g' abyss_sbatch.sh >abyss_sbatch_hp.sh Note the addition of hp to indicate H. pylori.

Open the file in a text editor or using less abyss_sbatch_hp.sh.

You should see something like this but with '[eraider]' replaced with your eraider id:

#!/bin/bash
#SBATCH --job-name=abyss
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=36

#Migrate to your working directory
cd /lustre/scratch/[eraider]/gge2023/abyss/hpylori

#Activate Abyss using module load
#Module load abyss and related components
module load gcc/9.2.0 openmpi/4.0.4 abyss/2.3.1 bwa/0.7.17 samtools/1.11

#Create a list of k-mers to test assemblies
LIST="26 36 46 56 66 76 86 96"

#For every k-mer in that list, build an assembly in a dedicated directory
for k in $LIST
        do mkdir "hp_hybrid_k"$k
        cd "hp_hybrid_k"$k
        abyss-pe np=36 k=$k name="hp_hybrid_k"$k \
        in='/lustre/scratch/[eraider]/gge2023/data/helicobacterpylori/short_reads_1.fastq.gz /lustre/scratch/[eraider]/gge2023/data/helicobacterpylori/short_reads_2.fastq.gz' \
        long='longa' \
        longa='/lustre/scratch/[eraider]/gge2023/data/helicobacterpylori/long_reads_high_depth.fastq.gz'
        cd ../
done

So, what will this script do? Essentially, it will create 8 assemblies using the same data as you used with Unicycler for the 'hp_long_high' assembly. Each assembly will use a different value of k. In other words, a different set of k-mers. One of the things I didn’t like about Unicycler is that it didn’t keep the various assemblies that were generated when using all of the tested k-mers. We’re going to try to see what those assemblies would have looked like if they were available.

To run this script, simply type

sbatch abyss_sbatch_hp.sh

The final assembly to be used in downstream analyses is "hp_hybrid_k[whatever the k is]-scaffolds.fa".

FOR YOU TO DO

When the run is finished (use squeue -u [eraider] to check if your job is still running), complete the following tasks.

Generate and gather the various .tab files that depict the basic stats for each run and generate a table containing all of the data. Include the table as your answer to this question.

One way to do this without having to download each file individually is as follows:

. ~/conda/etc/profile.d/conda.sh
conda activate unicycler
cd /lustre/scratch/[eraider]/gge2023/abyss/hpylori
for k in {26..96..10}; do abyss-fac "hp_hybrid_k"$k/"hp_hybrid_k"$k"-scaffolds.fa" |tee "hp_hybrid_k"$k"-stats.tab"; done
head hp_hybrid_k26-stats.tab >hp_hybrid_table.tab
for k in {36..96..10}; do echo $(tail -1 hp_hybrid_k${k}-stats.tab) >>hp_hybrid_table.tab; done
sed -i -e 's/  */\t/g' hp_hybrid_table.tab

If you did everything correctly, you should now have a file called 'hp_hybrid_table.tab' that contains all of the data from each genome in a single table.

Using whatever tools you have at your disposal (Excel, R, etc.), plot N50 vs k-mer (the dependent variable N50 on the Y, and the independent variable k-mer on the X) for the scaffolds. Same as we did for the assemblies in a previous exercise.
Do the same for N50 vs number of contigs greater than 500 bp and total assembly size.
Describe any trends that are obvious from the three plots.
Which assembly would you choose as the ‘best’ one for H. pylori? Explain why you selected that assembly.
Do the same set of assemblies for N. gonorrhoeae and S. pyogenes. Then perform the same post-assembly analyses. As you write your scripts make sure to use the appropriate folder names and abbreviations when needed.

a. While you wait, predict what you should see for general trends in the N50, n:500, and total size plots. Will they follow the same patterns?

b. Did they?

Upload all your information to Assignment 9 - Abyss as a single Word document.