13. ABYSS on a C. elegans Genome - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

Even the smallest genomes of eukaryotes are substantially larger and more complex than most bacterial genomes. This requires more data and more computing power. In fact, my attempts to assemble the relatively simple data set you’ll be using failed on Quanah for lack of memory resources. Each node on Nocona only has 512 Gb of RAM. Fortunately, there is another set of nodes, xlquanah, that has 1.5Tb of RAM. However, those nodes are limited in use and we won't be using them for this class. If you plan to try to assemble a large eukaryotic genome, you may consider using that resource.

RUNNING THE C. ELEGANS ASSEMBLY

The command line is a little more complicated for a more complex set of data because we now have multiple short read libraries and we also need to use additional computational resources. This run will also take a substantial amount of time compared to previous exercises. Indeed, it’s unlikely to finish in a few hours. So, you’re going to take advantage of the queue to get this and several other assemblies complete.

Enter the following to get the appropriate submission script for this work.

mkdir -p /lustre/scratch/[eraider]/gge2024/abyss/celegans

cd /lustre/scratch/[eraider]/gge2024/abyss/celegans

cp /lustre/scratch/daray/gge2024/bin/abyss_celegans_k36.sh .

View the new script using any of the methods we've been over. You'll notice that we're using module load' instead of conda this time. When I worked on this over the summer, Abyss would run in some instances, not in others. I found it better to just use the HPCC installed version of Abyss to get around this problem. You can see the change in that I've replaced the conda activation lines with module load' lines.

#!/bin/bash
#SBATCH --job-name=celegansk36
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=64

#The next line module loads abyss and related tools.
module load gcc/9.2.0 openmpi/4.0.4 abyss/2.3.1 bwa/0.7.17 samtools/1.11

#Assign a value to the k-mer size
k=36

#Define your working directory path
WORKDIR=/lustre/scratch/[eraider]/gge2024/abyss/celegans/"celegans-k"$k

#Identify your data directory, which should already have been created. 
DATADIR=/lustre/scratch/[eraider]/gge2024/data/celegans
#DATADIR is now a variable that can be used to signal this directory rather than typing the path over and over.

#Go to your abyss directory, which should already have been created
cd /lustre/scratch/[eraider]/gge2024/abyss/celegans

#Create your assembly directory
mkdir -p $WORKDIR/abyss/celegans/"celegans-k"$k
#Go to your working directory
cd $WORKDIR

###Line-by-line explanation of the next command
#Run abyss-pe with a k-mer size of 'k' and using 64 processors.
#Output data should be placed in a directory called 'cehybridk'<kmersize>
#Use two paired end libraries, a and b.
#Use two long read libraries, a and b.
#Define paired end library a, pea, and give the location of the files.
#Define paired end library b, peb, and give the location of the files.
#Define long read library a, longa, and give the location of the files.
#Define long read library b, longb, and give the location of the files.
abyss-pe k=$k np=64 name="cehybridk"$k \
        lib='pea peb' \
        long='longa longb' \
        pea="$DATADIR/H9_S5_L001_R1_001.fastq.gz
        $DATADIR/H9_S5_L001_R2_001.fastq.gz" \
        peb="$DATADIR/N2_A15_USD16081291_HG2G3ALXX_L8_1.fastq.gz
        $DATADIR/N2_A15_USD16081291_HG2G3ALXX_L8_2.fastq.gz" \
        longa="$DATADIR/nanopore_him9.tar.gz" \
        longb="$DATADIR/nanopore_N2.tar.gz"

They’re basically the same as when you ran the bacterial genome assembly but notice the addition of ‘np’, telling the system how many processors to use. Also notice that there are two sets of long reads being used and two sets of paired-end reads being used.

Submit the job to the queue after making the necessary alteration to the paths and keep track so that you know when it's done.

FOR YOU TO DO

Keep in mind that these genome assemblies will take some time. It took under an hour to assemble the H. pylori genome but this is a much more complex genome with a LOT more data. These genome assemblies will take a while. Everything associated with this exercise will take longer. Budget your time accordingly. I recommend starting as soon as possible for one reason. Look at the 'ntasks' line of the header. You're requesting 64 processors. This is a lot for one node. Your script will likely be waiting in the queue for a while as the scheduler waits for that number of processors to become available.

  1. Modify the script to have it generate assemblies using k-mers of 56 and 96. Run those assemblies. The final assembly file you're looking for is "cehybridk[whatever]-scaffolds.fa". If you sort your files by 'Date Modified', it will be one of the last ones to be produced.

  2. Generate the basic statistics for each assembly using the methods you learned in exercise 7.

  3. Finally, which of these assemblies is 'best' if you only consider the genome size estimate obtained from the previous exercise? What's your reasoning for your answer?

  4. Which of the assembly is 'best' if you consider only N50? Explain your reasoning.

Copy/Paste all four of the scripts you created/modified with your answers to questions 3 & 4 into a single Word document and submit to Blackboard under Assignment 13 - Abyss 2

⚠️ **GitHub.com Fallback** ⚠️