Lab 02 Genome Assembly - jfgout/AppliedGenomics GitHub Wiki

Assembling a genome

Our goal is to assemble a genome from some Illumina paired-end sequencing reads. To do so, we will be using a software named spades (http://cab.spbu.ru/software/spades/). In this lab, you will familiarize yourself with handling files and running programs on the supercomputer before finally assembling a genome!

1. Let's look at the data we'll be using...

The sequencing reads are in these two files: /mnt/scratch/goutfs/CMB8013/GenomeAssembly/SRR8420302_1.fastq and /mnt/scratch/goutfs/CMB8013/GenomeAssembly/SRR8420302_2.fastq

Use the "tabulation" auto-completion trick as much as possible to type the following command:
cat /mnt/scratch/goutfs/CMB8013/GenomeAssembly/SRR8420302_1.fastq

The command cat displays a file on the screen. However, because SRR8420302_1.fastq is a BIG file, displaying all the lines on the screen is not very convenient (and it can take a lot of time). You can interrupt the long ongoing cat command with the key combination: ctrl+c (yes, this means that ctrl+c will NOT work as a shortcut for "copy" in the terminal).

When looking at large text files, you can use the command less. Type the following command:
less /mnt/scratch/goutfs/CMB8013/GenomeAssembly/SRR8420302_1.fastq
and navigate through the file using the up/down arrows (as well as page up/page down and space bar). Simply type q to exit.

So, how big is this file? We can use the command ls to find out. Type the following (again, using the tabulation auto-completion as much as possible):
ls -l /mnt/scratch/goutfs/CMB8013/GenomeAssembly
(note, this is exactly equivalent to ls -l /mnt/scratch/goutfs/CMB8013/GenomeAssembly/ - the extra slash at the end of the last folder's name is optional)

You can see that the file size is 453,920,598 octets. To display the file size in a more "human readable" unit, you can use the following:
ls -lh /mnt/scratch/goutfs/CMB8013/GenomeAssembly

To see all the possible usages of the ls command, type the following: man less
man stands for "manual" and is the standard way to obtain help on a command in the shell.

Use the command head to display only the first 10 lines of a file. Read the manual page for head and find the correct syntax to display only the first 4 lines of the SRR8420302_1.fastq file.

What type of information is present on the different lines that are currently displayed?

Now, display the first 8 lines of SRR8420302_1.fastq and then SRR8420302_2.fastq. Notice how paired reads share the same identifiers and are in the same order between the two files.

Use the manual to find out how to use the wc command to count how many lines are in the SRR8420302_1.fastq file. How many reads does this correspond to?

2. Getting ready to assemble the genome.

Before we can use the software Spades, you will need to type the following command:
module load Spades
Note that if you log out and then log back in to the server, you will need to type this command again before using spades (the effect of this command lasts only as long as your current session on the server - if you are curious about what this "module" command does, you can read more about it here: http://modules.sourceforge.net/ However, for the purpose of this class, you only need to remember to use this command before using spades or any over software listed under module avail).

Look into the online spades manual (https://ablab.github.io/spades/) to find the correct syntax to run spades with two files (paired-end reads from one Illumina library).

Alternatively, you can types: spades.py --help to display the help message in the terminal.

To help you find the correct syntax, here are a few points to consider:

  • Are the reads stored in file pairs or interleaved files?
  • Which file contains the left reads and which file contains the right reads?
  • Where should the results be stored? (see the "-o" option under Basic options paragraph)

If you have the correct command line, spades should start working and your screen will look like this:

Congratulations, you are on your way to assembling a genome! However, after a few minutes (before spades is finished assembling the genome), you should see this error message:

** Message from admin_lugh@lughhead (as root) on at 13:08 ... Warning, your process 'spades-hammer' was killed because it was taking up too many resources on the head node. Please, use the job submission system. EOF**

So, let's use the job submission system to run the spades assembly command!

3. Submitting jobs on the server

Commands like ls, cat, head, ... take (milli)seconds to run. However, a command like the spades assembly we are trying to perform can request the full attention of the computer's processor(s) for several minutes/hours. As a consequence, if several users are trying to run such intensive commands at the same time, it can result in a very slow system.
To avoid this problem (so that you don't have to wait 10 seconds to see the result of a simple ls command), jobs (a job = a series of commands) that are expected to take more than a minute or so to run should be submitted via the job submission process. Here is how it is done:

3.1 Preparing the submission.

Create a file in which you will write all the commands to be run for this job. Remember, you can create an empty file with the command touch
For example: touch assembly.sh

Then, edit the file with gedit or nano (or any other text editor available on lugh) to contain the following text:


#!/bin/bash

cd [path to your working directory]
module load Spades
[your spades command]


3.2 Submitting the job

To submit your job, after saving your job file, type the following command:
sbatch -p gout [your job file]

Things might run a bit faster when using the regular queue instead of the "gout" node: sbatch [your job file]

The command sbatch submits a script file to the job submission system and the option -p gout specifies that it should run on the gout partition.

You can check the status of your job(s) by typing squeue

To see only the jobs that you submitted (and ignore those from other users): squeue -u [your login name]

To obtain more details on a specific job: scontrol show job [jobID]

3.3 Checking the results

Once your job is not listed anymore by the squeue command, its execution is finished. You can check for potential error messages in two files (slurm[jobNumber].out and slurm[jobNumber].err) automatically created in the current folder.

Now, check if Spades successfully produced a genome assembly.

4. Exercise: Checking the quality of the genome assembly.

Use R and the library Biostrings to load the assembly fasta file and obtain statistics about the genome assembly produced by spades.

library(Biostrings) genome <- readDNAStringSet("scaffolds.fasta")

You can obtain the length of all the scaffolds with:

width(genome)

Check the documentation for the function N50 (provided with the Biostrings package)

Let's write our own version of the N50 function (my_n50). We will implement 2 different versions:

  1. Using the sort function provided by R.

Here is a solution for this exercise:

myN50 <- function(vLengths){
# Let's start by sorting the vector of sizes in decreasing order (= largest scaffold first)
vlo <- sort(vLengths, decreasing = T)

# Target size represents half of the total assembly size
targetSize <- sum(vlo) / 2

# "total" will keep track of the cumulative size of the first i scaffolds
total <- vlo[1]
i <- 1
while( total < targetSize ){
i <- i + 1
total <- total + vlo[i]
}
n50 <- vLengths[i]
n50
}

  1. With our own my_sort sorting function.

We will cover (in class) 3 different algorithms to sort a vector:

  • selection sort

my_selectionSort <- function(v){
# We start by creating an empty vector of the same size as the vector to sort
vSorted <- rep(NA, length(v))
nbToDo <- length(v)
nbDone <- 0

# This loop will perform as many iterations as there are elements in the vector.
# At each iteration, it will select the largest value in the vector and add it
# to the new sorted vector (and remove it from the old vector).
while(nbDone < nbToDo){
nbDone <- nbDone + 1
wm <- which.max(v)
vSorted[nbDone] <- v[wm]
v[wm] <- NA
}
vSorted
}

  • insertion sort

myInsertionSort <- function(v){
nbElements <- length(v)
for(i in (2:nbElements)){
currentValue <- v[i]
j <- i-1
# Searching for the position where the 'currentValue' should be inserted.
while(j > 0 && v[j] < currentValue){
j <- j - 1
}
if( j < (i-1) ){
v <- append(v[-i], value = currentValue, after = j)
}
}
v
}

  • bubble sort

myBubbleSort <- function(v){
nbSwap <- 1
nbElements <- length(v)
while(nbSwap > 0){
nbSwap <- 0
for(i in (2:nbElements)){
if( v[i-1] > v[i] ){
vtmp <- v[i-1]
v[i-1] <- v[i]
v[i] <- vtmp
nbSwap <- nbSwap + 1
}
}
}
v
}

Make sure to compare efficiency of these different algorithms.

5. Improving the quality of the genome assembly.

The folder /mnt/scratch/goutfs/CMB8013/GenomeAssembly/PacBio/ contains one file with PacBio sequencing reads. What is the main difference between PacBio and Illumina reads?

These PacBio reads were obtained from the same organism as the Illumina reads used earlier. Find how to supply both the Illumina and PacBio reads to spades and get a new genome assembly that combines both types of data.

Use R to compare the two genome assemblies and discuss the results.

6. Going the extra mile.

For larger projects, you might have to use more than one pair of sequencing read files. When the number of files to use is large, writing the corresponding command can become challenging. The solution is to write a script which will automatically generate the corresponding command. The goal of this (optional) exercise is to write a script that will automatically generate the job file with the spades command for any list of paired-end reads files present in a given folder.

An example of this is available in: /mnt/scratch/AppliedGenomics/GenomeAssembly/ExtraMile/

If you are unsure what your spades command should look like, you can look at the file: /mnt/scratch/goutfs/CMB8013/GenomeAssembly/ExtraMile/run-spades.pbs

Here is the general recipe to follow for your script, and the name of some functions you could use:

  1. You need to pass at least two arguments to your script: the path to the folder containing the reads and the path the folder where you want to store the results from spades. You might also pass the name of the PBS file as an argument.

  2. Write the "header" for your PBS file (the part that will remain the same, no matter how many/which sequencing reads file you are working on).

  3. Create a variable which you will initialize with the start of the spades command. You will then append to the end of this variable for each pair of file...

  4. Search for all the files named "*_1.fastq" in the folder that was passed as an argument (lookup the function "find" and how to iterate through the results from "find" in bash)

4.a For each of these files, create a variable which contains the name of the paired file (one way is simply to substitute "_1.fastq" with "_2.fastq" in the file name...)

4.b Append this pair of files to your growing spades command.

  1. Output your full spades command into the PBS file.
⚠️ **GitHub.com Fallback** ⚠️