06. Simple Bacterial Genome Assembly - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

To introduce you to genome assemblies, we’ll start with simple genomes from bacteria. Unicycler is a relatively user-friendly bacterial genome assembler that we’ll use to demonstrate some basic aspects and complexities of genome assembly. You should take some time to read the documentation for the software while your assemblies are working (https://github.com/rrwick/Unicycler). As with fastqc, I have installed a working version of Unicycler into the singularity container we're using for this class.

PREPARING YOUR DIRECTORIES

cd /lustre/scratch/[eraider]

mkdir -p gge2024/unicycler

Down the road, we will also be performing assemblies with Abyss and Flye, so go ahead and make them.

mkdir -p gge2024/abyss

mkdir -p gge2024/flye

RUNNING A BASIC ASSEMBLY

You’ll get started with a simple assembly of Helicobacter pylori using only the lllumina paired end reads. Assuming you’re returning to this after having done something else, it’s necessary to reconstitute the appropriate environment to get this to work. Notice that this time, we’re getting an interactive session with 8 processors. Unicycler’s default is to use 8 processors. So, we need to have them available to us. If you're still using a previous session with fewer processors, exit that and enter a new one.

interactive -p nocona -c 8

As done in the previous tutorial, let's set up the variables needed to run this genome assembly.

WORKDIR="/lustre/scratch/[eraider]/gge2024"
GGE_CONTAINER="/lustre/scratch/[eraider]/gge2024/container"

Now direct to the folder where the results of this run will be generated.

cd /lustre/scratch/[eraider]/gge2024/unicycler

NOTE: The step below WILL take a little while. I plan to start it early enough in the allotted time that it won't be a problem. But, if you start this exercise in class, it will probably not finish before you need to leave. Don't leave your job running in the classroom. Enter CTRL+C to stop the job and restart it using the same commands when you get home (or wherever).

The following will start a Unicycler run using paired end data from the data directories you created earlier. It will generate a new directory with the assembly and precursor files in it. The run may take a while. Make sure you don’t close the terminal window while it runs. That will stop the run. Note: you've probably noticed that most commands, you can just copy and paste. This one sometimes gives a lot of people trouble so you may want to type it in.

singularity exec -B $WORKDIR:$WORKDIR \
$GGE_CONTAINER/gge_container_v4.sif \
 unicycler \
-1 "$WORKDIR"/data/helicobacterpylori/short_reads_1.fastq.gz \
-2 "$WORKDIR"/data/helicobacterpylori/short_reads_2.fastq.gz \
-o "$WORKDIR"/unicycler/hp_pe_only --keep 3 -t 8 --verbosity 2

A few notes here. First, that end of line \ allows you to split single commands into separate lines. This makes things easier to read if your terminal is not wide enough to see the entire line. But, you can't have ANYTHING (no spaces, no tabs, no characters) after that \ or it will not work.

Let's break down this command line by looking at the various options invoked. Each option tells the software to do something.

  • singularity exec -B $WORKDIR:$WORKDIR - starts Singularity and binds the paths in the host to the container
  • $GGE_CONTAINER/gge_container_v4.sif unicycler - execute the unicycler command from the container
  • -1 and -2 - Tells the software the names and locations of the reads to be used. Notice that these are relative paths to the bound ($WORKDIR) host directory.
  • -o "$WORKDIR"/unicycler/hp_pe_only - Set up a directory in $WORKDIR where the output will be found. The name of the folder is descriptive.
    • 'hp' = Helicobacter pylori
    • 'pe_only' = We're only using the Illumina paired-end reads for this assembly and no long reads.
  • --keep 3 - Tells the program to retain all the intermediate files that will be produced during the assembly. The default setting is to delete most of them but I want to demonstrate some things with that information.
  • -t 8 - 8 processors are available to use via our interactive session. It should use all 8 of them.
  • --verbosity 2 - Output most of the things it's doing to the terminal screen. This last option essentially outputs to the screen the 'log' file that you'll be using later.

Let's go into the directory where your assembly will end up. You can't do it right now because there is screen output from Unicycler. The only way to enter that directory would be to stop the run and we don't want to do that.

So, get ANOTHER terminal window and type,

cd /lustre/scratch/[eraider]/gge2024/unicycler/hp_pe_only

You can also view the files being generated using the Bitvise SFTP window.

Depending on how quickly you do all of this, there may or may not be many files and/or subdirectories visible. This assembly may take a while and it may appear as though nothing is happening for long periods of time. Don't close the terminal where it's working, though. You'll kill the job and have to start over!!!!

It would be helpful to read the Unicycler documentation while you wait. https://github.com/rrwick/Unicycler. The run may take up to an hour but probably less if you're using the right number of processors. When I tested this exercise using 10 processors, it took 24 minutes from hitting enter on the 'unicycler' command to creation of the final files.

#------------------------------------------------------------------------------#

FOR YOU TO DO

#------------------------------------------------------------------------------#

Use the information you read in the Unicycler documentation and in the output directory to answer these questions.

  1. What relationship do 'tangles' and 'repeats' have with one another in a genome assembly?

  2. How does the above question relate to assembly 'graphs'?

  3. What is a dead end? How many dead ends are present in a good bacterial assembly? Why?

  4. Notice in the 'Method: long-read-only assembly' section of the documentation, there is reference to racon. Find out what racon is used for. Based on what we've discussed in class, why is this necessary in this section of the manual but not in the Illumina-only assembly section?

  5. What is direct long read bridging?

  6. Several output files were created for your assembly. Which ones are graphs?

  7. What factors do you think determine how long the assembly takes to run? In your own words, explain what the influence of these factors might be.

  8. For this exercise, did we use long reads, short reads, or both? Explain how you determined your answer.

One of the really good things about Unicycler is the very well-documented unicycler.log file. Look at it whatever way you like but you'll need the run to be complete to answer the last questions.

  1. How many kmers were tried in Unicycler's attempt to generate an assembly? What were they and which one was considered the best? Use google to define 'kmer'.

  2. Using the log, reiterate the steps that Unicycler took here, using a few words to explain what each step does.

Upload your answers to Assignment 6 - Simple assembly on Blackboard.