Deprecated 13. Long Read Assembly with HASLR - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

Long read data are the wave of the future and resolve many of the contiguity problems that arise because of repeats and other ambiguous regions of the genome. I know some of the graduate students in this department, and possibly this class, are using long-read data for actual genome assemblies in their research. In the past two years, my lab has been involved in at least 16 projects that use either Nanopore or PacBio data. There are various types of long-read data and, just as with short read data, there are quite a few software packages to try. You've already used Unicycler and Abyss. But Unicycler is specialized for bacterial genomes and Abyss is just kind of clunky and doesn't always give great assemblies. For this exercise, We’ll be using HASLR (Fast Hybrid Assembly of Long Reads) to get a hybrid assembly of Drosophila mauritiana. This is a relatively new package that promises fast and accurate assemblies.

The manuscript associated with this tool kindly provides a simplified flowchart describing the pipeline. Note that this is hybrid assembly approach 3 from the lecture in class. Check the slides to remind yourself what that means. haslr

You should read up on HASLR to get an idea of what the package does and how it works. Resources are here:

https://www.cell.com/iscience/pdf/S2589-0042(20)30577-0.pdf

https://github.com/vpc-ccg/haslr

THE DATA

We’ll be using data from this publication Highly Contiguous Genome Assemblies of 15 Drosophila Species Generated Using Nanopore Sequencing. The manuscript describes data from many species but we’ll only be using data from one.

So far in this class, I've been providing you with data for each project but it wouldn’t hurt for you to see how to get data that others have generated from one of many databases that store it. In this case, we'll be using the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (NCBI). This is the same general database you used to get the reference assembly for Helicobacter pylori in a previous exercise. In this case, we're just going to the section that houses raw read data associated with publications and entries.

Get started as always by getting set up.

interactive -p nocona

mkdir /lustre/scratch/[eraider]/gge2022/data/dmauritiana

cd /lustre/scratch/[eraider]/gge2022/data/dmauritiana

To get the data, we’re going to use a set of tools from (sra-toolkit). The documentation for these tools is here.

Unfortunately, while there is a conda installation for the sra-toolkit, it's old and doesn't have the tool we need. So, we're just going to use my installation. That means doing things a little different but it should seem at least a little familiar given previous exercises.

There are several sra tools but we’re going to use fasterq-dump. It will take the native SRA format, .sra, and download and convert the files to .fastq simultaneously. It will also split any paired-end read data into two fastq files for you automatically. The first step just makes life a little easier by putting the directory where I have sra-toolkit installed in your PATH. While still in your ‘dmauritiana’ directory:

SRA=/lustre/work/daray/software/sratoolkit.2.9.6-1-centos_linux64/bin

The previous command just sets up a variable called SRA that now represents the path to the scripts you need. Note the inclusion of the "$" symbols on the lines below.

The following two commands will download the data for this project. The first targets short, paired end reads (two files) and the second targets long, nanopore reads (a single file). So, this could take more than a few minutes. You should probably use two terminals for this. Use an interactive session for both.

$SRA/fastq-dump SRR6425993

$SRA/fastq-dump SRR7167952

Once everything is downloaded, this last line just compresses everything so that it takes up less space.

gzip *.fastq

RUNNING THE LONG READ ASSEMBLY

As you've likely noticed, we've been running assemblies for genomes that are increasingly complex. That means they take more data and they take longer to run. This one is the largest we've attempted so far. Indeed, while you could run it directly in a terminal, I don't recommend it. Instead, we will replicate the last exercise in which you used a submission script. I've already written one for you to modify as necessary. Get it as follows:

mkdir -p /lustre/scratch/[eraider]/gge2022/haslr/bin

Note that in this case, we are creating a separate directory for our submission scripts. This can be a good practice, especially if the output of your process is a lot of files and directories. Having a dedicated bin directory allows you to avoid losing your run scripts in a morass of output files.

cd /lustre/scratch/[eraider]/gge2022/haslr/bin

cp /lustre/scratch/daray/gge2022/haslr/bin/haslr_sbatch.sh .

As usual, you will need to edit it to have your eraider id in the correct spot. You will also need to get the haslr software set up in a conda environment. . ~/conda/etc/profile.d/conda.sh

conda activate

conda create --name haslr

conda activate haslr

conda install -c bioconda haslr

Get it running using sbatch haslr_sbatch.sh.

What HASLR does now is generate a series of jobs that it will submit to the queue itself. If you plan to use HASLR for your own research, you should definitely read all of the documentation. When all of those jobs are finished, you should have a complete assembly. This will take many hours and, depending on your genome size, may take weeks. For example, over the summer, I attempted an assembly of a gopher genome (~2 Gb) and I finally just gave up because it was taking too long. We ended up going with a different assembler. But HASLR looks promising, so I want to include it and it works fine for smaller eukaryotic genomes.

The final assembly consists of a file called 'dMau_haslr_final.fa'.

POLISHING THE ASSEMBLY

After the assembly is complete, researchers will often polish assemblies like this one because Nanopore data, like PacBio data, is error-heavy (12-15%). If you sequence to a high enough coverage, that problem can be overcome because the errors are thought to be random. However, it’s expensive and one alternative is to use cheaper and more accurate Illumina data to correct any errors that may have crept in.

As you’re finding, this can be done multiple tools. There are quite a few polishers. Nanopolish, pilon, racon, etc. For this work, we’ll use polca, which is part of the Masurca genome assembly package. I chose it because it's both easy to use and does a good job. If I had to choose another, it would be pilon but you've already run that a pilon analysis as part of Unicycler. This process takes many hours, even using 36 processors. For that reason, I’ve put together a submission script to do the job.

Get the software.

. ~/conda/etc/profile.d/conda.sh

conda activate

conda create --name masurca

conda activate masurca

conda install -c bioconda masurca bwa

Now, get the script you are to run.

cd /lustre/scratch/[eraider]/gge2022/haslr/bin

cp /lustre/scratch/daray/gge2022/haslr/bin/polca_haslr_sbatch.sh .

Run it.

(NOTE: if you receive an error of "polca.sh not found" when you run this script, download and use /lustre/scratch/jkorstia/gge2022/haslr/bin/polca_haslr_sbatch_IF_ERROR.sh which has the path to my installation of polca.sh and seems to bypass the error some of your fellow students received)

All of these polishers do essentially the same thing. They take an assembly that was created using long but error-prone reads and map more accurate but shorter (but also usually more plentiful) reads to that assembly using any of several tools like bowtie2 or bwa. Then, using the more accurate reads, try to find potential errors in the assembly and correct them. FYI, this step took a couple of hours when I tried it using 36 processors. Take that into account when planning your work. The final, polished assembly should be named 'dMau_haslr_final.fa.PolcaCorrected.fa'.

FOR YOU TO DO

Use Quast to compare the polished and unpolished genome assemblies to the polished assembly generated in the paper from which we got the data. The first thing you'll need to do is get the assembly using

wget https://github.com/danrdanny/Drosophila15GenomesProject/raw/master/assembledGenomes/Dmau.pass.minimap2.racon.x3.pilon.x3.fasta.gz

Then, you'll need to generate an appropriate submission script to compare the three versions. This time, I'm going to ask you to generate your own script. Use the previous runs of quast as a guide to creating your own.

  1. First, compare and contrast our assembly process to the one used in the manuscript (https://www.g3journal.org/content/8/10/3131.long). The assembly generated for the manuscript followed this procedure:

"Minimap2 (Version 2.1-r311) (Li 2018) and miniasm (version 0.2-r168-dirty) (Li 2016) were used to assemble either all reads or only those reads with quality scores ≥7 (Table 3). Polishing was completed using either Racon (Vaser et al. 2017) or Pilon (Walker et al. 2014) alone or in combination (Table 4). Dot plots were generated with nucmer and mummerplot (Delcher et al. 1999); for clarity, only contigs or scaffolds >100 kb were plotted. Assemblies using only reads that passed quality filter were subjected to either four iterations of polishing with Racon, six iterations of polishing with Pilon, or three iterations of polishing with Racon followed by three iterations of polishing with Pilon."

  1. Create your script to run quast, make sure it runs to completion, and then upload the path to your quast directory to Assignment 13 - HASLR assembly.

  2. Which of the assemblies you generated is better? Explain your answer using information from the Quast results.

  3. Just knowing how to run a script is something anyone can do. Knowing the details of what that script does is important if you ever need to troubleshoot. And we ALWAYS need to troubleshoot. For that reason, I’ve intentionally left the haslr_sbatch.sh, polca_sbatch.sh, and the quast scripts undocumented. Your job is to provide a comment for each line of that script telling a reader what each line does. Document the haslr and polca scripts by commenting before any line that has a command. For example in the haslr_sbatch.sh script there is this line: find . -type f -name "asm.final.fa" -exec cp "{}" dMau_haslr_final.fa \;. You could document the purpose of that line by adding a comment line just above it as follows:

#This line searches for the final assembly by name (see documentation at https://github.com/vpc-ccg/haslr), and copies that file to the current directory, renaming it as dMau_haslr_final.fa.
find . -type f -name "asm.final.fa" -exec cp "{}" dMau_haslr_final.fa \;

This may take a little exercising of your google-fu skills.

Save your files and upload those files to 'Assignment 13 - HASLR assembly'.