Deprecated 13. Long Read Assembly and Polishing - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

Long read data are the wave of the future and resolve many of the contiguity problems that arise because of repeats and other ambiguous regions of the genome. I know some of the graduate students in this department, and possibly this class, are using long-read data for actual genome assemblies in their research. In the past year, we've been involved in at least 16 projects that use either Nanopore or PacBio data. There are various types of long-read data and, just as with short read data, there are quite a few software packages to try. We’ll be using Canu on a simple assembly of Drosophila mauritiana. In this case, we’ll assemble using nanopore data and polish the assembly with Illumina data, a common practice these days.

THE DATA

We’ll be using data from this publication Highly Contiguous Genome Assemblies of 15 Drosophila Species Generated Using Nanopore Sequencing. The manuscript describes data from many species but we’ll only be using data from one.

So far in this class, I've been providing you with data for each project but it wouldn’t hurt for you to see how to get data that others have generated from one of many databases that store it. In this case, we'll be using the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (NCBI). This is the same general database you used to get the reference assembly for Helicobacter pylori in a previous exercise. In this case, we're just going to the section that houses raw read data associated with publications and entries.

Get started as always by getting set up.

interactive -p nocona

mkdir /lustre/scratch/<eraider>/gge2021/data/dmauritiana

cd /lustre/scratch/<eraider>/gge2021/data/dmauritiana

To get the data, we’re going to use a set of tools from (sra-toolkit). The documentation for these tools is here.

Unfortunately, while there is a conda installation for the sra-toolkit, it's old and doesn't have the tool we need. So, we're just going to use my installation. That means doing things a little different but it should seem at least a little familiar given previous exercises.

There are several sra tools but we’re going to use fasterq-dump. It will take the native SRA format, .sra, and download and convert the files to .fastq simultaneously. It will also split any paired-end read data into two fastq files for you automatically. The first step just makes life a little easier by putting the directory where I have sra-toolkit installed in your PATH. While still in your ‘dmauritiana’ directory:

SRA=/lustre/work/daray/software/sratoolkit.2.9.6-1-centos_linux64/bin

The previous command just sets up a variable called SRA that now represents the path to the scripts you need. Note the inclusion of the "$" symbols on the lines below.

The following two commands will download the data for this project. The first targets short, paired end reads (two files) and the second targets long, nanopore reads (a single file). So, this could take more than a few minutes. You should probably use two terminals for this. Use an interactive session for both.

$SRA/fasterq-dump SRR6425993

$SRA/fasterq-dump SRR7167952

Once everything is downloaded, this last line just compresses everything so that it takes up less space.

gzip *.fastq

RUNNING THE LONG READ ASSEMBLY

The Canu documentation is deceptively simple. However, as you will find if you work on unix systems often, different queueing systems exist and sometimes they’re easy, sometimes they’re not. We have the latter case and it took a little work with HPCC to get Canu to work on our system but it now does.

As you've likely noticed, we've been running assemblies for genomes that are increasingly complex. That means they take more data and they take longer to run. This one is the largest we've attempted so far. Indeed, while you could run it directly in a terminal, I don't recommend it. Instead, we will replicate the last exercise in which you used a qsub script. I've already written one for you to modify as necessary. Get it as follows:

mkdir /lustre/scratch/<eraider>/gge2021/canu/bin

Note that in this case, we are creating a separate directory for our submission scripts. This can be a good practice, especially if the output of your process is a lot of files and directories. Having a dedicated bin directory allows you to avoid losing your run scripts in a morass of output files.

cd /lustre/scratch/<eraider>/gge2021/canu/bin

cp /lustre/scratch/daray/gge2021/bin/canu_sbatch.sh .

As usual, you will need to edit it to have your eraider id in the correct spot.

Get it running using sbatch canu_sbatch.sh.

What Canu does now is generate a series of jobs that it will submit to the queue itself. If you plan to use Canu for your own research, you should read all of the documentation. When all of those jobs are finished, you should have a complete assembly. This will take many hours and, depending on your genome size, may take weeks. For example, over the summer, I attempted an assembly of a gopher genome (~2 Gb) and I finally just gave up because it was taking too long. We ended up going with a different assembler. But Canu is relatively common, so I want to include it and it works fine for smaller eukaryotic genomes.

The final assembly consists of a file called 'dMau_nanopore.contigs.fasta'. You can see how it gets that name by viewing the qsub script.

POLISHING THE ASSEMBLY

After the assembly is complete, researchers will often polish assemblies like this one because Nanopore data, like PacBio data, is error-heavy (12-15%). If you sequence to a high enough coverage, that problem can be overcome because the errors are thought to be random. However, it’s expensive and one alternative is to use cheaper and more accurate Illumina data to correct any errors that may have crept in.

As you’re finding, this can be done multiple tools. There are quite a few polishers. Nanopolish, pilon, racon, etc. For this work, we’ll use polca, which is part of the Masurca genome assembly package. I chose it because it's both easy to use and does a good job. If I had to choose another, it would be pilon but you've already run that a pilon analysis as part of Unicycler. This process takes many hours, even using 36 processors. For that reason, I’ve put together a submission script to do the job.

Get the software.

. ~/conda/etc/profile.d/conda.sh

conda activate

conda create --name masurca

conda activate masurca

conda install -c bioconda masurca

Now, get the script you are to run.

cd /lustre/scratch/<eraider>/gge2021/canu/bin

cp /lustre/scratch/daray/gge2021/bin/polca_qsub.sh .

Run it.

All of these polishers do essentially the same thing. They take an assembly that was created using long but error-prone reads and map more accurate but shorter (but also usually more plentiful) reads to that assembly using any of several tools like bowtie2 or bwa. Then, using the more accurate reads, try to find potential errors in the assembly and correct them. FYI, this step took around 7 hours when I tried it using 36 processors. Take that into account when planning your work.

Finally, compare the polished and unpolished genomes using quast. This time, I'm going to ask you to generate your own script. Use the previous runs as a guide to creating your own. Then do the exercise below.

FOR YOU TO DO

Create your script, make sure it runs to completion and then upload the path to your quast directory to Assignment 13 - Canu assembly.

Just knowing how to run a script is something anyone can do. Knowing the details of what that script does is important if you ever need to troubleshoot. And we ALWAYS need to troubleshoot. For that reason, I’ve intentionally left the canu_qsub.sh, polca_qsub.sh, and the quast scripts undocumented. Your job is to provide a comment for each line of that script telling a reader what each line does. Start at line 14 and document every line after '#$ -P quanah'. I did leave the abyss_qsub.shscript from a previous exercise documented for you to use as an example.

Document the canu and polca scripts by commenting before any line that has a command. For example in the canu_qsub.sh script there is this line: CANU_PATH=/lustre/work/daray/software/canu-1.8/Linux-amd64/bin. You could document the purpose of that line by adding a comment line just above it as follows:

#This line defines the path to the canu-1.8 software.

CANU_PATH=/lustre/work/daray/software/canu-1.8/Linux-amd64/bin

Now, do the same for all of the lines below '#$ -P quanah' in both scripts. Save your files and upload those files to 'Assignment 13 - Canu assemblya'.