15. Long Read Genome Assembly with Flye - davidaray/Genomes-and-Genome-Evolution GitHub Wiki
Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio or Nanopore reads as input and outputs polished contigs. Note that last part. Assuming you have high enough coverage, it polishes the assembly for you using the assumption that errors in the reads are random and not systematic.
We're using this package as a comparison to Abyss and will be assembling the same genome using the same data. The difference is in the assembly strategy. Abyss assembles using the short reads, uses those short reads to correct the long reads, and then uses the long reads to assemble the short read contigs into scaffolds. This causes the choice of kmer to have a major impact on the quality of the assembly. Flye works differently. It uses the long reads directly in building the assembly, attempting to polish the final assembly using only that data. Optionally, you can polish the final assembly with the accurate illumina reads afterward. Let's see what differences are apparent.
To install the software, you’ll need to do the following.
Log in and get an interactive session so that you’re not working on the head node.
interactive -p nocona
There are multiple ways to install Flye. The easiest is using conda but if that doesn't work, you can use the github repository as follows:
cd ~
git clone https://github.com/fenderglass/Flye
cd Flye
make
At this point, the file you invoke to use Flye should be ~/Flye/bin/flye. To use that file, the easiest thing to do is to get it into your path whenever you want to use it:
export PATH=$PATH:~/Flye/bin
Assignment 12 - Flye assembly
- Write a submission script to run Flye on the same C. elegans long read data we used for the Abyss assembly. It should run using 64 processors on nocona and two polishing iterations. You can read the documentation for Flye to get hints but the basic command will be along these lines:
flye --nano-raw <read files> --genome-size <estimated genome size (for example, 5m or 2.6g)> --out-dir <path/to/output/dir> --threads <int> --iterations <int>
If you want more information on other possible options, check the Flye github usage page https://github.com/fenderglass/Flye/blob/flye/docs/USAGE.md.
For this exercise, we want to use all of the nanopore data. To accomplish that, I've combined the two sets of nanopore reads into a single file located at /lustre/scratch/daray/gge2023/data/celegans/combined_nanopore_celegans.fasta
.
-
Run the assembly in a 'flye/celegans' folder in your gge2023 directory,
/lustre/scratch/[eraider]/gge2023/flye/celegans
, so that I can find theassembly.fa
file. -
Using the benchmarking information from the flye documentation web page, predict how long it will take the assembly to complete.
-
Using the timestamps on the .err file from your submission and the assembly.fa, determine how long the assembly actually took. Explain any differences you notice.
Now, let's compare this assembly to the best one from abyss.
Copy and then modify the quast submission script that you used when comparing the three abyss assemblies of C. elegans. Modify it so that you are also comparing the Flye assembly to those three.
-
Using the output, identify the major differences between them when the reference assembly is ignored. Which assembly gets the best scores for largest contig and N50? Using this information, explain why long-read assembly is considered the new state of the art.
-
Use Bandage to examine the assembly graph file. Add a screenshot of that image to your submitted assignment.
Combine any scripts you created for this exercise and any answers to questions into a single Word document. Upload that document to Blackboard under the appropriately labeled exercise.