Requirements - rrwick/Perfect-bacterial-genome-tutorial GitHub Wiki

Skills

This tutorial was written with the assumption that you have basic Linux/Unix/Mac CLI skills: you should be able to navigate directories, run tools, use the pipe, etc. It also assumes familiarity with bioinformatics file formats like FASTA and FASTQ.

Here's a one-liner I made up as a skill test:

cat *.fasta | grep '>' | sort > headers

Can you follow the logic and understand that command? It takes the contents of a bunch of FASTA files (cat *.fasta), filters for the header lines (grep '>'), alphabetises the results (sort) and puts them in a file called headers. If that was clear to you, then you should be good to go! However, if that command looks like an incomprehensible foreign language, then you might find this tutorial difficult.

Data

You'll need a good hybrid read set of both Illumina and ONT reads to assemble. Visit the [Sample data]] page to download suitable S. aureus data. The [easy and medium versions of the tutorial assume that you are using the R10.4 and Illumina reads from this sample data set. The hard version is more general and can be done with any good read set.

What do I mean by 'good'? In order to get a perfect bacterial genome assembly, your reads should be deep: ideally 200× or more for both Illumina and ONT. And your ONT reads should be long (ideally an N50 of 15 kbp or more). If your data doesn't meet this high standard, you can certainly still follow the tutorial, but it might be harder, and you might not be able to assemble your genome to zero-error perfection.

Software

You'll need a lot of command-line tools installed to do this tutorial, and here's a list of what you'll need:

Read alignment: minimap2, BWA and Samtools
Read QC: Filtlong and fastp
Consensus long-read assembler: Trycycler.
- Trycycler has a few other software requirements, see the software requirements page of its documentation.
Required long-read assemblers: Flye, Raven and miniasm/Minipolish
Optional long-read assemblers: Canu, NECAT, NextDenovo/NextPolish, Redbean and Shasta
Long-read polisher: Medaka
Required short-read polishers: Polypolish, POLCA and ropebwt2/FMLRC2
Optional short-read polishers: ntEdit, HyPo and NextPolish
Sequence file manipulation: seqtk
Alignment visualisation: IGV
A text editor that can handle large (multi-megabyte) files:
- Sublime Text and Atom are good choices if you want a GUI editor.
- Command-line editors like Vim and Emacs are also appropriate.
A phylogeny-viewer, such as FigTree
Optional tools for reference-free assembly assessment: ALE and Prodigal

As every bioinformatician knows, installing software can often be the hardest part! I usually install tools in one of the following ways:

Use a package/environment manager, e.g. apt (Ubuntu), Homebrew (Mac) or conda/Bioconda (all platforms). This is an easy option, assuming your tool is available from that manager. Conda is also good for organising software installations, e.g. you could make a conda environment called assembly with this tutorial's tools (see below).
Download a pre-compiled binary. This is a great option if the tool provides a ready-to-use executable file appropriate for your OS and CPU. You can often find these in the 'Releases' section of the tool's GitHub page. I then copy the executable file to a location that's in my PATH environment variable (e.g. ~/.local/bin) so it's easy to run.
Build from source. This is sometimes hard, especially for tools with lots of dependencies. I usually only resort to this when the above options don't work or when I need the very latest version of a tool.

Conda environment

I tried to make a conda environment with all of the command-line tools listed above, and I nearly succeeded. Grab the environment.yml file from this repo and then create an environment using this command :

conda env create --file environment.yml --name assembly

(You can substitute mamba for conda in that command to run it faster.)

A few tools are not included: NextDenovo/NextPolish, HyPo and ALE, because these had conflicts with other tools, so you might want to install them separately. Also note that there are a few different tools out there named 'ALE', so be sure to grab the correct one!

Also, the miniasm_and_minipolish.sh won't be installed by conda, so you might want to download that one yourself, make it executable and put it in an appropriate location (e.g. your conda environment's bin directory).

Hardware and OS

Since bacterial genomes are small, none of the steps in this tutorial require a high-spec computer. I was able to do the entire process on my MacBook (M1 Pro, 10 CPU cores), where it took about 1.5 hours and used about 4 GB of RAM. However, some of the steps can take advantage of many threads (e.g. read alignment) or can be run simultaneously (e.g. assemblies for Trycycler), so if you have access to a large server with lots of CPUs, it can be faster.

Regarding CPU type and OS, an x86 Linux machine will probably be easiest for software installation (generally true in bioinformatics). An older x86 Mac is also a good option. The tutorial is possible on newer Apple-silicon Macs, but some pieces of software will take troubleshooting to install – hopefully this will get easier as tools are updated with better Apple silicon support. I haven't tried this tutorial on Windows, but I presume that it would work via the Windows Subsystem for Linux (WSL).