Microbiome Helper 2 beginner microbiome analysis - LangilleLab/microbiome_helper GitHub Wiki
Authors: Robyn Wright
Modifications by: -
Getting started with microbiome analyses can be fairly overwhelming - there's a lot of information out there and this often goes with the assumption that you already have a computer to work with, know what kind of data you'll be working with, know roughly what you'll need to do, etc. From our own experiences (and running enough workshops where we teach microbiome analysis to beginners), we know this isn't always the case. We've aimed to provide all of the information that you'll need to get started with your own analyses, but if anything isn't clear then please feel free to post on the Issues page at the top (making sure you're specific on the page that you're unclear about!).
When we're talking about microbiome analyses, we're usually talking about the characterisation of a microbial community (human gut, ocean, soil, etc.) using DNA/RNA sequencing techniques, and we're usually sequencing them using one of two methods:
- Amplicon or marker gene sequencing (16S, 18S, ITS, etc.)
- "Amplicon" - a piece of DNA or RNA that is the result of amplification via PCR
- Sequence a universal gene/barcode
- Used to identify the taxa within a sample (or diversity within a function, e.g. nifH)
- Restricted to organisms containing the gene targeted by the amplicon primers - host contamination isn't usually a big issue
- Metagenome/metatranscriptome:
- Sequencing everything in the sample
- "Who is there?" and "what are they doing?"
- Host contamination or high diversity can mean really high read depths are needed
- Comparison with databases more complicated, and many more "unknowns"
The choice as to which method we use will often come down to cost: amplicon sequencing is likely to cost ~$20 (CAD) per sample (Illumina, 2x300bp) or $30 per sample (PacBio, long reads) while metagenome sequencing costs ~$120-2,000 per sample (prices taken from IMR 2026 pricing). Although, as mentioned above, high host contamination can change this - for example, in some human body sites it is expected that >99% of the DNA in a sample is human with <1% microbial - in these cases the cost of sequencing enough to get adequate coverage of the microbial community may make this intangible. It will also depend on what you want to know. If you just want to have an overview of the different taxa that are present then amplicon sequencing is likely to be plenty, but if you also want to know which functions are present in the taxa then you will likely need to do metagenome sequencing (metatranscriptome sequencing will tell you about the active members of the community). There are a lot of other factors to consider when choosing a sequencing method, too, and it is not our aim here to cover them comprehensively, but if you would like help then we do have a "wizard"/guide that you can access from the IMR website.
Another thing that is worth keeping in mind is how you will analyse the sequencing data that you obtain. The methods and workflows used for amplicon sequencing data are much more mature (mainly due to the cost of sequencing being much lower for a long time) and aren't so dependent on the environment that you are looking at - I would usually advise that someone works through an amplicon sequencing data analysis before working on metagenome samples (we have instructions on obtaining tutorial data for both). It is also worth considering where you will store and analyse the sequencing data. We are working on putting together some more comprehensive guidance on this, but just as an indication, I'll give some typical compressed (zipped) file sizes for different types of sequencing along with how many sequences they contain:
- Illumina 16S rRNA gene V4-V5 amplicon sequencing: 48MB (20MB forward and 28M reverse reads), 144,888 paired-end reads
- PacBio 16S rRNA gene full length sequencing: 146MB, 306,468 reads
- Illumina metagenome: 825MB (407MB forward and 418MB reverse reads), 4,626,102 paired-end reads
I am not a computer scientist and I am therefore not aiming to give a comprehensive overview of what any of these things mean, but google is your friend :)
In short, there are three key things to consider when looking into the computing that you need:
- RAM (Random Access Memory) - this is the most difficult part to estimate, but is the key thing that will set aside the computer you will need to use for most analyses from your personal computer. As a general rule, you'll need ~16-64GB RAM for amplicon analyses and much more (>100GB RAM) for all but the tiniest metagenome sequencing files. I'll give some more guidance below.
-
Storage space - while you can eventually submit your sequencing data to an archiving service like NCBI/ENA, you will still need to store your sequencing data while you analyse it. You can estimate this based on the file sizes I gave above and the number of reads you have asked for in your sequencing. You will also need enough space for lots of intermediate files. For example, for a recent analysis that I did on PacBio full-length 16S rRNA genes:
- 12 samples, 2,622,979 total reads
- Raw data: 1.3GB
- All analysis files created: 3.3GB
- We are aiming to give some more comprehensive details on this eventually, but as a ballpark 3-5x the size of your raw data is likely to be sufficient.
- Operating system - most bioinformatic analyses are optimised for Linux (which means they tend to work fine on a Mac for the most part, too). Some programs will only run on certain systems (while other common, well-supported ones will work on any), so if there is anything in particular that you know you want to be able to run on your samples, it is worth checking out any system requirements (both operating system and version) before sequencing your samples.
Your institution may well already have access to a server for analysis and some countries will also have things that are accessible to any researchers within them (for example, Canada has the Digital Research Alliance of Canada - previously Compute Canada - that I believe any researcher in Canada can apply to use). It is worth speaking to someone at your institution about these options first, but something that you should ask about is how much freedom you will have to install programs that you need to use (or how you would go about requesting these to be added). Because this can be a bit of a pain, people will often go with their own options where feasible. An option that we have now used quite a bit (that avoids needing to buy a server yourself) is Amazon Web Services (AWS) instances. These can be a bit overwhelming for beginners, but we have provided some guidance on setting up an AWS instance for your analyses that already has the required programs installed. Essentially, these allow you to pay for use of the Amazon servers. These allow you to choose the computing needs that you have and pay per time that you use them (but this also means that you need to remember to stop them when not in use so you don't get charged for additional use!).
We're eventually hoping to be able to act as a "middle man" for these instances and to be able to provide you with an instance that is already set up for use and doesn't require you to launch it for yourself. If this is something that you're interested in then please reach out to me at robyn[dot]wright[at]dal[dot]ca and I can let you know where we're at in this process.
I will also note that if this all sounds way too overwhelming, you think you're only going to do one microbiome analysis ever, you don't want to learn the details yourself, etc etc etc, then there are options that avoid needing to learn to use the command line. We have not investigated them a lot ourselves as we tend to want more flexibility and to know exactly what we're doing, but one that seems to be popular is MicrobiomeAnalyst.
As mentioned, we do hope to at some point have an overview of the full computing resources required/used for various analyses. For now, I'll note that RAM is typically the limiting factor in most analyses, and there is usually a limiting step for this:
- Amplicon: the most intensive steps are amplicon sequence variant denoising, taxonomic assignment, and construction of a phylogenetic tree/insertion of sequences into an existing tree.
- 64GB RAM is likely to be sufficient for most analyses and you may be able to get away with less than this.
- Denoising will typically depend on the dataset size, and if this is a limiting factor for you, you will likely be able to get around this by running subsets of your samples at a time.
- Taxonomic assignment using QIIME2 typically required 32GB RAM
- Phylogenetic tree construction will depend on the number of sequences you have as well as the algorithm that you need. 10,000-100,000 sequences likely require 16-64GB RAM.
- Phylogenetic tree insertion likely requires 16-32GB RAM.
- Metagenome: this will vary massively depending on what exactly you want to do, but in a read-based analysis, taxonomic assignment (comparison of your sequences against a large database of all known organisms) is likely to be the most intensive. In MAG-based analyses there are several steps that could be very RAM-intensive, although often there will be options for these steps to take longer but use less RAM. You are still likely to need several hundred GBs of RAM for a reasonably-sized metagenome project.
- Taxonomic assignment will usually depend on the database that you use for comparison. We've found previously that the bigger the database you can use, the more reliable your results will be. The database that we typically use for Kraken2 (all NCBI sequences) requires ~800GB of RAM.
Almost all bioinformatic analyses are going to require some knowledge of how to use a server/computer and run things from the command line. We've tried our best to provide tutorials for this, but some of this will depend on your specific system. Once you have a basic understanding of this you should be fine to follow through most of the workflows that we have on the main Microbiome Helper 2 page.
If you are completely new to this, we recommend following our brief introduction to the command line, which should give a good overview that is compatible with most servers/computers. Also if you are completely new to this, I'd recommend running through some of the workflows with the tutorial data before you use your own data. This will help to familiarise you with the steps that you'll need to take, and has the added bonus of knowing that everything should work with this data - DNA sequencing isn't perfect and not knowing whether you're doing something wrong or there is something wrong with your data can be frustrating! It's therefore useful to run through with data that you know should work. If you want to do this, I'd recommend the following order:
- Download the tutorial data. Choose here between the Illumina short-read or PacBio long-read amplicon/marker gene data (or do both!)
- Follow the QIIME2 marker gene workflow, making sure that you follow the right instructions depending on whether you have used the short/Illumina or long/PacBio reads.
- Follow the basic statistics and visualisation workflow in QIIME2
- Optional: Follow the PICRUSt2 functional prediction workflow
If you are planning to do your own metagenomics analayses, I'd still recommend working through the above to practice using the command line and get the hang of the basic steps. But then:
- Download the tutorial data. There is again the option between short or long reads, and we have subsampled these files so that you would be able to run through all of the steps in a reasonable time frame.
- Taxonomic annotation of reads using Kraken2
- Functional annotation of reads using MMSeqs2
- Optional: annotation of AMR genes using CARD RGI
- MAG assembly, binning, and curation with Anvi'o
As with any analysis, it's really important to keep track of what you have done. I talk about this a little in the intro to command line/getting started on a linux server, but I think this is worth stating several times.
We do not plan on these pages going anywhere and you can see the version history at the top of each wiki page, but it can be really difficult to know exactly which version of a protocol you followed or which version of a package you used when you're writing up your methods months after finishing an analysis. It can also be really useful when you're trying to modify commands that you've previously run to have saved them somewhere that you can edit them. If you try using something like Microsoft Word or TextEdit then you'll find that this can be really irritating - it auto-corrects things and capitalises words that you don't want to capitalise (coding and commands are usually case-sensitive), and even things like punctuation can change to a version that isn't what is expected by a program, for example, look at the two quotation marks here:
![]()
The one on the left is from me typing directly into the console using my keyboard, and the one on the right is copied in typed in Microsoft Word and then pasted into the console. These things would be minor if you were writing a text document, but you will get an error by trying to run a command that includes them. Error messages can take a long time to learn to decode efficiently, and spotting these errors takes practice, too! It will also do things like "correcting" -- to -.
I personally like to use RStudio and R notebooks for keeping track of the code that I run, and use them like I would a lab notebook. You can see some information on them here, but essentially these allow you to have multiple "chunks" of code, which will specify which coding language they are written in (e.g., bash for Linux command line, R, or Python), and you can make notes around these. We will be providing some of these to go with some of the workflows on this repository. The really nice thing about R notebooks is that you can also "export" them to a HTML document at the end - this allows anyone to view them with the code that you have used to generate each part of the analysis, so they are great for ensuring that your analysis is reproducible.
These two screenshots are both just sections of the same notebook, and you can see that I have multiple tabs open for different projects along the top of my screen. In the first screenshot, you can see that this is a bash code chunk and the eval=FALSE is telling R not to run this and therefore not to include the output from this in the final HTML document - this is really useful if you have parts of code that would take a long time to run, or you tried out but didn't include in your final analysis.
In the second screenshot, these are R chunks. In the first one, I've "commented out" the code - I've added # before the code. You can always use the # in R/Python/bash to either add comments that won't be run, or parts of code that won't be run. The next part is being run and shows the packages being used for this analysis.
This final screenshot shows how the R notebook appears when converted to an HTML document at the end - you can see that the output of each line of code is shown alongside the code itself, and you can choose whether to hide or show this code by default. We'll be adding an introduction to R notebooks soon, so check for this in the sidebar!
We started collecting together some resources as we came across them a couple of years ago, and you can find these here. We have also taught many bioinformatics workshops, and several of these post the lectures to YouTube afterwards.
I will link the 2024 and 2025 CBW lectures below, but it's always worth searching to see if there are more recent versions.
Marker Gene Analysis
CBW 2025 module 1: Introduction
CBW 2025 module 2: Marker Gene Profiling
CBW 2025 module 3: Microbiome Statistics and Visualisations
CBW 2025 module 4: Functional Prediction and Additional Analyses
CBW 2024 module 1: Introduction to Sequencing Data Analysis
CBW 2024 module 2: Marker Gene Profiling
CBW 2024 module 3: Microbiome Statistics and Visualizations
CBW 2024 module 4: Functional Prediction and Additional Analyses\
Metagenome Analysis
CBW 2025 module 1: Introduction to metagenomics and read‐based profiling
CBW 2025 module 2: Metagenomic Assembly and Binning
CBW 2025 module 3: Assigning Functions
CBW 2025 module 4: Statistics, Visualisation and Finding Functional Significance
CBW 2024 module 1: Introduction to MGS and Read-Based Profiling
CBW 2024 module 2: MAG Assembly and Binning
CBW 2024 module 3: Metagenomic Functional Annotation
CBW 2024 module 4: Advanced Microbiome Statistics\