Microbiome Helper 2 Metagenomics initial Steps - LangilleLab/microbiome_helper GitHub Wiki
Authors: Robyn Wright,
Modifications by: NA
Based on initial versions by: NA
Please note: We are still testing/developing this so use with caution :) Please see the Metagenomics SOP v3 in the right hand side bar for our current version. If you want to use the below commands be sure to keep track of them locally.
- The below options are not necessarily best for your data; it is important to explore different options, especially at the read QC stage.
- You should check that only the FASTQs of interest are being specified at each step (e.g. with the
ls
command). If you are re-running commands multiple times or using optional steps the input filenames and/or folders could differ. - The tool
parallel
comes up several times in this workflow. Be sure to check out our tutorial on this tool here. - You can always run parallel with the
--dry-run
option to see what commands are being run to double-check they are doing what you want!
- This workflow starts with raw paired-end sequencing data in demultiplexed FASTQ format assumed to be located within a folder called
raw_data
. - If you do not have your own data and would like to run this, you can use some of our tutorial data from here.
-
concat_lanes.pl
script (here) fastqc
multiqc
kneaddata
bowtie2
Trimmomatic
trf
parallel
You can see how we set up a conda environment called kneaddata-v0.12.2
for carrying out these steps here.
Concatenate multiple lanes of sequencing together (e.g. for NextSeq550 data; NS2000 does not need this). If you do NOT do this step, remember to change cat_lanes
to raw_data
in the further commands below that have assumed you are working with the most common type of metagenome data.
concat_lanes.pl raw_data/* -o cat_lanes -p 4
Run fastqc
to allow manual inspection of the quality of sequences.
mkdir fastqc_out
fastqc -t 4 cat_lanes/* -o fastqc_out/
And run multiqc
on this output.
multiqc fastqc_out/ -o multiqc
This allows you to look at the fastqc
output for all files together, and will generate a html file in multiqc/multiqc_report.html
.
Note that these steps do not filter out any low quality reads, they simply allow you to look at the quality of your samples. The aim of this step is to see whether anything has gone wrong with your samples; the subsequent kneaddata
steps will allow you to remove low quality reads.
Stitch paired-end reads together with PEAR (summary of stitching results are written to "pear_summary_log.txt"). Note: it is important to check the % of reads assembled. It may be better to concatenate the forward and reverse reads together if the assembly % is too low (see step 2.2).
run_pear.pl -p 4 -o stitched_reads cat_lanes/*
If you don't stitch your reads together at this step you will need to unzip your FASTQ files before continuing with the below commands.
Use kneaddata
to run pre-processing tools. First Trimmomatic is run to remove low quality sequences. Then Bowtie2 is run to screen out contaminant sequences. Below we are screening out reads that map to the human or PhiX genomes. Note KneadData is being run below on all unstitched FASTQ pairs with parallel
, you can see our quick tutorial on this tool here. For a detailed breakdown of the options in the below command see this page. The forward and reverse reads will be specified by "_1" and "_2" in the output files, ignore the "R1" in each filename. Note that the \
characters at the end of each line are just to split the command over multiple lines to make it easier to read.
Depending on what you are sequencing, you will need to choose a different Bowtie2 database. This should contain the PhiX genome and, if you are looking at a host-associated microbiome, the host genome (or something close to it, if none is available). There are several available on the Bowtie2 website here, or we have some optional instructions on constructing one here.
This is an example for creating a database with the chicken Gallus gallus and PhiX genomes.
First, download the NCBI RefSeq assembly summary:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt
Have a look in this file. You'll see that it contains information on all of the genomes in NCBI RefSeq - we often need reference genomes for constructing various kinds of databases, so it's useful to get familiar with what is available on NCBI. You'll see that each assembly/accession has information on the NCBI taxonomy ID, whether it's a complete genome, whether it's a reference or representative genome, and a link to where we can find the associated files. Because the samples we are using are from chickens, we're going to look for the files that are associated with the chicken Gallus gallus:
grep "Gallus gallus" assembly_summary_refseq.txt
The first one has "reference genome" in the first column, so that's what we'll use. Copy the link that starts https://ftp.ncbi/.....
into a new browser window. You will see a bunch of different file types, so find the one that ends _genomic.fna.gz
, right click, and click "Copy Link Address". Now use the wget
command to download this to your instance.
Now we're going to do the same with PhiX. PhiX is added as a sequencing control and is often removed before we're given back our samples, but it's good to make sure.
- Use grep to search for "phiX174"
- Go to the link that you find
- Download the _genomic.fna.gz file using wget
Now we'll combine these two files into one:
cat GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz GCF_000819615.1_ViralProj14015_genomic.fna.gz > chicken_Gallusgallus_phiX174.fna.gz
And finally, we'll build our Bowtie2 database:
mkdir bowtie2_db
bowtie2-build chicken_Gallusgallus_phiX174.fna.gz bowtie2_db/chicken_Gallusgallus_phiX174 --threads 1
This will take about ten minutes to run.
Run Bowtie2:
parallel -j 1 --link --eta 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
-db BOWTIE2_DB --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
::: cat_lanes/*_R1.fastq ::: cat_lanes/*_R2.fastq
Make sure that you replace
BOWTIE2_DB
with the name of the database that you are using! This should have no file extension, so if my folder containing my database was namedbowtie2_databases
had the following files:GRCh38_PhiX.1.bt2 GRCh38_PhiX.2.bt2 GRCh38_PhiX.3.bt2 GRCh38_PhiX.4.bt2 GRCh38_PhiX.rev.1.bt2 GRCh38_PhiX.rev.2.bt2
Then you would replaceBOWTIE2_DB
withbowtie2_databases/GRCh38_PhiX
If you have unpaired reads, as you would when using the tutorial data, you will need to use the
-un
flag instead of-i
.
If you have installed
trimmomatic
via conda/mamba then you won't need to use the--trimmomatic
flag.
Optional: If you stitched the F+R reads together above, then use the following version.
parallel -j 1 --link 'kneaddata -un {} -o kneaddata_out/ \
-db BOWTIE2_DB --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
-t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
::: stitched_reads/*.assembled.fastq
Clean up the output directory (helps downstream commands) by moving the discarded sequences to a subfolder:
mkdir kneaddata_out/contam_seq
mkdir kneaddata_out/unmatched_seq
mv kneaddata_out/*_contam*.fastq kneaddata_out/contam_seq
mv kneaddata_out/*_unmatched*.fastq kneaddata_out/unmatched_seq
Note that you will not need to run the
unmatched_seq
commands if you only have unpaired reads.
You can produce a logfile summarizing the kneadData output with this command:
kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt
If you are having issues with running Kneaddata and all of your reads are being output as unmatched despite the fastq headers in both files matching, you may want to try using an older version of Kneaddata. It seems that this issue has been around for a while, and that the simplest fix is to use an older version of Kneaddata. I have found that version 0.7.4 works as expected.
If you did not stitch your paired-end reads together with PEAR, then you can concatenate the forward and reverse FASTQs together now. Note this is done after quality filtering so that both reads in a pair are either discarded or retained. It's important to only specify the "paired" FASTQs outputted by kneaddata
in the below command.
concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_*.fastq
You should check over the commands that are printed to screen to make sure the correct FASTQs are being concatenated together.
Now that you've carried out the initial steps, you can carry out taxonomic or functional annotation, or assemble MAGs:
- Taxonomic annotation
- Functional annotation
- MAG assembly