Microbiome Helper 2 Metagenomics initial Steps - LangilleLab/microbiome_helper GitHub Wiki

Authors: Robyn Wright,
Modifications by: NA
Based on initial versions by: NA

Please note: We are still testing/developing this so use with caution :) Please see the Metagenomics SOP v3 in the right hand side bar for our current version. If you want to use the below commands be sure to keep track of them locally.

Initial points

  • The below options are not necessarily best for your data; it is important to explore different options, especially at the read QC stage.
  • You should check that only the FASTQs of interest are being specified at each step (e.g. with the ls command). If you are re-running commands multiple times or using optional steps the input filenames and/or folders could differ.
  • The tool parallel comes up several times in this workflow. Be sure to check out our tutorial on this tool here.
  • You can always run parallel with the --dry-run option to see what commands are being run to double-check they are doing what you want!

Requirements

  • This workflow starts with raw paired-end sequencing data in demultiplexed FASTQ format assumed to be located within a folder called raw_data.
  • If you do not have your own data and would like to run this, you can use some of our tutorial data from here.
  • concat_lanes.pl script (here)
  • fastqc
  • multiqc
  • kneaddata
  • bowtie2
  • Trimmomatic
  • trf
  • parallel

You can see how we set up a conda environment called kneaddata-v0.12.2 for carrying out these steps here.

1. First Steps

1.1 Join lanes together (only some Illumina)

Concatenate multiple lanes of sequencing together (e.g. for NextSeq550 data; NS2000 does not need this). If you do NOT do this step, remember to change cat_lanes to raw_data in the further commands below that have assumed you are working with the most common type of metagenome data.

concat_lanes.pl raw_data/* -o cat_lanes -p 4

1.2 Inspect read quality

Run fastqc to allow manual inspection of the quality of sequences.

mkdir fastqc_out
fastqc -t 4 cat_lanes/* -o fastqc_out/

And run multiqc on this output.

multiqc fastqc_out/ -o multiqc

This allows you to look at the fastqc output for all files together, and will generate a html file in multiqc/multiqc_report.html.

Note that these steps do not filter out any low quality reads, they simply allow you to look at the quality of your samples. The aim of this step is to see whether anything has gone wrong with your samples; the subsequent kneaddata steps will allow you to remove low quality reads.

1.3 (Optional) Stitch F+R reads

Stitch paired-end reads together with PEAR (summary of stitching results are written to "pear_summary_log.txt"). Note: it is important to check the % of reads assembled. It may be better to concatenate the forward and reverse reads together if the assembly % is too low (see step 2.2).

run_pear.pl -p 4 -o stitched_reads cat_lanes/*

If you don't stitch your reads together at this step you will need to unzip your FASTQ files before continuing with the below commands.

2. Read Quality-Control and Contaminant Screens

2.1 Running KneadData

Use kneaddata to run pre-processing tools. First Trimmomatic is run to remove low quality sequences. Then Bowtie2 is run to screen out contaminant sequences. Below we are screening out reads that map to the human or PhiX genomes. Note KneadData is being run below on all unstitched FASTQ pairs with parallel, you can see our quick tutorial on this tool here. For a detailed breakdown of the options in the below command see this page. The forward and reverse reads will be specified by "_1" and "_2" in the output files, ignore the "R1" in each filename. Note that the \ characters at the end of each line are just to split the command over multiple lines to make it easier to read.

Depending on what you are sequencing, you will need to choose a different Bowtie2 database. This should contain the PhiX genome and, if you are looking at a host-associated microbiome, the host genome (or something close to it, if none is available). There are several available on the Bowtie2 website here, or we have some optional instructions on constructing one here.

Construct Bowtie2 database

This is an example for creating a database with the chicken Gallus gallus and PhiX genomes.

First, download the NCBI RefSeq assembly summary:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

Have a look in this file. You'll see that it contains information on all of the genomes in NCBI RefSeq - we often need reference genomes for constructing various kinds of databases, so it's useful to get familiar with what is available on NCBI. You'll see that each assembly/accession has information on the NCBI taxonomy ID, whether it's a complete genome, whether it's a reference or representative genome, and a link to where we can find the associated files. Because the samples we are using are from chickens, we're going to look for the files that are associated with the chicken Gallus gallus:

grep "Gallus gallus" assembly_summary_refseq.txt

The first one has "reference genome" in the first column, so that's what we'll use. Copy the link that starts https://ftp.ncbi/..... into a new browser window. You will see a bunch of different file types, so find the one that ends _genomic.fna.gz, right click, and click "Copy Link Address". Now use the wget command to download this to your instance.

Now we're going to do the same with PhiX. PhiX is added as a sequencing control and is often removed before we're given back our samples, but it's good to make sure.

  1. Use grep to search for "phiX174"
  2. Go to the link that you find
  3. Download the _genomic.fna.gz file using wget

Now we'll combine these two files into one:

cat GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz GCF_000819615.1_ViralProj14015_genomic.fna.gz > chicken_Gallusgallus_phiX174.fna.gz

And finally, we'll build our Bowtie2 database:

mkdir bowtie2_db
bowtie2-build chicken_Gallusgallus_phiX174.fna.gz bowtie2_db/chicken_Gallusgallus_phiX174 --threads 1

This will take about ten minutes to run.

Run Bowtie2:

parallel -j 1 --link --eta 'kneaddata -i {1} -i {2} -o kneaddata_out/ \
    -db BOWTIE2_DB --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
    -t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
    --bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
    ::: cat_lanes/*_R1.fastq ::: cat_lanes/*_R2.fastq

Make sure that you replace BOWTIE2_DB with the name of the database that you are using! This should have no file extension, so if my folder containing my database was named bowtie2_databases had the following files: GRCh38_PhiX.1.bt2 GRCh38_PhiX.2.bt2 GRCh38_PhiX.3.bt2 GRCh38_PhiX.4.bt2 GRCh38_PhiX.rev.1.bt2 GRCh38_PhiX.rev.2.bt2 Then you would replace BOWTIE2_DB with bowtie2_databases/GRCh38_PhiX

If you have unpaired reads, as you would when using the tutorial data, you will need to use the -un flag instead of -i.

If you have installed trimmomatic via conda/mamba then you won't need to use the --trimmomatic flag.

Optional: If you stitched the F+R reads together above, then use the following version.

parallel -j 1 --link 'kneaddata -un {} -o kneaddata_out/ \
    -db BOWTIE2_DB --trimmomatic /usr/local/prg/Trimmomatic-0.36/ \
    -t 4 --trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
    --bowtie2-options "--very-sensitive --dovetail" --remove-intermediate-output' \
    ::: stitched_reads/*.assembled.fastq

Clean up the output directory (helps downstream commands) by moving the discarded sequences to a subfolder:

mkdir kneaddata_out/contam_seq
mkdir kneaddata_out/unmatched_seq
mv kneaddata_out/*_contam*.fastq kneaddata_out/contam_seq
mv kneaddata_out/*_unmatched*.fastq kneaddata_out/unmatched_seq

Note that you will not need to run the unmatched_seq commands if you only have unpaired reads.

You can produce a logfile summarizing the kneadData output with this command:

kneaddata_read_count_table --input kneaddata_out --output kneaddata_read_counts.txt

If you are having issues with running Kneaddata and all of your reads are being output as unmatched despite the fastq headers in both files matching, you may want to try using an older version of Kneaddata. It seems that this issue has been around for a while, and that the simplest fix is to use an older version of Kneaddata. I have found that version 0.7.4 works as expected.

2.2 Concatenate unstitched output (omit if data stitched above)

If you did not stitch your paired-end reads together with PEAR, then you can concatenate the forward and reverse FASTQs together now. Note this is done after quality filtering so that both reads in a pair are either discarded or retained. It's important to only specify the "paired" FASTQs outputted by kneaddata in the below command.

concat_paired_end.pl -p 4 --no_R_match -o cat_reads kneaddata_out/*_paired_*.fastq 

You should check over the commands that are printed to screen to make sure the correct FASTQs are being concatenated together.

Next steps

Now that you've carried out the initial steps, you can carry out taxonomic or functional annotation, or assemble MAGs:

  • Taxonomic annotation
  • Functional annotation
  • MAG assembly
⚠️ **GitHub.com Fallback** ⚠️