Metagenomics (IMPACTT December 2022) Assembly - LangilleLab/microbiome_helper GitHub Wiki

This tutorial is for the 2022 IMPACTT bioinformatics workshop running from December 6-7th.

Authors: Robyn Wright and Morgan Langille

This tutorial has been written by Robyn Wright and Morgan Langille for the December 2022 IMPACTT workshop.

Table of Contents

1. Co-assembly of reads into contigs

2. Mapping of reads onto contigs

3. Binning of contigs

4. Quality control and assignment of taxonomy to bins

Introduction

The main goal of this tutorial is to introduce students to the assembly of genomes from metagenomic reads (Metagenome Assembled Genomes/MAGs). As with assigning taxonomy to metagenomic reads, there is not a one-size-fits-all pipeline for assembling MAGs. MAG assembly is incredibly computationally intensive, and the approach that we are taking here is therefore a very "bare bones" approach designed to demonstrate the main steps involved and give you some familiarity with the methods used. At the end of this tutorial we've provided a few other pipelines for MAG assembly that you may wish to look into if you are looking to assemble MAGs with your own metagenome data.

Throughout this tutorial there will be questions that are aimed at helping students understand the various steps in metagenome assembly and binning. The answers are found on this page (note that we are not marking your answers, they are just to make sure that you understand what you are doing).

This tutorial is based on our Metagenomics SOP and we are also working on a list of resources and papers that may be useful for beginners (or those that are not beginners!) in the microbiome field here.

Teaching Objectives

  1. Learn how to assemble contigs from pre-processed metagenomic reads
  2. Learn how to generate abundance profiles of reads within these contigs
  3. Learn how to bin these contigs
  4. Learn about some common approaches for assessing the quality of the bins and taxonomic assignment

Bioinformatic tool citations

Properly citing bioinformatic tools is important to ensure that developers are getting the proper recognition. If you use any of the commands in this tutorial for your own work be sure to cite the relevant tools listed below!

Activate conda environment

As previously, in this tutorial we have included a conda environment that contains all of the required bioinformatic tools!

You can activate this environment using the following command:

conda activate assembly

When you are finished this tutorial you can deactivate the conda environment using:

conda deactivate

1. Co-assembly of reads into contigs

The first step in the assembly of MAGs is the assembly of the shorter reads from sequencing into contigs. A contig is a contiguous sequence assembled from a set of sequence fragments, and there are different bioinformatic methods for assembling our reads into contigs. For this part of the tutorial we are continuing to use the same samples as we used for taxonomic annotation, and these are all short reads of approximately 100 base pairs each. We are using a tool called MEGAHIT for this assembly because it allows co-assembly of the reads from multiple samples. This means that rather than the assembly being performed separately for each sample, it is performed on all of the reads from all of the samples at the same time. In this tutorial, we will co-assemble all of the samples together, but in your own analyses you should think about what makes the most sense. Are you expecting microbial taxa to overlap between different samples? Would it make sense to find similar genomes in multiple samples? If you answered "yes" to those questions, then it might be worth thinking about co-assembly. If not, then it is probably worth assembling each sample separately. You may also want to consider assembling the samples from different treatments separately.

Prior to assembling your samples you would usually run quality checks and remove potentially contaminating sequences, but seeing as we already did that during the Taxonomic annotation tutorial, we can just use the same files.

First make the directory and copy across the reads that we processed in the taxonomic tutorial as well as a file containing all of the sample ID's:

mkdir assembly
cp -r taxonomy/kneaddata_out assembly
cd assembly
cp /home/ubuntu/CourseData/Module4/Data/sample_ids.txt .

Now we'll run MEGAHIT on our samples:

R1=$( ls kneaddata_out/*_1.fastq | tr '\n' ',' | sed 's/,$//' )
R2=$( ls kneaddata_out/*_2.fastq | tr '\n' ',' | sed 's/,$//' )
megahit -1 $R1 \
        -2 $R2 \
        --min-contig-len 1000 \
        --num-cpu-threads 4 \
        --presets meta-large \
        --memory 0.8 \
        -o megahit_out \
        --verbose

The arguments here are:

  • -1 - The forward reads
  • -2 - The reverse reads
  • --min-contig-len - The minimum length in base pairs for contigs that we want to use - 1000 is about the minimum length that you will ever want to use, but sometimes people might increase this to e.g. 2500 or 5000
  • --num-cpu-threads - The number of threads to use
  • --presets - This is for a set of parameters needed within MEGAHIT, and meta-large is suggested for large and complex metagenomes
  • --memory - The amount of available memory that we want to allow MEGAHIT to use - this means it will use up to 80% of the memory available. Keeping this below 100% just means that we would still be able to use the Amazon instance for other things, and could be important if you're sharing a server with other people that might also need to be carrying out some work!
  • -o - The output folder name
  • --verbose - MEGAHIT will print out what it is doing

The main output at this point is a fasta file containing the contigs megahit_out/final.contigs.fa. You can take a look at this with the less command if you like, and we can also count the number of contigs that we have with grep -c ">" megahit_out/final.contigs.fa.

Question 1: How many contigs are there in the megahit_out/final.contigs.fa file?

2. Mapping of reads onto contigs

The next step in the creation of MAGs is to map the reads in our samples onto our contigs. During the taxonomic annotation tutorial, we Bowtie2 was used within Kneaddata to map the reads in our samples to the sequences of known contaminants. Now we'll use Bowtie2 to map the reads onto our contigs so that we can create abundance profiles of the contigs.

First of all, we'll make a Bowtie2 database with the contigs:

bowtie2-build megahit_out/final.contigs.fa megahit_out/final.contigs

In this command, we simply give Bowtie2 the location of the fasta file that we want to build the database from, as well as the name for the database files (megahit_out/final.contigs).

And we'll make a folder for our mapping files to go in:

mkdir bam_files

Now we will map the contigs to the samples to generate abundance profiles:

for SAMPLE in `awk '{print $1}' sample_ids.txt`
do

    echo "Sample "$SAMPLE

    # do the bowtie mapping to get the SAM file:
    bowtie2 --threads 4 \
            -x megahit_out/final.contigs \
            -1 "kneaddata_out/"$SAMPLE"_R1_subsampled_kneaddata_paired_1.fastq" \
            -2 "kneaddata_out/"$SAMPLE"_R1_subsampled_kneaddata_paired_2.fastq" \
            --no-unal \
            -S bam_files/$SAMPLE.sam

    # convert the resulting SAM file to a BAM file:
    samtools view -F 4 -bS bam_files/$SAMPLE.sam > bam_files/$SAMPLE-RAW.bam

    # sort and index the BAM file:
    samtools sort bam_files/$SAMPLE-RAW.bam -o bam_files/$SAMPLE.bam
    samtools index bam_files/$SAMPLE.bam

    # remove temporary files:
    rm bam_files/$SAMPLE.sam bam_files/$SAMPLE-RAW.bam

done

This command is called a for loop. There are versions of for loops in most computing languages and they are really useful for carrying out multiple operations on the same file. In this case, for each of our samples, we are first using Bowtie2 to map the samples to the contigs and output a SAM file containing this information. Next, the SAM file is converted to a BAM file. This BAM file is then sorted and reindexed, and finally, we remove the intermediate files. While we could probably achieve this using parallel, it starts to get a little more complicated using parallel to carry out multiple commands one after the other, and it is easier to see what we are doing when we have the commands in a for loop, like this. Neither approach is necessarily "wrong" or "right" - they are just two different approaches.

Question 2: What proportion of the reads from sample HSM6XRQY were mapped to the contigs?

3. Binning of contigs

Next we will bin - or cluster - the contigs to create genome "bins", or MAGs. To do this, we typically use information on the coverage, or abundance, of contigs within samples (that we generated in step two) and the binning algorithms will usually identify patterns in nucleotide composition or k-mer frequencies in order to group together contigs that they think are likely to have originated from the same genome. We are using CONCOCT for this, but as with every step of this pipeline, there are multiple tools that are capable of carrying this out and may be better or worse in different circumstances.

The first step is to cut contigs into smaller parts, with a maximum length of 10,000 base pairs, which is aimed at mitigating the effect of potential local assembly errors:

conda activate concoct_env
cut_up_fasta.py megahit_out/final.contigs.fa -c 10000 -o 0 --merge_last -b contigs_10K.bed > contigs_10K.fa

This script takes:

  • The final contigs file
  • -c - The maximum contig length (10,000)
  • -o - The overlap size allowed between contigs
  • --merge_last - This means that the final part of the contig will be added onto the end of the last contig
  • -b - The bedfile to be created with the locations of the original contigs in relation to the new contigs
  • The output file for the cut up contigs

Now we will get the coverage of these new, cut up contigs in our samples:

concoct_coverage_table.py contigs_10K.bed bam_files/*.bam > coverage_table.tsv

This script takes the mapping information from the BAM files and converts it into abundance values for each contig within each sample.

Next we run CONCOCT, which bins these cut up contigs based on the coverage/abundance information and nucleotide frequency mentioned above:

concoct --composition_file contigs_10K.fa --coverage_file coverage_table.tsv -b concoct_output/ --threads 4

This is taking:

  • --composition_file - The file containing our cut up contigs
  • --coverage_file - The file containing information on the coverage/abundance of these contigs
  • -b - A folder to write the results to
  • --threads - The number of threads to use

Now that we have information on all of the bins/clusters, we can merge the cut up contig clustering back into the original, potentially longer, contigs:

merge_cutup_clustering.py concoct_output/clustering_gt1000.csv > concoct_output/clustering_merged.csv

This simply takes the CONCOCT output for our shorter contigs and merges them into a new file

Finally, we can use this clustering information to take the contig sequences and group together those that should be in the same bin/MAG:

mkdir concoct_output/fasta_bins
extract_fasta_bins.py megahit_out/final.contigs.fa concoct_output/clustering_merged.csv --output_path concoct_output/fasta_bins

This just takes the original MEGAHIT contig fasta file, the CONCOCT output and writes them to a folder. If we take a look in the concoct_output/fasta_bins folder then we should see some fasta files.

Question 3: How many fasta files/bins are there in the fasta_bins folder?

Take a look at the concoct_output/fasta_bins folder with the ls -lh command. Have a look at the largest file in the folder. Try using the less command and the grep -c ">" path_to_file.

Question 4: How many contigs are there in the largest fasta file?

4. Quality control and assignment of taxonomy to bins

Now that we've made our MAGs, we probably want to assess the quality of these MAGs as well as have a look at the taxonomy of them. To do this, we are going to use CheckM. CheckM uses a database of ubiquitous, single-copy genes in order to assess the quality of the MAGs using two main metrics, completeness and contamination. Completeness and contamination (also called redundancy) are both important when we're assessing the quality of the MAGs that we've assembled. Both scores are calculated based on the presence of the ubiquitous, single-copy marker genes - ones that all bacteria are known to possess - and the completion is a prediction of how complete the genome is likely to be, so whether it possesses a copy of all of the marker genes that are expected. The contamination/redundancy is a measure of whether those marker genes that are expected to be present in only a single copy are duplicated. Typically speaking, a MAG that has >50% completeness and <10% contamination/redundancy is considered to be reasonable, although >90% completeness is desirable. There is a good explanation on completeness and redundancy here.

First, we'll run CheckM:

checkm lineage_wf concoct_output/fasta_bins/ -x .fa -t 4 MAGs-CHECKM-WF

In this command, we're telling CheckM to run it's lineage_wf on all of the .fa files within the concoct_output/fasta_bins/ folder. It will do this using 4 threads (-t 4) and will write the output to the MAGs-CHECKM-WF folder. Internally, CheckM will: (1) insert each of the MAGs into a reference genome tree; (2) use the tool Prodigal to identify genes within each of the fasta bins, and identifies the lineage-specific marker genes for each of the genomes; (3) analyze whether the lineage-specific marker genes are found in each of the genomes; and (4) summarize the quality of each genome bin/MAG.

You can find this output in MAGs-CHECKM-WF/storage/bin_stats_ext.tsv, but we'll just run the following script to get this into a more easily readable format:

ln -s ~/CourseData/Module4/Data/helper_scripts/ .
conda activate assembly
python helper_scripts/sort_checkm_output.py
less MAGs-CHECKM-summary.tsv

You can open this file either using the less command, or by downloading it via your browser and opening it in Excel, or something similar (which will probably be much easier to visualise).

From the output of this, you will see that in the "Marker lineage" column, each of the MAGs (Bins) say "root (UID1)". This means that there CheckM wasn't able to classify them taxonomically. You will also see in the "Completeness" column that they all say "0.00", so we can tell that the MAGs weren't classified taxonomically because they didn't have any of the marker genes that CheckM is looking for. This isn't really too surprising, because we typically need a high sample depth in order to assemble MAGs from our reads and these samples were sub-sampled to only 25,000 reads per sample. We wouldn't have been able to run this tutorial on the full samples on the Amazon instances that we are using, but I already ran through the same steps on the full samples on our lab server, so we'll go ahead and copy the output across to our current directory:

ln -s ~/CourseData/Module4/Data/full_MAG_output/

If you take a look in the full_MAG_output directory using the ls command then you should see the same files as you created with the CheckM commands above. If you take a look in the new MAGs-CHECKM-summary.tsv file with the less command, can you see some MAGs that have a taxonomic classification, and that have a completeness of above zero?

In the full_MAG_output directory you should also see another folder called gtdbtk_out. You might have noticed in the CheckM output that most of the taxonomic classifications were only at the kingdom (k__Bacteria) or the order (e.g. o__Clostridiales) level, and this is because CheckM only contains a limited number of genomes for comparison. Another tool, called the Genome Taxonomy Database toolkit (gtdbtk), contains many more genomes and is good for classifying MAGs to the lowest level possible. It requires more memory to run than we have available, but I already ran it on the MAGs made from the full samples like so:

gtdbtk classify_wf --genome_dir concoct_output/fasta_bins/ --extension fa --out_dir gtdbtk_out --cpus 24
gtdbtk de_novo_wf --genome_dir concoct_output/fasta_bins/ --bacteria --extension fa --out_dir gtdbtk_out_tree --cpus 24 --outgroup_taxon p__Altarchaeota

Note that this won't work if you try to run it on the Amazon instances, but the first of these commands classifies all of our MAGs, and the second inserts the MAGs into an existing phylogenetic tree that contains all genomes that are in the Genome Taxonomy Database.

The main output from classify_wf is gtdbtk_out/gtdbtk.bac120.summary.tsv, which you can look at with the less command. You should see that you have some species level classifications for the MAGs here. We'll go ahead and combine this with the CheckM output so that we can compare the two:

python helper_scripts/combine_checkm_gtdb_output.py

You can either look at the output file, summary-CHECKM-gtdbtk.tsv using the less command, or you might like to download it using the file browser. The python script has combined the output for both CheckM and GTDB and has also sorted the MAGs so that the most complete ones are at the top. You should see that the most complete MAG, number 115, has a contamination of 1.45%.

Question 5: If you were keeping only the MAGs with >90% completeness and <10% contamination/redundancy, how many MAGs would you be left with? How about if you kept the MAGs with >50% completeness and <10% contamination/redundancy?

Finally, we'll visualise the MAGs that are >50% complete with <10% redundancy. We'll use a script that's modified from one that I used for a different project for this:

python helper_scripts/plot_mag_tree.py 

Question 6: Take a look at the most complete MAGs and the Kraken2/Bracken output. Do the most complete MAGs make sense to you based on the community profiles output by Kraken2/Bracken? Are there any taxa missing? If there are, why do you think this might be?

There are some other statistics on this plot, which I'll give a brief description of here:

  • GC (%) - The GC, or Guanine-Cytosine, content is the percentage of bases in the MAG contigs that are either Guanine or Cytosine (nitrogen-containing). This content can vary both between species as well as within individual genomes and a number of factors are expected to contribute to this variation.
  • Genome size (Mb) - This is the size of the bin/MAG/genome in megabases (Mb). Typical bacterial genome sizes vary between roughly 0.6 to 8 Mb.
  • # contigs / Longest contig / N50 (contigs) - These metrics will typically be looked at together when assessing the quality of an assembly. The N50 is the sequence length of the shortest contig that makes up 50% of the total assembly length, and the larger this is, the better, as it means that we have fewer gaps within the total sequence of the MAG and that there are typically more overlaps between the contigs. Likewise with the longest contig; the longer the better. When we look at the number of contigs, a smaller number is often better. If we are able to assembly a genome with fewer contigs then they are usually longer and we tend to get better assemblies (higher completeness and lower contamination/redundancy), rather than having lots of short contigs that lead to a fragmented assembly. If you think of the original genome for a bacterium as being like a jigsaw puzzle then it will be much easier to put the puzzle back together if we have fewer, larger pieces without lots of duplicated, smaller parts.

Question 7: Are there any MAGs with unusually small or large genome sizes? How do you think the completion or contamination/redundancy contributes to this?

Other approaches and pipelines

Note that we haven't touched on functional analysis of MAGs at all here. There are a bunch of different things that we might be interested in, e.g. general metabolism, antibiotic resistance or virulence genes present in the MAGs, etc. Also, as mentioned previously, the approach we've used here is very "bare bones" and there are several more comprehensive pipelines and ways that we could refine (remove some of the contigs in order to reduce the contamination/redundancy) the bins/MAGs that we get at the end. These are some other tools that we recommend taking a look at:

  • We typically use Anvi'o for MAG assembly and binning, and have previously modified this pipeline for our own projects. We're currently working on an Anvi'o Standard Operating Procedure (SOP) for Microbiome Helper and will update this when it is live.
  • We have found PathoFact to be useful for identifying AMR genes, toxins and virulence factors in MAGs
  • It is worth noting that a hybrid assembly of long and short metagenomic reads often yields better assembly statistics - in this kind of approach the long reads will often span multiple genomic regions while short reads tend to be more accurate than long reads. This means that the long reads can be used as a "scaffold" for the short reads to be aligned to, and the short reads can be used to correct errors in the long reads. There are multiple bioinformatic methods being developed for this type of assembly, and we are also working on a SOP for this.

Bonus tutorial

If you've finished working through everything here and would like to do something extra, then we also have a functional annotation tutorial that you can work through if you would like to.

In the previous workshop, we did taxonomic and functional annotation, so this was a follow-on from the taxonomic annotation tutorial that you have already done. You can find the tutorial here, and I've modified it to work now here.

First of all, download the additional data needed:

cd ~/workspace
wget -O metagenome_tutorial.tar.gz https://www.dropbox.com/sh/cn06suaq77di9sf/AADpERJGxme7xfON-rLoOE2Oa/metagenome_tutorial.tar.gz?dl=1
tar -xvf metagenome_tutorial.tar.gz
rm metagenome_tutorial.tar.gz
cd metagenome_tutorial
tar -xvf MMSeqs2_db.tar.gz
cd ..

Teaching Objectives

  1. Learn how to generate functional profiles using MMseqs2 in combination with Kraken2.
  2. Visualize the resulting taxonomic and functional profiles using JARRVIS.

Bioinformatic tool citations

Properly citing bioinformatic tools is important to ensure that developers are getting the proper recognition. If you use any of the commands in this tutorial for your own work be sure to cite the relevant tools listed below!

1. Determining functional profiles using MMseqs2

Now that we have an idea of the taxonomic profiles of our samples we can also look at their functional potential. To do this we will being using MMseqs2 along with a few scripts developed by our lab to connect the functional annotations to the taxonomic annotations.

MMseqs2 works by taking sequenced reads, translating them into protein and then mapping them against a protein database. In this case we will be using UniRef90, a large protein database clustered at 90% identity. Our lab has developed a set of scripts to run this tool and use the information to assign both a taxonomic ID and a functional ID to each read within our samples.

Activate conda environment

As previously, we've installed all of the tools that we'll need for this part of the tutorial into an Anaconda environment, which we can activate with the following command:

conda activate functional

Job file generation

One way of running multiple commands in terminal one after another is to create a job file. This job file contains all of the commands of interest along with all the needed parameters to run those commands. We can use the following command to create a job file for each sample that we would like to process with MMseqs2:

mkdir functional 
cd functional
ln -s ~/CourseData/Module4/Data/Functional_Helper_Scripts/ .
ln -s ~/workspace/taxonomy/cat_reads/ .
/usr/bin/perl Functional_Helper_Scripts/run_TaxonomyFunctionSearchMegan.pl --mapmethod mmseqs \
--db ~~/workspace/metagenome_tutorial/MMSeqs2_db/mmseqsUniref90DB -p 4 -o mmseqs_U90_out cat_reads/*fastq

This will give a warning about not having given a MEGAN path, but that's fine because we don't want to run MEGAN!

This should create ten separate job files, one for each sample. While you don't need to look in these files its always a good idea to understand what your bioinformatic programs are doing! Lets take take a look at one of the job files using the less command. You should see three lines each containing a separate command:

mmseqs createdb cat_reads/CSM7KOMH.fastq mmseqs_U90_out/mmseqs-CSM7KOMHqueryDB

This command creates an mmseqs database from the the input fastq file. The creation of this database is necessary for mmseqs as it vastly increases the speed at which translated DNA sequences can be mapped against a protein database.

mmseqs search mmseqs_U90_out/mmseqs-CSM7KOMHqueryDB MMSeqs2_db/mmseqsUniref90DB mmseqs_U90_out/mmseqs-CSM7KOMHresultDB tmp --db-load-mode 3 --threads 4 --max-seqs 25 -s 1 -a -e 1e-5 > /dev/null 2>&1

This command is the real meat of the job file and runs the freshly created sample database against the provided UniRef90 protien database. There are a number of parameters in this command.

  • --db-load-mode 3 - This parameter tells MMSeqs how to deal with loading the database into memory. For more information you can check out this page. However, setting this parameter to 3 helps when running MMSeqs on a cluster environment.
  • --threads - The number of processors we want MMSeqs to use during the search
  • --max-seqs 25 - This indicates that we want MMSeqs to output at maximum 25 hits for each sequence
  • -s 1 - This indicates the sensitivity that we want MMSeqs to run at. Increasing this number will lower the speed at which mmseqs runs but will increase its sensitivity. For well-explored environments such as the human gut, a setting of 1 should suffice.
  • -a - This indicates that we want our results to output backtraces for each sequence match. These are needed to convert the resulting MMSeqs file into a usable file format.
  • -e 1e-5 - This indicates that we only want to keep matches that are below an E-value of 1e-5 (E-values are a measure of how well two sequences match one another, and the closer they are to zero, the better the match is).
  • > /dev/null 2>&1 This final part of the command allows us to run the command without having too much text printed to our screen.

The final command allows us to convert the resulting file from the MMSeqs2 format into one that is more usable:

mmseqs convertalis mmseqs_U90_out/mmseqs-CSM7KOMHqueryDB MMSeqs2_db/mmseqsUniref90DB mmseqs_U90_out/mmseqs-CSM7KOMHresultDB mmseqs_U90_out/mmseqs-CSM7KOMH-s1.m8 --db-load-mode 2 > /dev/null 2>&1

This command is similar and takes as input the query database we made from our first command, the UniRef90 database we searched against and the resulting file from our search command. It will output the file mmseqs_U90_out/mmseqs-CSM79HR8-s1.m8.

We would run these job files with the following commands:

./mmseqs-*.sh

The * here is saying that we want to run all of the files that start with mmseqs- and end with .sh.

Unfortunately the AWS instances we are using do not have enough memory to run the above three commands in a timely manner (they take over an hour on this AWS configuration). As such I already provided the expected output from the above commands. You can copy them to your working directory using the following commands:

cp ~/workspace/metagenome_tutorial/mmseqs_U90_out.tar.gz .
tar -xvf mmseqs_U90_out.tar.gz

(If you already started running the MMSeqs commands then you can stop it by pressing ctrl+C together).

All of these commands made a lot of files that we are actually not interested in! We will go ahead and remove all of the files except for those that end in .m8. This will help us save significantly on the amount of hard drive space that is being taken up by these files:

mkdir mmseqs_m8_files
mv mmseqs_U90_out/*.m8 mmseqs_m8_files/
rm -r mmseqs_U90_out

Lets take a quick look at one of the files we just moved into the directory mmseqs_m8_files using the less command.

less mmseqs_m8_files/mmseqs-CSM7KOMH-s1.m8

We you will see is a file in BLAST tabular format.

Column Number Data Type
0 query sequence ID
1 Subject (database) sequence ID
2 Percent Identity
3 Alignment Length
4 Number of gaps
5 Number of mismatches
6 Start on the query sequence
7 End on the query sequence
8 Start on the database sequence
9 End on the database sequence
10 E value - the expectation that this alignment is random given the length of the sequence and length of the database
11 bit score - the score of the alignment itself

Question 1: How many protein sequences did the sequence CB1APANXX170426:1:1101:2664:65700/2 align with in the sample CSM7KOMH? Which alignment/alignments have the lowest E-value/highest bitscore?

The next step we need to take is to get the name of the protein sequence that had the best alignment for each sequence read in our samples. We can achieve this by running the command:

mkdir mmseqs_U90_out_tophit
python Functional_Helper_Scripts/pick_uniref_top_hit.py --unirefm8Dir mmseqs_m8_files --output_path mmseqs_U90_out_tophit

Now that we have the best protein sequence that matches best with each of the sequences in our samples we can begin creating our final data table containing the stratified abundance of each function in our samples. Because we (in the Langille lab) are continually updating some of these scripts, and the ones that we are using are ones that we have only recently developed, there are a few different steps that we'll take to get the files in the format that we want them to be in. Often as we develop things in bioinformatics, we'll try out a lot of different things, and as the protocols that we use things mature we can consolidate them into a single script. We have just about reached that point here, but we haven't consolidated things yet so we'll run these few steps.

First we need to generate an input file that contains information about where all of the results we have generated are located. We can generate this table using the following commands:

  • The 1st step is to generate the sample tags; we can use either the concatenated fastq files or the kraken output for extracting the sample tag IDs: ls -1 ~/workspace/taxonomy/kraken2_outraw/*_0.1_RefSeqCompleteV205.kraken.txt |sed -e 's/.*\///g'|sed -e 's/\_0.1_RefSeqCompleteV205\.kraken\.txt//g' > sample-tags.txt

  • Next we need the paths to the uniref functional files (parsed to get only the tophit) and the filetype (refseq, uniref or COG): ls -1 mmseqs_U90_out_tophit/* | sed -e 's/$/\tuniref/g' > uniref-hits-list.txt

  • Similarly, we would need the paths to the kraken2 classification files and the corresponding filetypes (note that here we're taking the kraken2 output from the RefSeqCompleteV205 database run with a confidence threshold of 0.1): ls -1 ~/workspace/taxonomy/kraken2_outraw/*_0.1_RefSeqCompleteV205.kraken.txt | sed -e 's/$/\tkraken2/g' > kraken2-results-list.txt

  • Finally, we need the list of the m8 output files for each sample: ls -1 mmseqs_m8_files/* > m8-list.txt

  • The last step is to paste put all of these files together: paste sample-tags.txt kraken2-results-list.txt uniref-hits-list.txt m8-list.txt > multi-sample-outfiles-w-m8.txt

It is always good practice to check the files that you've just made - if something went wrong then you want to pick up on this as soon as possible rather than trying to work out where something went wrong later on!

Now that we have this master file we can pass this information into the helper script to add all of it together for our samples:

ln -s ~/workspace/metagenome_tutorial/MMSeqs2_db/*.pbz2 .
python Functional_Helper_Scripts/parse_TaxonomyFunction.py --multisample multi-sample-outfiles-w-m8.txt --outputf Workshop_strat_matrix_RPKM.txt --stratified Y --map2EC Y

The first link command will grab all the databases that contain information about the length of each gene in our UniRef protein database. This will be important to normalize the abundance of each functional classification.

The second command will generate a final stratified data matrix that shows the abundance of each EC number stratified by the different taxonomic classifications within the sample. This script also normalizes the abundances of each of these ECS into reads per kilobase per million (RPKM). This abundance metric is the number of reads that mapped to that EC number per million reads mapped within the sample, divided by the gene length of that EC. It's important that the abundances are normalized by gene length or there would be an unfair bias toward longer genes due to the higher chance of them being sequenced. We can also run the same command as above without the --stratified Y option to generate functional profiles that are not broken down by the contributing taxa.

Wait a minute my command ended with the output Killed!

Do not panic this was expected as the AWS instance we are running on does not have enough memory to load the database that contains the lengths of each gene in our UniRef90 database. However I have already generated the expected output from this command and we can copy it over to our working directory:

cp ~/workspace/metagenome_tutorial/other_files/Workshop_strat_matrix_RPKM.txt .

Lets take a look at this file with the less command:

less Workshop_strat_matrix_RPKM.txt

Question 2: What is the RPKM contributed to the sample CSM79HR8 for the EC:6.3.5.2 contributed by Bacteroides intestinalis?

2. Adding descriptions and calculating pathway abundances using PICRUSt2

Using scripts from PICRUSt2 we can add descriptions (i.e the names for each EC number) or generate stratified pathway abundances. Before we can do this we need to make some minor formatting adjustments:

sed -e "s/|/\t/g" Workshop_strat_matrix_RPKM.txt > Workshop_strat_matrix_RPKM_fixed.txt
sed -e "s/function/function\tsequence/g" Workshop_strat_matrix_RPKM_fixed.txt > Workshop_strat_matrix_RPKM_fixed_final.txt

Now we will activate the PICRUSt2 conda environment:

conda activate picrust2

We can add descriptions using the following command:

add_descriptions.py -i Workshop_strat_matrix_RPKM_fixed_final.txt -m EC -o Workshop_strat_matrix_RPKM_fixed_des.txt

Question 3: What is the name of the enzyme with the EC number EC:6.3.5.2?

Next we will generate stratified pathway abundances using the following command:

pathway_pipeline.py -i Workshop_strat_matrix_RPKM_fixed_final.txt -p 4 -o pathways_out --wide_table

The results of this command will be in the pathways_out folder.

Question 4: How many total pathways were identified?

Question 5: how many taxa contribute to the pathway PANTO-PWY?

3. Visualising with JARRVIS

Now we'll visualise the differences between our samples using a new tool, JARRVIS, that is currently being developed in the Langille lab (it's currently in beta testing!). We will just need to run a few commands to get our data tables into the format that the tool is expecting.

In some of the above steps, we generated output tables that are stratified by taxa as well as by function. Now, we want to further stratify these so that they are separated by sample with only a single abundance column. We also want to change the taxonomy that we have so that instead of each just being listed as a species name, it will have all levels of the taxonomy.

First, we'll add together two of the files that we've already made to make an input file, and then remove the last column of this file:

paste sample-tags.txt kraken2-results-list.txt > input_for_taxonomy_abundance_script_int.txt
cut -d$'\t' -f1-2 input_for_taxonomy_abundance_script_int.txt > input_for_taxonomy_abundance_script.txt

Now we can run the script to get all of the taxonomy levels for our output:

conda activate functional
python Functional_Helper_Scripts/add_6levelsTaxonomy_to_Kraken2Results.py --taxafilelist input_for_taxonomy_abundance_script.txt --outfile kraken2_abundance_matrix-6level-tax.txt

This should print out some information on what has been created to the screen. And we can take a look at the kraken2_abundance_matrix-6level-tax.txt file with the less command. Most of the information in this file isn't needed for what we're doing (but as I mentioned above, this isn't yet a very mature pipeline!).

Next, we'll convert our abundance file into the format that is needed for the next script. We'll use the stratified pathway abundance file here as, with fewer pathways than EC numbers, this will be simpler to view, although this kind of visulisation can also be done with the EC numbers file (Workshop_strat_matrix_RPKM_fixed_final.txt):

gunzip pathways_out/path_abun_strat.tsv.gz
python Functional_Helper_Scripts/convert_picrust2_to_stratified_output_format.py --filename pathways_out/path_abun_strat.tsv --outfilename stratified-file-formatted.tsv

Finally, we'll run another script to convert this table to the right format for visualisation:

python Functional_Helper_Scripts/convert_stratifiedRpkm_to_SankeyFormat.py --StratFileName stratified-file-formatted.tsv --taxaAbundFile kraken2_abundance_matrix-6level-tax.txt --outfile pathways-stratified-SankeyFormat.txt

If you take a look at this file with the less command, then it should look something like this:

Sample  Genus   Gene    Contribution
CSM79HR8        p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__NA;g__NA;s__NA      PANTO-PWY       0.0
CSM79HR8        p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__NA PANTO-PWY       0.0
CSM79HR8        p__NA;c__NA;o__NA;f__NA;g__NA;s__NA     PANTO-PWY       0.0
CSM79HR8        p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides caccae PWY-6700        0.0
CSM79HR8        p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides uniformis      PWY-6700
        0.0

We'll just copy across the metadata file to our directory:

cp ~/workspace/metagenome_tutorial/mgs_metadata.txt .

You can now download both the mgs_metadata.txt and the pathways-stratified-SankeyFormat.txt files to your computer. You can right click them and select "Save link as..." (or similar) to download them.

Open RStudio on the server and run:

library(shiny)
runGist("943ff5fdbd94815cc27f302d9f56ff0b")

A window should pop up and you can continue with the tutorial

You can now select the pathways-stratified-SankeyFormat.txt file for the first box and the mgs_metadata.txt file for the second box. On the "Metadata categories" dropdown you should choose "disease_state", and after a few seconds, it should automatically come up with a plot. If it initially looks like you have an error message, don't panic - you just need to choose the disease_state category before the tool can visualise the data.

You should now see "CD" (Crohn's disease) and "nonIBD" on the left hand side, taxa names in the middle and pathway names on the right. Hovering over the lines will highlight them and give you information on what they connect.

Note that there is an RPKM threshold filter of 10 currently applied - usually it is a good idea to filter out very low abundance pathways/taxa as these can add a lot of confusion to visualisations, but as we are working with small, sub-sampled samples here, it's not really necessary to apply this as everything fits on fine as it is.

Question 6: Which samples is the Streptococcus genus found in?

Question 7: Are there any pathways that are only found in one sample and not the other?