Microbiome Helper 2 Getting a phylogenetic taxonomic tree (metagenome) - LangilleLab/microbiome_helper GitHub Wiki

Authors: Robyn Wright Modifications by: NA

Please note: We are still testing/developing this so use with caution :)

Introduction

While with marker gene analyses, we're often able to insert our reads into a phylogenetic tree (or build a phylogenetic tree de novo), the same isn't often the case with read-based metagenome analyses.

One way around this is to use reference genomes (and phylogenetic trees) for the taxa that we have identified in our samples. We previously did this for Figure 1 in this paper. We did this by matching the NCBI taxonomy ID's from our Kraken2 classifications to the GTDB bacterial and archaeal phylogenetic trees. This page will give instructions on doing this.

Note that we recommend using these trees for visualisation purposes only as their accuracy will depend on the accuracy of your taxonomic assignments!

Prerequisites

It is assumed that you have either already run our Kraken2/Bracken workflow and have a Bracken output file that contains NCBI taxonomy ID's. We may consider developing this script to be more flexible as to the input type in the future, if that is of interest.

1. Get GTDB files

wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release226/226.0/ar53_r226.tree
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release226/226.0/bac120_r226.tree
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release226/226.0/ar53_metadata_r226.tsv.gz
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release226/226.0/bac120_metadata_r226.tsv.gz

Unzip:

gunzip *metadata_r226.tsv.gz

Activate environment with needed programs:

mamba activate get_tree

2. Filter taxa and save reduced trees

First get the script:

wget http://kronos.pharmacology.dal.ca/public_files/MH2/scripts/filter_gtdb_tree_on_bracken_output.py

Now make an output folder for these files, and run the script:

mkdir gtdb_filtered
python filter_gtdb_tree_on_bracken_output.py      --bracken bracken_out_merged/merged_output.species.bracken \
                                                  --outprefix gtdb_filtered/CAMI_marine_GTDB_tree \
                                                  --archaea ar53_metadata_r226.tsv \
                                                  --bacteria bac120_metadata_r226.tsv \
                                                  --arc_tree ar53_r226.tree \
                                                  --bac_tree bac120_r226.tree

We're giving this script:

--bracken: the merged (species-level - this is important!) Bracken output file
--outprefix: the prefix to give out output files (this can optionally include a folder within it)
--archaea: the path to the GTDB archaea metadata file (that we downloaded and unzipped above)
--bacteria: the path to the GTDB bacteria metadata file (that we downloaded and unzipped above)
--arc_tree: the path to the GTDB archaea tree file (that we downloaded and unzipped above)
--bac_tree: the path to the GTDB bacteria tree file (that we downloaded and unzipped above)

The output that you'll get from this will be four files:

outprefix_ar53_r226_bracken.tree: the GTDB archaeal tree file filtered to only include the taxa that match the NCBI taxonomy ID's in your Bracken output
outprefix_bac120_r226_bracken.tree: the GTDB bacterial tree file filtered to only include the taxa that match the NCBI taxonomy ID's in your Bracken output
outprefix_bracken_merged_output.species.bracken: the same file as you gave above with --bracken, but with three extra columns: gtdb_accession, gtdb_taxonomy and gtdb_domain. These tell you the GTDB genome accession (so you can match up with the names in the filtered tree files), the full GTDB taxonomy, and the domain that they match (so you know which tree file to use), respectively.
outprefix_fraction_reads_remaining_in_GTDB_tree.tsv: a file that gives a summary of the proportion of your reads that are represented within these two trees