StrainPhlAn3 - biobakery/biobakery GitHub Wiki
StrainPhlAn 3.0 tutorial
StrainPhlAn is a tool for strain-level resolution of species across a large sample sets, based on single nucleotide polymorphisms (SNPs) within conserved and unique species marker genes. The first step in the StrainPhlAn workflow is to run MetaPhlAn 3.0.
- For information on the first generation of StrainPhlAn, see the StrainPhlAn 1.0 tutorial.
- For information on MetaPhlAn 3.0, see the MetaPhlAn User Manuals.
- Please direct questions to the bioBakery forum.
- For additional information, see the StrainPhlAn User Manual.
Table of contents
Overview
The following figure shows the workflow of StrainPhlAn.
Installation
NOTE if you are running StrainPhlAn in bioBakery (either locally in the VM or in Google Cloud) than you do not need to install StrainPhlAn because StrainPhlAn and its dependencies are already preinstalled. You can then proceed to the How to run section.
With MetaPhlAn 3.0
StrainPhlan 3.0 must be installed together with MetaPhlan 3.0. You can found the installation instructions here.
How to run
Follow these steps to learn how to process a set of 25 samples with
StrainPhlAn. For the purpose of this tutorial, we will use the following
set of 25 input files that have been subsampled for rapid analysis.
These 25 files are stool microbiome shotgun sequence files from 14
different projects,and 9 different countries. StrainPhlAn can take as
input .fasta
, .fastq
, .fasta.gz
, and .fastq.gz
files. Download
here
the set of 25 demo files to get started on the tutorial.
Sample | Study | Country |
---|---|---|
CMD64776337ST-21-0 | ZellerG_2014 | DEU |
G78505 | VatanenT_2016 | RUS |
G88884 | SchirmerM_2016 | NLD |
G88970 | SchirmerM_2016 | NLD |
G89027 | SchirmerM_2016 | NLD |
H2M514903 | LiJ_2017 | CHN |
H3M518116 | LiJ_2017 | CHN |
HD-1 | QinN_2014 | CHN |
HD-5 | QinN_2014 | CHN |
HV-6 | QinN_2014 | CHN |
LD-48 | QinN_2014 | CHN |
M1.42.ST | BritoIL_2016 | FJI |
M2.48.ST | BritoIL_2016 | FJI |
M2.58.ST | BritoIL_2016 | FJI |
N021 | WenC_2017 | CHN |
RA023 | WenC_2017 | CHN |
S353 | KarlssonFH_2013 | SWE |
SID530054 | FengQ_2015 | AUT |
SRS011302 | HMP_2012 | USA |
SZAXPI003417-4 | YuJ_2015 | CHN |
T2D-025 | QinJ_2012 | CHN |
T2D-063 | QinJ_2012 | CHN |
T2D-105 | QinJ_2012 | CHN |
W1.27.ST | BritoIL_2016 | FJI |
YSZC12003_36795 | XieH_2016 | GBR |
- Why is it more informative to perform strain tracking on a collection of related metagenomes, rather than a single metagenome, or metagenomes from extremely different environments?
There are four steps to run the subsequent StrainPhlAn analysis.
Step 1: Run MetaPhlAn
The first step is to run MetaPhlAn 3.0 to obtain the sam output files. The sam files contain the alignment information from mapping the reads of each sample against the MetaPhlAn3 marker database. For that, run the following Script:
mkdir -p sams
mkdir -p bowtie2
mkdir -p profiles
for f in input/*
do
echo "Running metaphlan 3.0 on ${f}"
bn=$(basename ${f%fastq.bz2})
metaphlan $f --input_type fastq -s sams/${bn}.sam.bz2 --bowtie2out bowtie2/${bn}.bowtie2.bz2 -o profiles/${bn}_profile.tsv
done
If you would like to run with multiple cores, add the option --nproc
.
The sam output files generated from the commands above, can be downloaded here.
- Why can't this step use the "simple" bowtie2 intermediate files that MetaPhlAn saves by default?
Step 2: Run sample_to_markers
Providing the sam output files, from the prior step, to the sample_to_markers script will generate a marker file for each sample. The marker files contain the consensus of unique marker genes for each species found in the sample, which will be used for SNP profiling. Run the command below to generate the marker files in your current working directory:
mkdir consensus_markers
sample2markers.py -i sams_simplified/*.sam.bz2 -o consensus_markers --nproc 4
If you would like to run with multiple cores, add the option --nproc
.
The marker output files generated from the commands above, can be downloaded if you're in a hurry here.
- Try looking at the .pkl files using less. What happens, and why? What information is contained in these files, and why not store it as plain text?
Step 3: Generate trees from alignments
Run StrainPhlAn to generate alignments and then trees, providing the sample marker files and the clade reference marker file. Also provide some genome sequence for E. rectale, for comparative purposes, which can be downloaded here:
mkdir clade_markers
mkdir output
extract_markers.py -c s__Eubacterium_rectale -o clade_markers
strainphlan -s consensus_markers/*.pkl -m clade_markers/s__Eubacterium_rectale.fna -r reference_genomes/*.fna -o output -c s__Eubacterium_rectale --phylophlan_mode fast --nproc 4
If you're in a hurry, the marker reference file can be also downloaded from here.
The tree generated from the command above, can be downloaded from the following link:
The alignment file generated from the command above, can be downloaded from the following link:
- Explain the contents of the s__Eubacterium_rectale.StrainPhlAn3_concatenated.aln file.
- Explain the format and contents of the phylogenetic tree output file.
- Explain the format and contents of the multiple alignment output file.
Visualize results
The sections that follow cover some of the common methods to visualize
the results of a StrainPhlAn anlaysis. We will use the ETE Toolkit for
quick visualization online. We will also visualize StrainPhlAn trees
with the GraPhlAn
and the
ggtree
R package. NOTE that GraPhlAn and ggtree packages are already
preinstalled in bioBakery so you don't need to install them. We will be working from the output
directory.
A list of tree editors for exploring phylogenetic trees can be found here.
Multiple sequence alignment files can be explored with the R package msa and interactively with Jalview.
Visualization with ETE
The tree and multiple marker sequence alignment file from the previous tutorial steps can be input together for an online visualization using the ETE Toolkit.
In your browser, go to the ETE Toolkit Phylogenetic tree
viewer. Upload the tree and merged
marker alignment file from the previous tutorial steps. Select
Condensed format
from the Alignment image type drop down menu. Click
on View Tree!
.
The tree generated will appear as follows:
Visualization with GraPhlAn
Step 1: Install GraPhlAn with Homebrew: :
brew tap biobakery/biobakery
brew install graphlan
This will install GraPhlAn and all of its dependencies.
Alternatively, you can manually install GraPhlAn from source by downloading GraPhlAn and then installing the GraPhlAn dependencies (numpy, pandas, biopython, scipy, and matplotlib). NOTE that if you are running this tutorial in the bioBakery VM then you do not need to install GraPhlAn because GraPhlAn and its dependencies are already preinstalled.
Step 2: Visualize the results
First add the metadata to your tree. Download the metadata.txt file for the steps that follow, which should be run from within the output folder. :
add_metadata_tree.py --ifn_trees RAxML_bestTree.s__Eubacterium_rectale.StrainPhlAn3.tre --ifn_metadata ../metadata.txt
Next provide the tree with metadata to create a dendrogram. :
plot_tree_graphlan.py --ifn_tree RAxML_bestTree.s__Eubacterium_rectale.StrainPhlAn3.tre.metadata --colorized_metadata Country --leaf_marker_size 60 --legend_marker_size 60
The resulting image is shown below:
Visualization with ggtree
Follow these steps to create dendrograms and an ordination plot using StrainPhlAn output files, ggtree and distmat.
Step 1: Install R and the following dependencies:
- optparse
- ggplot2
- ggtree
- RColorBrewer
- vegan
Step 2: Generate dendrogram
Run the following script from within the output folder, to generate two dendrograms, providing the output tree from StrainPhlAn along with a metadata file. Download the metadata.txt file for the steps that follow. :
strainphlan_ggtree_vis.R RAxML_bestTree.s__Eubacterium_rectale.StrainPhlAn3.tre ../metadata.txt s__Eubacterium_rectale.StrainPhlAn3_concatenated.aln strainphlan_tree_1.png strainphlan_tree_2.png
The dendrogram created will look like the following images:
Here we are decorating the dendrogram with Country metadata.
The second dendrogram is also decorated with a slice of the multiple sequence alignment of unique gene marker sequences, where colors in the alignment reveal SNPs.
Step 3: Generate an ordination plot.
Next, we can use the multiple sequence alignment file to generate a
phylogenetic distance matrix that contains the pairwise nucleotide
substitution rate between strains. We will use the distmat
function
from the EMBOSS package. NOTE that if you are running the analysis
in bioBakery then you do not need to install distmat
because EMBOSS is
already installed. Otherwise you can generate a distance matrix using
the online implementation of
distmat. Once we
have the distance matrix, we will use it to perform PCoA (Principal
Coordinate Analysis).
Enter this command from your terminal: :
distmat
Following this command provide the MSA input file.
Create a distance matrix from a multiple sequence alignment
Input aligned sequence set: s__Eubacterium_rectale.StrainPhlAn3_concatenated.aln
When prompted for the correction method, enter 2
(Kimura 2-parameter).
When prompted for output file, press the Enter
key to accept the
default name [strainphlan3_concatenated.distmat]
.
The values in the distance matrix represent the nucleotide substitution rate, i.e. the number of substitutions per 100 nucleotides of sequence between a given pair of samples.
- How does the Kimura 2-parameter distance differ from counting the number (or percentage) of non-identical sites in an alignment of two nucleotide sequences?
- Given that 97% nucleotide identity is a common threshold for defining that two genomes derive from the same species, are the distances that you're measuring here reasonable for strains of the same species?
Now run the following command to generate an ordination plot from the distance matrix: :
strainphlan_ordination_vis.R strainphlan3_concatenated.distmat ../metadata.txt strainphlan_ordination.png
The plot created will look like that below:
- What do you observe about the relative similarity of strains in this visualizations?
- In particular, how do strains from the same country compare to those from other countries?
- How does the reference strain compare to the metagenomic strains?
- How might you assign a measure of statistical confidence to your answers above?
- Which visualization method do you prefer? What are the strengths and limitations of the different visualization methods?