StrainPhlAn3 - biobakery/biobakery GitHub Wiki

StrainPhlAn 3.0 tutorial

StrainPhlAn is a tool for strain-level resolution of species across a large sample sets, based on single nucleotide polymorphisms (SNPs) within conserved and unique species marker genes. The first step in the StrainPhlAn workflow is to run MetaPhlAn 3.0.

For information on the first generation of StrainPhlAn, see the StrainPhlAn 1.0 tutorial.
For information on MetaPhlAn 3.0, see the MetaPhlAn User Manuals.
Please direct questions to the bioBakery forum.
For additional information, see the StrainPhlAn User Manual.

Table of contents

Overview
Installation
- With MetaPhlAn 3.0
How to run
Visualize results

Overview

The following figure shows the workflow of StrainPhlAn.

Installation

NOTE if you are running StrainPhlAn in bioBakery (either locally in the VM or in Google Cloud) than you do not need to install StrainPhlAn because StrainPhlAn and its dependencies are already preinstalled. You can then proceed to the How to run section.

With MetaPhlAn 3.0

StrainPhlan 3.0 must be installed together with MetaPhlan 3.0. You can found the installation instructions here.

How to run

Follow these steps to learn how to process a set of 25 samples with StrainPhlAn. For the purpose of this tutorial, we will use the following set of 25 input files that have been subsampled for rapid analysis. These 25 files are stool microbiome shotgun sequence files from 14 different projects,and 9 different countries. StrainPhlAn can take as input .fasta, .fastq, .fasta.gz, and .fastq.gz files. Download here the set of 25 demo files to get started on the tutorial.

Sample	Study	Country
CMD64776337ST-21-0	ZellerG_2014	DEU
G78505	VatanenT_2016	RUS
G88884	SchirmerM_2016	NLD
G88970	SchirmerM_2016	NLD
G89027	SchirmerM_2016	NLD
H2M514903	LiJ_2017	CHN
H3M518116	LiJ_2017	CHN
HD-1	QinN_2014	CHN
HD-5	QinN_2014	CHN
HV-6	QinN_2014	CHN
LD-48	QinN_2014	CHN
M1.42.ST	BritoIL_2016	FJI
M2.48.ST	BritoIL_2016	FJI
M2.58.ST	BritoIL_2016	FJI
N021	WenC_2017	CHN
RA023	WenC_2017	CHN
S353	KarlssonFH_2013	SWE
SID530054	FengQ_2015	AUT
SRS011302	HMP_2012	USA
SZAXPI003417-4	YuJ_2015	CHN
T2D-025	QinJ_2012	CHN
T2D-063	QinJ_2012	CHN
T2D-105	QinJ_2012	CHN
W1.27.ST	BritoIL_2016	FJI
YSZC12003_36795	XieH_2016	GBR

Why is it more informative to perform strain tracking on a collection of related metagenomes, rather than a single metagenome, or metagenomes from extremely different environments?

There are four steps to run the subsequent StrainPhlAn analysis.

Step 1: Run MetaPhlAn

The first step is to run MetaPhlAn 3.0 to obtain the sam output files. The sam files contain the alignment information from mapping the reads of each sample against the MetaPhlAn3 marker database. For that, run the following Script:

mkdir -p sams
mkdir -p bowtie2
mkdir -p profiles
for f in input/*
do
    echo "Running metaphlan 3.0 on ${f}"
    bn=$(basename ${f%fastq.bz2})
    metaphlan $f --input_type fastq -s sams/${bn}.sam.bz2 --bowtie2out bowtie2/${bn}.bowtie2.bz2 -o profiles/${bn}_profile.tsv 
done

If you would like to run with multiple cores, add the option --nproc.

The sam output files generated from the commands above, can be downloaded here.

Why can't this step use the "simple" bowtie2 intermediate files that MetaPhlAn saves by default?

Step 2: Run sample_to_markers

Providing the sam output files, from the prior step, to the sample_to_markers script will generate a marker file for each sample. The marker files contain the consensus of unique marker genes for each species found in the sample, which will be used for SNP profiling. Run the command below to generate the marker files in your current working directory:

 mkdir consensus_markers
 sample2markers.py -i sams_simplified/*.sam.bz2 -o consensus_markers --nproc 4

If you would like to run with multiple cores, add the option --nproc.

The marker output files generated from the commands above, can be downloaded if you're in a hurry here.

Try looking at the .pkl files using less. What happens, and why? What information is contained in these files, and why not store it as plain text?

Step 3: Generate trees from alignments

Run StrainPhlAn to generate alignments and then trees, providing the sample marker files and the clade reference marker file. Also provide some genome sequence for E. rectale, for comparative purposes, which can be downloaded here:

 mkdir clade_markers
 mkdir output
 extract_markers.py -c s__Eubacterium_rectale -o clade_markers
 strainphlan -s consensus_markers/*.pkl -m clade_markers/s__Eubacterium_rectale.fna -r reference_genomes/*.fna -o output -c s__Eubacterium_rectale --phylophlan_mode fast --nproc 4

If you're in a hurry, the marker reference file can be also downloaded from here.

The tree generated from the command above, can be downloaded from the following link:

RAxML_bestTree.s__Eubacterium_rectale.StrainPhlAn3.tre

The alignment file generated from the command above, can be downloaded from the following link:

s__Eubacterium_rectale.StrainPhlAn3_concatenated.aln

Explain the contents of the s__Eubacterium_rectale.StrainPhlAn3_concatenated.aln file.
Explain the format and contents of the phylogenetic tree output file.
Explain the format and contents of the multiple alignment output file.

Visualize results

The sections that follow cover some of the common methods to visualize the results of a StrainPhlAn anlaysis. We will use the ETE Toolkit for quick visualization online. We will also visualize StrainPhlAn trees with the GraPhlAn and the ggtree R package. NOTE that GraPhlAn and ggtree packages are already preinstalled in bioBakery so you don't need to install them. We will be working from the output directory.

A list of tree editors for exploring phylogenetic trees can be found here.

Multiple sequence alignment files can be explored with the R package msa and interactively with Jalview.

Visualization with ETE

The tree and multiple marker sequence alignment file from the previous tutorial steps can be input together for an online visualization using the ETE Toolkit.

In your browser, go to the ETE Toolkit Phylogenetic tree viewer. Upload the tree and merged marker alignment file from the previous tutorial steps. Select Condensed format from the Alignment image type drop down menu. Click on View Tree!.

The tree generated will appear as follows:

Visualization with GraPhlAn

Step 1: Install GraPhlAn with Homebrew: :

 brew tap biobakery/biobakery
 brew install graphlan

This will install GraPhlAn and all of its dependencies.

Alternatively, you can manually install GraPhlAn from source by downloading GraPhlAn and then installing the GraPhlAn dependencies (numpy, pandas, biopython, scipy, and matplotlib). NOTE that if you are running this tutorial in the bioBakery VM then you do not need to install GraPhlAn because GraPhlAn and its dependencies are already preinstalled.

Step 2: Visualize the results

First add the metadata to your tree. Download the metadata.txt file for the steps that follow, which should be run from within the output folder. :

 add_metadata_tree.py --ifn_trees RAxML_bestTree.s__Eubacterium_rectale.StrainPhlAn3.tre --ifn_metadata ../metadata.txt

Next provide the tree with metadata to create a dendrogram. :

 plot_tree_graphlan.py --ifn_tree RAxML_bestTree.s__Eubacterium_rectale.StrainPhlAn3.tre.metadata  --colorized_metadata Country --leaf_marker_size 60 --legend_marker_size 60

The resulting image is shown below:

Visualization with ggtree

Follow these steps to create dendrograms and an ordination plot using StrainPhlAn output files, ggtree and distmat.

Step 1: Install R and the following dependencies:

optparse
ggplot2
ggtree
RColorBrewer
vegan

Step 2: Generate dendrogram

Run the following script from within the output folder, to generate two dendrograms, providing the output tree from StrainPhlAn along with a metadata file. Download the metadata.txt file for the steps that follow. :

 strainphlan_ggtree_vis.R RAxML_bestTree.s__Eubacterium_rectale.StrainPhlAn3.tre ../metadata.txt s__Eubacterium_rectale.StrainPhlAn3_concatenated.aln strainphlan_tree_1.png strainphlan_tree_2.png

The dendrogram created will look like the following images:

Here we are decorating the dendrogram with Country metadata.

The second dendrogram is also decorated with a slice of the multiple sequence alignment of unique gene marker sequences, where colors in the alignment reveal SNPs.

Step 3: Generate an ordination plot.

Next, we can use the multiple sequence alignment file to generate a phylogenetic distance matrix that contains the pairwise nucleotide substitution rate between strains. We will use the distmat function from the EMBOSS package. NOTE that if you are running the analysis in bioBakery then you do not need to install distmat because EMBOSS is already installed. Otherwise you can generate a distance matrix using the online implementation of distmat. Once we have the distance matrix, we will use it to perform PCoA (Principal Coordinate Analysis).

Enter this command from your terminal: :

 distmat

Following this command provide the MSA input file.

Create a distance matrix from a multiple sequence alignment
Input aligned sequence set: s__Eubacterium_rectale.StrainPhlAn3_concatenated.aln

When prompted for the correction method, enter 2 (Kimura 2-parameter).

When prompted for output file, press the Enter key to accept the default name [strainphlan3_concatenated.distmat].

The values in the distance matrix represent the nucleotide substitution rate, i.e. the number of substitutions per 100 nucleotides of sequence between a given pair of samples.

How does the Kimura 2-parameter distance differ from counting the number (or percentage) of non-identical sites in an alignment of two nucleotide sequences?
Given that 97% nucleotide identity is a common threshold for defining that two genomes derive from the same species, are the distances that you're measuring here reasonable for strains of the same species?

Now run the following command to generate an ordination plot from the distance matrix: :

 strainphlan_ordination_vis.R strainphlan3_concatenated.distmat ../metadata.txt strainphlan_ordination.png

The plot created will look like that below:

What do you observe about the relative similarity of strains in this visualizations?
In particular, how do strains from the same country compare to those from other countries?
How does the reference strain compare to the metagenomic strains?
How might you assign a measure of statistical confidence to your answers above?
Which visualization method do you prefer? What are the strengths and limitations of the different visualization methods?