Phylogenetic trees: from qiime2 to R - UofS-EcoTox-Lab/Phylogenetic-Trees GitHub Wiki
Here are the steps I used to create the tree using qiime2-2020.2
First, download the latest qiime2-friendly tree derived from GreenGeenes:
wget -O "sepp-refs-gg-13-8.qza" "https://data.qiime2.org/2019.10/common/sepp-refs-gg-13-8.qza"
You'll have to make sure you've put your files in the correct directory and have qiime2 installed and activated.
Here's the blurb from the qiime2 library:
The move from OTU-based to sOTU-based analysis, while providing additional resolution, also introduces computational challenges. We demonstrate that one popular method of dealing with sOTUs (building a de novo tree from the short sequences) can provide incorrect results in human gut metagenomic studies and show that phylogenetic placement of the new sequences with SEPP resolves this problem while also yielding other benefits over existing methods.
qiime fragment-insertion sepp \
--i-representative-sequences rep-seqs.qza \
--i-reference-database sepp-refs-gg-13-8.qza \
--o-tree insertion-tree.qza \
--o-placements insertion-placements.qza
Export the tree
qiime tools export \
--input-path $PWD/insertion-tree.qza \
--output-path exported-tree
Now we move into R to generate our phylogenetic tree plotted with traits (relative abundance here).
Load the required packages
library(vegan) # Relative abundances
library(scales) # Transparency in plotting
library(ape) # drop.tip
library(adephylo) # Required by phylosignal
library(phylobase) # phylo4d
library(phylosignal) # dotplot.phylo4d
Cleanliness is godliness
rm(list = ls())
Read in the tree and make sure ASV identifiers starting with a numeric have an "X" prepended to match the community data. The "X" get prepended when strings that begin with a number are used for column headings in R. They may have fixed this in the latest R release, but in case you run into this issue, here's how you fix it.
root.tree <- read.tree("insertion-tree.nwk")
x <- root.tree$tip.label
x[grepl("^[:digit:](/UofS-EcoTox-Lab/Phylogenetic-Trees/wiki/:digit:)", x)] <- paste("X", x[grepl("^[:digit:](/UofS-EcoTox-Lab/Phylogenetic-Trees/wiki/:digit:)", x)], sep = "")
root.tree$tip.label <- x
internal standard added, joined using VSEARCH, quality-filtered using q-score-joined, denoised using deblur with a trim length of 400, and classified with classify-sklearn using a custom classifier trimmed to the V3-V5 hypervariable region.
These are three communities I'm interested in. They are bacterial 16S rRNA paired sequences from the V3-V5 region with anThese communities were read into R and corrected relative to the internal standard to allow gene-abundance comparisons among samples. There was also a lot of data filtering to facilitate robust inferences from the data, but hopefully you can read about that soon in our forthcoming paper.
selbal output that indicates balances of bacteria that are linked to canola yield performance, and three data frames that indicate that samples we are interested in.
Read in the data. This consists of a filtered 16S rRNA abundance table (above), the associated taxonomy, acan.asv <- read.csv("asv.root.reps_dupes_merged.prev_and_abun_filt.csv", row.names = 1)
can.tax <- read.csv("tax.root.reps_dupes_merged.prev_and_abun_filt.csv", row.names = 1)
Bal.Tab.full.w3.imp1 <- read.csv("Bal.Tab.full.imp1.w3.csv", row.names = 1)
Bal.Tab.full.w6.imp1 <- read.csv("Bal.Tab.full.imp1.w6.csv", row.names = 1)
Bal.Tab.full.w9.imp1 <- read.csv("Bal.Tab.full.imp1.w9.csv", row.names = 1)
w3.rows <- read.csv("w3.rows.csv", row.names = 1, header = T)
w6.rows <- read.csv("w6.rows.csv", row.names = 1, header = T)
w9.rows <- read.csv("w9.rows.csv", row.names = 1, header = T)
Calculate relative abundance for sizing the dots in our dotplot later.
can.asv.rel0 <- decostand(can.asv, method = "total")*100 range(rowSums(can.asv.rel0)) # Double check that everything sums to 100%
Tidy the selbal outputs.
names(Bal.Tab.full.w3.imp1) <- c("taxa","group")
names(Bal.Tab.full.w6.imp1) <- c("taxa","group")
names(Bal.Tab.full.w9.imp1) <- c("taxa","group")
rownames(Bal.Tab.full.w3.imp1) <- Bal.Tab.full.w3.imp1$taxa
rownames(Bal.Tab.full.w6.imp1) <- Bal.Tab.full.w6.imp1$taxa
rownames(Bal.Tab.full.w9.imp1) <- Bal.Tab.full.w9.imp1$taxa
Tidy the tree (prepend "X"s).
x <- root.tree$tip.label # Extract the ASV (row) names
x[grepl("^[:digit:](/UofS-EcoTox-Lab/Phylogenetic-Trees/wiki/:digit:)", x)] <- paste("X", x[grepl("^[:digit:](/UofS-EcoTox-Lab/Phylogenetic-Trees/wiki/:digit:)", x)], sep = "") # This will prepend an "X" to any name that begins with a number
root.tree$tip.label <- x # Assign the new ASV names to the taxonomy df
Drop the tips that aren't in the dataset - should change from 211,013 tips to 63.
w369.tree <- drop.tip(root.tree, root.tree$tip.label[-match(unique(c(Bal.Tab.full.w3.imp1$taxa,
Bal.Tab.full.w6.imp1$taxa,
Bal.Tab.full.w9.imp1$taxa)),
root.tree$tip.label)])
Tidy the taxonomy (prepend "X"s) and subset to the 63 ASVs of interest.
Do the taxonomy ASV identifiers match the abundance identifiers?
identical(names(can.asv), rownames(can.tax)) # Nope, b/c of the prepended "X" issue.
x <- rownames(can.tax) # Extract the ASV (row) names
x[grepl("^[:digit:](/UofS-EcoTox-Lab/Phylogenetic-Trees/wiki/:digit:)", x)] <- paste("X", x[grepl("^[:digit:](/UofS-EcoTox-Lab/Phylogenetic-Trees/wiki/:digit:)", x)], sep = "") # This will prepend an "X" to any name that begins with a number
rownames(can.tax) <- x # Assign the new ASV names to the taxonomy df
identical(names(can.asv), rownames(can.tax)) # Now the ASV names in the abundance and taxonomy files match
w369.tax <- can.tax[w369.tree$tip.label,] # Subset the taxonomy to the 63 ASVs
Here we will generate a df that specifies the groups we will plot for each of our three communities.
This the is numerator and denominator determined through selbal.
w369.bal.tab <- merge(Bal.Tab.full.w3.imp1,
Bal.Tab.full.w6.imp1,
by = 0,
all = T) # Merge the first two communities by row names
rownames(w369.bal.tab) <- w369.bal.tab$Row.names # Assign the row names
w369.bal.tab$Row.names <- NULL # Remove the 'Row.names' column autogenerated during the merge
w369.bal.tab <- merge(w369.bal.tab,
Bal.Tab.full.w9.imp1,
by = 0,
all = T) # Now merge the first two communities with the third by row names
rownames(w369.bal.tab) <- w369.bal.tab$Row.names # Assign the row names
Remove the columns that are no longer needed
w369.bal.tab$Row.names <- w369.bal.tab$taxa.x <- w369.bal.tab$taxa.y <- w369.bal.tab$taxa <- NULL
names(w369.bal.tab) <- c("w3.group",
"w6.group",
"w9.group") # Assign more useful column names
Subset relative abundance df to include only the 63 of interest
w369.rel0 <- can.asv.rel0[,unique(c(Bal.Tab.full.w3.imp1$taxa,
Bal.Tab.full.w6.imp1$taxa,
Bal.Tab.full.w9.imp1$taxa))] # This generates a list of unique ASV names and uses that to subset the relative abundance df by
Subset the relative abundance df to include only the samples of interest
w3.relabu <- w369.rel0[w3.rows$x,]
w6.relabu <- w369.rel0[w6.rows$x,]
w9.relabu <- w369.rel0[w3.rows$x,]
Calculate mean relative abundances for each taxa within each community
w3.relabu.means <- as.data.frame(colMeans(w3.relabu))
w6.relabu.means <- as.data.frame(colMeans(w6.relabu))
w9.relabu.means <- as.data.frame(colMeans(w9.relabu))
Check that the row names match so you can cbind the mean abundance data together
identical(rownames(w3.relabu.means),
rownames(w6.relabu.means)) # TRUE
identical(rownames(w3.relabu.means),
rownames(w9.relabu.means)) # TRUE
w369.relabu.means <- cbind.data.frame(w3.relabu.means,
w6.relabu.means,
w9.relabu.means) # Merging the data
names(w369.relabu.means) <- c("w3.comm",
"w6.comm",
"w9.comm") # Useful column names
Add taxonomy to the mean abundance df
w369.tax.rel.merge <- merge(w369.tax,
w369.relabu.means,
by = 0) # Merge by row names
rownames(w369.tax.rel.merge) <- w369.tax.rel.merge$Row.names # Assign ASV ids as row names
w369.tax.rel.bal.merge <- merge(w369.tax.rel.merge,
w369.bal.tab, by = 0) # Now merge this with the grouping (num/den) information
rownames(w369.tax.rel.bal.merge) <- w369.tax.rel.bal.merge$Row.names # Assign row names
w369.tax.rel.bal.merge$Row.names <- w369.tax.rel.bal.merge$Row.names <- NULL # Remove the Row.names column
w369.tax.rel.bal.merge <- w369.tax.rel.bal.merge[w369.tree$tip.label,] # Ensure this merged df is in the same order as the tree
identical(rownames(w369.tax.rel.bal.merge), w369.tree$tip.label) # Double-checking the above
Here we'll make a dataframe that will contain a number of plotting parameters (color, etc.)
Convert the grouping column to character.
w369.tax.rel.bal.merge$w3.group <- as.character(w369.tax.rel.bal.merge$w3.group)
w369.tax.rel.bal.merge$w6.group <- as.character(w369.tax.rel.bal.merge$w6.group)
w369.tax.rel.bal.merge$w9.group <- as.character(w369.tax.rel.bal.merge$w9.group)
Make empty abundance columns for each group within each community.
w369.tax.rel.bal.merge$w3.abu.den <- 0
w369.tax.rel.bal.merge$w6.abu.den <- 0
w369.tax.rel.bal.merge$w9.abu.den <- 0
w369.tax.rel.bal.merge$w3.abu.num <- 0
w369.tax.rel.bal.merge$w6.abu.num <- 0
w369.tax.rel.bal.merge$w9.abu.num <- 0
Assign abundances based on group.
w369.tax.rel.bal.merge$w3.abu.den[which(w369.tax.rel.bal.merge$w3.group == "DEN")] <- w369.tax.rel.bal.merge$w3.comm[which(w369.tax.rel.bal.merge$w3.group == "DEN")]
w369.tax.rel.bal.merge$w3.abu.num[which(w369.tax.rel.bal.merge$w3.group == "NUM")] <- w369.tax.rel.bal.merge$w3.comm[which(w369.tax.rel.bal.merge$w3.group == "NUM")]
w369.tax.rel.bal.merge$w6.abu.den[which(w369.tax.rel.bal.merge$w6.group == "DEN")] <- w369.tax.rel.bal.merge$w6.comm[which(w369.tax.rel.bal.merge$w6.group == "DEN")]
w369.tax.rel.bal.merge$w6.abu.num[which(w369.tax.rel.bal.merge$w6.group == "NUM")] <- w369.tax.rel.bal.merge$w6.comm[which(w369.tax.rel.bal.merge$w6.group == "NUM")]
w369.tax.rel.bal.merge$w9.abu.den[which(w369.tax.rel.bal.merge$w9.group == "DEN")] <- w369.tax.rel.bal.merge$w9.comm[which(w369.tax.rel.bal.merge$w9.group == "DEN")]
w369.tax.rel.bal.merge$w9.abu.num[which(w369.tax.rel.bal.merge$w9.group == "NUM")] <- w369.tax.rel.bal.merge$w9.comm[which(w369.tax.rel.bal.merge$w9.group == "NUM")]
Create empty color columns to use for plotting relative abundances.
w369.tax.rel.bal.merge$w3.col <- as.character(NA)
w369.tax.rel.bal.merge$w6.col <- as.character(NA)
w369.tax.rel.bal.merge$w9.col <- as.character(NA)
Assign colors abundances > 0.
w369.tax.rel.bal.merge$w3.col[w369.tax.rel.bal.merge$w3.group == "DEN"] <- "#b94663"
w369.tax.rel.bal.merge$w3.col[w369.tax.rel.bal.merge$w3.group == "NUM"] <- "#5794d7"
w369.tax.rel.bal.merge$w6.col[w369.tax.rel.bal.merge$w6.group == "DEN"] <- "#b94663"
w369.tax.rel.bal.merge$w6.col[w369.tax.rel.bal.merge$w6.group == "NUM"] <- "#5794d7"
w369.tax.rel.bal.merge$w9.col[w369.tax.rel.bal.merge$w9.group == "DEN"] <- "#b94663"
w369.tax.rel.bal.merge$w9.col[w369.tax.rel.bal.merge$w9.group == "NUM"] <- "#5794d7"
Colors for abundances == 0.
w369.tax.rel.bal.merge$w3.col[is.na(w369.tax.rel.bal.merge$w3.col)] <- "white"
w369.tax.rel.bal.merge$w6.col[is.na(w369.tax.rel.bal.merge$w6.col)] <- "white"
w369.tax.rel.bal.merge$w9.col[is.na(w369.tax.rel.bal.merge$w9.col)] <- "white"
Extract for plotting.
w3.dot.col <- w369.tax.rel.bal.merge$w3.col
w6.dot.col <- w369.tax.rel.bal.merge$w6.col
w9.dot.col <- w369.tax.rel.bal.merge$w9.col
Add unique genus column (numbers appended).
w369.tax.rel.bal.merge$genus2 <- make.unique(w369.tax.rel.bal.merge$genus) # Unique identical(rownames(w369.tax.rel.bal.merge), w369.tree$tip.label) # TRUE
Make a single abundance column for ease of plotting.
w369.tax.rel.bal.merge$w3.abundance <- c(w369.tax.rel.bal.merge$w3.abu.num + w369.tax.rel.bal.merge$w3.abu.den)
w369.tax.rel.bal.merge$w6.abundance <- c(w369.tax.rel.bal.merge$w6.abu.num + w369.tax.rel.bal.merge$w6.abu.den)
w369.tax.rel.bal.merge$w9.abundance <- c(w369.tax.rel.bal.merge$w9.abu.num + w369.tax.rel.bal.merge$w9.abu.den)
Make a column to size the points to best encompass the range of abundances.
pt.cex <- abs(c(abs(w369.tax.rel.bal.merge$w3.abundance),abs(w369.tax.rel.bal.merge$w6.abundance),abs(w369.tax.rel.bal.merge$w9.abundance)))
pt.cex[pt.cex>0] <- rescale(pt.cex[pt.cex>0], to = c(1, 8)) # pt.cex size range from 1 to 8
Extract for plotting
w3.pt.cex <- pt.cex[1:63]
w6.pt.cex <- pt.cex[64:126]
w9.pt.cex <- pt.cex[127:189]
Make the phylo4d object with the pruned tree and plotting parameter df. The parameter df can only include numeric columns, otherwise you'll get this error: "Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric" when plotting the dotplot.
p4d <- phylo4d(w369.tree, w369.tax.rel.bal.merge[c(24:26)])
Set the y-values for plotting locations that will appear on concentric circles laterwith 'data.xlim = c(0,1000)'
p4d@data$w3.abundance <- c(0.2, rep(c(0.01,0.001), 31))
p4d@data$w6.abundance <- c(0.2, rep(c(0.01,0.001), 31))
p4d@data$w9.abundance <- c(0.2, rep(c(0.01,0.001), 31))
Generate a column of colors to use for the tree edges.
edge.col <- rep("black", Nedge(w369.tree))
edge.col[1:81] <- "#ca7b0a" # Proteobacteria
edge.col[84:101] <- "#26b223" # Actinobacteria
edge.col[102] <- "#3f62fb" # Chloroflexi
edge.col[103:105] <- "#f806a0" # Acidobacteria
edge.col[106:122] <- "#a53ec7" # Bacteroidetes
Generate colors to use for the genera labels.
w369.tax.rel.bal.merge$phy.col <- NA
w369.tax.rel.bal.merge$phy.col[w369.tax.rel.bal.merge$phylum == "Proteobacteria"] <- "#ca7b0a"
w369.tax.rel.bal.merge$phy.col[w369.tax.rel.bal.merge$phylum == "Bacteroidetes"] <- "#a53ec7"
w369.tax.rel.bal.merge$phy.col[w369.tax.rel.bal.merge$phylum == "Acidobacteria"] <- "#f806a0"
w369.tax.rel.bal.merge$phy.col[w369.tax.rel.bal.merge$phylum == "Actinobacteria"] <- "#26b223"
w369.tax.rel.bal.merge$phy.col[w369.tax.rel.bal.merge$phylum == "Chloroflexi"] <- "#3f62fb"
Rename some of the long genera to make better use of the space.
w369.tax.rel.bal.merge$genus[w369.tax.rel.bal.merge$genus == "Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium"] <- "Allorhizobium-N-P-R"
w369.tax.rel.bal.merge$genus[w369.tax.rel.bal.merge$genus == "Burkholderia-Caballeronia-Paraburkholderia"] <- "Burkholderia-C-P"
w369.tax.rel.bal.merge$genus[w369.tax.rel.bal.merge$genus == "uncultured forest soil bacterium"] <- "uncultured"
And now we can use all that work above to make a pretty tree.
Here's a breakdown of what each parameter in the dotplot.phylo4d does:
edge.color
sets the color of the branches in the tree. I first labelled the branches sequentially (see last line of code below) and then used that information to assign colors based on phylum (see above). The colors were chosen using I Want Hue.data.xlim
assigns an x-limit (0,1000) that goes well-beyond the range of the data (~0 to 8). This makes the dots appear aligned with the concentric circles in the plots. Otherwise, the relative abundance will be used as the x-coordinate and the dots are staggered and not visually appealing.edge.color
is used to color the tree branches. Here I've colored them according to phylum.dot.cex
is the point size. Here I used a scaled relative abundance for size.dot.col
is the point color (made above).grid.horizontal
displays the dashed grid if set toTRUE
.grid.col
sets the color of the above grid.show.data.axis
will show the data axis if set toTRUE
.show.tip
will show the tip labels (genera here) if set toTRUE
.show.trait
will show the trait names if set toTRUE
. IfTRUE
, will populate the labels using the specified strings intrait.labels
below.tip.col
specifies the color of the text used to label the branch tips (genera).tip.cex
can be used to change the font size used in the branch tip labels.tip.labels
uses the specified string to label the branch tips.trait.cex
will adjust the size of the axis labels for the traits.tree.type
sets the style of tree. There are a few different options. Read the help file for more on those.tree.ladderize
to right-ladderize the tree.tree.open.angle
is used to specify how wide the opening of the tree is where the trait labels are. Here I've set it to 15°.tree.ratio
is a numeric value in [0, 1] giving the proportion of width of the figure for the tree.trait.bg.col
can be used to specify the background color for the trait plotting area. I've plotted three traits here and so could choose three colors in theory.trait.bg.col = rep("white",3)
could be changed totrait.bg.col = "white"
, but the 3 is there as a reminder of the number of color options you could choose.
# Export at 8 x 8
dotplot.phylo4d(p4d,
edge.color = edge.col,
data.xlim = c(0,1000),
dot.cex = cbind(w3.pt.cex,w6.pt.cex,w9.pt.cex),
dot.col = cbind(w3.dot.col,w6.dot.col,w9.dot.col),
grid.horizontal = TRUE,
grid.col = "gray80",
show.data.axis = FALSE,
show.tip = TRUE,
show.trait = TRUE,
tip.col = w369.tax.rel.bal.merge$phy.col,
tip.cex = 0.75,
tip.labels = w369.tax.rel.bal.merge$genus,
trait.cex = 0.75,
trait.labels = c("L","F","S"),
tree.type = "fan",
tree.ladderize = TRUE,
tree.open.angle = 15,
tree.ratio = 0.15,
trait.bg.col = "white")
Add polygon to highlight shared taxa between the F and S communities.
polygon(x = c(0.25,0.35,0.53,0.43),
y = c(1.28,1.26,1.989,2.01),
col = alpha("gray50",0.3), border = NA)
Add polygon to highlight shared taxa between the L and S communities.
polygon(x = c(0.025,-0.075,-0.115,-0.015),
y = c(-0.76,-0.76,-2.05,-2.05),
col = alpha("gray50",0.3), border = NA)
Add relative abundance legend.
text(-3.45,2.75,"Relative abundance", font = 2, cex = 0.9)
legend("topleft", legend = "10%", bty = "n", cex = 0.75, col = "gray50", pch = 16, pt.cex = 8, inset = c(0.04,0.08), x.intersp = 3)
legend("topleft", legend = "5%", bty = "n", cex = 0.75, col = "gray50", pch = 16, pt.cex = 4, inset = c(0.04,0.165), x.intersp = 3)
legend("topleft", legend = "1%", bty = "n", cex = 0.75, col = "gray50", pch = 16, pt.cex = 0.8, inset = c(0.04,0.220), x.intersp = 3)
Add numerator/denominator legend.
text(-3.82,1.2,"Balance", font = 2, cex = 0.9)
legend("topleft", legend = c("Numerator","Denominator"), bty = "n", cex = 0.75, col = c("#5794d7","#b94663"), pch = 16, pt.cex = 1.75, inset = c(0.021,0.3))
Add edge legend.
legend("bottomright", legend = c("Proteobacteria","Actinobacteria","Chloroflexi","Acidobacteria","Bacteroidetes"), bty = "n", cex = 0.75, col = c("#ca7b0a","#a53ec7","#f806a0","#26b223","#3f62fb"), pch = 15, pt.cex = 1.75, inset = c(0.015,0))
I used the following to identify which edges belonged to which phylum. Then used that information to add the colors (see 'edge.col' above).
edgelabels(seq(1,Nedge(w369.tree)), frame = "n", bg = NA, font = 3)