Phylogenetic trees: from qiime2 to R - UofS-EcoTox-Lab/Phylogenetic-Trees GitHub Wiki

Here are the steps I used to create the tree using qiime2-2020.2

First, download the latest qiime2-friendly tree derived from GreenGeenes:

wget -O "sepp-refs-gg-13-8.qza" "https://data.qiime2.org/2019.10/common/sepp-refs-gg-13-8.qza"

You'll have to make sure you've put your files in the correct directory and have qiime2 installed and activated.

Here's the blurb from the qiime2 library:

The move from OTU-based to sOTU-based analysis, while providing additional resolution, also introduces computational challenges. We demonstrate that one popular method of dealing with sOTUs (building a de novo tree from the short sequences) can provide incorrect results in human gut metagenomic studies and show that phylogenetic placement of the new sequences with SEPP resolves this problem while also yielding other benefits over existing methods.

qiime fragment-insertion sepp \
	--i-representative-sequences rep-seqs.qza \
	--i-reference-database sepp-refs-gg-13-8.qza \
	--o-tree insertion-tree.qza \
	--o-placements insertion-placements.qza

Export the tree

qiime tools export \
	--input-path $PWD/insertion-tree.qza \
	--output-path exported-tree

Now we move into R to generate our phylogenetic tree plotted with traits (relative abundance here).

Load the required packages

library(vegan)        # Relative abundances
library(scales)       # Transparency in plotting
library(ape)          # drop.tip
library(adephylo)     # Required by phylosignal
library(phylobase)    # phylo4d
library(phylosignal)  # dotplot.phylo4d

Cleanliness is godliness rm(list = ls())

Read in the tree and make sure ASV identifiers starting with a numeric have an "X" prepended to match the community data. The "X" get prepended when strings that begin with a number are used for column headings in R. They may have fixed this in the latest R release, but in case you run into this issue, here's how you fix it.

root.tree <- read.tree("insertion-tree.nwk")
x <- root.tree$tip.label
x[grepl("^[:digit:](/UofS-EcoTox-Lab/Phylogenetic-Trees/wiki/:digit:)", x)] <- paste("X", x[grepl("^[:digit:](/UofS-EcoTox-Lab/Phylogenetic-Trees/wiki/:digit:)", x)], sep = "")
root.tree$tip.label <- x

These are three communities I'm interested in. They are bacterial 16S rRNA paired sequences from the V3-V5 region with an internal standard added, joined using VSEARCH, quality-filtered using q-score-joined, denoised using deblur with a trim length of 400, and classified with classify-sklearn using a custom classifier trimmed to the V3-V5 hypervariable region.

These communities were read into R and corrected relative to the internal standard to allow gene-abundance comparisons among samples. There was also a lot of data filtering to facilitate robust inferences from the data, but hopefully you can read about that soon in our forthcoming paper.

Read in the data. This consists of a filtered 16S rRNA abundance table (above), the associated taxonomy, a selbal output that indicates balances of bacteria that are linked to canola yield performance, and three data frames that indicate that samples we are interested in.

can.asv <- read.csv("asv.root.reps_dupes_merged.prev_and_abun_filt.csv", row.names = 1)
can.tax <- read.csv("tax.root.reps_dupes_merged.prev_and_abun_filt.csv", row.names = 1)
Bal.Tab.full.w3.imp1 <- read.csv("Bal.Tab.full.imp1.w3.csv", row.names = 1)
Bal.Tab.full.w6.imp1 <- read.csv("Bal.Tab.full.imp1.w6.csv", row.names = 1)
Bal.Tab.full.w9.imp1 <- read.csv("Bal.Tab.full.imp1.w9.csv", row.names = 1)
w3.rows <- read.csv("w3.rows.csv", row.names = 1, header = T)
w6.rows <- read.csv("w6.rows.csv", row.names = 1, header = T)
w9.rows <- read.csv("w9.rows.csv", row.names = 1, header = T)

Calculate relative abundance for sizing the dots in our dotplot later.

can.asv.rel0 <- decostand(can.asv, method = "total")*100 range(rowSums(can.asv.rel0)) # Double check that everything sums to 100%

Tidy the selbal outputs.

names(Bal.Tab.full.w3.imp1) <- c("taxa","group")
names(Bal.Tab.full.w6.imp1) <- c("taxa","group")
names(Bal.Tab.full.w9.imp1) <- c("taxa","group")
rownames(Bal.Tab.full.w3.imp1) <- Bal.Tab.full.w3.imp1$taxa
rownames(Bal.Tab.full.w6.imp1) <- Bal.Tab.full.w6.imp1$taxa
rownames(Bal.Tab.full.w9.imp1) <- Bal.Tab.full.w9.imp1$taxa

Tidy the tree (prepend "X"s).

x <- root.tree$tip.label                                                          # Extract the ASV (row) names
x[grepl("^[:digit:](/UofS-EcoTox-Lab/Phylogenetic-Trees/wiki/:digit:)", x)] <- paste("X", x[grepl("^[:digit:](/UofS-EcoTox-Lab/Phylogenetic-Trees/wiki/:digit:)", x)], sep = "")  # This will prepend an "X" to any name that begins with a number
root.tree$tip.label <- x                                                          # Assign the new ASV names to the taxonomy df

Drop the tips that aren't in the dataset - should change from 211,013 tips to 63.

w369.tree <- drop.tip(root.tree, root.tree$tip.label[-match(unique(c(Bal.Tab.full.w3.imp1$taxa, 
                                                                     Bal.Tab.full.w6.imp1$taxa, 
                                                                     Bal.Tab.full.w9.imp1$taxa)), 
                                                            root.tree$tip.label)])

Tidy the taxonomy (prepend "X"s) and subset to the 63 ASVs of interest.

Do the taxonomy ASV identifiers match the abundance identifiers?

identical(names(can.asv), rownames(can.tax))                                      # Nope, b/c of the prepended "X" issue.
x <- rownames(can.tax)                                                            # Extract the ASV (row) names
x[grepl("^[:digit:](/UofS-EcoTox-Lab/Phylogenetic-Trees/wiki/:digit:)", x)] <- paste("X", x[grepl("^[:digit:](/UofS-EcoTox-Lab/Phylogenetic-Trees/wiki/:digit:)", x)], sep = "")  # This will prepend an "X" to any name that begins with a number
rownames(can.tax) <- x                                                            # Assign the new ASV names to the taxonomy df
identical(names(can.asv), rownames(can.tax))                                      # Now the ASV names in the abundance and taxonomy files match
w369.tax <- can.tax[w369.tree$tip.label,]                                         # Subset the taxonomy to the 63 ASVs

Here we will generate a df that specifies the groups we will plot for each of our three communities.

This the is numerator and denominator determined through selbal.

w369.bal.tab <- merge(Bal.Tab.full.w3.imp1, 
                      Bal.Tab.full.w6.imp1, 
                      by = 0, 
                      all = T)                      # Merge the first two communities by row names
rownames(w369.bal.tab) <- w369.bal.tab$Row.names    # Assign the row names
w369.bal.tab$Row.names <- NULL                      # Remove the 'Row.names' column autogenerated during the merge
w369.bal.tab <- merge(w369.bal.tab,
                      Bal.Tab.full.w9.imp1, 
                      by = 0, 
                      all = T)                      # Now merge the first two communities with the third by row names
rownames(w369.bal.tab) <- w369.bal.tab$Row.names    # Assign the row names

Remove the columns that are no longer needed

w369.bal.tab$Row.names <- w369.bal.tab$taxa.x <- w369.bal.tab$taxa.y <- w369.bal.tab$taxa <- NULL
names(w369.bal.tab) <- c("w3.group",
                         "w6.group",
                         "w9.group")                # Assign more useful column names

Subset relative abundance df to include only the 63 of interest

w369.rel0 <- can.asv.rel0[,unique(c(Bal.Tab.full.w3.imp1$taxa, 
                                    Bal.Tab.full.w6.imp1$taxa, 
                                    Bal.Tab.full.w9.imp1$taxa))] # This generates a list of unique ASV names and uses that to subset the relative abundance df by

Subset the relative abundance df to include only the samples of interest

w3.relabu <- w369.rel0[w3.rows$x,]
w6.relabu <- w369.rel0[w6.rows$x,]
w9.relabu <- w369.rel0[w3.rows$x,]

Calculate mean relative abundances for each taxa within each community

w3.relabu.means <- as.data.frame(colMeans(w3.relabu))
w6.relabu.means <- as.data.frame(colMeans(w6.relabu))
w9.relabu.means <- as.data.frame(colMeans(w9.relabu))

Check that the row names match so you can cbind the mean abundance data together

identical(rownames(w3.relabu.means), 
          rownames(w6.relabu.means))                    # TRUE
identical(rownames(w3.relabu.means), 
          rownames(w9.relabu.means))                    # TRUE
w369.relabu.means <- cbind.data.frame(w3.relabu.means, 
                                      w6.relabu.means, 
                                      w9.relabu.means)  # Merging the data
names(w369.relabu.means) <- c("w3.comm",
                              "w6.comm",
                              "w9.comm")                # Useful column names

Add taxonomy to the mean abundance df

w369.tax.rel.merge <- merge(w369.tax, 
                            w369.relabu.means, 
                            by = 0)                                           # Merge by row names
rownames(w369.tax.rel.merge) <- w369.tax.rel.merge$Row.names                  # Assign ASV ids as row names
w369.tax.rel.bal.merge <- merge(w369.tax.rel.merge, 
                                w369.bal.tab, by = 0)                         # Now merge this with the grouping (num/den) information
rownames(w369.tax.rel.bal.merge) <- w369.tax.rel.bal.merge$Row.names          # Assign row names
w369.tax.rel.bal.merge$Row.names <- w369.tax.rel.bal.merge$Row.names <- NULL  # Remove the Row.names column
w369.tax.rel.bal.merge <- w369.tax.rel.bal.merge[w369.tree$tip.label,]        # Ensure this merged df is in the same order as the tree
identical(rownames(w369.tax.rel.bal.merge), w369.tree$tip.label)              # Double-checking the above

Here we'll make a dataframe that will contain a number of plotting parameters (color, etc.)

Convert the grouping column to character.

w369.tax.rel.bal.merge$w3.group <- as.character(w369.tax.rel.bal.merge$w3.group)
w369.tax.rel.bal.merge$w6.group <- as.character(w369.tax.rel.bal.merge$w6.group)
w369.tax.rel.bal.merge$w9.group <- as.character(w369.tax.rel.bal.merge$w9.group)

Make empty abundance columns for each group within each community.

w369.tax.rel.bal.merge$w3.abu.den <- 0
w369.tax.rel.bal.merge$w6.abu.den <- 0
w369.tax.rel.bal.merge$w9.abu.den <- 0
w369.tax.rel.bal.merge$w3.abu.num <- 0
w369.tax.rel.bal.merge$w6.abu.num <- 0
w369.tax.rel.bal.merge$w9.abu.num <- 0

Assign abundances based on group.

w369.tax.rel.bal.merge$w3.abu.den[which(w369.tax.rel.bal.merge$w3.group == "DEN")] <- w369.tax.rel.bal.merge$w3.comm[which(w369.tax.rel.bal.merge$w3.group == "DEN")]
w369.tax.rel.bal.merge$w3.abu.num[which(w369.tax.rel.bal.merge$w3.group == "NUM")] <- w369.tax.rel.bal.merge$w3.comm[which(w369.tax.rel.bal.merge$w3.group == "NUM")]
w369.tax.rel.bal.merge$w6.abu.den[which(w369.tax.rel.bal.merge$w6.group == "DEN")] <- w369.tax.rel.bal.merge$w6.comm[which(w369.tax.rel.bal.merge$w6.group == "DEN")]
w369.tax.rel.bal.merge$w6.abu.num[which(w369.tax.rel.bal.merge$w6.group == "NUM")] <- w369.tax.rel.bal.merge$w6.comm[which(w369.tax.rel.bal.merge$w6.group == "NUM")]
w369.tax.rel.bal.merge$w9.abu.den[which(w369.tax.rel.bal.merge$w9.group == "DEN")] <- w369.tax.rel.bal.merge$w9.comm[which(w369.tax.rel.bal.merge$w9.group == "DEN")]
w369.tax.rel.bal.merge$w9.abu.num[which(w369.tax.rel.bal.merge$w9.group == "NUM")] <- w369.tax.rel.bal.merge$w9.comm[which(w369.tax.rel.bal.merge$w9.group == "NUM")]

Create empty color columns to use for plotting relative abundances.

w369.tax.rel.bal.merge$w3.col <- as.character(NA)
w369.tax.rel.bal.merge$w6.col <- as.character(NA)
w369.tax.rel.bal.merge$w9.col <- as.character(NA)

Assign colors abundances > 0.

w369.tax.rel.bal.merge$w3.col[w369.tax.rel.bal.merge$w3.group == "DEN"] <- "#b94663"
w369.tax.rel.bal.merge$w3.col[w369.tax.rel.bal.merge$w3.group == "NUM"] <- "#5794d7"
w369.tax.rel.bal.merge$w6.col[w369.tax.rel.bal.merge$w6.group == "DEN"] <- "#b94663"
w369.tax.rel.bal.merge$w6.col[w369.tax.rel.bal.merge$w6.group == "NUM"] <- "#5794d7"
w369.tax.rel.bal.merge$w9.col[w369.tax.rel.bal.merge$w9.group == "DEN"] <- "#b94663"
w369.tax.rel.bal.merge$w9.col[w369.tax.rel.bal.merge$w9.group == "NUM"] <- "#5794d7"

Colors for abundances == 0.

w369.tax.rel.bal.merge$w3.col[is.na(w369.tax.rel.bal.merge$w3.col)] <- "white"
w369.tax.rel.bal.merge$w6.col[is.na(w369.tax.rel.bal.merge$w6.col)] <- "white"
w369.tax.rel.bal.merge$w9.col[is.na(w369.tax.rel.bal.merge$w9.col)] <- "white"

Extract for plotting.

w3.dot.col <- w369.tax.rel.bal.merge$w3.col
w6.dot.col <- w369.tax.rel.bal.merge$w6.col
w9.dot.col <- w369.tax.rel.bal.merge$w9.col

Add unique genus column (numbers appended).

w369.tax.rel.bal.merge$genus2 <- make.unique(w369.tax.rel.bal.merge$genus) # Unique identical(rownames(w369.tax.rel.bal.merge), w369.tree$tip.label) # TRUE

Make a single abundance column for ease of plotting.

w369.tax.rel.bal.merge$w3.abundance <- c(w369.tax.rel.bal.merge$w3.abu.num + w369.tax.rel.bal.merge$w3.abu.den)
w369.tax.rel.bal.merge$w6.abundance <- c(w369.tax.rel.bal.merge$w6.abu.num + w369.tax.rel.bal.merge$w6.abu.den)
w369.tax.rel.bal.merge$w9.abundance <- c(w369.tax.rel.bal.merge$w9.abu.num + w369.tax.rel.bal.merge$w9.abu.den)

Make a column to size the points to best encompass the range of abundances.

pt.cex <- abs(c(abs(w369.tax.rel.bal.merge$w3.abundance),abs(w369.tax.rel.bal.merge$w6.abundance),abs(w369.tax.rel.bal.merge$w9.abundance)))
pt.cex[pt.cex>0] <- rescale(pt.cex[pt.cex>0], to = c(1, 8))                   # pt.cex size range from 1 to 8

Extract for plotting

w3.pt.cex <- pt.cex[1:63]
w6.pt.cex <- pt.cex[64:126]
w9.pt.cex <- pt.cex[127:189]

Make the phylo4d object with the pruned tree and plotting parameter df. The parameter df can only include numeric columns, otherwise you'll get this error: "Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric" when plotting the dotplot.

p4d <- phylo4d(w369.tree, w369.tax.rel.bal.merge[c(24:26)])

Set the y-values for plotting locations that will appear on concentric circles laterwith 'data.xlim = c(0,1000)'

p4d@data$w3.abundance <- c(0.2, rep(c(0.01,0.001), 31))
p4d@data$w6.abundance <- c(0.2, rep(c(0.01,0.001), 31))
p4d@data$w9.abundance <- c(0.2, rep(c(0.01,0.001), 31))

Generate a column of colors to use for the tree edges.

edge.col <- rep("black", Nedge(w369.tree))
edge.col[1:81] <- "#ca7b0a"    # Proteobacteria
edge.col[84:101] <- "#26b223"  # Actinobacteria
edge.col[102] <- "#3f62fb"     # Chloroflexi
edge.col[103:105] <- "#f806a0" # Acidobacteria
edge.col[106:122] <- "#a53ec7" # Bacteroidetes

Generate colors to use for the genera labels.

w369.tax.rel.bal.merge$phy.col <- NA
w369.tax.rel.bal.merge$phy.col[w369.tax.rel.bal.merge$phylum == "Proteobacteria"] <- "#ca7b0a"
w369.tax.rel.bal.merge$phy.col[w369.tax.rel.bal.merge$phylum == "Bacteroidetes"] <- "#a53ec7"
w369.tax.rel.bal.merge$phy.col[w369.tax.rel.bal.merge$phylum == "Acidobacteria"] <- "#f806a0"
w369.tax.rel.bal.merge$phy.col[w369.tax.rel.bal.merge$phylum == "Actinobacteria"] <- "#26b223"
w369.tax.rel.bal.merge$phy.col[w369.tax.rel.bal.merge$phylum == "Chloroflexi"] <- "#3f62fb"

Rename some of the long genera to make better use of the space.

w369.tax.rel.bal.merge$genus[w369.tax.rel.bal.merge$genus == "Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium"] <- "Allorhizobium-N-P-R"
w369.tax.rel.bal.merge$genus[w369.tax.rel.bal.merge$genus == "Burkholderia-Caballeronia-Paraburkholderia"] <- "Burkholderia-C-P"
w369.tax.rel.bal.merge$genus[w369.tax.rel.bal.merge$genus == "uncultured forest soil bacterium"] <- "uncultured"

And now we can use all that work above to make a pretty tree.

Here's a breakdown of what each parameter in the dotplot.phylo4d does:

edge.color sets the color of the branches in the tree. I first labelled the branches sequentially (see last line of code below) and then used that information to assign colors based on phylum (see above). The colors were chosen using I Want Hue.
data.xlim assigns an x-limit (0,1000) that goes well-beyond the range of the data (~0 to 8). This makes the dots appear aligned with the concentric circles in the plots. Otherwise, the relative abundance will be used as the x-coordinate and the dots are staggered and not visually appealing.
edge.color is used to color the tree branches. Here I've colored them according to phylum.
dot.cex is the point size. Here I used a scaled relative abundance for size.
dot.col is the point color (made above).
grid.horizontal displays the dashed grid if set to TRUE.
grid.col sets the color of the above grid.
show.data.axis will show the data axis if set to TRUE.
show.tip will show the tip labels (genera here) if set to TRUE.
show.trait will show the trait names if set to TRUE. If TRUE, will populate the labels using the specified strings in trait.labels below.
tip.col specifies the color of the text used to label the branch tips (genera).
tip.cex can be used to change the font size used in the branch tip labels.
tip.labels uses the specified string to label the branch tips.
trait.cex will adjust the size of the axis labels for the traits.
tree.type sets the style of tree. There are a few different options. Read the help file for more on those.
tree.ladderize to right-ladderize the tree.
tree.open.angle is used to specify how wide the opening of the tree is where the trait labels are. Here I've set it to 15°.
tree.ratio is a numeric value in [0, 1] giving the proportion of width of the figure for the tree.
trait.bg.col can be used to specify the background color for the trait plotting area. I've plotted three traits here and so could choose three colors in theory. trait.bg.col = rep("white",3) could be changed to trait.bg.col = "white", but the 3 is there as a reminder of the number of color options you could choose.

# Export at 8 x 8
dotplot.phylo4d(p4d, 
                edge.color = edge.col,
                data.xlim = c(0,1000),
                dot.cex = cbind(w3.pt.cex,w6.pt.cex,w9.pt.cex),
                dot.col = cbind(w3.dot.col,w6.dot.col,w9.dot.col),
                grid.horizontal = TRUE, 
                grid.col = "gray80",
                show.data.axis = FALSE,
                show.tip = TRUE, 
                show.trait = TRUE,
                tip.col = w369.tax.rel.bal.merge$phy.col, 
                tip.cex = 0.75,
                tip.labels = w369.tax.rel.bal.merge$genus,
                trait.cex = 0.75,
                trait.labels = c("L","F","S"),
                tree.type = "fan", 
                tree.ladderize = TRUE,
                tree.open.angle = 15, 
                tree.ratio = 0.15,
                trait.bg.col = "white")

Add polygon to highlight shared taxa between the F and S communities.

polygon(x = c(0.25,0.35,0.53,0.43), 
        y = c(1.28,1.26,1.989,2.01),
        col = alpha("gray50",0.3), border = NA)

Add polygon to highlight shared taxa between the L and S communities.

polygon(x = c(0.025,-0.075,-0.115,-0.015), 
        y = c(-0.76,-0.76,-2.05,-2.05),
        col = alpha("gray50",0.3), border = NA)

Add relative abundance legend.

text(-3.45,2.75,"Relative abundance", font = 2, cex = 0.9)
legend("topleft", legend = "10%", bty = "n", cex = 0.75, col = "gray50", pch = 16, pt.cex = 8, inset = c(0.04,0.08), x.intersp = 3)
legend("topleft", legend = "5%",  bty = "n", cex = 0.75, col = "gray50", pch = 16, pt.cex = 4, inset = c(0.04,0.165), x.intersp = 3)
legend("topleft", legend = "1%",  bty = "n", cex = 0.75, col = "gray50", pch = 16, pt.cex = 0.8, inset = c(0.04,0.220), x.intersp = 3)

Add numerator/denominator legend.

text(-3.82,1.2,"Balance", font = 2, cex = 0.9)
legend("topleft", legend = c("Numerator","Denominator"), bty = "n", cex = 0.75, col = c("#5794d7","#b94663"), pch = 16, pt.cex = 1.75, inset = c(0.021,0.3))

Add edge legend.

legend("bottomright", legend = c("Proteobacteria","Actinobacteria","Chloroflexi","Acidobacteria","Bacteroidetes"), bty = "n", cex = 0.75, col = c("#ca7b0a","#a53ec7","#f806a0","#26b223","#3f62fb"), pch = 15, pt.cex = 1.75, inset = c(0.015,0))

I used the following to identify which edges belonged to which phylum. Then used that information to add the colors (see 'edge.col' above).

edgelabels(seq(1,Nedge(w369.tree)), frame = "n", bg = NA, font = 3)

Rplot