Importing BIOM files - StevenMamet/Working-with-biom-files GitHub Wiki

You'll have to do the steps below if you haven't installed any of the packages before. Once you've completed the steps below, you shouldn't have to repeat them unless perhaps you've updated your R installation.

If you run into "non-zero exit status" for fields (MACOSX problem), you'll need to download and install this.

Here I've downloaded the fields package to my downloads folder:

install.packages("~/Downloads/fields_10.0.tar.gz", repos = NULL, type="source")

install.packages("BiocManager")
BiocManager::install(version="3.10")
BiocManager::install("ALDEx2", version = "3.8")
install.packages('devtools')
library(devtools)
devtools::install_github('ggloor/CoDaSeq/CoDaSeq')
BiocManager::install('phyloseq')
BiocManager::install('DESeq2')
BiocManager::install('biomformat')

Once the above have been done once, you only need to load the packages now in each new session:

library(biomformat) # Working with biom files
library(ade4)       # Required by phyloseq
library(phyloseq)   # Used to manipulate phylo data
library(tidyverse)  # Load the tidy packages   
library(stringr)    # Helpful for working with strings

rm(list = ls())

1. Read in the biom file and extract abundance, sample, and taxonomy information

This is the file you made in QIIME2 following my P2IRC tutorial:

canola.biom <- read_biom("~/Dropbox/School/School/Teaching/PLSC898.3/feature-table.biom")

You may get the error message below. Don't worry about it.

Warning message:
In strsplit(conditionMessage(e), "\n") :
  input string 1 is invalid in this locale

Extract the abundance data (returns ASV by sample data frame)

can.asv <- data.frame(as.matrix(biom_data(canola.biom)), stringsAsFactors = F)

Process the sample information. In this case, we can extract everything we need from the column names:

can.sam <- data.frame(names(can.asv), stringsAsFactors = F)

Extract all the columns to populate the sample df. Since we're working with a subset of the data here, not all of these items will be present, but at least you get the gist:

Sample.Code <- can.sam$names.can.asv.                 # Sample code (P2IRC identifier)
SampleType <- substr(Sample.Code,1,4)                 # Sample type (ORIG/CHCK)
Year <- substr(Sample.Code,5,5)                       # Year (2016/2017)
Site <- rep(NA,10)                                    # Site (S/L/M)
RootSoil <- substr(Sample.Code,7,7)                   # Root or soil 
Plot <- substr(Sample.Code,8,11)                      # Plot number
Week <- substr(Sample.Code,14,15)                     # Week
Field.Duplicate <- substr(Sample.Code,12,12)          # Extract field dupes
CollectionType <- substr(Sample.Code,13,13)           # Collection (C01,C02,C03 = week 3,6,9) or weekly (weeks 1-10)
Extraction.Duplicate <- substr(Sample.Code,16,16)     # Extract extraction dupes
PCR.Duplicate <- substr(Sample.Code,17,17)            # Extract PCR dupes
Sequencer.Replicate <- substr(Sample.Code,18,18)      # Extract sequencer reps
Sequencer.Plate.Or.Site <- substr(Sample.Code,19,19)  # Extract sequencer plate (2016) or site (2017)

Make a new df:

can.sam2 <- cbind.data.frame(SampleID = Sample.Code,
                                 Year = Year,
                                 Site = Site,
                                 RootSoil = RootSoil,
                                 Plot = Plot,
                                 Week = Week,
                                 Field.Duplicate = Field.Duplicate,
                                 CollectionType = CollectionType,
                                 Extraction.Duplicate = Extraction.Duplicate,
                                 PCR.Duplicate = PCR.Duplicate,
                                 Sequencer.Replicate = Sequencer.Replicate,
                                 Sequencer.Plate.Or.Site = Sequencer.Plate.Or.Site,
                                 stringsAsFactors = F)

Assign the sites and tidy that column up. This is unnecessary using this small subset of data, but I'll include it here for a teachable moment:

can.sam2$Site <- can.sam2$Sequencer.Plate.Or.Site     # The sample ids differ b/w 2016 and 2017 - use this to assign sites
can.sam2$Site[can.sam2$Site == "a" | can.sam2$Site == "b" | can.sam2$Site == "c"] <- "L" # All 2016 samples had the sequencer plate appended to the sample id. Recode these to the correct site
can.sam2 <- droplevels.data.frame(can.sam2)           # Remove any artifact factor levels
can.sam2$Year[can.sam2$Year == 1] <- 2016             # Recode year 1 to something more descriptive
can.sam2$Year[can.sam2$Year == 2] <- 2017             # Recode year 2 to something more descriptive
can.sam2$SiteYearLinePlotWeekRootSoil <- paste(can.sam2$Site, can.sam2$Year, can.sam2$CanolaLine, can.sam2$Plot, can.sam2$Week, can.sam2$RootSoil, sep = ".")          # Create a merging index to merge duplicate samples later

Process the taxonomic information. Here we didn't attach the taxonomy information to the biom file, so it is imported as a separate file:

taxa <- read.table(file = "~/Dropbox/School/School/Teaching/PLSC898.3/taxonomy.tsv", sep = "\t")
names(taxa) <- c("FeatureID","taxonomy","confidence")
taxa2 <- taxa %>% separate(taxonomy, into = c("kingdom","phylum","class","order","family","genus","species","subspecies"), sep = ";")

After the last step, you'll get this error:

Warning messages:
1: Expected 8 pieces. Additional pieces discarded in 13 rows [23, 39, 42, 48, 52, 53, 77, 83, 101, 105, 109, 111, 114]. 
2: Expected 8 pieces. Missing pieces filled with `NA` in 115 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

Don't worry, the data are still fine. We've just ignored the taxonomy information below species.

Remove the superfluous characters in each name ("D_0__", "D_1__", etc.)

Kingdom <- gsub("D_0__", "", taxa2$kingdom)
Phylum <- gsub("D_1__", "", taxa2$phylum)
Class <- gsub("D_2__", "", taxa2$class)
Order <- gsub("D_3__", "", taxa2$order)
Family <- gsub("D_4__", "", taxa2$family)
Genus <- gsub("D_5__", "", taxa2$genus)
Species <- gsub("D_6__", "", taxa2$species)

Combine each taxonomic level into a matrix:

taxa3 <- as.matrix(cbind(kingdom = Kingdom, phylum = Phylum, class = Class, order = Order, family = Family, genus = Genus, species = Species))

Eventually, when using phyloseq to manipulate the data, the subset_taxa command will remove rows with the specified taxonomy, but also NAs (see here for a discussion). We don't want this. So recode the NAs to "unclassified" here to avoid removing extra rows.

can.tax <- taxa3                           # Make a new object
can.tax[is.na(can.tax)] <- "unclassified"  # Replace NAs with "unclassified"
row.names(can.tax) <- row.names(can.asv)   # Assign the ASV ids as the row names

Check that the row and column names match up. phyloseq uses these to merge the data:

identical(rownames(can.asv),rownames(can.tax)) # Abundance row names match taxonomy row names
rownames(can.sam2) <- can.sam2[,1]             # Assign sample ids as the row names for the sample df
identical(names(can.asv),rownames(can.sam2))   # Abundance df column names match row names of the sample df

2. Combine abundance, sample, and taxonomy data into a phyloseq object

Make phyloseq class. Double check to see if the taxonomy columns are capitalized or not. This will affect the coding below. During DNA extraction, we added an internal standard to account for differential abundances of microbes across samples. The processing pipeline needs to incorporate this information:

can_all.asv <- phyloseq(otu_table(can.asv, taxa_are_rows = TRUE), 
                        sample_data(can.sam2), 
                        tax_table(can.tax))                                         # Make the phyloseq object
can_all.asv <- subset_taxa(can_all.asv, kingdom != "Archaea")                       # Remove archaea
can_all.asv <- subset_taxa(can_all.asv, class != "Chloroplast")                     # Remove chloroplasts
can_all.asv <- subset_taxa(can_all.asv, order != "Chloroplast")                     # Remove chloroplasts
can_all.no_mito <- subset_taxa(can_all.asv, family != "Mitochondria")               # Remove mitochondrial contaminant
can_all.fischeri <- subset_taxa(can_all.no_mito, genus == "Aliivibrio" )            # Collect Aliivibrio internal standard
can_all.no_mito.no_fischeri <- subset_taxa(can_all.no_mito, genus != "Aliivibrio")  # Remove internal standard from data set

3. Use the internal standard to correct abundances

Here are the A. fischeri calculations:

fischeri <- as.vector(sample_sums(can_all.fischeri))  # Create vector of internal standard
wi <- 1e-10                                           # Weight of internal standard gDNA added to the samples
gi <- 4.49e-15                                        # Weight of genome of the internal standard
ci <- 8                                               # 16S copy number of the internal standard
adj <- wi/gi*ci                                       # Adjustment index
fischeri.1 <- adj/fischeri                            # Divide the index by sample sums of A. fischeri
fischeri.1[is.infinite(fischeri.1)] <- 0              # To infinity and beyond...is what we want to avoid

Prepare the new phyloseq object:

asv.table <- as.matrix(can_all.no_mito.no_fischeri@otu_table)          # Export ASV abundance table as matrix
asv.table.norm <- as.data.frame(sweep(asv.table, 2, fischeri.1, `*`))  # This divides the abundance matrix by the vector to normalize the data
tax.table.norm <- as.matrix(can_all.no_mito.no_fischeri@tax_table)     # This extracts the taxonomic information exlcuding mitochondria & aliivibrio
can_all.norm <- phyloseq(otu_table(asv.table.norm, taxa_are_rows = TRUE), 
                         sample_data(can.sam2), 
                         tax_table(tax.table.norm))                    # Recreating a phyloseq object

4. OPTIONAL. Merge duplicate samples

Use the "SiteYearLinePlotWeekRootSoil" column created earlier to merge duplicates. Note that duplicates should only be merged if they do not differ significantly in composition. You could evaluate this using an ordination a priori.

can_all.merge <- merge_samples(can_all.norm, "SiteYearLinePlotWeekRootSoil", fun = sum) # This step will sum everything (factors become NAs, weeks, lines, etc. get summed. Need to fix later)
can_all.no.dup <- can_all.merge

Extract the sample information to correct the loss of string information above. Also extract the abundance and taxonomy information to merge into a phyloseq object later:

can.asv.root.0 <- as.matrix(can_all.no.dup@otu_table)
can.tax.root.0 <- as.matrix(can_all.no.dup@tax_table)
can.sam.root.0 <- data.frame(can_all.no.dup@sam_data)

Sample clean-up:

Sample.ID <- c(rownames(can.sam.root.0))             # Extract the sample information contained within the row names
split.df <- unlist(strsplit(Sample.ID,"[.]"))        # Separate the above at each `.`
Site <- substr(Sample.ID,1,1)                        # Extract site
Year <- substr(Sample.ID,3,6)                        # Extract year
CanolaLine <- split.df[seq(3,length(split.df),6)]    # Extract genotype
Plot <- split.df[seq(4,length(split.df),6)]          # Extract plot (deprecated)
Week <- split.df[seq(5,length(split.df),6)]          # Extract week
RootSoil <- split.df[seq(6,length(split.df),6)]      # Extract root/soil sample info
SiteYearPlot <- paste(Site, Year, Plot, sep = ".")   # Create a new index for a rainy day
can.sam.root.1 <- cbind.data.frame(Site = Site, Year = Year, Plot = Plot, CanolaLine = CanolaLine, Week = Week, RootSoil = RootSoil, SiteYearPlot = SiteYearPlot)               # Make the new df
rownames(can.sam.root.1) <- Sample.ID                # Assign row names
can_all.no.dup1 <- phyloseq(otu_table(can.asv.root.0, taxa_are_rows = FALSE), 
                            sample_data(can.sam.root.1), 
                            tax_table(can.tax.root.0)) # Recreating a phyloseq object

5. OPTIONAL. Final processing

Subset to root communities:

can_all.root <- subset_samples(can_all.no.dup2, RootSoil == "R")

Removes singletons and doubletons, then zero-sum samples:

can_all.root <- prune_taxa(taxa_sums(can_all.root) > 2, can_all.root) can_all.root <- prune_samples(sample_sums(can_all.root) > 2, can_all.root)

Create root matrices for further analysis in R:

can.asv.root.table <- as.matrix(can_all.root@otu_table)
can.tax.root.table <- as.matrix(can_all.root@tax_table)
can.sam.root.table <- data.frame(can_all.root@sam_data)