Reading HDF5 with R - Golob-Minot/geneshot GitHub Wiki

Reading Geneshot output using R

Author: Kristopher Kerns (2020)

The output from Geneshot is encoded in HDF5 format (as well as flat text files). One major advantage of HDF5 is that it is self-documenting and has a highly standardized format which makes it easy to read tables without parsing strings directly. However, one drawback is that it can be difficult to read in easily using R. To help with this, I have put together some code snippets showing how to use the reticulate library to read data into R, while using Python for its HDF5 parsing library.

Requirements

Install Python3 with the pandas library. Keep track of the path to your Python installation.

Example

# Install the reticulate package

install.packages("devtools")
devtools::install_github("rstudio/reticulate")
library("reticulate")

# Example of how reticulate can be used to execute Python code
os <- import("os")
os$listdir(".")

# Use the path to Python on your system
use_python("/usr/path/to/python")

# Create and use a new virtual enviornment named "r-reticulate"
use_virtualenv("r-reticulate") 

# Install packages required to read hdf5 data
py_install("pandas")
py_install("scipy")
py_install("pytables")
py_install("h5py")
py_install("rpy2")

# Import those packages into the environment
h5py <- import("h5py")
rpy2 <- import("rpy2")
rpy2$robjects
rpy2_ro <- import("rpy2.robjects")
rpy2_pandas2ri <- import("rpy2.robjects.pandas2ri")
rpy2_pandas2ri$py2rpy
pd <- import("pandas", convert = FALSE)
np <- import("numpy", convert = FALSE)

# Using the pandas read_hdf function (pd.read_hdf('file_path', key='your_group')

#Geneshot outputs files ("keys")
#'/manifest'                          '/ref/taxonomy'
#'/summary/all'                       '/ordination/pca'
#'/summary/breakaway'                 '/ordination/tsne'
#'/summary/experiment'                '/annot/gene/all'
#'/summary/genes_aligned'             '/annot/gene/cag'
#'/summary/genes_assembled'           '/annot/gene/eggnog'
#'/summary/readcount'                 '/annot/gene/tax'
#'/stats/cag/corncob'                 '/annot/cag/all'
#'/stats/cag/corncob_wide'
#'/abund/gene/wide'
#"/abund/cag/wide'

# Create an R data.frame for each of the geneshot outputs using the py_to_r function from the reticulate package
key <- pd$read_hdf("/path/to/geneshot.results.hdf5", key = "key", mode ='r+')
key_df <- py_to_r(key)
key_df

# Example
corncob_wide <- pd$read_hdf("/path/to/geneshot.results.hdf5", key = "/stats/cag/corncob_wide", mode ='r+')
corncob_wide_df <- py_to_r(corncob_wide)
corncob_wide_df

# To look at feathered gene and CAG abundance outputs use the feather package
BiocManager::install("feather")
library("feather")

cag_abund <- read_feather("/path/to/abund/CAG.abund.feather")
cag_abund

gene_abund <- read_feather("/path/to/abund/gene.abund.feather")
gene_abund

#once these outputs are converted to R data.frames you can utilize your favorite R packages to further explore your results!
#happy hunting