Reading HDF5 with R - Golob-Minot/geneshot GitHub Wiki
Reading Geneshot output using R
Author: Kristopher Kerns (2020)
The output from Geneshot is encoded in HDF5 format (as well as flat
text files). One major advantage of HDF5 is that it is self-documenting
and has a highly standardized format which makes it easy to read tables
without parsing strings directly. However, one drawback is that it can
be difficult to read in easily using R. To help with this, I have put
together some code snippets showing how to use the reticulate
library
to read data into R, while using Python for its HDF5 parsing library.
Requirements
Install Python3 with the pandas
library. Keep track of the path to
your Python installation.
Example
# Install the reticulate package
install.packages("devtools")
devtools::install_github("rstudio/reticulate")
library("reticulate")
# Example of how reticulate can be used to execute Python codeβ
os <- import("os")
os$listdir(".")
β
# Use the path to Python on your system
use_python("/usr/path/to/python")
# Create and use a new virtual enviornment named "r-reticulate"
use_virtualenv("r-reticulate")
β
# Install packages required to read hdf5 data
py_install("pandas")
py_install("scipy")
py_install("pytables")
py_install("h5py")
py_install("rpy2")
β
# Import those packages into the environment
h5py <- import("h5py")
rpy2 <- import("rpy2")
rpy2$robjects
rpy2_ro <- import("rpy2.robjects")
rpy2_pandas2ri <- import("rpy2.robjects.pandas2ri")
rpy2_pandas2ri$py2rpy
pd <- import("pandas", convert = FALSE)
np <- import("numpy", convert = FALSE)
β
# Using the pandas read_hdf function (pd.read_hdf('file_path', key='your_group')
β
#Geneshot outputs files ("keys")
#'/manifest' '/ref/taxonomy'
#'/summary/all' '/ordination/pca'
#'/summary/breakaway' '/ordination/tsne'
#'/summary/experiment' '/annot/gene/all'
#'/summary/genes_aligned' '/annot/gene/cag'
#'/summary/genes_assembled' '/annot/gene/eggnog'
#'/summary/readcount' '/annot/gene/tax'
#'/stats/cag/corncob' '/annot/cag/all'
#'/stats/cag/corncob_wide'
#'/abund/gene/wide'
#"/abund/cag/wide'
β
# Create an R data.frame for each of the geneshot outputs using the py_to_r function from the reticulate package
key <- pd$read_hdf("/path/to/geneshot.results.hdf5", key = "key", mode ='r+')
key_df <- py_to_r(key)
key_df
β
# Example
corncob_wide <- pd$read_hdf("/path/to/geneshot.results.hdf5", key = "/stats/cag/corncob_wide", mode ='r+')
corncob_wide_df <- py_to_r(corncob_wide)
corncob_wide_df
β
# To look at feathered gene and CAG abundance outputs use the feather package
BiocManager::install("feather")
library("feather")
β
cag_abund <- read_feather("/path/to/abund/CAG.abund.feather")
cag_abund
β
gene_abund <- read_feather("/path/to/abund/gene.abund.feather")
gene_abund
β
#once these outputs are converted to R data.frames you can utilize your favorite R packages to further explore your results!
#happy hunting