Data Manager Overview - nshawen/RTOUtil GitHub Wiki

Introduction

This data manager package focuses on two main components of analysis pipelines for wearables in healthcare:

A hierarchy (Participant -> Session -> Event -> ...) describing the provenance of data/metadata. Project-relevant data objects can be found in objects of any level in the hierarchy. Additional data objects can be used for signal processing steps (filtering, etc.)
A structured library of features, used to condense the raw information for use in statistical modeling or reporting. Features can be easily be incorporated into a project and aggregated at any levels of the project hierarchy. Custom feature classes can be easily incorporated and designed to only be computed from appropriate data sources.

This package is not designed to provide production-level, fast pipelines. Rather, it is intended to serve as a flexible tool for developing custom research pipelines with minimal upfront work, especially for novice Python programmers. The goal is for the user to focus on high-level definition of the data structure of the project, desired processing and feature extraction steps, and modeling inputs/outputs instead of spending time developing a full analysis pipeline from scratch.

Data Hierarchy

All organization starts at the level of the Project, a class meant to encapsulate all lower levels and any information that is general to the entire project (e.g. the types of data sources, data storage locations, allowed types of participant cohorts). This is followed by the Participant, representing an individual from whom data is collected as part of the Project it belongs to. Metadata such as demographic characteristics can be managed at this level. Each Participant has one or more Session objects, each of which represent a timepoint at which data was collected. Metadata such as assessment scores, along with raw data signals, can be aggregated at this level. Within a Session, Event objects can be added, each of which refers to only a specific subset of the timespan included in the parent Session. These may reference a time range within the Session or a single timestamp. Event objects are useful for organizing data signals and metadata related to scripted tasks or event markings during a larger data recording session.

At either the level of Session or Event, Data objects describing raw or processed data signals can be aggregated. The package contains a library of raw and processed data types, but custom ones may be easily added. Each Data object may then have individual Feature objects added to it, each describing an aspect of the parent Data as a single number. Again, a library of pre-defined features is included in the package, but custom features can be added. The desired Data and Feature types for a given project can be defined at the level of the parent Project object.

This object hierarchy allows for data or features to be easily organized at any of the levels and combined with metadata for modeling and statistical analyses. The generalized structure allows for flexible adaptations to modeling targets without significant changes to upstream code.