Minutes Data Working Group 26 Apr 2020 - Project-MONAI/MONAI GitHub Wiki
Agenda
Review and refine the committee scope
Task focus: identify the state of the art, and define best practices
Plan meeting cadence
Attendees
Invited: Brad Genereaux (Nvidia), Michael Götz (DKFZ), Carole Sudre (KCL), Stephen Aylward (Kitware), Ben Murray (KCL), Wenqi Li (Nvidia), Jorge Cardoso (KCL), Prerna Dogra (NVIDIA)
Notes
What is the state of art in biomarkers -
Preprocessing - trying to make MONAI - “rosetta stone” - make it speak every library for bringing in other libraries. Enabling thoughts, but not much further
Pre-processing pipeline built around NIFTI format - data array / offline and spacing orientation.
Non-imaging data is classification labels - no principled way to capture this
Bounding boxes is being added
NLP / communication in EHR is a different problem space - combining imaging and non-imaging data
Are their libraries that can be interfaced today that can bring pipelines in?
NIFTI is well-prepared for storing imaging data, but not for other data types
I/O Working Group - getting things off disk and store them in memory
File structures - XArrays in Python is an emerging standard
Preliminary support, not updating meta information
Connecting the data with the physical representation
Metadata for healthcare metadata - FHIR can be used, maybe CDA
But what are the use cases? It can be really broad, but too broad = maybe not very useful
Use cases for people using MONAI
Bringing tabular data into MONAI
Structured data - ontology tagged, data dictionary and key values
Structured data relating to images - coordinates, bounding boxes, imaging elements
Unstructured data - free text (NLP), 1D or 2D signals, heart rate, waveforms, movement rates
How do we make it easy for developers to feed the data to MONAI?
Design implications for the types of data being brought in
Use cases with data elements
Neuroimaging,
COPD
Gene expression
Ultrasound guided intervention
How do we connect data together? Imaging and non-imaging data
E.g., a table or data dictionary element
Define the data types of these elements
A data manifest?
Creating a way to represent “subjects”
Do we pass along a “bag of content”, or do we need to indicate “prominence” of content?
Making it too complex makes it unusable
Leveraging DICOM specifications like TID1500 while might be most descriptive might not work well with developers early on
Starting with something simple but supporting more complex transactions are ideal
“Minimal data set” - if the use case demands “just pass along an image and a label”, that should be sufficient - developers shouldn’t have to specify unnecessary content
Flexible data loader -> standardized internal representations
Are there numpy representations that are used in other domains?
What do we trust exists when things get passed along?
We should support “if someone passes in a set of JPGs”
Impact of digital pathology as an example - there’s open challenges in the interoperability space
A content manifest to define the content
Reproducibility
Provenance - where the data came from
Notion should be shared with the challenge group
Action items
Brad; Doodle poll
Brad; manifest
All; collect use cases -> and types of inputs and outputs (EHR, signals, sensors, genetic data - few markers)
Carole; put together a template and share with the team