Minutes Data Working Group 26 Apr 2020 - Project-MONAI/MONAI GitHub Wiki

Agenda

Review and refine the committee scope
Task focus: identify the state of the art, and define best practices
Plan meeting cadence

Attendees

Invited: Brad Genereaux (Nvidia), Michael Götz (DKFZ), Carole Sudre (KCL), Stephen Aylward (Kitware), Ben Murray (KCL), Wenqi Li (Nvidia), Jorge Cardoso (KCL), Prerna Dogra (NVIDIA)

Notes

What is the state of art in biomarkers -
- Preprocessing - trying to make MONAI - “rosetta stone” - make it speak every library for bringing in other libraries. Enabling thoughts, but not much further
- Pre-processing pipeline built around NIFTI format - data array / offline and spacing orientation.
- Non-imaging data is classification labels - no principled way to capture this
- Bounding boxes is being added
- NLP / communication in EHR is a different problem space - combining imaging and non-imaging data
- Are their libraries that can be interfaced today that can bring pipelines in?
- NIFTI is well-prepared for storing imaging data, but not for other data types
I/O Working Group - getting things off disk and store them in memory
File structures - XArrays in Python is an emerging standard
- Preliminary support, not updating meta information
- Connecting the data with the physical representation
Metadata for healthcare metadata - FHIR can be used, maybe CDA
- But what are the use cases? It can be really broad, but too broad = maybe not very useful
Use cases for people using MONAI
- Bringing tabular data into MONAI
- Structured data - ontology tagged, data dictionary and key values
- Structured data relating to images - coordinates, bounding boxes, imaging elements
- Unstructured data - free text (NLP), 1D or 2D signals, heart rate, waveforms, movement rates
How do we make it easy for developers to feed the data to MONAI?
Design implications for the types of data being brought in
Use cases with data elements
- Neuroimaging,
- COPD
  - Gene expression
- Ultrasound guided intervention
How do we connect data together? Imaging and non-imaging data
- E.g., a table or data dictionary element
- Define the data types of these elements
- A data manifest?
Creating a way to represent “subjects”
Do we pass along a “bag of content”, or do we need to indicate “prominence” of content?
- Making it too complex makes it unusable
- Leveraging DICOM specifications like TID1500 while might be most descriptive might not work well with developers early on
- Starting with something simple but supporting more complex transactions are ideal
“Minimal data set” - if the use case demands “just pass along an image and a label”, that should be sufficient - developers shouldn’t have to specify unnecessary content
Flexible data loader -> standardized internal representations
- Are there numpy representations that are used in other domains?
What do we trust exists when things get passed along?
We should support “if someone passes in a set of JPGs”
Impact of digital pathology as an example - there’s open challenges in the interoperability space
A content manifest to define the content
- Reproducibility
- Provenance - where the data came from
- Notion should be shared with the challenge group

Action items

Brad; Doodle poll
Brad; manifest
All; collect use cases -> and types of inputs and outputs (EHR, signals, sensors, genetic data - few markers)
Carole; put together a template and share with the team