Minutes Data Working Group 9 Jul 2020 - Project-MONAI/MONAI GitHub Wiki

Agenda

Update to MONAI Board on Data Working Group activities
Discuss synergy with Evaluation, Reproducibility & Benchmarking workgroup
- Breadth of data: https://arxiv.org/ftp/arxiv/papers/2005/2005.03501.pdf
- Bias of data (patient population, equipment): https://arxiv.org/ftp/arxiv/papers/1910/1910.04071.pdf (see Methods 24a)
  - Also see https://drive.google.com/file/d/1wq6NutFHneE7_B_ULWrlx9cg-951VtW-/view?usp=sharing
- Use of ontologies: https://arxiv.org/pdf/2003.10299.pdf
Discuss desired data structure pipeline from MONAI 0.2 onward
Check-in on Data Working Group for additional scope for definition

Minutes

Brad to prepare content for next week’s MONAI board meeting on activities / motion to adopt
Update from joint discussion with Evaluation, Reproducibility & Benchmarking workgroup
- Effort to get challenges data structured properly for sharing
- Data work group looks at the “what” in this data getting in
- Difficult to isolate pure data properties related to the experiments
  - E.g., although benchmarking is looking at this problem, with cross validation, some data fields like “type of scanner” might not be in the scope of the workgroup
- Can DICOM help?
  - Potentially - there are fields that could be populated - but this isn’t a solution for all kinds of data - what about when we leave DICOM?
    - E.g., what about when it is NIFTI - no metadata is included, and there’s only a free-text header field limited to 80 characters.
Compilation task should give researchers the ability to extract painlessly
Reproducible IO
- I/O Working Group looking at kicking out a file to repeat a training session
- Look at MLFlow (https://mlflow.org/) - they capture entire environment and flows down to network architecture
- Some networks may not need to be reproducible
- Also consider factors that affect reproducibility like hardware and drivers
Compilation pipeline - is it possible to detect when data was transformed "different" than how someone else transformed the data
- E.g., a warning on "damaging the data"
MSD in MONAI 0.2
- Random seed feature - training data is separated to validation / cross validation
- Implemented "import MSD dataset" - this pulls the data and parses the JSON
- Automatic / grid search - learning rate - compute cross validation experiments
“Read-only” Superset of Data
- Experiments derive the data they are to run - which could be a subset of that data
- Is the read-only version of that data converted on every experiment, or is it computed at run-time?
  - Boundary between super-optimized and super-flexible code
- Representations that can be regenerated - should it be cached? (if so, consider using checksums to make sure it is still valid)