Data Set & Preprocessing Methods - TobiasSchmidtDE/DeepL-MedicalImaging GitHub Wiki

Dataset

CheXpert dataset
- 224,316 chest X-ray image
- 65,240 patients
- 14 observarions
  - Atelectasis, Consolidation, Pneumothorax, Edema, Pleural Effusion, Pneunomia, Pleural Other, Cardiomegaly, Lung Lesion, Lung Opacity, Enlarged Cardiom, Fracture, Support Devices, No Finding
- frontal and lateral radiopgraphs available
- labels are extracted from radiology reports via rule-based text extractor
- each label can be posisitve (1), negative (0) or uncertain (u)
- validation set of 200 random samples from 200 patients that has been individually annotated by three board-certified radiologists
ChestX-ray14 - NIH Chest X-ray Dataset of 14
- 112,120 frontal-view chest X-ray image
- 30,805 patients
- image size ~ 1024x1024px
- text-mined 14 disease image labels
  - Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia, No Finding
- additional meta information:
  - patient age & gender
  - bounding boxes for ~1000 images
PLCO
- 185,421 chest X-ray image
- 56.071 patients
- image size ~ 2500 x 2100 px
- 12 disease image labels
  - Granuloma, Scaring, COPD, Hilar Abnorm, Infiltration, Pneumothorax, Fibrosis, Effusion, Pleural Thickening, Nodule, Mass and Hernia, No Finding
MIMIC-CXR Database
- MIMIC-CXR-JPG v2.0.0, a large dataset of 377,110 chest x-rays associated with 227,827 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 - 2016.
- Same auto-labeler used as in Chexpert

Dataset Downloads

For ease of access all datasets can be downloaded from our GCP storage instance "idp-datasets":

DO NOT SHARE THESE PUBLICLY, SHARE OFFICIAL LINKS INSTEAD!

Chexpert Full Dataset - Size: 10.7 GB (Our download link) (Offical Website)
- Full copy of chexpert dataset.
- Each entry in train.csv and eval.csv has entry "Path" that points to corresponding image file when used relative to downloaded dataset folder.
Chexpert Dev Dataset - Size: 1.09 GB (Our download link)
- This is a subset of the original chexpert (full) dataset intendet to increase development speed of preprocessing methods and architecture networks before running them on the whole dataset for the actual training.
- Sampled by choosing 6400 patients randomly and selecting all their data (train & test data both in train folder).
- Automated generated labels (as described in chexpert paper) available in "train.csv" and "test.csv"
- Eval data is the same as in full dataset (only 200 samples), but all labeled by hand and majority vote from 3 radiologists.
- Each entry in train.csv, test.csv and eval.csv has entry "Path" that points to corresponding image file when used relative to downloaded dataset folder.
Chestxray14-NIH-1024 Dataset - Size: 41.99 GB (Our download link) (Offical Website) (Offical download link)
- Detailed desription can be found in subfolder meta/info
- Labels can be found in subfolder meta/data/labels.csv
- All images of this dataset are of size 1024x1024
Chestxray14-NIH-512 Dataset - Size: 10.22 GB (Our download link)
- Same as Chestxray14-NIH-1024 Dataset
- All images of this dataset are of size 512x512
Chestxray14-NIH-256 Dataset - Size: 2.86 GB (Our download link)
- Same as Chestxray14-NIH-1024 Dataset
- All images of this dataset are of size 256x512

Dataset Combination

"Multi-task Learning for Chest X-ray Abnormality Classification on Noisy Labels" proposes to combine the datasets PLCO and ChestX-ray14 by treating all abnormalities individually, defining D = 12 + 14 = 26 classes for their network, only computing gradients for labels of the dataset where an image is derived from.

Normalization

Problem: large variability of the image appearance, depending on the acquisition source, radiation dose as well as proprietary non-linear postprocessing. None of which is given in the meta data.

Options:

Generic Solution for radiographs using multi-scale contrast enhancement/leveling techniques proposed in Localized energy-based normalization of medical images: Application to chest radiography and in Multiscale contrast enhancement for radiographies: Laplacian pyramid versus fast wavelet transform
An efficient method for dynamically windowing each image (i.e., adjust the brightness andcontrast via a linear transformation of the image intensities) is proposed in Multi-task Learning for Chest X-ray Abnormality Classification on Noisy Labels and was specifically chosen over the generic solution. It was shown that this did indeed increase the performance for the diagnostic application. Additionally to the higher degree of generalization it also reduced the training time on average by 2-3 times