Data Set & Preprocessing Methods - TobiasSchmidtDE/DeepL-MedicalImaging GitHub Wiki

Dataset

  • CheXpert dataset
    • 224,316 chest X-ray image
    • 65,240 patients
    • 14 observarions
      • Atelectasis, Consolidation, Pneumothorax, Edema, Pleural Effusion, Pneunomia, Pleural Other, Cardiomegaly, Lung Lesion, Lung Opacity, Enlarged Cardiom, Fracture, Support Devices, No Finding
    • frontal and lateral radiopgraphs available
    • labels are extracted from radiology reports via rule-based text extractor
    • each label can be posisitve (1), negative (0) or uncertain (u)
    • validation set of 200 random samples from 200 patients that has been individually annotated by three board-certified radiologists
  • ChestX-ray14 - NIH Chest X-ray Dataset of 14
    • 112,120 frontal-view chest X-ray image
    • 30,805 patients
    • image size ~ 1024x1024px
    • text-mined 14 disease image labels
      • Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia, No Finding
    • additional meta information:
      • patient age & gender
      • bounding boxes for ~1000 images
  • PLCO
    • 185,421 chest X-ray image
    • 56.071 patients
    • image size ~ 2500 x 2100 px
    • 12 disease image labels
      • Granuloma, Scaring, COPD, Hilar Abnorm, Infiltration, Pneumothorax, Fibrosis, Effusion, Pleural Thickening, Nodule, Mass and Hernia, No Finding
  • MIMIC-CXR Database
    • MIMIC-CXR-JPG v2.0.0, a large dataset of 377,110 chest x-rays associated with 227,827 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 - 2016.
    • Same auto-labeler used as in Chexpert

Dataset Downloads

For ease of access all datasets can be downloaded from our GCP storage instance "idp-datasets":

DO NOT SHARE THESE PUBLICLY, SHARE OFFICIAL LINKS INSTEAD!

  • Chexpert Full Dataset - Size: 10.7 GB (Our download link) (Offical Website)
    • Full copy of chexpert dataset.
    • Each entry in train.csv and eval.csv has entry "Path" that points to corresponding image file when used relative to downloaded dataset folder.
  • Chexpert Dev Dataset - Size: 1.09 GB (Our download link)
    • This is a subset of the original chexpert (full) dataset intendet to increase development speed of preprocessing methods and architecture networks before running them on the whole dataset for the actual training.
    • Sampled by choosing 6400 patients randomly and selecting all their data (train & test data both in train folder).
    • Automated generated labels (as described in chexpert paper) available in "train.csv" and "test.csv"
    • Eval data is the same as in full dataset (only 200 samples), but all labeled by hand and majority vote from 3 radiologists.
    • Each entry in train.csv, test.csv and eval.csv has entry "Path" that points to corresponding image file when used relative to downloaded dataset folder.
  • Chestxray14-NIH-1024 Dataset - Size: 41.99 GB (Our download link) (Offical Website) (Offical download link)
    • Detailed desription can be found in subfolder meta/info
    • Labels can be found in subfolder meta/data/labels.csv
    • All images of this dataset are of size 1024x1024
  • Chestxray14-NIH-512 Dataset - Size: 10.22 GB (Our download link)
    • Same as Chestxray14-NIH-1024 Dataset
    • All images of this dataset are of size 512x512
  • Chestxray14-NIH-256 Dataset - Size: 2.86 GB (Our download link)
    • Same as Chestxray14-NIH-1024 Dataset
    • All images of this dataset are of size 256x512

Dataset Combination

"Multi-task Learning for Chest X-ray Abnormality Classification on Noisy Labels" proposes to combine the datasets PLCO and ChestX-ray14 by treating all abnormalities individually, defining D = 12 + 14 = 26 classes for their network, only computing gradients for labels of the dataset where an image is derived from.

Normalization

Problem: large variability of the image appearance, depending on the acquisition source, radiation dose as well as proprietary non-linear postprocessing. None of which is given in the meta data.

Options: