RSNA‐STR Pulmonary Embolism Detection - RSNA/AI-Challenge-Data GitHub Wiki

RSNA assembled this dataset in 2020 for the RSNA STR Pulmonary Embolism Detection AI Challenge (https://www.kaggle.com/c/rsna-str-pulmonary-embolism-detection/). With more than 12,000 CT pulmonary angiography (CTPA) studies contributed by five international research centers, it is the largest publicly available annotated PE dataset. RSNA collaborated with the Society of Thoracic Radiology to recruit more than 80 expert thoracic radiologists who labeled the dataset with detailed clinical annotations.

Description

The dataset is contained in a Zip archive that includes both DICOM image files (.dcm) and a tabular annotation file (.csv). A detailed description of the dataset is provided in E Colak, FC Kitamura, SB Hobbs, et al. "The RSNA Pulmonary Embolism CT Dataset." Radiology: Artificial Intelligence 2021;3:2 (https://pubs.rsna.org/doi/full/10.1148/ryai.2021200254).

License

You may access and use these de-identified imaging datasets and annotations (“the data”) for non-commercial purposes only, including academic research and education, as long as you agree to abide by the following provisions: Not to make any attempt to identify or contact any individual(s) who may be the subjects of the data. If you share or re-distribute the data in any form, include a citation to the “RSNA-STR Pulmonary Embolism CT (RSPECT) Dataset, Copyright RSNA, 2020” as follows: E Colak, FC Kitamura, SB Hobbs, et al. The RSNA Pulmonary Embolism CT Dataset [10.1148/ryai.2021200254]. Radiology: Artificial Intelligence 2021;3:2."

Tutorial

Data Overview

This competition was to predict the existence and characteristics of pulmonary embolisms. The competition was inference-only, meaning that submitted kernels did not have access to the training set.

The private test set is approximately 3x larger than the public test set (230GB vs. 70GB). The training set includes 7279 studies, the public set 650, and the private set has 1517.

Files

You will need the training and test images, as well as train.csv and test.csv. The images are grouped in directories by study and series. They are in DICOM format, and contain additional metadata that may be relevant to the competition. Each image has a unique identifier - SOPInstanceUID.

The location for each image is given by: //.dcm.

The data provided by the host for this competition and made available below is the RSNA-STR PE CT (RSPECT) dataset. Use of the dataset for non-commercial and/or academic purposes is permitted with citation.

Data Format

train.csv contains the three UIDs noted above, and a number of labels. Some are targets which require predictions, and some are informational, which will also be noted below in Data fields.

test.csv contains only the three UIDs.

Predictions

Competitors predicted a number of labels, at both the image and study level. Note that some labels are logically mutually exclusive.

File Descriptions

test - all test images
train - all train images (note that submission kernels did NOT have access to this set of images, so built models elsewhere and incorporated them into their submissions)
sample_submission.csv - contains rows for each UID+label combination that requires a prediction. Therefore it has a row for each image (for which competitors predicted the existence of a pulmonary embolism within the image) and row for each study+label that required a study-level prediction.
train.csv - contains UIDs and all labels.
test.csv - contains UIDs.

Data Fields

StudyInstanceUID - unique ID for each study (exam) in the data.
SeriesInstanceUID - unique ID for each series within the study.
SOPInstanceUID - unique ID for each image within the study (and data).
pe_present_on_image - image-level, notes whether any form of PE is present on the image.
negative_exam_for_pe - exam-level, whether there are any images in the study that have PE present.
qa_motion - informational, indicates whether radiologists noted an issue with motion in the study.
qa_contrast - informational, indicates whether radiologists noted an issue with contrast in the study.
flow_artifact - informational
rv_lv_ratio_gte_1 - exam-level, indicates whether the RV/LV ratio present in the study is >= 1
rv_lv_ratio_lt_1 - exam-level, indicates whether the RV/LV ratio present in the study is < 1
leftsided_pe - exam-level, indicates that there is PE present on the left side of the images in the study
chronic_pe - exam-level, indicates that the PE in the study is chronic
true_filling_defect_not_pe - informational, indicates a defect that is NOT PE
rightsided_pe - exam-level, indicates that there is PE present on the right side of the images in the study
acute_and_chronic_pe - exam-level, indicates that the PE present in the study is both acute AND chronic
central_pe - exam-level, indicates that there is PE present in the center of the images in the study
indeterminate -exam-level, indicates that while the study is not negative for PE, an ultimate set of exam-level labels could not be created, due to QA issues

The image below is a flowchart outlining the relationships between labels. Note that there are four labels in the training set that are purely informational and require no predictions. They are QA Contrast, QA Motion, True filling defect not PE, and Flow artifact, and are not scored, but are meant to be used as helpers. Also note that Acute PE is not an explicit label, but is implied by the lack of Chronic PE or Acute and Chronic PE.

PE_label_hierarchy

Note that predictions were required to adhere to the expected label hierarchy defined in this diagram, and the host verified that prospective winners did not make conflicting label predictions. The requirements submissions were held to are specified by the host in this post, and the code used to check compliance with these requirements is available in this notebook.

Download

Medical Imaging Resource for AI (MIRA)