i2b2 UTHealth - sporedata/researchdesigneR GitHub Wiki

General description

The 2014 i2b2/UTHealth dataset is a significant dataset in the domain of clinical natural language processing (NLP). It was released as part of the 2014 i2b2/UTHealth shared task focused on the automatic de-identification and identification of risk factors for patients with coronary artery disease (CAD) from clinical narratives. This dataset has been instrumental in advancing NLP for healthcare applications, particularly in terms of identifying important medical concepts and understanding the narrative structure of clinical texts.

The 2014 i2b2/UTHealth dataset continues to be a valuable resource for advancing the field of clinical text mining and developing tools for handling sensitive healthcare data while extracting meaningful medical insights from unstructured clinical narratives.

Dataset Categories

The dataset consists of clinical discharge summaries and medical progress notes from patient electronic health records (EHRs). These documents were sourced from clinical sites, and they contain detailed information about patient diagnoses, treatments, and medical histories. The dataset is annotated with:

  1. De-identification tags: Information such as patient names, addresses, phone numbers, dates, and other identifiable information are marked, allowing participants to train models for anonymizing sensitive data.
  2. Risk factor annotations: Specific annotations highlight the presence of various CAD-related risk factors, including lifestyle factors (e.g., smoking), clinical findings (e.g., hyperlipidemia), and family history of heart disease.

Limitations

  1. Negation detection: Identifying when a risk factor is negated in the text is a significant challenge (e.g., "no history of diabetes" versus "has a history of diabetes").
  2. Temporal reasoning: Some conditions may only be relevant at specific points in time, requiring temporal reasoning capabilities to extract relevant risk factors accurately.
  3. Ambiguity and complexity of clinical language: Clinical narratives are often unstructured and include abbreviations, jargon, and complex syntactic structures. Parsing this information accurately is non-trivial.

Related publications