ctakes smoking status - apache/ctakes GitHub Wiki

This is version 1.2 of the cTAKES smoking status annotator. The pipeline has been tested on flat text files and CDA documents. The sample provided uses a bar (|) delimited file with multiple records (and patients) per file. The dictionary lookup annotator is limited to smoking status dictionaries provided in the '/resources/ss/data/*.dictionary' files.

The smoking status pipeline processes patient records into five pre-determined categories - past smoker (P), current smoker (C), smoker (S), non-smoker (N), and unknown (U). The definition of smoking status was adapted from I2B2 Natural Language Processing Challenges for Clinical Records.

  • PAST SMOKER (P): A patient whose record asserts either that they are a past smoker or that they were a smoker a year or more ago but who have not smoked for at least one year.
  • CURRENT SMOKER (C): A patient whose record asserts that they are a current smoker (or that they smoked without indicating that they stopped more than a year ago) or that they were a smoker within the past year.
  • SMOKER (S): A patient who is either a CURRENT or a PAST smoker but, whose medical record does not provide enough information to classify the patient as either a CURRENT or a PAST smoker.
  • NON-SMOKER (N): A patient whose record indicates that they have never smoked.
  • UNKNOWN (U): The patient's record does not mention anything about smoking.

Annotation Engines

Annotation Engines


Uses SVM for smoking status classification.

Source class: PcsClassifier
Source package: org.apache.ctakes.smokingstatus.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase

Parameter Description Class Required Default
KeyWordsPath Path to file containing key words. String Yes
ModelPath Path to file containing the model. String Yes
StopWordsPath Path to file containing stop words. String Yes
CaseSensitive yes/no for case sensitivity. String No yes
⚠️ **GitHub.com Fallback** ⚠️