ctakes smoking status - apache/ctakes GitHub Wiki
This is version 1.2 of the cTAKES smoking status annotator. The pipeline has been tested on flat text files and CDA documents. The sample provided uses a bar (|) delimited file with multiple records (and patients) per file. The dictionary lookup annotator is limited to smoking status dictionaries provided in the '/resources/ss/data/*.dictionary' files.
The smoking status pipeline processes patient records into five pre-determined categories - past smoker (P), current smoker (C), smoker (S), non-smoker (N), and unknown (U). The definition of smoking status was adapted from I2B2 Natural Language Processing Challenges for Clinical Records.
- PAST SMOKER (P): A patient whose record asserts either that they are a past smoker or that they were a smoker a year or more ago but who have not smoked for at least one year.
- CURRENT SMOKER (C): A patient whose record asserts that they are a current smoker (or that they smoked without indicating that they stopped more than a year ago) or that they were a smoker within the past year.
- SMOKER (S): A patient who is either a CURRENT or a PAST smoker but, whose medical record does not provide enough information to classify the patient as either a CURRENT or a PAST smoker.
- NON-SMOKER (N): A patient whose record indicates that they have never smoked.
- UNKNOWN (U): The patient's record does not mention anything about smoking.
Uses SVM for smoking status classification.
Source class: PcsClassifier
Source package: org.apache.ctakes.smokingstatus.ae
Parent class: org.apache.uima.fit.component.JCasAnnotator_ImplBase
Parameter | Description | Class | Required | Default |
---|---|---|---|---|
KeyWordsPath | Path to file containing key words. | String | Yes | |
ModelPath | Path to file containing the model. | String | Yes | |
StopWordsPath | Path to file containing stop words. | String | Yes | |
CaseSensitive | yes/no for case sensitivity. | String | No | yes |