10.Natural language processing01.Extraction and classification - sporedata/researchdesigneR GitHub Wiki

1. Use cases: in which situations should I use this method?

  1. Natural language processing (NLP) is used to extract, classify, and automate information expressed in a natural language, oftentimes transforming the free text into columns in a traditional dataset. Examples of medical free text include radiology and pathology reports, admission and discharge summaries, surgical reports, and reports containing laboratory test results.

  2. NLP will often process data scraped from public sources such as public webpages, Twitter, among others.

  3. Generation of summaries compliant with international reporting guidelines - see Development and Validation of a Natural Language Processing Tool to Generate the CONSORT Reporting Checklist for Randomized Clinical Trials

  4. Towards phenotyping of clinical trial eligibility criteria

  5. NLP association with spacy is useful for extraction and classification[3], as the quantitative component of some mixed methods studies, stuff like sentiment analysis[2], and perhaps even summarization. Spacy has the concept of patterns, which can simultaneously combine multiple different approaches: rules (including negex), NER (pre-trained models to identify people and a number of other common things), POS (identifies the syntactic function of words - verbs, nouns, etc -- which can be incredibly useful in a situation where you have homonyms like "go" which can be a verb, a phenomenal Japanese game, and a programming language), and machine learning (including BERT), among other approaches.

2. Input: what kind of data does the method require?

  • Free text containing medical information

3. Algorithm: how does the method work?

Model mechanics

  • Extraction and classification using NLP transform free text from clinical narratives into spreadsheet-like data (patients on rows and variables in columns). Their development requires direct interaction with clinical experts to convert their knowledge, often tacit, into a set of explicit pattern-matching rules and machine learning algorithms.

  • One of the commonly used methods for extraction and classification is the noisy silver standard methodology. This methodology starts with the use of regular expressions (RegEx and NegEx) to search for the concepts of interest. Later, the concepts matched by these rules are considered a silver standard and further explored with machine learning approaches. The latter often involves Recurrent Neural Networks (RNN). The cycling of RegEx/NegEx and RNN is repeated over time. The overall performance of the extraction and classification improves with each cycle, ultimately reaching precision predictive performance that is equivalent to the levels obtained through traditional methods (formal corpus followed by machine learning only).

  • In situations where the number of concepts in the free text database is not large enough for the use of RNN, only the initial stage (RegEx and NegEX) is used.

  • In NLP, text preprocessing is the first step in building a model. It is a method of cleaning the text data and preparing it to feed data into the model. There are different steps/techniques for text preprocessing (which have been explained in detail here), all of them focusing on cleaning the text and making it noise-free, such as:

    1. Expand Contractions
    2. LowerCase
    3. Remove Punctuation
    4. Remove words and digits containing digits
    5. Remove Stopwords
    6. Rephrase Text
    7. Stemming and Lemmatization
    8. Remove White spaces

Reporting guidelines

Data science packages

Suggested companion methods

Learning materials

  1. Books

  2. Articles

5. SporeData-specific

Templates

References

[1] Spasic I, Nenadic G. Clinical Text Data in Machine Learning: Systematic Review. JMIR Medical Informatics. 2020;8(3):e17984.

[2] Ayre K, Bittar A, Kam J, Verma S, Howard LM, Dutta R. Developing a Natural Language Processing tool to identify perinatal self-harm in electronic healthcare records.

[3] Nobel JM, Puts S, Bakers FC, Robben SGF, Dekker ALAJ. Natural Language Processing in Dutch Free Text Radiology Reports: Challenges in a Small Language Area Staging Pulmonary Oncology.

[4] Panahi M, Talby D. Comparing the Functionality of Open Source NLP Libraries

[5] Kaya MA. SpaCy or Spark NLP — A Benchmarking Comparison

[6] Mayer B, Arnold J, Begoli E, Rush E, Drewry M, Brown K, Ponce E, Srinivas S. Evaluating Text Analytic Frameworks for Mental Health Surveillance. In2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW) 2018 Apr 16 (pp. 39-47). IEEE.

[7] Ellafi SA. Comparing production-grade NLP libraries: Training Spark-NLP and spaCy pipelines

⚠️ **GitHub.com Fallback** ⚠️