10.Natural language processing01.Extraction and classification - sporedata/researchdesigneR GitHub Wiki
-
Natural language processing (NLP) is used to extract, classify, and automate information expressed in a natural language, oftentimes transforming the free text into columns in a traditional dataset. Examples of medical free text include radiology and pathology reports, admission and discharge summaries, surgical reports, and reports containing laboratory test results.
-
NLP will often process data scraped from public sources such as public webpages, Twitter, among others.
-
Generation of summaries compliant with international reporting guidelines - see Development and Validation of a Natural Language Processing Tool to Generate the CONSORT Reporting Checklist for Randomized Clinical Trials
-
NLP association with spacy is useful for extraction and classification[3], as the quantitative component of some mixed methods studies, stuff like sentiment analysis[2], and perhaps even summarization. Spacy has the concept of patterns, which can simultaneously combine multiple different approaches: rules (including negex), NER (pre-trained models to identify people and a number of other common things), POS (identifies the syntactic function of words - verbs, nouns, etc -- which can be incredibly useful in a situation where you have homonyms like "go" which can be a verb, a phenomenal Japanese game, and a programming language), and machine learning (including BERT), among other approaches.
- Free text containing medical information
-
Extraction and classification using NLP transform free text from clinical narratives into spreadsheet-like data (patients on rows and variables in columns). Their development requires direct interaction with clinical experts to convert their knowledge, often tacit, into a set of explicit pattern-matching rules and machine learning algorithms.
-
One of the commonly used methods for extraction and classification is the noisy silver standard methodology. This methodology starts with the use of regular expressions (RegEx and NegEx) to search for the concepts of interest. Later, the concepts matched by these rules are considered a silver standard and further explored with machine learning approaches. The latter often involves Recurrent Neural Networks (RNN). The cycling of RegEx/NegEx and RNN is repeated over time. The overall performance of the extraction and classification improves with each cycle, ultimately reaching precision predictive performance that is equivalent to the levels obtained through traditional methods (formal corpus followed by machine learning only).
-
In situations where the number of concepts in the free text database is not large enough for the use of RNN, only the initial stage (RegEx and NegEX) is used.
-
In NLP, text preprocessing is the first step in building a model. It is a method of cleaning the text data and preparing it to feed data into the model. There are different steps/techniques for text preprocessing (which have been explained in detail here), all of them focusing on cleaning the text and making it noise-free, such as:
- Expand Contractions
- LowerCase
- Remove Punctuation
- Remove words and digits containing digits
- Remove Stopwords
- Rephrase Text
- Stemming and Lemmatization
- Remove White spaces
- Common packages for natural language processing
- Parsr is a tool for cleaning, parsing, and extraction that outputs files in JSON, Markdown, CSV/Pandas DF, or txt format.
- iHateRegex has a regular expression scripts fo common concepts.
- BeautifulSoup vs. Rvest
- Interface with Google Cloud Document AI API is an R package for Google Document AI, a powerful server-based OCR processor with support for over 60 languages. The package provides an interface for the Document AI API and comes with additional tools for output file parsing and text reconstruction.
-
Books
-
Articles
- Common references for natural language processing
- Clinical Text Data in Machine Learning: Systematic Review [1].
- Comparing the Functionality of Open Source NLP Libraries [4]. This post shortlists open-source NLP libraries to enable users to choose the right open-source NLP library to build production-grade software.
- SpaCy or Spark NLP — A Benchmarking Comparison [5]. This article shows how to run a realistic Natural Language Processing (NLP) scenario to compare Spark NLP and spaCy, the two best linguistic programming libraries.
- Evaluating Text Analytic Frameworks for Mental Health Surveillance [6]. This article assesses scalable storage solutions based on fault tolerance, performance, and scalability, highlighting the current approach to evaluation, the preliminary findings, and the work in progress toward a more robust text analysis pipeline.
- Comparing production-grade NLP libraries: Training Spark-NLP and spaCy pipelines [7]. This article provides a step-by-step guide to initialize the libraries, load the data, and train a tokenizer model using Spark-NLP and spaCy.
[1] Spasic I, Nenadic G. Clinical Text Data in Machine Learning: Systematic Review. JMIR Medical Informatics. 2020;8(3):e17984.
[2] Ayre K, Bittar A, Kam J, Verma S, Howard LM, Dutta R. Developing a Natural Language Processing tool to identify perinatal self-harm in electronic healthcare records.
[3] Nobel JM, Puts S, Bakers FC, Robben SGF, Dekker ALAJ. Natural Language Processing in Dutch Free Text Radiology Reports: Challenges in a Small Language Area Staging Pulmonary Oncology.
[4] Panahi M, Talby D. Comparing the Functionality of Open Source NLP Libraries
[5] Kaya MA. SpaCy or Spark NLP — A Benchmarking Comparison
[6] Mayer B, Arnold J, Begoli E, Rush E, Drewry M, Brown K, Ponce E, Srinivas S. Evaluating Text Analytic Frameworks for Mental Health Surveillance. In2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW) 2018 Apr 16 (pp. 39-47). IEEE.
[7] Ellafi SA. Comparing production-grade NLP libraries: Training Spark-NLP and spaCy pipelines