2. Run scripts - NCBI-Codeathons/Use-UMLS-and-Python-to-classify-website-visitor-queries-into-measurable-categories GitHub Wiki

The scripts described below are located in /src/features. Instructions for running, and the file inputs and outputs, are described in the scripts.

It is recommended that you examine what each operation is doing, because there are match files that you will need to build that are customized to your site and subject areas. Files 01, 02, and 05 can be run automatically from the command line, after you have built and tuned your match files over multiple months, for pickup by your reporting tool(s). But customizing the match files to your circumstance can take a good bit of manual work.

The pilot project used the Spyder editor, so modules will be divided by #%%. During the pilot, most scripts were run one module at a time. Because new matches are sometimes added automatically to the matching files, you should be careful about running entire scripts at once - this could have unintended effects on your matching files and future dataset joins. Search counts could be inflated.

00_StartNewProject.py

Fuzzy matching and a spreadsheet program are used in the first run to build a list of terms SPECIFIC TO YOUR SITE, which would be handled poorly by the processes in later scripts, including:

  1. Your "brands" - Programs, product and service names, in the forms that people search for them
  2. Person names, whether staff, authors, historical figures, fictional characters, etc.
  3. Anything else you don't want the later scripts to tag. For example, if your organization is using a generic word as an acronym, you probably want these occurrences tagged as your product, rather than the generic term.

The next script will have similar operations, so during every run after the first you can continue to build this information. This is required to collect the type of "training data" that machine learning will use to make term classification more automated in the future. This is the best way (so far) to classify the many ways your customers search for your products and product-related terminology.

The script puts your highest frequency terms into buckets/clusters, so you can use Excel to add 2 levels of aggregation, to match the UMLS framework, Preferred Term (preferred version of the term) and Semantic Type (MID-LEVEL description).

You can re-run the fuzzy-match process until you believe you have categorized what you need. Over time this will lighten the manual work in later steps with more accurate assignments and machine learning.

01_CleanBuildMatch.py

Import search queries from Google Analytics, clean up, match query entries against historical files, do some custom matching, and iterate as appropriate. Instructions inside the script file walk you through actions that should be conducted in a specific sequence for best matching results; this includes early work to pull out your "brands," staff names, organization-piece names, etc., because matching site-specific terms to generic vocabularies can result in improper or unfortunate tags.

02_ForUmlsLicenseHolders.py

You can skip this step if you don't have a (free) UMLS license and don't want to get one now. This step can be integrated later if you find you aren't satisfied with the percentage of search volume you are able to tag. The pilot project aimed for tagging 80 percent of total search volume, but only reached ~75 percent, INCLUDING with this step.

This script runs unmatched search queries against the UMLS Metathesaurus https API, which has many vocabularies and a lexical toolset specific to biomedicine. This script uses normalized string matching, which is conservative enough to assume that almost all the matches returned, are correct. Some clean-up will be needed later in your PastMatches file.

(Skip over 03 and 04 in the MVP; see bottom of this page for more.)

05_TagAndFinalize.py

The current advice is to skip from either script 01 or 02 to this script, 05, as the basic package / Minimum Viable Product / proof of concept. This script adds columns that provide the top-down view in reporting, by cleaning up the Semantic Type assignments and adding the Semantic Group assignments. If you have Custom Topics you can add them here; the examples here are opioids, vaping, and Coronavirus. You can do this for anything you can build a synonym file for.

A file in Excel format, taggedLog, with the log and summary information, can be viewed in Excel or imported into a Tableau workbook. A second file, BiggestMovers, can be used to generate reporting on search trends over time.

06_Integrate.py

This script joins the new data to your old data (if any). For the file TaggedLogAllMonths.xlsx, the script replaces total search counts to existing query rows and adds new rows for new queries; for the Tableau discovery UI. The file BiggestMovers.xlsx follows the same procedure regarding row addition, however a new column is appended for every month of data to allow for time series analysis over months.

You will need to edit this file before running.

Scripts 03 and 04

Skip over 03 and 04 if you want a Minimum Viable Product (MVP) / proof of concept example project, because scripts 03 and 04 are still under development. Description:

  • 03 is meant to generate tagging suggestions using the CSpell consumer health spelling tool to suggest tags for terms that may be misspelled, and the MetaMap Lite API, which can provide a verbose list of possible tagging matches. It turned out in the pilot project that MOST of the unmatched terms are NOT misspelled, but are fragments from the organization's programs, products, services, or people. CSpell and MetaMap Lite can be configured for more accurate matching, but configuring these tools will NOT result in capturing all searches.
  • 04 uses a local Python-Django browser interface to support fast manual tag selection for ~10 "top tags," with the help of a local NoSQL database.

In this codeathon repo is a Linux-based Flask UI; the 2018 hackathon repo (referenced on this repo's ReadMe at the bottom) had a Python-Django tagging UI. A new version, encompassing CSpell and MetaMap Lite, should be created that operates on the combined-months file that results from 06, and a machine learning component should be added. This way a person could upload the two new log files into a browser interface, scripts 01, 02, 05, and 06 could be run automatically from the browser, THEN a helper UI could assist a person in tagging what has been left untagged in the multi-months file.