Find regular expression named group matches - lvphj/epydemiology GitHub Wiki

phjFindRegexNamedGroups()

import numpy as np
import pandas as pd
import re
import epy

df = epy.phjFindRegexNamedGroups(phjDF,
                                 phjDescriptorVarName,
                                 phjNamedGroupRegexStr,
                                 phjSeparateRegexGroups = False,
                                 phjNumberMatchesVarName = 'numberMatches',
                                 phjMatchedGroupVarName = 'matchedgroup',
                                 phjUnclassifiedStr = 'unclassified',
                                 phjMultipleMatchStr = 'multiple',
                                 phjCleanup = False,
                                 phjPrintResults = False)

Description

This function takes a column of text and uses a regex with named groups to determine the group to which the text best fits.

Function parameters

  1. phjDF

  2. phjDescriptorVarName

  3. phjNamedGroupRegexStr

  4. phjSeparateRegexGroups (default = False)

    When set to False (the default), the regex (potentially containing several named groups) is run as a single regex. Strings which potentially match regexes in several groups will ultimately only be matched against the first named group. When generating the named groups regex using phjCreateNamedGroupRegex() function, the order of the named groups can be changed by identifying a column that contains a group order identifier (phjIDVarName). Manipulating the order of the named groups within a regex will ensure that commonly matched regexes will be matched first, which probably makes sense. However, in this function, phjSeparateRegexGroups is set to False. This ensures that each named group in the regex is run separately and all potential matches will be identified.

  5. phjNumberMatchesVarName (default = 'numberMatches')

  6. phjMatchedGroupVarName (default = 'matchedgroup')

  7. phjUnclassifiedStr (default = 'unclassified')

  8. phjMultipleMatchStr (default = 'multiple')

  9. phjCleanup (default = False)

  10. phjPrintResults (default = False)

Exceptions raised

  1. AssertionError

    An AssertionError is raised if function arguments are entered incorrectly.

  2. re.error

    A re.error is raised if regex fails to compile.

Returns

Other notes

Example