Find regular expression named group matches - lvphj/epydemiology GitHub Wiki
phjFindRegexNamedGroups()
import numpy as np
import pandas as pd
import re
import epy
df = epy.phjFindRegexNamedGroups(phjDF,
phjDescriptorVarName,
phjNamedGroupRegexStr,
phjSeparateRegexGroups = False,
phjNumberMatchesVarName = 'numberMatches',
phjMatchedGroupVarName = 'matchedgroup',
phjUnclassifiedStr = 'unclassified',
phjMultipleMatchStr = 'multiple',
phjCleanup = False,
phjPrintResults = False)
Description
This function takes a column of text and uses a regex with named groups to determine the group to which the text best fits.
Function parameters
-
phjDF
-
phjDescriptorVarName
-
phjNamedGroupRegexStr
-
phjSeparateRegexGroups (default = False)
When set to
False
(the default), the regex (potentially containing several named groups) is run as a single regex. Strings which potentially match regexes in several groups will ultimately only be matched against the first named group. When generating the named groups regex usingphjCreateNamedGroupRegex()
function, the order of the named groups can be changed by identifying a column that contains a group order identifier (phjIDVarName
). Manipulating the order of the named groups within a regex will ensure that commonly matched regexes will be matched first, which probably makes sense. However, in this function,phjSeparateRegexGroups
is set toFalse
. This ensures that each named group in the regex is run separately and all potential matches will be identified. -
phjNumberMatchesVarName (default = 'numberMatches')
-
phjMatchedGroupVarName (default = 'matchedgroup')
-
phjUnclassifiedStr (default = 'unclassified')
-
phjMultipleMatchStr (default = 'multiple')
-
phjCleanup (default = False)
-
phjPrintResults (default = False)
Exceptions raised
-
AssertionError
An
AssertionError
is raised if function arguments are entered incorrectly. -
re.error
A
re.error
is raised if regex fails to compile.