Create a named group regex from individual regexes - lvphj/epydemiology GitHub Wiki
import numpy as np
import pandas as pd
import re
import epydemiology as epy
myRegex = epy.phjCreateNamedGroupRegex(phjDF,
phjGroupVarName,
phjRegexVarName,
phjIDVarName = None,
phjRegexPreCompile = False,
phjPrintResults = False):
Individual regexes may be stored in a database table or a spreadsheet and several regexes may define a group. As an example, consider several regexes being used to define the groups dog
and cat
as follows:
category regex
dog (?:dog)
(?:canine)
(?:k9)
cat (?:cat)
(?:feline)
To determine whether a string is dog
or cat
related, a regex that combines all the options could be created:
(?P<dog>(?:dog)|(?:canine)|(?:k9))|(?P<cat>(?:cat)|(?:feline))
This may not be the most efficient regex to achieve the task but it does mean that tables of meaningful regexes can be readily stored and modified in a table format.
The phjCreateNamedGroupRegex()
function combines the series of regexes stored in a table or a spreadsheet into a single named-group regex and returns either the full string or the compiled regex.
-
phjDF
The dataframe containing the individual regex and group names.
-
phjGroupVarName
Name of variable containing group name.
-
phjRegexVarName
Name of variable containing non-capturing regexes.
-
phjIDVarName (default = None)
Name of variable containing an index number for groups. The ID sets the order in which the group names appear in the final regex. If the group names and ID variables do not match, a warning will be generated.
-
phjRegexPreCompile (default = False)
If set to True, the function will return a compiled regex.
-
phjPrintResults (default = False)
If set to True, the function will print information to screen as it proceeds.
-
AssertionError
An
AssertionError
is raised if function arguments are entered incorrectly. -
pd.core.groupby.DataError
A
pd.core.groupby.DataError
is raised if there is a problem with thegroupby
method that is used to aggregate regexes based on group variable. -
re.error
A
re.error
is raised ifphjRegexPreCompile
is set toTrue
and regex fails to compile.
By default (phjRegexPreCompile = False
) returns string of combined regex. If phjRegexPreCompile
is set to False
, the function returns a compiled regex object.
None.
import numpy as np
import pandas as pd
import re
import epydemiology as epy
df = pd.DataFrame({'id':[2,2,2,1,1],
'group':['dog','dog','dog','cat','cat'],
'regex':['(?:dog)','(?:canine)','(?:k9)','(?:cat)','(?:feline)']})
print("Dataframe\n---------")
print(df)
regexStr = epy.phjCreateNamedGroupRegex(phjDF = df,
phjGroupVarName = 'group',
phjRegexVarName = 'regex',
phjIDVarName = 'id',
phjRegexPreCompile = False,
phjPrintResults = False)
print("\nCombined Regex string\n---------------------")
print(regexStr)
This returns the following output:
Dataframe
---------
group id regex
0 dog 2 (?:dog)
1 dog 2 (?:canine)
2 dog 2 (?:k9)
3 cat 1 (?:cat)
4 cat 1 (?:feline)
Combined Regex string
---------------------
(?P<cat>
(?:cat)|
(?:feline))|
(?P<dog>
(?:dog)|
(?:canine)|
(?:k9))
df = pd.DataFrame({'id':[2,2,2,1,1],
'group':['dog','dog','dog','cat','cat'],
'regex':['(?:dog)','(?:canine)','(?:k9)','(?:cat)','(?:feline)']})
print("Dataframe\n---------")
print(df)
myCompiledRegexObj = epy.phjCreateNamedGroupRegex(phjDF = df,
phjGroupVarName = 'group',
phjRegexVarName = 'regex',
phjIDVarName = 'id',
phjRegexPreCompile = True,
phjPrintResults = False)
print("\nCompiled Regex object\n---------------------")
print(myCompiledRegexObj)
This returns the following output:
Dataframe
---------
id group regex
0 2 dog (?:dog)
1 2 dog (?:canine)
2 2 dog (?:k9)
3 1 cat (?:cat)
4 1 cat (?:feline)
Compiled Regex object
---------------------
re.compile('(?P<cat>\n (?:cat)|\n (?:feline))|\n(?P<dog>\n (?:dog)|\n (?:canine)|\n (?:k9))', re.IGNORECASE|re.VERBOSE)