Create a named group regex from individual regexes - lvphj/epydemiology GitHub Wiki

phjCreateNamedGroupRegex()

import numpy as np
import pandas as pd
import re
import epydemiology as epy

myRegex = epy.phjCreateNamedGroupRegex(phjDF,
                                       phjGroupVarName,
                                       phjRegexVarName,
                                       phjIDVarName = None,
                                       phjRegexPreCompile = False,
                                       phjPrintResults = False):

Description

Individual regexes may be stored in a database table or a spreadsheet and several regexes may define a group. As an example, consider several regexes being used to define the groups dog and cat as follows:

category        regex
     dog      (?:dog)
           (?:canine)
               (?:k9)
     cat      (?:cat)
           (?:feline)

To determine whether a string is dog or cat related, a regex that combines all the options could be created:

(?P<dog>(?:dog)|(?:canine)|(?:k9))|(?P<cat>(?:cat)|(?:feline))

This may not be the most efficient regex to achieve the task but it does mean that tables of meaningful regexes can be readily stored and modified in a table format.

The phjCreateNamedGroupRegex() function combines the series of regexes stored in a table or a spreadsheet into a single named-group regex and returns either the full string or the compiled regex.

Function parameters

  1. phjDF

    The dataframe containing the individual regex and group names.

  2. phjGroupVarName

    Name of variable containing group name.

  3. phjRegexVarName

    Name of variable containing non-capturing regexes.

  4. phjIDVarName (default = None)

    Name of variable containing an index number for groups. The ID sets the order in which the group names appear in the final regex. If the group names and ID variables do not match, a warning will be generated.

  5. phjRegexPreCompile (default = False)

    If set to True, the function will return a compiled regex.

  6. phjPrintResults (default = False)

    If set to True, the function will print information to screen as it proceeds.

Exceptions raised

  1. AssertionError

    An AssertionError is raised if function arguments are entered incorrectly.

  2. pd.core.groupby.DataError

    A pd.core.groupby.DataError is raised if there is a problem with the groupby method that is used to aggregate regexes based on group variable.

  3. re.error

    A re.error is raised if phjRegexPreCompile is set to True and regex fails to compile.

Returns

By default (phjRegexPreCompile = False) returns string of combined regex. If phjRegexPreCompile is set to False, the function returns a compiled regex object.

Other notes

None.

Example

phjRegexPreCompile parameter set to False

import numpy as np
import pandas as pd
import re
import epydemiology as epy

df = pd.DataFrame({'id':[2,2,2,1,1],
                   'group':['dog','dog','dog','cat','cat'],
                   'regex':['(?:dog)','(?:canine)','(?:k9)','(?:cat)','(?:feline)']})

print("Dataframe\n---------")
print(df)

regexStr = epy.phjCreateNamedGroupRegex(phjDF = df,
                                        phjGroupVarName = 'group',
                                        phjRegexVarName = 'regex',
                                        phjIDVarName = 'id',
                                        phjRegexPreCompile = False,
                                        phjPrintResults = False)

print("\nCombined Regex string\n---------------------")
print(regexStr)

This returns the following output:

Dataframe
---------
  group  id       regex
0   dog   2     (?:dog)
1   dog   2  (?:canine)
2   dog   2      (?:k9)
3   cat   1     (?:cat)
4   cat   1  (?:feline)

Combined Regex string
---------------------
(?P<cat>
    (?:cat)|
    (?:feline))|
(?P<dog>
    (?:dog)|
    (?:canine)|
    (?:k9))

phjRegexPreCompile parameter set to True

df = pd.DataFrame({'id':[2,2,2,1,1],
                   'group':['dog','dog','dog','cat','cat'],
                   'regex':['(?:dog)','(?:canine)','(?:k9)','(?:cat)','(?:feline)']})

print("Dataframe\n---------")
print(df)

myCompiledRegexObj = epy.phjCreateNamedGroupRegex(phjDF = df,
                                                  phjGroupVarName = 'group',
                                                  phjRegexVarName = 'regex',
                                                  phjIDVarName = 'id',
                                                  phjRegexPreCompile = True,
                                                  phjPrintResults = False)

print("\nCompiled Regex object\n---------------------")
print(myCompiledRegexObj)

This returns the following output:

Dataframe
---------
   id group       regex
0   2   dog     (?:dog)
1   2   dog  (?:canine)
2   2   dog      (?:k9)
3   1   cat     (?:cat)
4   1   cat  (?:feline)

Compiled Regex object
---------------------
re.compile('(?P<cat>\n    (?:cat)|\n    (?:feline))|\n(?P<dog>\n    (?:dog)|\n    (?:canine)|\n    (?:k9))', re.IGNORECASE|re.VERBOSE)
⚠️ **GitHub.com Fallback** ⚠️