Reverse map a categorical variable based on dictionary values - lvphj/epydemiology GitHub Wiki
A function to map a variable into standardised categories (defined as keys of a dictionary) based on lists of values, which can be regular expressions.
def phjReverseMap(phjDF,
phjMappingDict,
phjCategoryVarName,
phjMappedVarName = 'mapped_cat',
phjUnmapped = np.nan,
phjTreatAsRegex = False,
phjDropPreExisting = False,
phjPrintResults = False)This function can be used to categorise a variable into standardised category names. The category names are defined as dictionary keys and the variation of allowed values for each category is included as a list in the dictionary value.
-
phjDF
Pandas dataframe containing a variable to be categorised.
-
phjMappingDict
Dictionary where keys represented standardised category names and values are lists of allowable values (or regular expressions) for each category.
-
phjCategoryVarName
Name of column in
phjDFdataframe that should be recatorised. -
phjMappedVarName (default = 'mapped_cat')
Name of new column that will be created in
phjDFdataframe that will contain the new standardised category values. -
phjUnmapped (default = np.nan)
Value to use if existing value in
phjCategoryVarNamecolumn is not matched by any values inphjMappingDict. -
phjTreatAsRegex (default = False)
Indicates whether values in
phjMappingDictshould be considered as regular expressions.When
phjTreatAsRegexis set toTrue, the function callsphjFindRegexNamesGroups()function withphjSeparateRegexGroupsargument set toTrue. This ensures that all possible matches are identified. -
phjDropPreExisting (default = False)
Column names that will be created by the function will be dropped if they already exist in
phjDF. -
phjPrintResults (default = False)
Indicates whether intermediate results should be printed to screen as the function progresses.
AssertionError
An AssertionError is raised if any of the parameters passed to the function is of the incorrect type.
Pandas dataframe containing additional columns created by the function. If phjTreatAsRegex is False, the returned dataframe will have all original columns plus an additional column defined by phjMappedVarName, which contains the mapped categories. If phjTreatAsRegex is True the returned dataframe will have all original columns plus a column for each key in phjMappingDict (containing any regex matches), a column named numberMatches (which indicates how many regexes were matched), and a column named as defined by phjMappedVarName parameter (which contains either the name of the mapped group or the value defined by phjUnmapped if no matches.
None
Example 1 - exact string matches
myDF = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'var':['dogg','canine','cannine','catt','felin','cot','feline']})
print(myDF)
d = {'dog':['dogg','canine','cannine'],
'cat':['catt','felin','feline']}
myDF = epy.phjReverseMap(phjDF = myDF,
phjMappingDict = d,
phjCategoryVarName = 'var',
phjMappedVarName = 'new',
phjUnmapped = 'missing',
phjDropPreExisting = True,
phjTreatAsRegex = False,
phjPrintResults = True)Produces the following output:
id var
0 1 dogg
1 2 canine
2 3 cannine
3 4 catt
4 5 felin
5 6 cot
6 7 feline
Reversed dictionary
{'felin': 'cat', 'cannine': 'dog', 'dogg': 'dog', 'canine': 'dog', 'feline': 'cat', 'catt': 'cat'}
id var new
0 1 dogg dog
1 2 canine dog
2 3 cannine dog
3 4 catt cat
4 5 felin cat
5 6 cot missing
6 7 feline cat
Example 2 - regexes
myDF = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'var':['dogg','canine','cannine','catt','felin','cot','feline']})
print(myDF)
d = {'dog':['(?:(?:dog+))','(?:can*ine)'],
'cat':['(?:cat+)','(?:fel+ine?)']}
myDF = epy.phjReverseMap(phjDF = myDF,
phjMappingDict = d,
phjCategoryVarName = 'var',
phjMappedVarName = 'new',
phjUnmapped = 'missing',
phjDropPreExisting = True,
phjTreatAsRegex = True,
phjPrintResults = True)Produces the following output:
id var
0 1 dogg
1 2 canine
2 3 cannine
3 4 catt
4 5 felin
5 6 cot
6 7 feline
Full Regex string
(?P<cat>
(?:cat+)|
(?:fel+ine?))|
(?P<dog>
(?:(?:dog+))|
(?:can*ine))
cat ... done
dog ... done
Table of number of group matches identified per description term
Frequency
Number of matches
0 1
1 6
id var cat dog numberMatches matchedgroup
0 1 dogg NaN dogg 1 dog
1 2 canine NaN canine 1 dog
2 3 cannine NaN cannine 1 dog
3 4 catt catt NaN 1 cat
4 5 felin felin NaN 1 cat
5 6 cot NaN NaN 0 missing
6 7 feline feline NaN 1 cat