Retrieve unique values from dataframes - lvphj/epydemiology GitHub Wiki
A function to retrieve unique values from one or more data frames
phjRetrieveUniqueFromMultiDataFrames()
myDF = epy.phjRetrieveUniqueFromMultiDataFrames(phjDFList,
phjVarNameList,
phjSort = True,
phjPrintResults = False)
Description
This function takes a list of dataframes and returns a dataframe of unique values that occur in the variable names listed.
Function parameters
-
phjDFList
List containing Pandas dataframes from which unique values will be extracted. A single dataframe can also be passed.
-
phjVarNameList
List of variable names from which unique values will be extracted. A single variable may also be passed. The variable names need to exist in all dataframes passed in phjDFList.
-
phjSort (default = True)
Sort values in returned dataframe. Sorting will be performed using variables in the order given in phjVarNameList.
-
phjPrintResults (default = False)
Print results at various points.
Exceptions raised
None
Returns
Pandas dataframe containing unique values.
Other notes
None.
Example
An example of the function in use is given below:
Single dataframe
phjTempDF = pd.DataFrame({'a':[1,2,3,4,5,6,1,2,3,4,5,6],
'b':['a','b','c','d','e','f','a','b','w','d','e','f']})
print('Single variable')
print('---------------')
phjOutDF = epy.phjRetrieveUniqueFromMultiDataFrames(phjDFList = [phjTempDF],
phjVarNameList = 'a',
phjSort = True,
phjPrintResults = True)
print('\n')
print('Multiple variables')
print('------------------')
phjOutDF = epy.phjRetrieveUniqueFromMultiDataFrames(phjDFList = phjTempDF,
phjVarNameList = ['a','b'],
phjSort = True,
phjPrintResults = True)
To give results:
Single variable
---------------
Unique values in dataframe at position 0
a
0 1
1 2
2 3
3 4
4 5
5 6
Dataframe of unique values from all dataframes
a
0 1
1 2
2 3
3 4
4 5
5 6
Multiple variables
------------------
Unique values in dataframe at position 0
a b
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
8 3 w
Dataframe of unique values from all dataframes
a b
0 1 a
1 2 b
2 3 c
3 3 w
4 4 d
5 5 e
6 6 f
Multiple dataframes
df1 = pd.DataFrame({'m':[1,2,3,4,5,6],
'n':['a','b','c','d','e','f']})
df2 = pd.DataFrame({'m':[2,5,7,8],
'n':['b','e','g','h']})
phjOutDF = epy.phjRetrieveUniqueFromMultiDataFrames(phjDFList = [df1,df2],
phjVarNameList = ['m','n'],
phjSort = True,
phjPrintResults = True)
To give results:
Unique values in dataframe at position 0
m n
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
Unique values in dataframe at position 1
m n
0 2 b
1 5 e
2 7 g
3 8 h
Dataframe of unique values from all dataframes
m n
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
6 7 g
7 8 h