Retrieve unique values from dataframes - lvphj/epydemiology GitHub Wiki

A function to retrieve unique values from one or more data frames

phjRetrieveUniqueFromMultiDataFrames()

myDF = epy.phjRetrieveUniqueFromMultiDataFrames(phjDFList,
                                                phjVarNameList,
                                                phjSort = True,
                                                phjPrintResults = False)

Description

This function takes a list of dataframes and returns a dataframe of unique values that occur in the variable names listed.

Function parameters

  1. phjDFList

    List containing Pandas dataframes from which unique values will be extracted. A single dataframe can also be passed.

  2. phjVarNameList

    List of variable names from which unique values will be extracted. A single variable may also be passed. The variable names need to exist in all dataframes passed in phjDFList.

  3. phjSort (default = True)

    Sort values in returned dataframe. Sorting will be performed using variables in the order given in phjVarNameList.

  4. phjPrintResults (default = False)

    Print results at various points.

Exceptions raised

None

Returns

Pandas dataframe containing unique values.

Other notes

None.

Example

An example of the function in use is given below:

Single dataframe

phjTempDF = pd.DataFrame({'a':[1,2,3,4,5,6,1,2,3,4,5,6],
                          'b':['a','b','c','d','e','f','a','b','w','d','e','f']})

print('Single variable')
print('---------------')

phjOutDF = epy.phjRetrieveUniqueFromMultiDataFrames(phjDFList = [phjTempDF],
                                                    phjVarNameList = 'a',
                                                    phjSort = True,
                                                    phjPrintResults = True)
 
print('\n')
print('Multiple variables')
print('------------------')

phjOutDF = epy.phjRetrieveUniqueFromMultiDataFrames(phjDFList = phjTempDF,
                                                    phjVarNameList = ['a','b'],
                                                    phjSort = True,
                                                    phjPrintResults = True)

To give results:

Single variable
---------------
Unique values in dataframe at position 0
   a
0  1
1  2
2  3
3  4
4  5
5  6


Dataframe of unique values from all dataframes
   a
0  1
1  2
2  3
3  4
4  5
5  6


Multiple variables
------------------
Unique values in dataframe at position 0
   a  b
0  1  a
1  2  b
2  3  c
3  4  d
4  5  e
5  6  f
8  3  w


Dataframe of unique values from all dataframes
   a  b
0  1  a
1  2  b
2  3  c
3  3  w
4  4  d
5  5  e
6  6  f

Multiple dataframes

df1 = pd.DataFrame({'m':[1,2,3,4,5,6],
                    'n':['a','b','c','d','e','f']})
 
df2 = pd.DataFrame({'m':[2,5,7,8],
                    'n':['b','e','g','h']})

phjOutDF = epy.phjRetrieveUniqueFromMultiDataFrames(phjDFList = [df1,df2],
                                                    phjVarNameList = ['m','n'],
                                                    phjSort = True,
                                                    phjPrintResults = True)

To give results:

Unique values in dataframe at position 0
   m  n
0  1  a
1  2  b
2  3  c
3  4  d
4  5  e
5  6  f


Unique values in dataframe at position 1
   m  n
0  2  b
1  5  e
2  7  g
3  8  h


Dataframe of unique values from all dataframes
   m  n
0  1  a
1  2  b
2  3  c
3  4  d
4  5  e
5  6  f
6  7  g
7  8  h