Generate or select a matched or unmatched case control dataset - lvphj/epydemiology GitHub Wiki

Python functions to randomly select and generate matched or unmatched case-control datasets (without replacement) from data in Pandas dataframes.

phjGenerateCaseControlDataset()

import pandas as pd
import epydemiology as epy

df = epy.phjGenerateCaseControlDataset(phjAllDataDF,
                                       phjConsultationIDVarName,
                                       phjPatientIDVarName,
                                       phjCasesDF,
                                       phjMatchingVariablesList = None,
                                       phjControlsPerCaseInt = 1,
                                       phjScreeningRegexStr = None,
                                       phjScreeningRegexPathAndFileName = None,
                                       phjFreeTextVarName = None,
                                       phjControlType = 'consultation',
                                       phjConsultationDateVarName = None,
                                       phjAggDict = None,
                                       phjPrintResults = False)

phjSelectCaseControlDataset()

import pandas as pd
import epydemiology as epy

df = epy.phjSelectCaseControlDataset(phjCasesDF,
                                     phjPotentialControlsDF,
                                     phjUniqueIdentifierVarName,
                                     phjMatchingVariablesList = None,
                                     phjControlsPerCaseInt = 1,
                                     phjPrintResults = False)

Description

These two functions are closely related. In fact, the phjGenerateCaseControlDataset() function calls the phjSelectCaseControlDataset() function as part of the data selection process but both functions can be used independently. In essence, the difference between them is that the phjSelectCaseControlDataset() function takes as arguments two dataframes, one containing all the data that can be used as potential controls and the other containing a list of cases, and returns a skeletel, bare-bones dataframe containing the unique IDs of cases and controls together with case/control status and, in the case of matched controls, the group membership. This dataframe needs to be merged with the original data to retrieve all the original variables. The phjGenerateCaseControlDataset() function, in contrast, seeks to automate the entire process. It takes a dataframe of all data and a list of cases and returns either a consultation-based or patient-based case-control dataset that contains all the original variables.

General workflow

These functions were written to streamline a commonly-encountered workflow in our research group, namely the need to randomly select matched or unmatched controls from a large dataset, having screened and confirmed the identification of cases. The controls that are selected to go with cases could be either consultation controls (i.e. a random selection of consultations from any animals not represented in the cases dataset) or patient controls (i.e. a random selection of animals that are not represented in the case dataset). In the latter case, consultation-specific information needs to be collapsed on patient ID to produce patient-based information.

It is assumed the cases and controls are ideally* stored in the same flat-file dataframe having the following basic structure:

| consultID |       date | patientID | match | freetext | var2 | var3 |
|-----------|------------|-----------|-------|----------|------|------|
|      1001 | 2017-01-23 |      7324 |  catA |        a |  454 |  low |
|      1002 | 2017-01-25 |      7324 |  catB |        b |  345 |  low |
|      1003 | 2017-01-29 |      7324 |  catA |        c |  879 |  low |
|      1004 | 2017-02-05 |      9767 |  catB |        a |  276 |  mid |
|      1005 | 2017-02-11 |      9767 |  catB |        b |  478 |  mid |
|      1006 | 2017-02-28 |      3452 |  catA |        c |  222 |  mid |
|      1007 | 2017-03-23 |      5322 |  catA |        a |  590 |   hi |
|      1008 | 2017-03-23 |      5322 |  catB |        b |  235 |   hi |
|      1009 | 2017-04-02 |      5322 |  catB |        c |  657 |   hi |
etc.

Cases that are not part of the whole dataset can be provided as a dataframe but the extra columns in the dataframe (i.e. other variables) need to be the same as the columns in the dataset containing 'all' the data from which the controls will be selected.

The following provides a brief description of the workflow used to identify and generate case-control datasets.

IDENTIFY CASES The first step – prior to using these functions – is to identify cases within the data set.

The whole database (or a partial excerpt) is downloaded and stored in a pandas dataframe. The data set consists of consultation ID, date of consultation, patient ID, freetext clinical narrative, variables to be used for matching (if required) and any other variables of interest.
Potential cases are identified using a screening regex applied to the freetext clinical narrative.
The researcher manually reads consultations of potenial cases to confirm that they are cases. The consultation numbers of confirmed cases are recorded either alone or as a slice of the dataframe.

IDENTIFY POTENIAL CONTROLS The potential controls may be drawn from a larger range of data than was used to select the cases. As a result, it is important that the potential controls do not include any consultations that would have been identified as cases had they been included in the initial screening of cases.

Identify all consultations that would have been identified as a potential CASE using the screening regex and identify the corresponding patient ID.
Identify all corresponding patient IDs for consultations identified as confirmed cases.
Remove all consultations from confirmed and potential case patients (regardless of whether the individual consultation was positive or negative. If a patient has one consultation where the regex identifies a match, all consultations from that animal should be excluded from the list of potential controls. The remaining consultations are, therefore, potential cases.

SELECT CONTROL DATASET

Select suitable controls from the dataframe of potential controls, either unmatched or matched on give variables. The controls can be either consultation controls (where individual consultations are selected from the dataframe) or patient controls (where patients are selected).
When selecting patient controls, it is necessary to collapse the consultation-based dataframe down to a patient-based on patient ID. A default, a collapsed dataframe will contain a 'count' variable to indicate how many consultations were recorded for each patient, the dates of the first and last consultations, and the last recorded entry for all other variables. This can, however, be altered as necessary.

MERGE CASE-CONTROL DATASET WITH ORIGINAL DATAFRAMES The initial selection of case control dataset returns and minimalist dataframe that contains the bare minimum variables to be able to make the selection. After the case-control dataframe has been selected, it is necessary to merge with the original dataframe to return a complete dataset that contains all the original variables.

POINTS TO NOTE

Collapsing a consultation-based dataframe to a patient-based dataframe requires a lot of computer processing that can be slow. As a result collapsing the consultation-based dataframe to a patient-based dataframe is only done after the controls have been selected; this ensures that only an minimal amount of computer processing is required to collapse the dataset.
The list of confirmed cases can be passed to the function either as a list (or series) with not other variables, or as a dataframe which contains several variables, one of which is the consultation ID or patient ID (depending on whether the required control dataset consists of consultations or patients).
Two dataframes need to be passed to the functions, one is a dataframe of 'ALL' data (including all necessary variables) and the other is a dataframe (or series or list) of confirmed cases. If the confirmed cases are all included in the dataframe of 'ALL' data then the final returned dataset (containing all the necessary other variables of interest) will be recreated from the dataframe of 'ALL' data. This means that any edits included in the dataframe of confirmed cases will be lost in favour of recreating the data from source. However, if the dataframe of cases contains some consultations or patients that are not included in the original dataframe then the returned dataframe will contain the data included in confirmed cases dataframe.

Selecting consultation controls

There are two main functions that can be used to create a case-control dataset:

phjGenerateCaseControlDataset()

This function ultimately calls the phjSelectCaseControlDataset() function but it also attempts to automate a large proportion of the required pre- and post-production faffing around. For example, the function will determine whether a consultation-based or patient-based dataset is required, it will generate the dataframe of potential controls automatically and will merge the skeleton dataframe returned by phjSelectCaseControlDataset() function to produce a dataframe that is complete with all the variables that were included in the original dataframes.
phjSelectCaseControlDataset()

This function takes, as arguments, two dataframes, one of confirmed cases and the other of potential controls. It then returns a 'skeleton' dataframe containing the minimal number of variables (e.g. ID, case/control and group membership (if a matched control set was required). It will be necessary to merge this skeleton with appropriate dataframes to produce a complete case-control dataset that contains all the necessary variables required for further analysis.

phjGenerateCaseControlDataset()

This function tries to deliver a complete solution for selecting controls for use in case-control studies.

Function parameters

Exceptions raised

AssertionError if entered variables are invalid.
FileNotFoundError if path and name of file containing screening regex does not exist.
re.error if screening regex (entered as an argument or retrieved from a text file) does not compile.

Returns

Other notes

As mentioned previously, there are some limitions that should be recognised when passing case data. It is, therefore, important to pass suitable case data. The function should be passed a full dataframe containing 'ALL' the data. In fact, some of the confirmed cases need not be included in the dataframe of 'ALL' data (but there some limitations if this is the case). The requested case-control dataset can be either 'consultation-based' or 'patient-based'. In each of these cases, the confirmed cases can be passed in one of several formats (but, in some situations, returning a valid case-control dataset may not be feasible).

Consultation-based dataset requested

Cases passed as a SERIES of consultation ID numbers, all of which are included in the dataframe of 'ALL' data. SUCCESS. Returned dataframe will contain variables reconstructed from 'ALL' data.
Cases passed as a SERIES of consultation ID numbers, some of which are not included in the dataframe of 'ALL' data. FAILED. Required variables missing for some cases.
Cases passed as a DATAFRAME containing several variables, one of which is the CONSULTATION ID and all consultations are a subset of the consultations in the dataframe of 'ALL' data. SUCCESS
Cases passed as a DATAFRAME containing several variables, one of which is the CONSULTATION ID but not all consultations are included in the dataframe of 'ALL' data. FAILED
Cases passed as a DATAFRAME containing all the same variables as included in the 'ALL' dataframe. Not all consultations are included in the dataframe of 'ALL' data. SUCCESS

Patient-based dataset requested

Cases passed as a SERIES of case PATIENT IDs that are a subset of the information in the dataframe of 'ALL' data. SUCCESS
Cases passed as a SERIES of case PATIENT IDs that are NOT a subset of the information in the dataframe of 'ALL' data (e.g. there may be extra rows). FAILED
Cases passed as a DATAFRAME containing several variables, one of which is the PATIENT ID and all patients are a subset of the patients in the dataframe of 'ALL' data. SUCCESS
Cases passed as a DATAFRAME containing several variables, one of which is the PATIENT ID but not all patients are a subset of the patients in the dataframe of 'ALL' data. FAILED
Cases passed as a DATAFRAME containing numerous variables, one of which is the PATIENT ID and all patients are a subset of the patients in the dataframe of 'ALL' data. The variables are the same as those that will be produced when the consultation dataframe is collapsed based on patient ID data. SUCCESS
Cases passed as a DATAFRAME containing numerous variables, one of which is the PATIENT ID but the patients are NOT a subset of the patients in the dataframe of 'ALL' data. The variables are the same as those that will be produced when the consultation dataframe is collapsed based on patient ID data. SUCCESS

phjSelectCaseControlDataset()

The phjSelectCaseControlDataset() function can be used independently to select case-control datasets from the SAVSNET database. It receives, as parameters, two Pandas dataframes, one containing known cases and, the other, potential controls. For unmatched controls, the algorithm selects the requested number of random controls from the database whilst for matched controls, the algorithm steps through each case in turn and selects the relevant number of control subjects from the second dataframe, matching on the list of variables provided as an argument to the function. The function then adds the details of the case and the selected controls to a separate, pre-defined dataframe before moving onto the next case.

Initially, the phjSelectCaseControlDataset() function calls phjParameterCheck() to check that passed parameters meet specified criteria (e.g. ensure lists are lists and ints are ints etc.). If all requirements are met, phjParameterCheck() returns True and phjSelectCaseControlDataset() continues.

The function requires a parameter called phjMatchingVariablesList. If this parameter is None (the default), an unmatched case-control dataset is produced. If, however, the parameter is a list of variable names, the function will return a dataset where controls have been matched on the variables in the list.

The phjSelectCaseControlDataset() function proceeds as follows:

Creates an empty dataframe in which selected cases and controls will be stored.
Steps through each case in the phjCasesDF dataframe, one at a time.
Gets data from matched variables for the case and store in a dict
Creates a mask for the controls dataframe to select all controls that match the cases in the matched variables
Applies mask to controls dataframe and count number of potential matches
Adds cases and controls to dataframe (through call to phjAddRecords() function)
Removes added control records from potential controls database so single case cannot be selected more than once
Returns Pandas dataframe containing list of cases and controls. This dataframe only contains columns for unique identifier, case and group id. It will, therefore need to be merged with the full database to get and additional required columns.

Function parameters

The function takes the following parameters:

phjCasesDF

Pandas dataframe containing list of cases.
phjPotentialControlsDF

Pandas dataframe containing a list of potential control cases.
phjUniqueIdentifierVarName

Name of variable that acts as a unique identifier (e.g. consulations ID number would be a good example). N.B. In some cases, the consultation number is not unique but has been entered several times in the database, sometimes in very quick succession (ms). Data must be cleaned to ensure that the unique identifier variable is, indeed, unique.
phjMatchingVariablesList (Default = None)

List of variable names for which the cases and controls should be matched. Must be a list. The default is None.
phjControlsPerCaseInt (Default = 1)

Number of controls that should be selected per case.
phjPrintResults (Default= False)

Print verbose output during execution of scripts. If running on Jupyter-Notebook, setting PrintResults = True causes a lot a output and can cause problems connecting to kernel.

Exceptions raised

AssertionError if entered variables are invalid.

Returns

Pandas dataframe containing a column containing the unique identifier variable, a column containing case/control identifier and – for matched case-control studies – a column containing a group identifier. The returned dataframe will need to be left-joined with another dataframe that contains additional required variables.

Other notes

Setting phjPrintResults = True can cause problems when running script on Jupyiter-Notebook.

Examples

Examples of the functions in use are given below:

Selecting unmatched controls

casesDF = pd.DataFrame({'animalID':[1,2,3,4,5],'var1':[43,45,34,45,56],'sp':['dog','dog','dog','dog','dog']})
potControlsDF = pd.DataFrame({'animalID':[11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30],
                              'var1':[34,54,34,23,34,45,56,67,56,67,78,98,65,54,34,76,87,56,45,34],
                              'sp':['dog','cat','dog','dog','cat','dog','cat','dog','cat','dog',
                                    'dog','dog','dog','cat','dog','cat','dog','dog','dog','cat']})

print("This dataframe contains all the cases of disease\n")
print(casesDF)
print("\n")
print("This dataframe contains all the animals you could potentially use as controls\n")
print(potControlsDF)
print("\n")

unmatchedDF = epy.phjSelectCaseControlDataset(phjCasesDF = casesDF,
                                              phjPotentialControlsDF = potControlsDF,
                                              phjUniqueIdentifierVarName = 'animalID',
                                              phjMatchingVariablesList = None,
                                              phjControlsPerCaseInt = 2,
                                              phjPrintResults = False)

print(unmatchedDF)

This produces the following output:

This dataframe contains all the cases of disease

   animalID  var1   sp
0         1    43  dog
1         2    45  dog
2         3    34  dog
3         4    45  dog
4         5    56  dog


This dataframe contains all the animals you could potentially use as controls

    animalID  var1   sp
0         11    34  dog
1         12    54  cat
2         13    34  dog
3         14    23  dog
4         15    34  cat
5         16    45  dog
6         17    56  cat
7         18    67  dog
8         19    56  cat
9         20    67  dog
10        21    78  dog
11        22    98  dog
12        23    65  dog
13        24    54  cat
14        25    34  dog
15        26    76  cat
16        27    87  dog
17        28    56  dog
18        29    45  dog
19        30    34  cat


    case  animalID
0      1         1
1      1         2
2      1         3
3      1         4
4      1         5
5      0        18
6      0        25
7      0        24
8      0        14
9      0        22
10     0        12
11     0        27
12     0        16
13     0        13
14     0        30

Selecting controls that are matched to cases on variable 'sp'

casesDF = pd.DataFrame({'animalID':[1,2,3,4,5],'var1':[43,45,34,45,56],'sp':['dog','dog','dog','dog','dog']})
potControlsDF = pd.DataFrame({'animalID':[11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30],
                              'var1':[34,54,34,23,34,45,56,67,56,67,78,98,65,54,34,76,87,56,45,34],
                              'sp':['dog','cat','dog','dog','cat','dog','cat','dog','cat','dog',
                                    'dog','dog','dog','cat','dog','cat','dog','dog','dog','cat']})

print("This dataframe contains all the cases of disease\n")
print(casesDF)
print("\n")
print("This dataframe contains all the animals you could potentially use as controls\n")
print(potControlsDF)
print("\n")

matchedDF = epy.phjSelectCaseControlDataset(phjCasesDF = casesDF,
                                            phjPotentialControlsDF = potControlsDF,
                                            phjUniqueIdentifierVarName = 'animalID',
                                            phjMatchingVariablesList = ['sp'],
                                            phjControlsPerCaseInt = 2,
                                            phjPrintResults = False)

print(matchedDF)

Output

This dataframe contains all the cases of disease

   animalID   sp  var1
0         1  dog    43
1         2  dog    45
2         3  dog    34
3         4  dog    45
4         5  dog    56


This dataframe contains all the animals you could potentially use as controls

    animalID   sp  var1
0         11  dog    34
1         12  cat    54
2         13  dog    34
3         14  dog    23
4         15  cat    34
5         16  dog    45
6         17  cat    56
7         18  dog    67
8         19  cat    56
9         20  dog    67
10        21  dog    78
11        22  dog    98
12        23  dog    65
13        24  cat    54
14        25  dog    34
15        26  cat    76
16        27  dog    87
17        28  dog    56
18        29  dog    45
19        30  cat    34


UNMATCHED CONTROLS

    case  animalID
0      1         1
1      1         2
2      1         3
3      1         4
4      1         5
5      0        22
6      0        13
7      0        30
8      0        18
9      0        25
10     0        28
11     0        14
12     0        15
13     0        24
14     0        19


MATCHED CONTROLS

   animalID group case   sp
0         1     0    1  dog
1        28     0    0  dog
2        16     0    0  dog
3         2     1    1  dog
4        25     1    0  dog
5        27     1    0  dog
6         3     2    1  dog
7        21     2    0  dog
8        11     2    0  dog
9         4     3    1  dog
10       18     3    0  dog
11       14     3    0  dog
12        5     4    1  dog
13       22     4    0  dog
14       29     4    0  dog