Generate or select a matched or unmatched case control dataset - lvphj/epydemiology GitHub Wiki
Python functions to randomly select and generate matched or unmatched case-control datasets (without replacement) from data in Pandas dataframes.
phjGenerateCaseControlDataset()
import pandas as pd
import epydemiology as epy
df = epy.phjGenerateCaseControlDataset(phjAllDataDF,
phjConsultationIDVarName,
phjPatientIDVarName,
phjCasesDF,
phjMatchingVariablesList = None,
phjControlsPerCaseInt = 1,
phjScreeningRegexStr = None,
phjScreeningRegexPathAndFileName = None,
phjFreeTextVarName = None,
phjControlType = 'consultation',
phjConsultationDateVarName = None,
phjAggDict = None,
phjPrintResults = False)
phjSelectCaseControlDataset()
import pandas as pd
import epydemiology as epy
df = epy.phjSelectCaseControlDataset(phjCasesDF,
phjPotentialControlsDF,
phjUniqueIdentifierVarName,
phjMatchingVariablesList = None,
phjControlsPerCaseInt = 1,
phjPrintResults = False)
Description
These two functions are closely related. In fact, the phjGenerateCaseControlDataset()
function calls the phjSelectCaseControlDataset()
function as part of the data selection process but both functions can be used independently. In essence, the difference between them is that the phjSelectCaseControlDataset()
function takes as arguments two dataframes, one containing all the data that can be used as potential controls and the other containing a list of cases, and returns a skeletel, bare-bones dataframe containing the unique IDs of cases and controls together with case/control status and, in the case of matched controls, the group membership. This dataframe needs to be merged with the original data to retrieve all the original variables. The phjGenerateCaseControlDataset()
function, in contrast, seeks to automate the entire process. It takes a dataframe of all data and a list of cases and returns either a consultation-based or patient-based case-control dataset that contains all the original variables.
General workflow
These functions were written to streamline a commonly-encountered workflow in our research group, namely the need to randomly select matched or unmatched controls from a large dataset, having screened and confirmed the identification of cases. The controls that are selected to go with cases could be either consultation controls (i.e. a random selection of consultations from any animals not represented in the cases dataset) or patient controls (i.e. a random selection of animals that are not represented in the case dataset). In the latter case, consultation-specific information needs to be collapsed on patient ID to produce patient-based information.
It is assumed the cases and controls are ideally* stored in the same flat-file dataframe having the following basic structure:
| consultID | date | patientID | match | freetext | var2 | var3 |
|-----------|------------|-----------|-------|----------|------|------|
| 1001 | 2017-01-23 | 7324 | catA | a | 454 | low |
| 1002 | 2017-01-25 | 7324 | catB | b | 345 | low |
| 1003 | 2017-01-29 | 7324 | catA | c | 879 | low |
| 1004 | 2017-02-05 | 9767 | catB | a | 276 | mid |
| 1005 | 2017-02-11 | 9767 | catB | b | 478 | mid |
| 1006 | 2017-02-28 | 3452 | catA | c | 222 | mid |
| 1007 | 2017-03-23 | 5322 | catA | a | 590 | hi |
| 1008 | 2017-03-23 | 5322 | catB | b | 235 | hi |
| 1009 | 2017-04-02 | 5322 | catB | c | 657 | hi |
etc.
- Cases that are not part of the whole dataset can be provided as a dataframe but the extra columns in the dataframe (i.e. other variables) need to be the same as the columns in the dataset containing 'all' the data from which the controls will be selected.
The following provides a brief description of the workflow used to identify and generate case-control datasets.
- IDENTIFY CASES The first step – prior to using these functions – is to identify cases within the data set.
- The whole database (or a partial excerpt) is downloaded and stored in a pandas dataframe. The data set consists of consultation ID, date of consultation, patient ID, freetext clinical narrative, variables to be used for matching (if required) and any other variables of interest.
- Potential cases are identified using a screening regex applied to the freetext clinical narrative.
- The researcher manually reads consultations of potenial cases to confirm that they are cases. The consultation numbers of confirmed cases are recorded either alone or as a slice of the dataframe.
- IDENTIFY POTENIAL CONTROLS The potential controls may be drawn from a larger range of data than was used to select the cases. As a result, it is important that the potential controls do not include any consultations that would have been identified as cases had they been included in the initial screening of cases.
-
Identify all consultations that would have been identified as a potential CASE using the screening regex and identify the corresponding patient ID.
-
Identify all corresponding patient IDs for consultations identified as confirmed cases.
-
Remove all consultations from confirmed and potential case patients (regardless of whether the individual consultation was positive or negative. If a patient has one consultation where the regex identifies a match, all consultations from that animal should be excluded from the list of potential controls. The remaining consultations are, therefore, potential cases.
- SELECT CONTROL DATASET
-
Select suitable controls from the dataframe of potential controls, either unmatched or matched on give variables. The controls can be either consultation controls (where individual consultations are selected from the dataframe) or patient controls (where patients are selected).
-
When selecting patient controls, it is necessary to collapse the consultation-based dataframe down to a patient-based on patient ID. A default, a collapsed dataframe will contain a 'count' variable to indicate how many consultations were recorded for each patient, the dates of the first and last consultations, and the last recorded entry for all other variables. This can, however, be altered as necessary.
- MERGE CASE-CONTROL DATASET WITH ORIGINAL DATAFRAMES The initial selection of case control dataset returns and minimalist dataframe that contains the bare minimum variables to be able to make the selection. After the case-control dataframe has been selected, it is necessary to merge with the original dataframe to return a complete dataset that contains all the original variables.
POINTS TO NOTE
-
Collapsing a consultation-based dataframe to a patient-based dataframe requires a lot of computer processing that can be slow. As a result collapsing the consultation-based dataframe to a patient-based dataframe is only done after the controls have been selected; this ensures that only an minimal amount of computer processing is required to collapse the dataset.
-
The list of confirmed cases can be passed to the function either as a list (or series) with not other variables, or as a dataframe which contains several variables, one of which is the consultation ID or patient ID (depending on whether the required control dataset consists of consultations or patients).
-
Two dataframes need to be passed to the functions, one is a dataframe of 'ALL' data (including all necessary variables) and the other is a dataframe (or series or list) of confirmed cases. If the confirmed cases are all included in the dataframe of 'ALL' data then the final returned dataset (containing all the necessary other variables of interest) will be recreated from the dataframe of 'ALL' data. This means that any edits included in the dataframe of confirmed cases will be lost in favour of recreating the data from source. However, if the dataframe of cases contains some consultations or patients that are not included in the original dataframe then the returned dataframe will contain the data included in confirmed cases dataframe.
Selecting consultation controls
There are two main functions that can be used to create a case-control dataset:
-
phjGenerateCaseControlDataset()
This function ultimately calls the phjSelectCaseControlDataset() function but it also attempts to automate a large proportion of the required pre- and post-production faffing around. For example, the function will determine whether a consultation-based or patient-based dataset is required, it will generate the dataframe of potential controls automatically and will merge the skeleton dataframe returned by phjSelectCaseControlDataset() function to produce a dataframe that is complete with all the variables that were included in the original dataframes.
-
phjSelectCaseControlDataset()
This function takes, as arguments, two dataframes, one of confirmed cases and the other of potential controls. It then returns a 'skeleton' dataframe containing the minimal number of variables (e.g. ID, case/control and group membership (if a matched control set was required). It will be necessary to merge this skeleton with appropriate dataframes to produce a complete case-control dataset that contains all the necessary variables required for further analysis.
phjGenerateCaseControlDataset()
This function tries to deliver a complete solution for selecting controls for use in case-control studies.
Function parameters
Exceptions raised
-
AssertionError if entered variables are invalid.
-
FileNotFoundError if path and name of file containing screening regex does not exist.
-
re.error if screening regex (entered as an argument or retrieved from a text file) does not compile.
Returns
Other notes
As mentioned previously, there are some limitions that should be recognised when passing case data. It is, therefore, important to pass suitable case data. The function should be passed a full dataframe containing 'ALL' the data. In fact, some of the confirmed cases need not be included in the dataframe of 'ALL' data (but there some limitations if this is the case). The requested case-control dataset can be either 'consultation-based' or 'patient-based'. In each of these cases, the confirmed cases can be passed in one of several formats (but, in some situations, returning a valid case-control dataset may not be feasible).
- Consultation-based dataset requested
-
Cases passed as a SERIES of consultation ID numbers, all of which are included in the dataframe of 'ALL' data. SUCCESS. Returned dataframe will contain variables reconstructed from 'ALL' data.
-
Cases passed as a SERIES of consultation ID numbers, some of which are not included in the dataframe of 'ALL' data. FAILED. Required variables missing for some cases.
-
Cases passed as a DATAFRAME containing several variables, one of which is the CONSULTATION ID and all consultations are a subset of the consultations in the dataframe of 'ALL' data. SUCCESS
-
Cases passed as a DATAFRAME containing several variables, one of which is the CONSULTATION ID but not all consultations are included in the dataframe of 'ALL' data. FAILED
-
Cases passed as a DATAFRAME containing all the same variables as included in the 'ALL' dataframe. Not all consultations are included in the dataframe of 'ALL' data. SUCCESS
- Patient-based dataset requested
-
Cases passed as a SERIES of case PATIENT IDs that are a subset of the information in the dataframe of 'ALL' data. SUCCESS
-
Cases passed as a SERIES of case PATIENT IDs that are NOT a subset of the information in the dataframe of 'ALL' data (e.g. there may be extra rows). FAILED
-
Cases passed as a DATAFRAME containing several variables, one of which is the PATIENT ID and all patients are a subset of the patients in the dataframe of 'ALL' data. SUCCESS
-
Cases passed as a DATAFRAME containing several variables, one of which is the PATIENT ID but not all patients are a subset of the patients in the dataframe of 'ALL' data. FAILED
-
Cases passed as a DATAFRAME containing numerous variables, one of which is the PATIENT ID and all patients are a subset of the patients in the dataframe of 'ALL' data. The variables are the same as those that will be produced when the consultation dataframe is collapsed based on patient ID data. SUCCESS
-
Cases passed as a DATAFRAME containing numerous variables, one of which is the PATIENT ID but the patients are NOT a subset of the patients in the dataframe of 'ALL' data. The variables are the same as those that will be produced when the consultation dataframe is collapsed based on patient ID data. SUCCESS
phjSelectCaseControlDataset()
The phjSelectCaseControlDataset() function can be used independently to select case-control datasets from the SAVSNET database. It receives, as parameters, two Pandas dataframes, one containing known cases and, the other, potential controls. For unmatched controls, the algorithm selects the requested number of random controls from the database whilst for matched controls, the algorithm steps through each case in turn and selects the relevant number of control subjects from the second dataframe, matching on the list of variables provided as an argument to the function. The function then adds the details of the case and the selected controls to a separate, pre-defined dataframe before moving onto the next case.
Initially, the phjSelectCaseControlDataset() function calls phjParameterCheck() to check that passed parameters meet specified criteria (e.g. ensure lists are lists and ints are ints etc.). If all requirements are met, phjParameterCheck() returns True and phjSelectCaseControlDataset() continues.
The function requires a parameter called phjMatchingVariablesList. If this parameter is None (the default), an unmatched case-control dataset is produced. If, however, the parameter is a list of variable names, the function will return a dataset where controls have been matched on the variables in the list.
The phjSelectCaseControlDataset() function proceeds as follows:
- Creates an empty dataframe in which selected cases and controls will be stored.
- Steps through each case in the phjCasesDF dataframe, one at a time.
- Gets data from matched variables for the case and store in a dict
- Creates a mask for the controls dataframe to select all controls that match the cases in the matched variables
- Applies mask to controls dataframe and count number of potential matches
- Adds cases and controls to dataframe (through call to phjAddRecords() function)
- Removes added control records from potential controls database so single case cannot be selected more than once
- Returns Pandas dataframe containing list of cases and controls. This dataframe only contains columns for unique identifier, case and group id. It will, therefore need to be merged with the full database to get and additional required columns.
Function parameters
The function takes the following parameters:
-
phjCasesDF
Pandas dataframe containing list of cases.
-
phjPotentialControlsDF
Pandas dataframe containing a list of potential control cases.
-
phjUniqueIdentifierVarName
Name of variable that acts as a unique identifier (e.g. consulations ID number would be a good example). N.B. In some cases, the consultation number is not unique but has been entered several times in the database, sometimes in very quick succession (ms). Data must be cleaned to ensure that the unique identifier variable is, indeed, unique.
-
phjMatchingVariablesList (Default = None)
List of variable names for which the cases and controls should be matched. Must be a list. The default is None.
-
phjControlsPerCaseInt (Default = 1)
Number of controls that should be selected per case.
-
phjPrintResults (Default= False)
Print verbose output during execution of scripts. If running on Jupyter-Notebook, setting PrintResults = True causes a lot a output and can cause problems connecting to kernel.
Exceptions raised
- AssertionError if entered variables are invalid.
Returns
Pandas dataframe containing a column containing the unique identifier variable, a column containing case/control identifier and – for matched case-control studies – a column containing a group identifier. The returned dataframe will need to be left-joined with another dataframe that contains additional required variables.
Other notes
Setting phjPrintResults = True can cause problems when running script on Jupyiter-Notebook.
Examples
Examples of the functions in use are given below:
Selecting unmatched controls
casesDF = pd.DataFrame({'animalID':[1,2,3,4,5],'var1':[43,45,34,45,56],'sp':['dog','dog','dog','dog','dog']})
potControlsDF = pd.DataFrame({'animalID':[11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30],
'var1':[34,54,34,23,34,45,56,67,56,67,78,98,65,54,34,76,87,56,45,34],
'sp':['dog','cat','dog','dog','cat','dog','cat','dog','cat','dog',
'dog','dog','dog','cat','dog','cat','dog','dog','dog','cat']})
print("This dataframe contains all the cases of disease\n")
print(casesDF)
print("\n")
print("This dataframe contains all the animals you could potentially use as controls\n")
print(potControlsDF)
print("\n")
unmatchedDF = epy.phjSelectCaseControlDataset(phjCasesDF = casesDF,
phjPotentialControlsDF = potControlsDF,
phjUniqueIdentifierVarName = 'animalID',
phjMatchingVariablesList = None,
phjControlsPerCaseInt = 2,
phjPrintResults = False)
print(unmatchedDF)
This produces the following output:
This dataframe contains all the cases of disease
animalID var1 sp
0 1 43 dog
1 2 45 dog
2 3 34 dog
3 4 45 dog
4 5 56 dog
This dataframe contains all the animals you could potentially use as controls
animalID var1 sp
0 11 34 dog
1 12 54 cat
2 13 34 dog
3 14 23 dog
4 15 34 cat
5 16 45 dog
6 17 56 cat
7 18 67 dog
8 19 56 cat
9 20 67 dog
10 21 78 dog
11 22 98 dog
12 23 65 dog
13 24 54 cat
14 25 34 dog
15 26 76 cat
16 27 87 dog
17 28 56 dog
18 29 45 dog
19 30 34 cat
case animalID
0 1 1
1 1 2
2 1 3
3 1 4
4 1 5
5 0 18
6 0 25
7 0 24
8 0 14
9 0 22
10 0 12
11 0 27
12 0 16
13 0 13
14 0 30
Selecting controls that are matched to cases on variable 'sp'
casesDF = pd.DataFrame({'animalID':[1,2,3,4,5],'var1':[43,45,34,45,56],'sp':['dog','dog','dog','dog','dog']})
potControlsDF = pd.DataFrame({'animalID':[11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30],
'var1':[34,54,34,23,34,45,56,67,56,67,78,98,65,54,34,76,87,56,45,34],
'sp':['dog','cat','dog','dog','cat','dog','cat','dog','cat','dog',
'dog','dog','dog','cat','dog','cat','dog','dog','dog','cat']})
print("This dataframe contains all the cases of disease\n")
print(casesDF)
print("\n")
print("This dataframe contains all the animals you could potentially use as controls\n")
print(potControlsDF)
print("\n")
matchedDF = epy.phjSelectCaseControlDataset(phjCasesDF = casesDF,
phjPotentialControlsDF = potControlsDF,
phjUniqueIdentifierVarName = 'animalID',
phjMatchingVariablesList = ['sp'],
phjControlsPerCaseInt = 2,
phjPrintResults = False)
print(matchedDF)
Output
This dataframe contains all the cases of disease
animalID sp var1
0 1 dog 43
1 2 dog 45
2 3 dog 34
3 4 dog 45
4 5 dog 56
This dataframe contains all the animals you could potentially use as controls
animalID sp var1
0 11 dog 34
1 12 cat 54
2 13 dog 34
3 14 dog 23
4 15 cat 34
5 16 dog 45
6 17 cat 56
7 18 dog 67
8 19 cat 56
9 20 dog 67
10 21 dog 78
11 22 dog 98
12 23 dog 65
13 24 cat 54
14 25 dog 34
15 26 cat 76
16 27 dog 87
17 28 dog 56
18 29 dog 45
19 30 cat 34
UNMATCHED CONTROLS
case animalID
0 1 1
1 1 2
2 1 3
3 1 4
4 1 5
5 0 22
6 0 13
7 0 30
8 0 18
9 0 25
10 0 28
11 0 14
12 0 15
13 0 24
14 0 19
MATCHED CONTROLS
animalID group case sp
0 1 0 1 dog
1 28 0 0 dog
2 16 0 0 dog
3 2 1 1 dog
4 25 1 0 dog
5 27 1 0 dog
6 3 2 1 dog
7 21 2 0 dog
8 11 2 0 dog
9 4 3 1 dog
10 18 3 0 dog
11 14 3 0 dog
12 5 4 1 dog
13 22 4 0 dog
14 29 4 0 dog