Unique Ids - matthewcornell/wikitest GitHub Wiki
Why Unique IDs
In order to keep track of cases through time, we need to determine if two reported cases are the same physical case. In order to do this, we require the data to match on several fields. If the data matched on those fields, we consider the two cases to be different reports of the same case. In order to make it easy to compare the reported cases, we combined the fields into a single string to compare. That string is our unique id.
How we originally did unique ids
We say that two reported cases are different reports of the same case if they have the same values for the following fields:
- disease
- province
- address_code
- date_sick
- birth_day
- birth_month
- birth_year
- race
- marital
- occupation
- ?
Problems with UIDs
We encountered two main problems with matching based on UID
Collisions
It is entirely possible for two different cases to have the same UID. Some of the fields above appear to frequently be assigned standard values if the data is missing. Even were this not the case, a single report can have multiple cases with the same UID. This is part of the reason we only use UIDs for determining date_delivered. We estimate a few hundred cases each year collide in this way. This is still currently an issue and may mean some cases are seen to be delivered before their actual delivery date.
Revisions
Some cases are removed or modified after their original submission. If any of the fields used for UID calculation are modified, then the case will be assigned a different UID in the later report. This is still currently an issue and may mean some cases are seen to be delivered after their actual delivery date.
Direction for the future
In an effort to account for revisions, we wanted to consider a probabalistic matching framework, which would assign probabilities to cases being the same based on their different fields. We can then use this to determine a probability distribution over date_delivered etc.