Unique Ids - matthewcornell/wikitest GitHub Wiki

Why Unique IDs

In order to keep track of cases through time, we need to determine if two reported cases are the same physical case. In order to do this, we require the data to match on several fields. If the data matched on those fields, we consider the two cases to be different reports of the same case. In order to make it easy to compare the reported cases, we combined the fields into a single string to compare. That string is our unique id.

How we originally did unique ids

We say that two reported cases are different reports of the same case if they have the same values for the following fields:

  1. disease
  2. province
  3. address_code
  4. date_sick
  5. birth_day
  6. birth_month
  7. birth_year
  8. race
  9. marital
  10. occupation
  11. ?

Problems with UIDs

We encountered two main problems with matching based on UID

Collisions

It is entirely possible for two different cases to have the same UID. Some of the fields above appear to frequently be assigned standard values if the data is missing. Even were this not the case, a single report can have multiple cases with the same UID. This is part of the reason we only use UIDs for determining date_delivered. We estimate a few hundred cases each year collide in this way. This is still currently an issue and may mean some cases are seen to be delivered before their actual delivery date.

Revisions

Some cases are removed or modified after their original submission. If any of the fields used for UID calculation are modified, then the case will be assigned a different UID in the later report. This is still currently an issue and may mean some cases are seen to be delivered after their actual delivery date.

Direction for the future

In an effort to account for revisions, we wanted to consider a probabalistic matching framework, which would assign probabilities to cases being the same based on their different fields. We can then use this to determine a probability distribution over date_delivered etc.