Skip to content

MDS Data Redaction

Michael Schnuerle edited this page Mar 12, 2021 · 14 revisions

DRAFT.

MDS 1.1.0 introduces two new beta features, Provider Reports and Metrics. This OMF guidance document explains the use of specific data redaction principles to remove low counts of data from these new features for privacy.

Summary

Some uses of Provider Reports and Metrics may return a small count of trips for a certain geographic area or time frame. These low aggregated data counts could increase a privacy risk of re-identification when combined with other data sources or outside knowledge. To correct for that, these features do not return data below a certain count of aggregate results. This is called k-anonymity and the threshold we have set is a k-value of 10.

If the query returns fewer than 10 trips in an aggregate count, then that row's count value is returned as a "-1" value. Note 0 values are also returned as "-1" to account for privacy risk in some edge case scenarios.

Both Provider Reports and Metrics have a "Data Redaction" section that summarizes the information in this document, and link here for more details, context, and guidance. See the Data Redaction sections here within Provider Reports and Metrics.

Feedback Welcome

As these new features are in beta, the k-value of 10 may be adjusted up or down in future releases and/or may become dynamic to account for specific categories of use cases. To improve the specification and to inform future guidance, beta users are encouraged to share their feedback and questions about k-values on this discussion thread and at our weekly Working Group meetings.

Managing Risk

Using this k-anonymity methodology will reduce, but not necessarily eliminate the risk that an individual could be re-identified in a dataset. Redacting low counts using k-anonymity is just one part of good privacy protection practice, which you can read more about in our MDS Privacy Guide for Cities. The "Managing Risk" section has guidance for cities, including "while it is important to protect data with the strongest possible technical measures, these measures should be further buttressed with strong legal and administrative controls, such as contractual commitments not to attempt re-identification, terms of use, etc".

Risk Scenarios

Higher k-values have lower re-identification risk, but may result in less complete data depending on the duration of time periods and size of geographic areas for which the reports are calculated. Some use cases (such as sharing results with trusted parties who already have access to disaggregated trip data) may not require k-anonymization, while others (such as sharing with less trusted partners or extracts for the public) may require substantial k-anonymization. While reports with any k-value are substantially less sensitive than disaggregated trip records, they should still be treated as potentially sensitive unless a more detailed risk analysis is performed by the hosting organization.

For example, an aggregate count of trips for a small geography and precise time could allow the determination of the end point of a specific trip, which is why that data is redacted.

Because of scenario variability and the dynamic nature of how Provider Reports and Metrics work with subsets of MDS data, we recommend a lower risk k-value of 10 during the beta learning period until we get real-world feedback and incorporate changes.

Methodology

It is a common practice to remove small counts of individuals from aggregated datasets, eg, census areas, health department maps. In many of these cases a k-value of 5 is sufficient to protect privacy of individuals. However, the OMF community has decided that during the learning phase as cities and companies test out these features in the real world and receive feedback, we should use a value of 10 as it leans towards lower risk and greater data anonymization. See the References section below for some resources.

Factors in Scenario Variability

Low k-values mean more information, but higher risk. High k-values mean less information, but lower risk. We have an idea of the risk, but it changes greatly based based on scenarios and audience.

Some factors that affect both risk exposure and the need for more granular data are:

  1. Geography size (parking, no ride, equity zone, operating areas)
  2. Population density (dense, sparse, residential, commercial)
  3. Time frame (month, week, day, hour)
  4. Data consumer/audience (internal, research, public)
  5. Policy reason (enforcement, equity, operations)
  6. Special groups data (all riders, low income)

The combination of these factors could allow different k-values. For now we are using a higher one-size-fits all k-value of 10 since it provides the right balance of low risk and adequate data for most policy scenarios.

Hypothetical Scenarios

This hypothetical chart shows the kind of risk incurred with various factors for some fabricated scenarios and data, and can give a visual idea of the complexity and choices that need to be considered for each use case. Different data sets have different risk profiles, and higher k-values reduce risk, but have lower data utility.

K-value Risk Variability Chart

Note this graph is for illustrative purposes and is not showing actual data or research. In the real world there are infinite combinations of factors and acceptable risk ranges.

Open Questions

The OMF acknowledges that there are many ways to use k-anonymity and we have chosen the method presented here as a low risk option until we receive more on the ground feedback.

  1. The k-value is set now as a flat value of 10. Should it be a range instead? We don't yet have much basis to define ranges for every scenario.
  2. Should there be different k-values for different scenarios? We are not yet sure how to define a basis for all scenario combinations until we get some real-world feedback.
  3. Should we show 0 count values separately? Some other k-anonymity methods do this, but for our case we believe 0 and 1-9 should not be distinguishable.

For more questions and to leave your thoughts, see our public discussion area.

Definitions

  • Disclosure avoidance - a series of techniques that can be used to suppress or redact data, like k-anonymity, record swapping, table collapsing, rounding, top-coding, and differential privacy.
  • K-anonymity - Removing low counts of aggregated data to reduce individual re-identification risk. It's a disclosure avoidance technique called data suppression.
  • K-value - The threshold at which you redact data at a minimum number of cases. The “K” here just means a variable you can set, like ‘x’ in algebra. Counts below the "k" value are removed.

References

A sampling of external documents that explain some of the concepts and reasoning used in the MDS Metrics and Provider Reports. While many of these examples include data suppression required by federal law when sharing with the public HIPAA-compliant data about individuals, and Metrics and Provider Reports differs significantly in many distinct ways (already aggregated, not HIPAA data, not for public consumption, internal use only, etc), these resources can still be informative.

  1. US Census history of privacy protections diagram
  2. CDC statistical notes on Healthy People 2010 Criteria for Data Suppression
  3. US Census American Community Survey Data Suppression
  4. CDC FAQ: Suppressed Small Data Values
  5. CDC Data Use Agreement
Clone this wiki locally