Screened Tables - openmpp/openmpp.github.io GitHub Wiki
Home > Model Development Topics > Screened Tables
Screened tables let model developers implement Statistical Disclosure Control (SDC) policies at the cell level of entity tables. Screened tables can also be used for output quality management, for example to suppress statistically unreliable cells or to round values to a given number of digits of precision to avoid the impression of spurious accuracy in model outputs.
- Introduction Introduction
- Overview Overview
- Syntax and simple example Syntax and simple example
- Remarks and limitations Remarks and limitations
- Extrema Collections Extrema collections
- Screening function arguments Arguments of the screening transformation function(s)
- Examples Examples
- Annex 1: SM1 Code
- Annex 2: SM1 Parameters
Please see the following references for an overview of Statistical Disclosure Control:
Screened Tables is an optional capability that allows a model developer to examine and modify table cell values, based on the values themselves and their provenance. It is activated by using a keyword in the entity table declaration, and by providing a C++ function which implements screening at the cell level. That C++ screening function is not necessarily complex, and multiple function arguments are supplied to ease the task. This wiki topic also includes a suite of working examples.
The examples are based on the model SM1
in the OpenM++ distribution at OM_ROOT/models/SM1
.
SM1
is a simple model which adds several attributes to the NewCaseBased
model to provide raw material for the examples in this topic.
The example table declarations in this topic can be copy/pasted directly into the module SM1/code/ScreenedTables.ompp
,
and the corresponding example screening code can be copy/pasted directly into the body of the function TransformScreened1
in SM1/code/ScreeningCode.ompp
.
The upper right corner of each code block in this wiki topic should display a browser pop-up which can be clicked to copy the entire block to the system clipboard for subsequent pasting into SM1
, for in-depth exploration.
To provide flexibility and facilitate comparison of different methods a model can use up to four different screening functions. The examples in this topic use only a single method, method #1.
To activate one of the four screening methods for a given entity table,
include exactly one of the four keywords screened1
, screened2
, screened3
, or screened4
in the table's properties,
and supply a definition for the corresponding C++ transformation function.
For example, the following table is declared in SM1/code/ScreenedTables.ompp
:
table snapshot screened1 Person ExampleTable
[trigger_entrances(integer_age, 50)]
{
{
unit, //EN Persons
mean(earnings), //EN Average earnings
P50(earnings) //EN Median earnings
}
};
The table property screened1
specifies that it will be screened using screening method #1,
which uses the C++ transformation function TransformScreened1
.
That means that each of the three accumulators (statistics) in the single cell of EarningsAt50
will be subject to modification by the developer-supplied C++ function TransformScreened1
.
This function is called automatically when the simulation of each sub/member/replicate completes,
for each value in this table.
The developer-supplied function TransformScreened1
takes 10 arguments (described below),
whose values are supplied by the framework.
The first argument in_value
is a value in the table before modification.
The function returns the possibly modified value.
Here's an example function body of TransformScreened1
defined in the model code module SM1/code/ScreeningCode.ompp
:
double TransformScreened1(
const double in_value,
...
)
{
/// transformed value, initialized to quiet NaN (shows as empty)
double out_value = UNDEF_VALUE;
// notional example of transformation (round to 100's)
out_value = 100.0 * std::round(in_value / 100.0);
return out_value;
}
The C++ code in this example rounds all table values to 100's.
Here's a comparison of the table values for both unscreened and screened versions of the table:
Quantity | Label | unscreened | screened |
---|---|---|---|
unit |
Persons | 2538 | 2500 |
mean(earnings) |
Average earnings | 99203.9 | 99200 |
P50(earnings) |
Median earnings | 101191 | 101200 |
To examine the transformation process in action,
build the Debug version of SM1
and set breakpoints in the function TransformScreened1
in the module ScreeningCode.ompp
.
- only at sub level, indirectly at run level
- only at cell level (including marginal cells if present) - no checks for residual disclosure
- only at accumulator/statistic level, indirectly at expression/measure level.
- only for entity tables, indirectly to derived tables.
- use the statistic
mean
to screen averages, i.e.mean(x)
instead ofsum(x)/unit
as in older model code. - to easily deactivate screening without editing any table declarations, insert the single line
return in_value;
as the first line in the screening function. - in use cases where model code is released but the screening algorithm is confidential, the screening function can be built separately and supplied as an object file or library to build the model. Alternatively, an alternative module containing a 'do nothing' screening function can be distributed, if the model code is distributed with a non-confidential synthetic version of the microdata file(s).
The screening function arguments smallest
and largest
are collections containing the highest M and lowest M observations in the table cell containing the value being screened,
where M is a configurable constant for each of the four screening methods.
These collections allow implementing 'dominance' rules for cell suppression, e.g. suppress a cell total if the top 3 observations in the cell account for more than 70% of the total.
To set an appropriate value for M, use the corresponding option screened[1-4]_extremas_size
in model code.
For example, the following statement retains the highest 3 and lowest 3 observations in the smallest
and largest
extrema collections for method #1:
options screened1_extremas_size = 3;
Extrema collections might be smaller than the specified size if there are fewer than that many observations in the cell.
Extrema collections can contain the special floating point values +inf and -inf. They never contain the special floating point value NaN, because a NaN increment is treated as a run-time model error by OpenM++.
Only certain statistics are considered eligible for extrema collections. They are
sum
minimum
maximum
mean
If a quantity is ineligible the associated extrema collections will be empty.
Reducing the size of extrema collections reduces memory and processing requirements.
M is set to 0 by default for all four screening methods, to avoid the computational and memory overhead of maintaining these collections for each cell of each screened table unless needed by the screening method.
This subtopic contains several reference sections which are listed and linked here for convenience:
- Screening function signature
- statistic enumeration
- increment enumeration
- table enumeration
- attribute enumeration
This subtopic describes the arguments to the developer-supplied screening function(s).
It is essential that the definition of the screening function(s) in model code have the correct argument types. Otherwise the model build will fail at the C++ link stage, with an error message like
error LNK2019: unresolved external symbol "double __cdecl TransformScreened1(...
The correct function signature can be copied from the section below,
or from the file OM_ROOT/include/omc/globals1.h
in the OpenM++ distribution.
The rows of the following table describe the 10 arguments of a screening function.
The example column contains values pasted from a debugger session
on a breakpoint in the function TransformScreened1
in SM1
,
on the second invocation of the function,
using the table given earlier.
Name | Example | Notes |
---|---|---|
in_value |
99203.855397951411 |
The original value in the table, which can be transformed or suppressed by code in the transformation function. |
description |
ExampleTable: accumulator 1: mean(value_out(interval(earnings))) |
A descriptive string which can be useful in debugging sessions. Don't attempt to parse it for content. Instead, use the enumerator arguments described in rows below. |
statistic |
mean (4) |
The enumerator for the statistic for use in function code, e.g. omr::stat::mean
|
increment |
value_out (7) |
The enumerator for the increment for use in function code, e.g. omr::incr::value_out . For the unit keyword (count of increments), the value is omr::incr::unused . |
table |
ExampleTable (0) |
The entity table name as an enumerator, e.g. omr::etbl::ExampleTable
|
attribute |
earnings (4) |
The attribute name as an enumerator, e.g. omr::attr::earnings . If the attribute name is not visible, e.g. duration() , the enumerator is omr::attr::om_none . |
observations |
2538.0000000000000 |
The number of observations (increments) in the cell. It is always unweighted, even if the table is weighted. The value is less than the 5,000 cases in the Default SM1 run due to mortality before age 50. |
extrema_size |
3 |
Identical to the value supplied in the screened[1-4]_extremas_size option, for use in function code. This is the maximum possible size of the extrema collections. The actual size may be less if there are fewer observations in the cell. |
smallest |
{0.0000000000000000, 0.0000000000000000, 0.0000000000000000} |
The three smallest observed earnings, in increasing order. The extrema collection is one of the standard C++ container types, specifically std::multiset<double> . See code examples elsewhere in this topic for use. The values are all zero in this example because the distribution of earnings in SM1 is mixed discrete-continuous, with a large subpopulation having zero earnings. |
largest |
{370810.00000000000, 398272.00000000000, 484007.00000000000} |
The three largest observed earnings, in increasing order. |
/**
* Table screening transformation function #1
*
* @param in_value The table value subject to transformation.
* @param description A formatted string describing the table and statistic.
* @param statistic The statistic of the accumulator, e.g. sum, mean.
* @param increment The increment of the accumulator, e.g. delta, value_out.
* @param table The table of the accumulator (model-specific).
* @param attribute The attribute of the accumulator (model-specific).
* @param observations The count of observations in the cell (# of increments).
* @param extrema_size The maximum size M of the two extrema collections (configurable)
* @param smallest The extrema collection containing the smallest M observations.
* @param largest The extrema collection containing the largest M observations.
*
* @returns The transformed version of in_value.
*/
double TransformScreened1(
const double in_value,
const char* description,
const omr::stat statistic,
const omr::incr increment,
const omr::etbl table,
const omr::attr attribute,
const double observations,
const size_t extrema_size,
const std::multiset<double>& smallest,
const std::multiset<double>& largest
)
[back to screening function arguments]
[back to topic contents]
This enumeration is generated by the OpenM++ compiler in the file src/om_types0.h
.
namespace omr {
/// statistic in an entity table
enum class stat {
unit,
sum,
minimum,
maximum,
mean,
variance,
stdev,
P1,
P2,
P5,
P10,
P20,
P25,
P30,
P40,
P50,
P60,
P70,
P75,
P80,
P90,
P95,
P98,
P99,
gini,
};
} // namespace omr
[back to screening function arguments]
[back to topic contents]
This enumeration is generated by the OpenM++ compiler in the file src/om_types0.h
.
namespace omr {
/// increment in an entity table
enum class incr {
unused,
delta,
delta2,
nz_delta,
value_in,
value_in2,
nz_value_in,
value_out,
value_out2,
nz_value_out,
};
} // namespace omr
[back to screening function arguments]
[back to topic contents]
This model-specific enumeration is generated by the OpenM++ compiler in the file src/om_types0.h
.
Here it is for the SM1
model:
namespace omr {
/// entity table in model
enum class etbl {
ExampleTable,
om_none
};
} // namespace omr
[back to screening function arguments]
[back to topic contents]
This model-specific enumeration is generated by the OpenM++ compiler in the file src/om_types0.h
.
Here it is for the SM1
model:
namespace omr {
/// visible entity attribute in model
enum class attr {
age,
alive,
all_earnings,
benefit,
case_id,
case_seed,
earnings,
entity_id,
integer_age,
lifecycle_counter,
lifecycle_event,
positive_earnings,
region,
se_earnings,
time,
under_audit,
om_none
};
} // namespace omr
[back to screening function arguments]
[back to topic contents]
This subtopic contains some worked examples of screened tables and screening functions.
Each example shows the table declaration, the body of the screening function, and a cell-by-cell comparison of the effects of screening on the table.
The examples are meant to illustrate coding approaches to different kinds of screening requirements. They have not been tested carefully for validity.
Some of these examples use standard math functions which require the header file <cmath>
,
which is made available to model code through a #include
instruction in the file SM1/code/custom_early.h
.
- Example 1 Rounding based on the kind of statistic
- Example 2 Rounding to 3 digits of precision
- Example 3 Suppressing cells with few observations
- Example 4 Suppressing cells dominated by a few large observations
This example rounds values to a fixed number of decimal digits, but treats different statistics differently. Specifically, the gini coefficient is not modified, counts are rounded to 100's, means are rounded to 1000's, sums are rounded to 1000000's, and other statistics are suppressed.
Table declaration:
table snapshot screened1 Person ExampleTable
[trigger_entrances(integer_age, 50)]
{
{
unit, //EN Persons
nz_value_out(earnings), //EN Persons with earnings
mean(earnings), //EN Average earnings
sum(earnings), //EN Total earnings
P50(earnings), //EN Median earnings
maximum(earnings), //EN Maximum earnings
gini(earnings) //EN gini of earnings
}
};
Screening function body:
{
/// transformed value, initialized to quiet NaN (shows as empty)
double out_value = UNDEF_VALUE;
/// the increment is from the count-like nz family
bool is_nz =
(increment == omr::incr::nz_delta)
|| (increment == omr::incr::nz_value_in)
|| (increment == omr::incr::nz_value_out);
if (statistic == omr::stat::gini) {
// gini coefficient
// do not modify
out_value = in_value;
}
else if ((statistic == omr::stat::unit) || is_nz) {
// count-like value
// round to 100's
out_value = 100.0 * std::round(in_value / 100.0);
}
else if (statistic == omr::stat::mean) {
// average-like value
// round to 1000's
out_value = 1000.0 * std::round(in_value / 1000.0);
}
else if (statistic == omr::stat::sum) {
// total-like value
// round to 1000000's
out_value = 1000000.0 * std::round(in_value / 1000000.0);
}
else {
// suppress other things
out_value = UNDEF_VALUE;
}
return out_value;
}
Effects:
Quantity | Label | unscreened | screened |
---|---|---|---|
unit |
Persons | 2538 | 2500 |
nz_value_out(earnings) |
Persons with earnings | 2030 | 2000 |
mean(earnings) |
Average earnings | 99203.9 | 99000 |
sum(earnings) |
Total earnings | 251779000 | 252000000 |
P50(earnings) |
Median earnings | 101191 | |
maximum(earnings) |
Maximum earnings | 484007 | |
gini(earnings) |
gini of earnings | 0.375879 | 0.375879 |
[back to Examples]
[back to topic contents]
This example rounds values to a fixed number of digits of precision (3). The table declaration is identical to example 1 immediately above.
Table declaration:
table snapshot screened1 Person StatsAt50
[trigger_entrances(integer_age, 50)]
{
{
unit, //EN Persons
nz_value_out(earnings), //EN Persons with earnings
mean(earnings), //EN Average earnings
sum(earnings), //EN Total earnings
P50(earnings), //EN Median earnings
maximum(earnings), //EN Maximum earnings
gini(earnings) //EN gini of earnings
}
};
Screening function body:
{
// pass through non-finite values and 0.0
if (!std::isfinite(in_value) || in_value == 0.0) {
return in_value;
}
/// transformed value, initialized to quiet NaN (shows as empty)
double out_value = UNDEF_VALUE;
// number of significant digits to retain
const static double precision = 3;
double d = std::ceil(std::log10(std::abs(in_value)));
/// power of 10 for scaling
double power = precision - std::trunc(d);
/// scaling needed before rounding
double magnitude = std::pow(10.0, power);
out_value = std::round(in_value * magnitude) / magnitude;
return out_value;
}
Effects:
Quantity | Label | unscreened | screened |
---|---|---|---|
unit |
Persons | 2538 | 2540 |
nz_value_out(earnings) |
Persons with earnings | 2030 | 2030 |
mean(earnings) |
Average earnings | 99203.9 | 99200 |
sum(earnings) |
Total earnings | 251779000 | 252000000 |
P50(earnings) |
Median earnings | 101191 | 101000 |
maximum(earnings) |
Maximum earnings | 484007 | 484000 |
gini(earnings) |
gini of earnings | 0.375879 | 0.376 |
[back to Examples]
[back to topic contents]
This example rounds counts to 5's, suppresses cells with under 100 observations. The table has a classification dimension and a margin.
Table declaration:
table snapshot screened1 Person ExampleTable //EN High earners by region
[trigger_entrances(integer_age, 50)]
{
region+
* {
high_earner //EN High earners
}
};
Screening function body:
{
/// transformed value, initialized to quiet NaN (shows as empty)
double out_value = UNDEF_VALUE;
if (observations < 100) {
// suppress if fewer than 100 observations
out_value = UNDEF_VALUE;
}
else {
// round to 5's
out_value = 5.0 * std::round(in_value / 5.0);
}
return out_value;
}
Effects:
Region | observations | unscreened | screened |
---|---|---|---|
0 |
1374 | 269 | 270 |
1 |
675 | 141 | 140 |
2 |
373 | 83 | 85 |
3 |
59 | 5 | |
4 |
57 | 9 | |
All |
2538 | 507 | 505 |
[back to Examples]
[back to topic contents]
This example implements a dominance rule to suppress cells which are dominated by only a few observations in a cell. Specifically, average earnings are suppressed if the top 3 earners account for 60% or more of the earnings in the cell. The code for the screening function also illustrates how to specialize for a specific table.
Table declaration:
table snapshot screened1 Person ExampleTable //EN Average earnings of high earners
[trigger_entrances(integer_age, 50) && high_earner]
{
region+
* {
mean(earnings) //EN Earnings
}
};
Screening function body:
{
/// transformed value, initialized to quiet NaN (shows as empty)
double out_value = UNDEF_VALUE;
switch (table) {
case omr::etbl::ExampleTable:
{
assert(extrema_size == 3); // code below requires that extrema size is 3
assert(statistic == omr::stat::mean); // sanity check
assert(attribute == omr::attr::earnings); // sanity check
double sum_top3 = 0.0;
for (auto& val : largest) {
sum_top3 += val;
}
double sum_all = in_value * observations;
if ((sum_top3 / sum_all) >= 0.60) {
// suppress if top 3 earners account for 60% or more of earnings in cell
out_value = UNDEF_VALUE;
}
else {
// round value to 1000's
out_value = 1000.0 * std::round(in_value / 1000.0);
}
break;
}
default:
{
// code to handle other screened tables would go here
break;
}
} // switch
return out_value;
}
Effects:
Region | observations | unscreened | screened |
---|---|---|---|
0 |
236 | 172678 | 173000 |
1 |
127 | 168087 | 168000 |
2 |
74 | 168747 | 169000 |
3 |
5 | 173615 | |
4 |
9 | 210466 | 210000 |
All |
451 | 171504 | 172000 |
Note that the number of observations is small because the table is filtered on high earners.
[back to Examples]
[back to topic contents]
Contents of the module SM1/code/Income.ompp
:
/* NOTE(Income.mpp, EN)
This module contains hard-coded notional income dynamics for testing screened tables.
*/
#include "omc/optional_IDE_helper.h" // help an IDE editor recognize model symbols
#if 0 // Hide non-C++ syntactic island from IDE
range REGION //EN Region
{
0,
4
};
parameters
{
double EarningsNonZeroProportion;
double EarningsScaleFactor;
double EarningsSigma;
double SE_EarningsNonZeroProportion;
double SE_EarningsScaleFactor;
double SE_EarningsSigma;
double HighIncomeThreshold;
double GuaranteedAnnualIncome;
double AuditThreshold;
cumrate RegionDistribution[REGION];
};
entity Person
{
//EN Earnings
double earnings = { 0.0 };
//EN Self-employed earnings
double se_earnings = { 0.0 };
//EN All earnings
double all_earnings = earnings + se_earnings;
//EN Positive earnings
double positive_earnings = max(0.0, all_earnings);
//EN Benefit
double benefit = max(GuaranteedAnnualIncome, GuaranteedAnnualIncome - positive_earnings);
//EN High earner
bool high_earner = (all_earnings >= HighIncomeThreshold);
//EN Under audit
bool under_audit = (all_earnings >= AuditThreshold) || (positive_earnings != all_earnings);
//EN Region
REGION region;
//EN Notional model of earnings
void AssignEarnings(void);
//EN Assign region
void AssignRegion(void);
// call EarningsGrowth at each change in integer_age
hook AssignEarnings, trigger_changes(integer_age);
// call AssignRegion at Start
hook AssignRegion, Start, 1;
};
#endif // Hide non-C++ syntactic island from IDE
void Person::AssignEarnings(void)
{
if (integer_age == 20) {
// Assign starting earnings at age 20
if (RandUniform(10) < EarningsNonZeroProportion) {
double z = RandNormal(11);
double x = EarningsScaleFactor * std::exp(EarningsSigma * z);
earnings = std::round(x);
}
// else earnings have default value of 0
// Assign starting se_earnings at age 20
if (RandUniform(12) < SE_EarningsNonZeroProportion) {
// 80% have self-employed earnings
double z = RandNormal(13);
double x = SE_EarningsScaleFactor * std::exp(SE_EarningsSigma * z);
se_earnings = std::round(x);
}
}
else if (integer_age > 20) {
// Annual change to earnings
{
double u = RandUniform(14);
// rescale uniform to to [0.9, 1.1]
u = 0.9 + 0.2 * u;
double x = earnings * u;
x *= 1.03; // career growth with increasing age
earnings = std::round(x);
}
// Annual change to se_earnings
{
double u = RandUniform(15);
// rescale uniform to [0.9, 1.1]
u = 0.9 + 0.2 * u;
double x = se_earnings * u;
x *= 1.03; // career growth with increasing age
se_earnings = std::round(x);
}
}
else {
// No earnings before age 20
}
}
void Person::AssignRegion(void)
{
double draw = RandUniform(3);
int nRegion = 0;
Lookup_RegionDistribution(draw, &nRegion);
region = (REGION)nRegion;
}
Contents of SM1/parameters/Default/Income.ompp
:
parameters
{
double EarningsNonZeroProportion = 0.80;
double EarningsScaleFactor = 50000.0;
double EarningsSigma = 0.25;
double SE_EarningsNonZeroProportion = 0.80;
double SE_EarningsScaleFactor = 40000.0;
double SE_EarningsSigma = 0.25;
double HighIncomeThreshold = 250000.00;
double GuaranteedAnnualIncome = 20000.00;
double AuditThreshold = 250000.00;
cumrate RegionDistribution[REGION] =
{
200, 100, 50, 10, 10
};
};