Horizontal aggregation: census variables to pff variables - NYCPlanning/db-factfinder GitHub Wiki
Combining census variables to create pff-variables
Sums of census variables
In order to calculate estimates and margins of error for pff variables, input census variables undergo up to two forms of aggregation. We describe these as "horizontal" and "vertical" aggregation, referring to summing tables over either columns or rows. For example, refer to the simplified example below. The two input variables represent data as downloaded from the Census API. If PFF Variable = Input 1 + Input 2
, the estimate columns need to be combined to derive the PFF variable estimate, and MOE columns need to be combined to derive the PFF MOE. Each of these steps are described in the following sections.
geoid | Input Estimate 1 | Input MOE 1 | Input Estimate 2 | Input MOE 2 |
---|---|---|---|---|
tract 1 | 10 | 1 | 20 | 2 |
tract 2 | 30 | 3 | 40 | 4 |
Horizontal aggregation (excluding aggregations described in the "Exceptions" section below) happens in the aggregate_horizontal
method of the Calculate
class.
Calculating estimates of sums
Several population factfinder variables are sums of more granular, mutually exclusive inputs. For example, counts representing a population under 18 might come from the aggregated counts of several childhood age bins (i.e. 0-4, 4-10, 10-15, 15-18). Or, a variable might reflect a sum over male- and female-specific counts. This form of aggregation is "horizontal." In our simplified case, the tract-level counts for a variable comprised of two inputs would be:
geoid | PFF Variable Estimate |
---|---|
tract 1 | 10 + 20 = 30 |
tract 2 | 30 + 40 = 70 |
In general, PFF variable estimate (for a row) = Sum of Input Estimates (for that row)
Calculating MOEs of sums
The margin of error for aggregations are a simple root sum-of-squares of input margins of error. This is based on an assumption that input variables are independent.
geoid | PFF Variable MOE |
---|---|
tract 1 | sqrt(1^2 + 2^2) = sqrt(5) |
tract 2 | sqrt(3^2 + 4^2) = sqrt(25) |
In general, PFF variable MOE (for a row) = Square root of the sum of squared input MOEs (for that row)
Exceptions
Not all pff variables are simple sums of census variables. There are two types of non-sum combinations of census variables: medians and special variables.
Special variables
PFF variables that are non-sum, non-median combinations of census variables are referred to as "special variables". These include:
hovacrtm
percapinc
mntrvtm
mnhhinc
avghhsooc
avghhsroc
avghhsz
avgfmsz
hovacrt
rntvacrt
wrkrnothm
Estimate and MOE calculation for special variables occurs in the method calculate_e_m_special
of the Calculate
class. After downloading the estimates and MOEs of necessary input variables, this function then calls one of the pff variable-specific functions in special.py
to combine inputs.
would more documentation elaborating on special variables be useful, like you did with Medians?
Medians
Several PFF variables are medians, rather than counts. These include:
mdage
mdhhinc
mdfaminc
mdnfinc
mdewrk
mdemftwrk
mdefftwrk
mdrms
mdvl
mdgr
Estimate and MOE calculation for medians occurs in the method calculate_e_m_median
of the Calculate
class. This method calculates medians by:
- Extracting ranges, design factors, and booleans indicating whether top and bottom coding are appropriate from the metadata class (see metadata documentation for more information).
- Downloading and calculating the estimate and MOE for all input variables. For medians, input variables are counts within a given bin. For example, a count of people ages 5 to 9 is an input for median age.
- Pivoting the outputs of step 2 to create a table with each row representing a geoid, where each input pff variable corresponds with two columns (one estimate column and one MOE column).
- Combine columns (a form of horizontal aggregation), using formulas contained in the
Median
class.
For more detail on median calculation, as implemented in the Median
class, see the median calculation documentation page.