Horizontal aggregation: census variables to pff variables - NYCPlanning/db-factfinder GitHub Wiki

Combining census variables to create pff-variables

Sums of census variables

In order to calculate estimates and margins of error for pff variables, input census variables undergo up to two forms of aggregation. We describe these as "horizontal" and "vertical" aggregation, referring to summing tables over either columns or rows. For example, refer to the simplified example below. The two input variables represent data as downloaded from the Census API. If PFF Variable = Input 1 + Input 2, the estimate columns need to be combined to derive the PFF variable estimate, and MOE columns need to be combined to derive the PFF MOE. Each of these steps are described in the following sections.

geoid Input Estimate 1 Input MOE 1 Input Estimate 2 Input MOE 2
tract 1 10 1 20 2
tract 2 30 3 40 4

Horizontal aggregation (excluding aggregations described in the "Exceptions" section below) happens in the aggregate_horizontal method of the Calculate class.

Calculating estimates of sums

Several population factfinder variables are sums of more granular, mutually exclusive inputs. For example, counts representing a population under 18 might come from the aggregated counts of several childhood age bins (i.e. 0-4, 4-10, 10-15, 15-18). Or, a variable might reflect a sum over male- and female-specific counts. This form of aggregation is "horizontal." In our simplified case, the tract-level counts for a variable comprised of two inputs would be:

geoid PFF Variable Estimate
tract 1 10 + 20 = 30
tract 2 30 + 40 = 70

In general, PFF variable estimate (for a row) = Sum of Input Estimates (for that row)

Calculating MOEs of sums

The margin of error for aggregations are a simple root sum-of-squares of input margins of error. This is based on an assumption that input variables are independent.

geoid PFF Variable MOE
tract 1 sqrt(1^2 + 2^2) = sqrt(5)
tract 2 sqrt(3^2 + 4^2) = sqrt(25)

In general, PFF variable MOE (for a row) = Square root of the sum of squared input MOEs (for that row)

Exceptions

Not all pff variables are simple sums of census variables. There are two types of non-sum combinations of census variables: medians and special variables.

Special variables

PFF variables that are non-sum, non-median combinations of census variables are referred to as "special variables". These include:

  • hovacrtm
  • percapinc
  • mntrvtm
  • mnhhinc
  • avghhsooc
  • avghhsroc
  • avghhsz
  • avgfmsz
  • hovacrt
  • rntvacrt
  • wrkrnothm

Estimate and MOE calculation for special variables occurs in the method calculate_e_m_special of the Calculate class. After downloading the estimates and MOEs of necessary input variables, this function then calls one of the pff variable-specific functions in special.py to combine inputs.

would more documentation elaborating on special variables be useful, like you did with Medians?

Medians

Several PFF variables are medians, rather than counts. These include:

  • mdage
  • mdhhinc
  • mdfaminc
  • mdnfinc
  • mdewrk
  • mdemftwrk
  • mdefftwrk
  • mdrms
  • mdvl
  • mdgr

Estimate and MOE calculation for medians occurs in the method calculate_e_m_median of the Calculate class. This method calculates medians by:

  1. Extracting ranges, design factors, and booleans indicating whether top and bottom coding are appropriate from the metadata class (see metadata documentation for more information).
  2. Downloading and calculating the estimate and MOE for all input variables. For medians, input variables are counts within a given bin. For example, a count of people ages 5 to 9 is an input for median age.
  3. Pivoting the outputs of step 2 to create a table with each row representing a geoid, where each input pff variable corresponds with two columns (one estimate column and one MOE column).
  4. Combine columns (a form of horizontal aggregation), using formulas contained in the Median class.

For more detail on median calculation, as implemented in the Median class, see the median calculation documentation page.