Metadata - NYCPlanning/db-factfinder GitHub Wiki

Metadata

Overview

The pff-factfinder package relies on a series of json metadata files. The primary function of these files is to relate a given fact finder variable to input census variables. Because ACS and census tables can change slightly from year-to-year as the Census Bureau adds, drops, or modifies included variables, the metadata files are specific to a release year. The metadata also contains any other pff-variable level information necessary for calculations.

Metadata structure

Metadata for a given factfinder variable are structured as follows

{
    "pff_variable": "lgoenlep1",
    "base_variable": "lgbase",
    "census_variable": [
      "C16001_005",
      "C16001_008",
      "C16001_011",
      "C16001_014",
      "C16001_017",
      "C16001_020",
      "C16001_023",
      "C16001_026",
      "C16001_029",
      "C16001_032",
      "C16001_035",
      "C16001_038"
    ],
    "domain": "community_profiles",
    "rounding": 0,
    "category": "Language Spoken at Home"
  }

pff_variable: The field name, as it appears in the final Population FactFinder data files (not a display name) and as it gets called by the main Calculate class

base_variable: The pff_variable name of the associated base variable. This is the denominator when calculating percent estimates and percent MOEs

census_variable: A list containing all input census variables for a given factfinder variable. These are listed without an "E" or "M" suffix (these suffixes are included in the Census API variable documentation and column headings of downloaded ACS data).

domain: Used for filtering of final outputs. For variables used in Population FactFinder, these are "housing", "demographic", "economic", or "social". For variables used only in the Community Profiles datasets, the domain is "community_profiles".

rounding: Variable-specific number of digits for rounding final output estimates and MOEs

category:

Median metadata

Variables that are medians have additional metadata, as follows.

"mdage": {
        "design_factor": 1.1,
        "top_coding": true,
        "bottom_coding": true,
        "ranges": {
            "mdpop0t4": [
                0,
                4.9999
            ],
            "mdpop5t9": [
                5,
                9.9999
            ],
            ...
            "mdpop85pl": [
                85,
                115
            ]
        }

design_factor: design factor values that account for the fact that the ACS does not use a simple random sample. These values are a ratio of observed standard errors for a variable to the standard errors that would be obtained from a simple random sample of the same size, and come from the Census Bureau.

top_coding: if True, medians falling within the bottom bin are set to the upper bound of the bottom bin. For example, if a geography's median age income is between 0 and 4.999 based on the example above, the median gets set to 4.999.

bottom_coding: if True, medians falling within the top bin are set to the lower bound of the top bin. For example, if a geography's median age income is between 85 and 115 based on the example above, the median gets set to 85.

ranges: The upper and lower values associated with each input pff_variable, where the inputs are counts of either people or households with a characteristic falling in a particular range

Metadata class

The Metadata class parses and reads the metadata json files as a whole. This class contains properties differentiating different types of population factfinder variables. These lists inform which methodology is appropriate when aggregating census variables (either horizontally or vertically) to calculate a pff_variable.

  • median_variables is a list of all pff_variable names referring to medians
  • median_inputs is a list of all pff_variables that are inputs to median calculations
  • median_ranges is a dictionary containing the value ranges associated with each median input variable
  • special_variables is a list of all pff_variables that require special calculations upon aggregation. These calculations are variable-specific functions contained in special.py.
  • profile_only_variables is a list of pff_variables for which percent and percent MOE are available from the Census API, and so are downloaded rather than calculated for geotypes available in the census API (i.e. not NYC-specific geographies). These variables are ones pulled from the census profile tables. The census variable names of profile variables have the prefix "DP".
  • profile_only_exceptions is a list of pff_variables pulled from profile tables (their census variable names have a "DP" prefix), but for which percent and percent MOE are calculated for all geotypes.
  • base_variables is a list of all pff_variables that serve as the base (denominator) for the percent and percent MOE calculation of another pff_variable.

Other methods include:

Variable class

The Variable class reads and parses the metadata files with reference to a particular pff_variable. The census_variables method returns the census variable names associated with estimate (E), margin of error (M), percent estimate (PE), and percent margin of error (PM) of a given pff_variable. The create_census_variables method splits a given a list of partial census variable names (i.e. ["B01001_044", "B01001_045"]) into a tuple of estimate and MOE census variable names (i.e. ["B01001_044E", "B01001_045E"], ["B01001_044M", "B01001_045M"]). Other Variable properties include the domain of the variable (i.e. "economic"), the base_variable (the name of the pff_variable serving as a denominator when calculating percent and percent MOE, where applicable), the number of decimal places to retain in the final rounded estimate and margin of error, and the category assigned to a variable by Labs' front-end application.

Metadata maintenance

Maintaining the metadata files is a largely manual process. Metadata undergoes several updates between yearly releases of data. These include:

  • If a Census Bureau table has a change in schema resulting in shifted columns, the census_variable portion of metadata likely need updates to reflect new column numbers
  • If Census Bureau tables containing median inputs change to include either more or fewer binned counts, the ranges portion of median metadata will need to get updated
  • If the Census Bureau releases new design factors associated with median input tables, the design_factor portion of the median metadata will need to get updated
  • If PFF variables are either discontinued or introduced (due to upstream Census Bureau changes or otherwise), these variables will need to get either added or removed from the metadata