Estimates, MOEs, and CVs - NYCPlanning/db-factfinder GitHub Wiki

Calculating Estimate and MOE of PFF variables

Calculate class method calculate_c_e_m_p_z is the entry function into all logic for calculating non-rounded c, e, m, p, and z for a given PFF variable.

This method first creates an instance of the Variable class, so that all of the metadata associated with the PFF variable is easily accessible. For more information about metadata, see the metadata documentation page.

Basic estimate and MOE workflow

The most straight-forward workflow for calculating estimates and MOEs of a given PFF variable is defined in the calculate_e_m method. This is the workflow for non-median, non-special variables.

  1. Determine from & to geography types: In cases where the requested geography type is not a standard Census geography, block group-, or tract-level data get aggregated to produce the requested estimates and MOEs. Logic and lookups necessary for geographic aggregation are year-specific python files in the geography directory. Each of these files defines an AggregatedGeography class, which contains an options property of the form:
    {
        "decennial": {
            "tract": {"NTA": self.tract_to_nta, "cd": self.tract_to_cd},
            "block": {
                "cd_fp_500": self.block_to_cd_fp500,
                "cd_fp_100": self.block_to_cd_fp100,
                "cd_park_access": self.block_to_cd_park_access,
            },
        },
        "acs": {
            "tract": {"NTA": self.tract_to_nta, "cd": self.tract_to_cd},
            "block group": {
                "cd_fp_500": self.block_group_to_cd_fp500,
                "cd_fp_100": self.block_group_to_cd_fp100,
                "cd_park_access": self.block_group_to_cd_park_access,
            },
        },
   }

These lookups determine the necessary geography to download from the Census API in order to produce output for the requested geotype. For example, calculating ACS data at the NTA-level (the "to geography") requires raw data at the tract-level (the "from geography"), while calculating ACS data for the irregular park access region within each community district (cd_park_acess), requires raw data at the block group-level.

  1. Download input data: Once the necessary raw data geography format is identified, all necessary census variables are downloaded using the Download class. For more information on downloading data from the Census API using this class, see the "Downloading data from the API" documentation page.

  2. Aggregate horizontally: If a pff_variable is a sum of multiple, mutually-exclusive census variables, the data downloaded in step 2 gets aggregated "horizontally." For example, if PFF Variable = Input 1 + Input 2, horizontal aggregation first combines the two input census variables to calculate a PFF variable estimate and MOE for each row of the input data. For more information on this form of aggregation, see the "Horizontal aggregation" documentation page.

  3. Aggregate vertically: In cases where the requested geography is not a Census geography, the results of step 3 undergo "vertical" aggregation. For example, rows containing tract-level estimates and MOEs for a given PFF variable get combined to produce NTA-level estimates and MOEs. For more information on this form of aggregation, see the "Vertical aggregation" documentation page.

Estimate & MOE workflow exceptions

Several variables require slight modifications to the workflow above.

  • Medians: For medians, estimate and MOE calculations occur in the calculate_e_m_median method, rather than calculate_e_m. When downloading data (step 2 above), all necessary variables of counts within bins get downloaded. Horizontal and vertical aggregation (steps 3 & 4) are handled by the Median class. For more information about medians, see the "Median calculation" documentation page.

  • Special variables: Several PFF variables are combinations of census variables, but are not simple sums. In these cases, horizontal aggregation relies on variable-specific formulas contained in special.py. For more information about special variable calculation, which occurs in calculate_e_m_special, see the "Special variables" section of the Horizontal aggregation documentation page.

  • Profile-only variables: For some PFF variables, estimates and MOEs are available both in reference to a count, and a percent of the larger population. In these cases, the downloading step also includes the download of associated percent estimate and percent MOE data. Estimate and MOE calculations for profile-only variables occur in calculate_e_m_p_z. There is no vertical aggregation associated with these cases. For more information about profile-only calculations for non-aggregated geography types, see the exceptions section of the "Percent Estimate and Percent MOE" documentation page.

Performance enhancements with cacheing and multiprocessing

In order to improve performance, both raw data and calculated estimate and MOE data get cached locally. When downloading data from the Census API, the Download class first checks to see if the same variables for the same geographies exist in the local cache, implemented here. If so, the raw data is read from the cache and is not re-downloaded. Otherwise, raw data is obtained via the API and saved to the cache for future calls, using the write_to_cache utility function.

Caching also occurs after raw data is transformed into PFF variable estimates and MOEs. The method calculate_e_m, described above, first checks to see if previously calculated data are saved in the cache, in these lines. If so, estimate and MOE data are read from local files rather than being recalculated. First time calculations (ones not already in the cache), are added to the cache here.

In cases where a single PFF variable is a combination of multiple input PFF variables (such as the binned data used to calculate medians), inputs are calculated in parallel. The method calculate_e_m_multiprocessing is a wrapper function that calls either calculate_e_m or calculate_e_m_special over a list of input PFF variables.

Calculating coefficient of variation

After e and m are calculated in calculate_c_e_m_p_z, c is calculated, using the function get_c.

If the estimate is 0 or the MOE is NULL, then c is NULL. Otherwise, c = m / 1.645 / e * 100.