Median calculations - NYCPlanning/db-factfinder GitHub Wiki

Medians

Methods for calculating the median estimate and MOE for a given geography are in the Median class.

Estimate calculation

Median estimates are calculated from count estimates of binned data. For example, median household income estimates are calculated from estimated count of households with incomes in various ranges (under 10k, 10-14k, 15-19k, etc.).

Below is an example of tract-level input variable estimates for median household income:

Input variable (range)	Estimate of count in range
mdhhiu10 (0 to 9999)	20.0
mdhhi10t14 (10000 to 14999)	0.0
mdhhi15t19 (15000 to 19999)	12.0
mdhhi20t24 (20000 to 24999)	9.0
mdhhi25t29 (25000 to 29999)	0.0
mdhhi30t34 (30000 to 34999)	0.0
mdhhi35t39 (35000 to 39999)	0.0
mdhhi40t44 (40000 to 44999)	0.0
mdhhi45t49 (45000 to 49999)	0.0
mdhhi50t59 (50000 to 59999)	0.0
mdhhi60t74 (60000 to 74999)	0.0
mdhhi75t99 (75000 to 99999)	0.0
mdhi100t124 (100000 to 124999)	0.0
mdhi125t149 (125000 to 149999)	0.0
mdhi150t199 (150000 to 199999)	0.0
mdhhi200pl (200000 to 9999999)	0.0

This, in turn, corresponds with a cumulative count distribution of:

Input variable (range)	Cumulative count
mdhhiu10 (0 to 9999)	20.0
mdhhi10t14 (10000 to 14999)	20.0
mdhhi15t19 (15000 to 19999)	32.0
mdhhi20t24 (20000 to 24999)	41.0
mdhhi25t29 (25000 to 29999)	41.0
mdhhi30t34 (30000 to 34999)	41.0
mdhhi35t39 (35000 to 39999)	41.0
mdhhi40t44 (40000 to 44999)	41.0
mdhhi45t49 (45000 to 49999)	41.0
mdhhi50t59 (50000 to 59999)	41.0
mdhhi60t74 (60000 to 74999)	41.0
mdhhi75t99 (75000 to 99999)	41.0
mdhi100t124 (100000 to 124999)	41.0
mdhi125t149 (125000 to 149999)	41.0
mdhi150t199 (150000 to 199999)	41.0
mdhhi200pl (200000 to 9999999)	41.0

And a cumulative percent distribution of:

Input variable (range)	Cumulative percent
mdhhiu10 (0 to 9999)	48.78048780487805
mdhhi10t14 (10000 to 14999)	48.78048780487805
mdhhi15t19 (15000 to 19999)	78.04878048780488
mdhhi20t24 (20000 to 24999)	100.0
mdhhi25t29 (25000 to 29999)	100.0
mdhhi30t34 (30000 to 34999)	100.0
mdhhi35t39 (35000 to 39999)	100.0
mdhhi40t44 (40000 to 44999)	100.0
mdhhi45t49 (45000 to 49999)	100.0
mdhhi50t59 (50000 to 59999)	100.0
mdhhi60t74 (60000 to 74999)	100.0
mdhhi75t99 (75000 to 99999)	100.0
mdhi100t124 (100000 to 124999)	100.0
mdhi125t149 (125000 to 149999)	100.0
mdhi150t199 (150000 to 199999)	100.0
mdhhi200pl (200000 to 9999999)	100.0

Calculating median estimates from binned data occurs in the median method of the Median class. This method calculates the median estimate by:

Calculating the sum of all counts (N) within all input bins
Using the cumulative distribution of counts within bins, identifies which bin contains N/2
Using linear interpolation to estimate where within the bin identified in step 2 the median lies
- First, the difference between N/2 and the total count in all lower bins represents how far within the median-containing bin the median lies.
- The median is assigned as that difference, times the width of the bin divided by the count in that bin.

Median = (Lower boundary of the median-containing bin) 
          + (N/2 - (Total count in all bins below median-containing bin)) 
            * (Difference between min and max value of the median-containing group) / (Count within the median-containing group)

For a video demonstration of median linear interpolation, see here

Top and bottom coding

Some medians undergo top- or bottom-coding, as described in the top_coding and bottom_coding sections of the median metadata.

If top_coding is True, medians falling within the bottom bin are set to the max value of the bottom bin. For example, if a geography's median household income is between 0 and 9999 based on the calculations described in the previous section, the median gets set to 9999. Similarly, if bottom_coding is True, medians falling within the top bin are set to the min value of the top bin. For example, if a geography's median household income is above 200000 based on the calculations described in the previous section, the median gets set to 200000.

MOE calculation

Calculating 1 standard error interval around a 50% proportion

Margins of errors for medians are estimated by calculating a 1 standard error interval around a 50% proportion estimate. First, the Median class calculates the standard error of a 50% proportion (se_50) as:

(Design Factor) * ((93/7 * Base) * 2500)) ^ .5

where the Base is the sum of counts in all bins. Design factors are values that account for the fact that the ACS does not use a simple random sample. These values are a ratio of observed standard errors for a variable to the standard errors that would be obtained from a simple random sample of the same size, and come from the Census Bureau.

This standard error is added to and subtracted from 50, creating a 1SE interval around a 50% estimate (with boundaries p_lower and p_upper).

Comparing confidence interval boundaries to the cumulative distribution

Then, p_lower and p_upper are compared to a cumulative percent distribution (see above), cumm_dist, to determine which bins contain the boundaries for a 1SE interval around a 50% proportion. These bins are saved as lower_bin and upper_bin.

For both lower_bin and upper_bin, the next step is to get the following values using the cumulative percent distribution of all input bins:

A1: The min value the bin
A2: The min value of the next highest bin
C1: The cumulative percentage of counts strictly less than A1 (total counts in bins up to the one containing the boundary)
C2: The cumulative percentage of counts strictly less than A2 (total counts in bins up to and including the one containing the boundary)

Calculation of A1, A2, C1, C2 for a given p occurs in the method base_case.

A1, A2, C1, and C2 get calculated relative to both lower_bin and upper_bin by calling base_case where _bin = lower_bin and then _bin = upper_bin. These calls happen in lower_bound and upper_bound methods, respectively.

There are several exceptions in which A1, A2, C1, and C2 do not follow the base case. To account for exceptions, the methods lower_bound and upper_bound subsequently modify the base case results according to the following:

lower_bin is in the bottom bin: C1 of the lower_bin is 0, and C2 of the lower bin is the percent of counts in the lowest bin.
lower_bin is the first bin with a count more than zero: A1 of the lower_bin is 0, A2 of lower_bin is the lower boundary of the second bin
upper_bin is the top bin: A1 of upper_bin and A2 of upper_bin are both the lower boundary of the top bin
upper_bin and lower_bin are both in the first bin with a count more than zero: A1 of upper_bin is 0, A2 of upper_bin is the lower boundary of the second bin

Calculate a confidence interval around the median

Once A1, A2, C1, and C2 are set, the method get_bound converts these values into a boundary for the confidence interval around the median.

CI boundary = (p - C1) * (A2 - A1) / (C2 - C1) + A1

This equation is similar to the linear interpolation used in estimate calculation, but uses percent cumulative distributions rather than count cumulative distributions. In estimate calculation, we determined where within a given bin an estimate lies, assuming that all frequencies within that bin are uniformly distributed between the min and max values of the bin. If the median was in the bin 1000 to 1499, which contained 45 counts, we assumed that these 45 counts were evenly distributed between 1000 and 1499.

Estimating where within a bin the boundary for a median confidence interval lies is similar. We first identified which bin contains percent 1SE away from 50%. From here, we assume that the cumulative percentage of counts contained within that bin is evenly distributed between its two extremes, i.e. if the bin 1000 to 1499 contains accounts for 30% to 40% of the cumulative counts, we assume that those 10% of total counts are evenly distributed between 1000 and 1499.

The various components of the CI boundary calculation are:

(p - C1): The difference between the 1SE boundary for the 50% proportion and the percent of counts that are in all bins below the one containing this boundary
(A2 - A1): The width of the bin containing the 1SE boundary for the 50% proportion CI
(C2 - C1): The percent of counts that are in the bin containing boundary for 50% proportion CI
(A1): The lowest value of the bin containing the 1SE boundary for the 50% proportion

The method get_bound is used to calculate both the upper and lower boundaries of the confidence interval. When calculating the lower boundary of the median confidence interval, p refers to p_lower, and A1, A2, C1, C2 are all in reference to p_lower. Similarly, when calculating the upper boundary of the median confidence interval, p refers to p_upper.

Use the confidence interval to calculate the median MOE

The median MOE is calculated from the median confidence interval determined above. This occurs in the median_moe class property.

MOE of the median = (Width of CI around the median) * 1.645 / 2

In the following exceptions, the median MOE is set to NULL: