Median calculations - NYCPlanning/db-factfinder GitHub Wiki
Medians
Methods for calculating the median estimate and MOE for a given geography are in the Median
class.
Estimate calculation
Median estimates are calculated from count estimates of binned data. For example, median household income estimates are calculated from estimated count of households with incomes in various ranges (under 10k, 10-14k, 15-19k, etc.).
Below is an example of tract-level input variable estimates for median household income:
Input variable (range) | Estimate of count in range |
---|---|
mdhhiu10 (0 to 9999) | 20.0 |
mdhhi10t14 (10000 to 14999) | 0.0 |
mdhhi15t19 (15000 to 19999) | 12.0 |
mdhhi20t24 (20000 to 24999) | 9.0 |
mdhhi25t29 (25000 to 29999) | 0.0 |
mdhhi30t34 (30000 to 34999) | 0.0 |
mdhhi35t39 (35000 to 39999) | 0.0 |
mdhhi40t44 (40000 to 44999) | 0.0 |
mdhhi45t49 (45000 to 49999) | 0.0 |
mdhhi50t59 (50000 to 59999) | 0.0 |
mdhhi60t74 (60000 to 74999) | 0.0 |
mdhhi75t99 (75000 to 99999) | 0.0 |
mdhi100t124 (100000 to 124999) | 0.0 |
mdhi125t149 (125000 to 149999) | 0.0 |
mdhi150t199 (150000 to 199999) | 0.0 |
mdhhi200pl (200000 to 9999999) | 0.0 |
This, in turn, corresponds with a cumulative count distribution of:
Input variable (range) | Cumulative count |
---|---|
mdhhiu10 (0 to 9999) | 20.0 |
mdhhi10t14 (10000 to 14999) | 20.0 |
mdhhi15t19 (15000 to 19999) | 32.0 |
mdhhi20t24 (20000 to 24999) | 41.0 |
mdhhi25t29 (25000 to 29999) | 41.0 |
mdhhi30t34 (30000 to 34999) | 41.0 |
mdhhi35t39 (35000 to 39999) | 41.0 |
mdhhi40t44 (40000 to 44999) | 41.0 |
mdhhi45t49 (45000 to 49999) | 41.0 |
mdhhi50t59 (50000 to 59999) | 41.0 |
mdhhi60t74 (60000 to 74999) | 41.0 |
mdhhi75t99 (75000 to 99999) | 41.0 |
mdhi100t124 (100000 to 124999) | 41.0 |
mdhi125t149 (125000 to 149999) | 41.0 |
mdhi150t199 (150000 to 199999) | 41.0 |
mdhhi200pl (200000 to 9999999) | 41.0 |
And a cumulative percent distribution of:
Input variable (range) | Cumulative percent |
---|---|
mdhhiu10 (0 to 9999) | 48.78048780487805 |
mdhhi10t14 (10000 to 14999) | 48.78048780487805 |
mdhhi15t19 (15000 to 19999) | 78.04878048780488 |
mdhhi20t24 (20000 to 24999) | 100.0 |
mdhhi25t29 (25000 to 29999) | 100.0 |
mdhhi30t34 (30000 to 34999) | 100.0 |
mdhhi35t39 (35000 to 39999) | 100.0 |
mdhhi40t44 (40000 to 44999) | 100.0 |
mdhhi45t49 (45000 to 49999) | 100.0 |
mdhhi50t59 (50000 to 59999) | 100.0 |
mdhhi60t74 (60000 to 74999) | 100.0 |
mdhhi75t99 (75000 to 99999) | 100.0 |
mdhi100t124 (100000 to 124999) | 100.0 |
mdhi125t149 (125000 to 149999) | 100.0 |
mdhi150t199 (150000 to 199999) | 100.0 |
mdhhi200pl (200000 to 9999999) | 100.0 |
Calculating median estimates from binned data occurs in the median
method of the Median
class. This method
calculates the median estimate by:
- Calculating the sum of all counts (N) within all input bins
- Using the cumulative distribution of counts within bins, identifies which bin contains N/2
- Using linear interpolation to estimate where within the bin identified in step 2 the median lies
- First, the difference between N/2 and the total count in all lower bins represents how far within the median-containing bin the median lies.
- The median is assigned as that difference, times the width of the bin divided by the count in that bin.
Median = (Lower boundary of the median-containing bin)
+ (N/2 - (Total count in all bins below median-containing bin))
* (Difference between min and max value of the median-containing group) / (Count within the median-containing group)
For a video demonstration of median linear interpolation, see here
Top and bottom coding
Some medians undergo top- or bottom-coding, as described in the top_coding and bottom_coding sections of the median metadata.
If top_coding
is True, medians falling within the bottom bin are set to the max value of the bottom bin. For example, if a geography's median household income is between 0 and 9999 based on the calculations described in the previous section, the median gets set to 9999.
Similarly, if bottom_coding
is True, medians falling within the top bin are set to the min value of the top bin. For example, if a geography's median household income is above 200000 based on the calculations described in the previous section, the median gets set to 200000.
MOE calculation
Calculating 1 standard error interval around a 50% proportion
Margins of errors for medians are estimated by calculating a 1 standard error interval around a 50% proportion estimate. First, the Median
class calculates the standard error of a 50% proportion (se_50
) as:
(Design Factor) * ((93/7 * Base) * 2500)) ^ .5
where the Base is the sum of counts in all bins. Design factors are values that account for the fact that the ACS does not use a simple random sample. These values are a ratio of observed standard errors for a variable to the standard errors that would be obtained from a simple random sample of the same size, and come from the Census Bureau.
This standard error is added to and subtracted from 50, creating a 1SE interval around a 50% estimate (with boundaries p_lower
and p_upper
).
Comparing confidence interval boundaries to the cumulative distribution
Then, p_lower
and p_upper
are compared to a cumulative percent distribution (see above), cumm_dist
, to determine which bins contain the boundaries for a 1SE interval around a 50% proportion. These bins are saved as lower_bin
and upper_bin
.
For both lower_bin
and upper_bin
, the next step is to get the following values using the cumulative percent distribution of all input bins:
- A1: The min value the bin
- A2: The min value of the next highest bin
- C1: The cumulative percentage of counts strictly less than A1 (total counts in bins up to the one containing the boundary)
- C2: The cumulative percentage of counts strictly less than A2 (total counts in bins up to and including the one containing the boundary)
Calculation of A1, A2, C1, C2 for a given p occurs in the method base_case
.
A1, A2, C1, and C2 get calculated relative to both lower_bin
and upper_bin
by calling base_case
where _bin = lower_bin
and then _bin = upper_bin
. These calls happen in lower_bound
and upper_bound
methods, respectively.
There are several exceptions in which A1, A2, C1, and C2 do not follow the base case. To account for exceptions, the methods lower_bound
and upper_bound
subsequently modify the base case results according to the following:
lower_bin
is in the bottom bin: C1 of thelower_bin
is 0, and C2 of the lower bin is the percent of counts in the lowest bin.lower_bin
is the first bin with a count more than zero: A1 of thelower_bin
is 0, A2 oflower_bin
is the lower boundary of the second binupper_bin
is the top bin: A1 ofupper_bin
and A2 ofupper_bin
are both the lower boundary of the top binupper_bin
andlower_bin
are both in the first bin with a count more than zero: A1 ofupper_bin
is 0, A2 ofupper_bin
is the lower boundary of the second bin
Calculate a confidence interval around the median
Once A1, A2, C1, and C2 are set, the method get_bound
converts these values into a boundary for the confidence interval around the median.
CI boundary = (p - C1) * (A2 - A1) / (C2 - C1) + A1
This equation is similar to the linear interpolation used in estimate calculation, but uses percent cumulative distributions rather than count cumulative distributions. In estimate calculation, we determined where within a given bin an estimate lies, assuming that all frequencies within that bin are uniformly distributed between the min and max values of the bin. If the median was in the bin 1000 to 1499, which contained 45 counts, we assumed that these 45 counts were evenly distributed between 1000 and 1499.
Estimating where within a bin the boundary for a median confidence interval lies is similar. We first identified which bin contains percent 1SE away from 50%. From here, we assume that the cumulative percentage of counts contained within that bin is evenly distributed between its two extremes, i.e. if the bin 1000 to 1499 contains accounts for 30% to 40% of the cumulative counts, we assume that those 10% of total counts are evenly distributed between 1000 and 1499.
The various components of the CI boundary calculation are:
- (p - C1): The difference between the 1SE boundary for the 50% proportion and the percent of counts that are in all bins below the one containing this boundary
- (A2 - A1): The width of the bin containing the 1SE boundary for the 50% proportion CI
- (C2 - C1): The percent of counts that are in the bin containing boundary for 50% proportion CI
- (A1): The lowest value of the bin containing the 1SE boundary for the 50% proportion
The method get_bound
is used to calculate both the upper and lower boundaries of the confidence interval. When calculating the lower boundary of the median confidence interval, p refers to p_lower, and A1, A2, C1, C2 are all in reference to p_lower. Similarly, when calculating the upper boundary of the median confidence interval, p refers to p_upper.
Use the confidence interval to calculate the median MOE
The median MOE is calculated from the median confidence interval determined above. This occurs in the median_moe
class property.
MOE of the median = (Width of CI around the median) * 1.645 / 2
In the following exceptions, the median MOE is set to NULL: