ClassifyJenksFisher - ObjectVision/GeoDMS GitHub Wiki
Classify functions ClassifyJenksFisher
- ClassifyJenksFisher(a, domain unit)
ClassifyJenksFisher(a, domain unit) results in a data item with class breaks, based on the method described in Fisher's Natural Breaks Classification complexity proof. The resulting values unit is the values unit of data item a, the resulting domain unit is the domain unit argument.
- a: numeric data item to be classified
- domain unit: determining the number of class breaks.
The Jenks Fisher classification method, is a fast algorithm that results in breaks that minimize the sum of the square deviations from the class means, also known as natural breaks. The self contained code with an example usage is: CalcNaturalBreaks
The same function can also be applied from the GUI by requesting the Palette Editor of a map layer and activate the Classify > JenksFisher classification.
The ClassifyJenksFisher results in a set of ClassBreaks that can be used in the classify function to classify a data item.
- data item a with Numeric value type
- domain unit with value type from group CanBeDomainUnit
The following example classifies the NrInh (number of inhabitants) attribute into 4 classes:
attribute<nrPersons> classifyJfNrInh (inh_4K) := ClassifyJenksFisher(NrInh, inh_4K);
The result contains 4 class break values that minimize within-class variance:
| classifyJfNrInh |
|---|
| 0 |
| 200 |
| 550 |
| 860 |
Table inh_4K, nr of rows = 4
Input data:
| NrInh |
|---|
| 550 |
| 1025 |
| 300 |
| 200 |
| 0 |
| null |
| 300 |
| 2 |
| 20 |
| 55 |
| 860 |
| 1025 |
| 1025 |
| 100 |
| 750 |
Table District, nr of rows = 15
The GeoDMS implementation uses an optimized O(k × n × log(n)) algorithm, compared to the classical Fisher/Jenks dynamic programming approach which has O(k × n²) complexity. The key optimization relies on the "no crossing paths" property (also known as the Monge condition), which states that optimal class break indices are monotonically increasing. This allows a divide-and-conquer strategy that halves the search space at each recursion level.
Fisher's Natural Breaks Classification is mathematically equivalent to optimal 1-D k-means clustering. Because values are sorted, optimal clusters must be contiguous intervals. The objective—minimizing within-class sum of squared deviations—is identical in both formulations. This equivalence means optimizations developed for 1-D k-means (such as the SMAWK algorithm by Aggarwal et al., 1987) are applicable here.
The interval cost function C(a,b) = Σ wᵢ(vᵢ - μ)² satisfies the Monge inequality:
C(a,c) + C(b,d) ≤ C(a,d) + C(b,c) for a ≤ b ≤ c ≤ d
This property ensures the DP transition matrix is totally monotone, formally justifying the divide-and-conquer optimization. While an O(k × n) algorithm is theoretically possible using SMAWK, the current O(k × n × log(n)) implementation provides excellent practical performance—classifying 7 million unique values into 15 classes takes approximately 20 seconds on a desktop PC.
The algorithm requires:
- Sorting: O(n log n) to order values
- Duplicate removal: values are made unique with weights representing occurrence frequency
- Prefix sums: O(n) preprocessing enables O(1) computation of any interval's sum of squared deviations
For more details, see Fisher's Natural Breaks Classification complexity proof.
Null values in the input data item are ignored during classification. In the example above, the District table contains 15 rows but only 14 valid values are used to determine the class breaks.
- ClassifyNonzeroJenksFisher - variant that excludes zero values from classification
- classify - apply the resulting class breaks to categorize data
- Fisher's Natural Breaks Classification complexity proof - mathematical background and proof