ClassifyJenksFisher - ObjectVision/GeoDMS GitHub Wiki

Classify functions ClassifyJenksFisher

syntax

ClassifyJenksFisher(a, domain unit)

definition

ClassifyJenksFisher(a, domain unit) results in a data item with class breaks, based on the method described in Fisher's Natural Breaks Classification complexity proof. The resulting values unit is the values unit of data item a, the resulting domain unit is the domain unit argument.

a: numeric data item to be classified
domain unit: determining the number of class breaks.

description

The Jenks Fisher classification method is a fast algorithm that results in breaks that minimize the sum of the square deviations from the class means, also known as natural breaks. The self-contained code with an example usage is: CalcNaturalBreaks

The same function can also be applied from the GUI by requesting the Palette Editor of a map layer and activating the Classify > JenksFisher classification.

The ClassifyJenksFisher results in a set of ClassBreaks that can be used in the classify function to classify a data item.

applies to

data item a with Numeric value type
domain unit with value type from group CanBeDomainUnit

example

The following example classifies the NrInh (number of inhabitants) attribute into 4 classes:

attribute<nrPersons> classifyJfNrInh (inh_4K) := ClassifyJenksFisher(NrInh, inh_4K);

The result contains 4 class break values that minimize within-class variance:

classifyJfNrInh
0
200
550
860

Table inh_4K, nr of rows = 4

Input data:

NrInh
550
1025
300
200
0
null
300
2
20
55
860
1025
1025
100
750

Table District, nr of rows = 15

algorithmic considerations

The GeoDMS implementation follows the optimized O(k × m × log(m)) dynamic programming algorithm described in Fisher's Natural Breaks Classification complexity proof, where m is the number of unique values and k the number of classes; the classical Fisher/Jenks approach requires O(k × m²) time. The key optimization relies on the "no crossing paths" property (related to the Monge condition): optimal class break indices are monotonically non-decreasing, which allows a divide-and-conquer strategy that halves the search space at each recursion level. The implementation resides in struct JenksFisher in CalcClassBreaks.cpp; CalcNaturalBreaks presents a self-contained version.

preprocessing

The n input values are aggregated into a sorted table of m ≤ n strictly increasing unique values with their occurrence frequencies as weights (value-count pairs). This table is built tile-parallel: chunks of at most 1024 values are sorted, the resulting sub-tables are pairwise merged, and multiple tiles are processed by multiple threads; a data item already flagged as sorted is counted in a single sequential scan. This preprocessing requires O(n × log(n)) time.
From this table, cumulative weights W and cumulative weighted values WV are precomputed as prefix sums, so that the weighted square of the mean of any interval of values, ssm(b..e) = (WV[e] − WV[b−1])² / (W[e] − W[b−1]), is obtained in constant time.
When the same classification is requested from the Palette Editor of the GUI, the value-count table is first capped at 4096 pairs by repeatedly aggregating adjacent pairs, trading exactness for responsiveness. The ClassifyJenksFisher operator itself always uses the exact table.

maximizing SSM rather than minimizing SSD

The total weighted sum of squared values SSV does not depend on the chosen partitioning, so minimizing the sum of squared deviations SSD = SSV − WSM is equivalent to maximizing WSM, the sum over all classes of the class weight times the squared class mean. The implementation works exclusively with this maximization; it evaluates the dynamic programming recurrence

$$SSM_{i,j} := \max\limits_{p \in {j..i}} SSM_{p-1, j-1}+ssm({p..i})$$

where the start row SSM_i,1 = ssm({1..i}) is filled directly from the prefix sums.

row-wise evaluation and memory layout

The rows j = 2..k−1 of the recurrence are evaluated one after the other:

Only two rows of SSM values are kept in memory, the previous and the current one, swapped after each completed row. Since only entries with j ≤ i and i − j ≤ m − k can be part of an optimal solution, each row buffer holds just m − k + 1 entries.
To reconstruct all class breaks afterwards, the break indices CB_i,j of the intermediate rows are all stored: a matrix of (k − 2) × (m − k + 1) indices. (The proof mentions re-applying the algorithm k times as a memory-saving alternative; the implementation prefers the matrix.)
For the final row j = k only the single entry CB_m,k is needed. It is found with one linear scan, after which all class breaks are collected by walking the stored matrix backwards. The resulting break values are the minimum values of each class; the first break is the overall minimum value.

divide and conquer per row

Within a row, the optimal break index of the middle element of an index range is found by a linear scan over its candidate interval. The no-crossing-paths property then restricts the candidates for the left half of the range to those up to and including the found break, and the candidates for the right half to those from the found break onward. Applying this recursively computes a whole row in O(m × log(m)) time, giving O(k × m × log(m)) in total, as budgeted in the proof.

equivalence to 1-D k-means and the Monge property

Fisher's Natural Breaks Classification is mathematically equivalent to optimal weighted 1-D k-means clustering. Because values are sorted, optimal clusters must be contiguous intervals, and the objective—minimizing the within-class sum of squared deviations—is identical in both formulations. The interval cost function C(a,b) = Σ wᵢ(vᵢ - μ)² satisfies the Monge inequality (quadrangle inequality):

C(a,c) + C(b,d) ≤ C(a,d) + C(b,c)  for a ≤ b ≤ c ≤ d

This is the mirror image of the inequality derived in the proof for the maximized ssm terms, from which the no-crossing-paths property follows; it makes the DP transition matrix totally monotone. Consequently, a theoretical O(k × m) algorithm is possible using the SMAWK algorithm (see Grønlund et al., 2017, arXiv:1701.07204). The current O(k × m × log(m)) implementation already provides excellent practical performance: classifying 7 million unique values into 15 classes took approximately 20 seconds on a desktop PC when the algorithm was first published.

A further improvement suggested in the proof—CB_i,j ≤ CB_i,j+1, so that the breaks of the previous row can serve as lower bounds for the scans in the next row—is present in the source code behind the compile-time flag MG_ASSUME_CB_INC, but currently not enabled; debug builds verify the underlying assumption with an assertion instead.

degenerate cases and variants

If k ≥ m, each unique value can be given its own class and the breaks are the unique values themselves (padded with the highest value if k > m), as in ClassifyUniqueValues; no dynamic programming is performed.
For k = 1, the single break is the minimum value.
ClassifyNonzeroJenksFisher classifies negative and positive values separately with a compulsory class break at 0: it runs the dynamic program for every feasible division of the k classes over both sides and keeps the division with the maximal total WSM.

For the mathematical background and proofs, see Fisher's Natural Breaks Classification complexity proof.

null handling

Null values in the input data item are ignored during classification. In the example above, the District table contains 15 rows but only 14 valid values are used to determine the class breaks.

ClassifyJenksFisher - ObjectVision/GeoDMS GitHub Wiki

syntax

definition

description

applies to

example

algorithmic considerations

preprocessing

maximizing SSM rather than minimizing SSD

row-wise evaluation and memory layout

divide and conquer per row

equivalence to 1-D k-means and the Monge property

degenerate cases and variants

null handling

see also

⚠️ GitHub.com Fallback ⚠️

ClassifyJenksFisher - ObjectVision/GeoDMS GitHub Wiki

syntax

definition

description

applies to

example

algorithmic considerations

preprocessing

maximizing SSM rather than minimizing SSD

row-wise evaluation and memory layout

divide and conquer per row

equivalence to 1-D k-means and the Monge property

degenerate cases and variants

null handling

see also

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️