ClassifyJenksFisher - ObjectVision/GeoDMS GitHub Wiki

Classify functions ClassifyJenksFisher

syntax

definition

ClassifyJenksFisher(a, domain unit) results in a data item with class breaks, based on the method described in Fisher's Natural Breaks Classification complexity proof. The resulting values unit is the values unit of data item a, the resulting domain unit is the domain unit argument.

  • a: numeric data item to be classified
  • domain unit: determining the number of class breaks.

description

The Jenks Fisher classification method, is a fast algorithm that results in breaks that minimize the sum of the square deviations from the class means, also known as natural breaks. The self contained code with an example usage is: CalcNaturalBreaks

The same function can also be applied from the GUI by requesting the Palette Editor of a map layer and activate the Classify > JenksFisher classification.

The ClassifyJenksFisher results in a set of ClassBreaks that can be used in the classify function to classify a data item.

applies to

  • data item a with Numeric value type
  • domain unit with value type from group CanBeDomainUnit

example

The following example classifies the NrInh (number of inhabitants) attribute into 4 classes:

attribute<nrPersons> classifyJfNrInh (inh_4K) := ClassifyJenksFisher(NrInh, inh_4K);

The result contains 4 class break values that minimize within-class variance:

classifyJfNrInh
0
200
550
860

Table inh_4K, nr of rows = 4

Input data:

NrInh
550
1025
300
200
0
null
300
2
20
55
860
1025
1025
100
750

Table District, nr of rows = 15

algorithmic considerations

The GeoDMS implementation uses an optimized O(k × n × log(n)) algorithm, compared to the classical Fisher/Jenks dynamic programming approach which has O(k × n²) complexity. The key optimization relies on the "no crossing paths" property (also known as the Monge condition), which states that optimal class break indices are monotonically increasing. This allows a divide-and-conquer strategy that halves the search space at each recursion level.

equivalence to 1-D k-means

Fisher's Natural Breaks Classification is mathematically equivalent to optimal 1-D k-means clustering. Because values are sorted, optimal clusters must be contiguous intervals. The objective—minimizing within-class sum of squared deviations—is identical in both formulations. This equivalence means optimizations developed for 1-D k-means (such as the SMAWK algorithm by Aggarwal et al., 1987) are applicable here.

Monge property

The interval cost function C(a,b) = Σ wᵢ(vᵢ - μ)² satisfies the Monge inequality:

C(a,c) + C(b,d) ≤ C(a,d) + C(b,c)  for a ≤ b ≤ c ≤ d

This property ensures the DP transition matrix is totally monotone, formally justifying the divide-and-conquer optimization. While an O(k × n) algorithm is theoretically possible using SMAWK, the current O(k × n × log(n)) implementation provides excellent practical performance—classifying 7 million unique values into 15 classes takes approximately 20 seconds on a desktop PC.

preprocessing

The algorithm requires:

  • Sorting: O(n log n) to order values
  • Duplicate removal: values are made unique with weights representing occurrence frequency
  • Prefix sums: O(n) preprocessing enables O(1) computation of any interval's sum of squared deviations

For more details, see Fisher's Natural Breaks Classification complexity proof.

null handling

Null values in the input data item are ignored during classification. In the example above, the District table contains 15 rows but only 14 valid values are used to determine the class breaks.

see also

⚠️ **GitHub.com Fallback** ⚠️