Analyzers: Kolmogorov Smirnov Test for two samples - lanit-tercom-school/analyzeme GitHub Wiki



Suppose we have two samples. Suppose the first sample X_1,...,X_m of size m has distribution function F(x) and the second sample Y_1,...,Y_n of size n has distribution function G(x) and we want to test if F = G. Suppose F_m(x) and G_n(x) are corresponding empirical distribution functions. Let us define statistic D_nm:

The hypothesis that F = G is rejected at level a, if D_nm > c(a), where value of the c(a) comes from the table below for each level a:

a 0.10 0.05 0.025 0.01 0.005 0.001
c(a) 1.22 1.36 1.48 1.63 1.73 1.95


The purpose of the test is to check how much two data sets differ.

Сценарии использования

An example from ( To test if some drug really works, we compare "control group" results to the "treated group" results, using KS-test. If they don't pass the test, then the difference is significant and drug seems to work.


Two double data sets (they may contain different number of elements).


Boolean value: true, if at given level two data sets seem to come from the same distribution.


Algorithm is very straightforward; it comes from the mathematical definition. Firstly, D_nm should be calculated. To do so, we must find supremum of (F_m(x) - G_n(x)). To do so by definition seems complicated, so the following formulas should be used:

{if image isn't loading, please refer to slide 19 of [2] }

Then it should be multiplicated by the square root of (n * m / (n + m)). Afterwards, comparsion to c(a) should be conducted. Result of the comparison is the end result.

Литература, ссылки

