Analyzers: Kolmogorov Smirnov Test for two samples - lanit-tercom-school/analyzeme GitHub Wiki

Описание

Suppose we have two samples. Suppose the first sample X_1,...,X_m of size m has distribution function F(x) and the second sample Y_1,...,Y_n of size n has distribution function G(x) and we want to test if F = G. Suppose F_m(x) and G_n(x) are corresponding empirical distribution functions. Let us define statistic D_nm:

The hypothesis that F = G is rejected at level a, if D_nm > c(a), where value of the c(a) comes from the table below for each level a:

a	0.10	0.05	0.025	0.01	0.005	0.001
c(a)	1.22	1.36	1.48	1.63	1.73	1.95

Назначение

The purpose of the test is to check how much two data sets differ.

Сценарии использования

An example from (http://www.physics.csbsju.edu/stats/KS-test.html). To test if some drug really works, we compare "control group" results to the "treated group" results, using KS-test. If they don't pass the test, then the difference is significant and drug seems to work.

Вход

Two double data sets (they may contain different number of elements).

Выход

Boolean value: true, if at given level two data sets seem to come from the same distribution.

Реализация/алгоритмы

Algorithm is very straightforward; it comes from the mathematical definition. Firstly, D_nm should be calculated. To do so, we must find supremum of (F_m(x) - G_n(x)). To do so by definition seems complicated, so the following formulas should be used:

{if image isn't loading, please refer to slide 19 of [2] }

Then it should be multiplicated by the square root of (n * m / (n + m)). Afterwards, comparsion to c(a) should be conducted. Result of the comparison is the end result.

Литература, ссылки

[1] http://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lecture-notes/lecture14.pdf

[2] https://compscicenter.ru/media/slides/math_stat_2013_spring/2013_04_10_math_stat_2013_spring.pdf