Analyzers: Kolmogorov Smirnov Test for two samples - lanit-tercom-school/analyzeme GitHub Wiki
TOC
Описание
Suppose we have two samples. Suppose the first sample X_1,...,X_m of size m has distribution function F(x) and the second sample Y_1,...,Y_n of size n has distribution function G(x) and we want to test if F = G. Suppose F_m(x) and G_n(x) are corresponding empirical distribution functions. Let us define statistic D_nm:
The hypothesis that F = G is rejected at level a, if D_nm > c(a), where value of the c(a) comes from the table below for each level a:
a | 0.10 | 0.05 | 0.025 | 0.01 | 0.005 | 0.001 |
---|---|---|---|---|---|---|
c(a) | 1.22 | 1.36 | 1.48 | 1.63 | 1.73 | 1.95 |
Назначение
The purpose of the test is to check how much two data sets differ.
Сценарии использования
An example from (http://www.physics.csbsju.edu/stats/KS-test.html). To test if some drug really works, we compare "control group" results to the "treated group" results, using KS-test. If they don't pass the test, then the difference is significant and drug seems to work.
Вход
Two double data sets (they may contain different number of elements).
Выход
Boolean value: true, if at given level two data sets seem to come from the same distribution.
Реализация/алгоритмы
Algorithm is very straightforward; it comes from the mathematical definition. Firstly, D_nm should be calculated. To do so, we must find supremum of (F_m(x) - G_n(x)). To do so by definition seems complicated, so the following formulas should be used:
{if image isn't loading, please refer to slide 19 of [2] }
Then it should be multiplicated by the square root of (n * m / (n + m)). Afterwards, comparsion to c(a) should be conducted. Result of the comparison is the end result.
Литература, ссылки
[2] https://compscicenter.ru/media/slides/math_stat_2013_spring/2013_04_10_math_stat_2013_spring.pdf