Density Estimation - shivamvats/notes GitHub Wiki
- Nearest-neighbour based approach
- Takes in two parameters - a kernel function and a bandwidth
- Given an input, it computes its density by summing contributions from every point in its training data. Each contribution is computed by applying the kernel on the distance of the input from every a point scaled by the bandwidth.
- Low bandwidth leads to greater weight to nearby points but high variance. High bandwidth leads to smooth density function but high bias.
The density distribution is approximated with a mixture of k
(specified) Gaussian distributions. The density at every point is sum(a_i * N(mu_i, sigma_i)
where sum(a_i) = 1
.
Estimating the CDF is easy. Use the Empirical Distribution (frequentist approach). Basically assign probability mass 1/n at every point and use the resulting step-function CDF as the estimate. This empirical CDF is guaranteed to converge to the true CDF exponentially fast in the number of data-points, i.e., the probability that max error between these two distributions is greater that epsilon goes down exponentially fast in n.