04 Non‐Parametric Regression - chanchishing/Generalized-Linear-Models-and-Nonparametric-Regression GitHub Wiki

Non-Parametric Regression

A parametric statistical model is a family of probability distribution with a finite set of parameters. For example the normal regression model as written below, it as $p+1$ elements in the $\beta$ vector and $1$ parameter in the variance-covariance matrix. The total number of parameters to estimate in the model is finite (p+2) and fixed.

$$ \begin{alignat*}{4} && Y & \sim N(X\beta,\sigma^2I_n)&&\ && & &\ && &\text{$Y$ is the vector of response is normally distributed, which it} && \ && &\text{has a mean of $X\beta$, $X$ is Design Matrix and $\beta$ is the parameter and intercept vector,} && \ && &\text{$\sigma^2I_n$ is the variance-covariance matrix, with a constant variance $\sigma^2$ along the} && \ && &\text{diagonal and $0$'s covariance between the predictors} && \ \end{alignat*} $$

A non-parametric statistical model is a family of probability distribution with infinitely many parameters. For example:

$$ \begin{alignat*}{4} && Y & \sim N(f(x_1),\sigma^2I_n)&&\ && & &\ && &\text{$Y$ is the vector of response is normally distributed, which has a mean of described by an arbitrary function $f(x_1)$,} && \ && &\text{where $x_1 \in [-1,1]$ as $f(x_1)$. And in non-parametric regression we will estimate $f(x_1)$ instead of $\beta$'s} && \ && &\text{As $f(x_1)$ can be of any form, to describe the shape $f(x_1)$ we will need infinite number of parameters} && \ && &\text{$\sigma^2I_n$ is the variance-covariance matrix, with a constant variance $\sigma^2$ along the} && \ && &\text{diagonal and $0$'s covariance between the predictors} && \ \end{alignat*} $$

We can think of in parametric regression, we fixed $f(x_1)$ beforehand (e.g. for Poisson Regression we fix $f(x_1)=\exp(\beta0 +\beta_1x_1 + \cdots + \beta_px_p)$ ) and then find the $\beta's$ by MLE. But in Non-Parametric Regression we does not fix $f(x_1)$ but try to learn/estimate what it is from the data.

The Non-Parametric approach has the advantage of being flexible and making less distributional assumption. However, it is less efficient when the structure of the relationship of the predictors and the response is already known and Non-parametric regression are more difficult to interpret regression parameters.

Kernel Estimators

A Kernel Estimator can be understood as a simple weighted local moving average of the response. We can think of our model as a response equal to some unknown function of a predictor $f(x_i)$ plus some noise $\epsilon$. $f(x_i)$ is not fixed and we will need to estimate it.

$$ \begin{alignat*}{4} && Y_i &= f(x_i) + \epsilon &&, \epsilon \sim N(0,\sigma^2)\ \end{alignat*} $$

The Kernel Estimator estimates $f(x)$ in the following way:

$$ \begin{alignat*}{4} && \hat{f_\lambda}(x) &= \dfrac{\dfrac{1} {n\lambda} \sum\limits_{i=1}^n K (\dfrac{x-x_i}{\lambda})Y_i}{\dfrac{1} {n\lambda} \sum\limits_{i=1}^n K (\dfrac{x-x_i}{\lambda})} && \ \end{alignat*} $$

$K(x)$ is called a kernel function. It is a nonnegative real-valued function that $K(x)=K(-x)$ for all values of x (symmetry) and $\int K(x)dx =1$ for the whole domain of $K(x)$. Some commonly used kernels are listed as below:

(Note: In estimating $f(x)$, the result of the estimation is not very sensitive to the choice of kernel function)

  • Uniform/rectangular Kernel:

$$ \begin{alignat*}{4} && K(x) &= \dfrac {1} {2} &&, -1 \ge x \le 1 \ \end{alignat*} $$

  • Gaussian/Normal Kernel:

$$ \begin{alignat*}{4} && K(x) &= \dfrac {1} {\sqrt{2\pi}} \exp(-\dfrac{x^2}{2}) &&, -\infty \gt x \lt \infty \ \end{alignat*} $$

  • Epanechnikov Kernel:

$$ \begin{alignat*}{4} && K(x) &= \dfrac{3}{4} (1-x^2) &&, -1 \ge x \le 1 \ \end{alignat*} $$

$\lambda$ is the bandwidth or smoothing parameters of the estimation. It controls the smoothness of bumpiness of the estimated $\hat{f}$-function. A larger $\lambda$ will give a smoother $\hat{f}$.

The estimator $\hat{f_\lambda}(x)$ can be thought as a weighted average of the response $Y$. If we define $w_i$ as below and it is the weight of each data point:

$$ \begin{alignat*}{4} && w_i &= \dfrac {1} {\lambda} K(\dfrac {x-x_i} {\lambda}) && \ \end{alignat*} $$

In R, we could use the ksmooth() to do get a list y-values of kernel smoothed y-values based on function input parameters.

Smoothing Splines

Given the model $Y_i = f(x_i) + \epsilon_i$, in Smoothing Splines we choose $\hat{f}$ that minimise the below:

$$ \begin{alignat*}{4} && &\overbrace{\dfrac{1}{n} \sum\limits_{i=1}^n(Y_i-f(x_i))^2}^{\text{MSE to fit the data}} + \underbrace{\lambda \int[f''(x)]^2 dx}_{\textit{smoothness}}&& \ \end{alignat*} $$

Note the first term $\dfrac{1}{n} \sum\limits_{i=1}^n(Y_i-f(x_i))^2$ is actually the MSE, if we just minimise only the MSE part of the above (i.e. MSE=0), we will get $\hat{f}(x_i)=y_i$ which means $\hat{f}(x_i)$ is just each of $y_i$'s and it overfits the data.

The second term is control the smoothness of of $\hat{f}$. Recall from calculus that the 2nd derivative of a function measure the rate of change of the function, a small $f''(x)$ avoids abrupt changes in slope,i.e. a more smooth curve, squaring it makes the penalty always positive (as we are interested in minimise the amount of slope change, not the direction of change) and give a bigger penalty for large change. Finally the the integral sum-up all the slope change for the whole function support. The smoothness term prefers smoother functions by minimising the integral of curvature of the function.

The $\lambda$ in the smoothness term controls how much we want the smoothness term to minimise the overall curvature. When $\lambda=0$, there is no penalty, we will have a over-fitted $\hat{f}(x_i)$ as discussed above. When $\lambda\rightarrow\infty$, $\hat{f}(x_i)$ will have low curvature and tends towards a straight line.

The solution to the above will result a $\hat{f}(x_i)$ in a from call Cubic Smoothing Spline. A Spline is a piece wise function, where each segment of is a polynomial, Splines are meant to be continuous and have continuous derivatives. A Cubic Spline the polynomial of each segment of the Spline is in degree 3.

In R, we could use the smooth.spline() to do Smoothing Splines Regression. The spar parameter of smooth.spline() is related to $\lambda$ and can be specified to give different fitted splines.

Loess: Locally Estimated Scatterplot Smoothing

As discussed above, when doing regression in general, we can think of we are given the model $Y_i = f(x_i) + \epsilon_i$, we then find a $\hat{f}(x)$ that minimise the MSE:

$$ \begin{alignat*}{4} && MSE = \dfrac{1}{n} \sum\limits_{i=1}^n(Y_i-f(x_i))^2 && \ \end{alignat*} $$

If we approximate $f(x)$ by a Taylor Expansion at a arbitrary point $x_0$ and treat each of the $n$ th-derivative of f(x) divided by $n!$ term as $\beta_n$ of regression parameter, we may consider we are looking into a linear regression probelm again:

$$ \begin{alignat*}{4} && f(x) &\approx \underbrace{f(x_0)}_{\beta_0} + \overbrace{\dfrac{f'(x_0)}{1!}}^{\beta_1}(x-x_0) + \overbrace{\dfrac{f''(x_0)}{2!}}^{\beta_2}(x-x_0)^2 + \dfrac{f'''(x_0)}{3!}(x-x_0)^3 +\cdots+ \overbrace{\dfrac{f^{p}(x_0)}{p!}}^{\beta_p}(x-x_0)^p && \ \end{alignat*} $$

The MSE of the "Taylor"-transformed Regression would look like the MSE of a linear regression, therefore we could find $\beta_j$'s of using methods similar to finding the $\beta$'s to linear regression.

$$ \begin{alignat*}{4} && MSE_{Taylor} = \dfrac{1}{n} \sum\limits_{i=1}^n \left(Y_i- \sum\limits_{j=0}^p \beta_j(x-x_0)^p \right)^2 && \text{where } \beta_j=\dfrac{f^j(x)}{j!};\ f^j(x) \text{ is the $j$-th derivative of } f(x)\ \end{alignat*} $$

We could approximate $f(x)$ by just do a linear approximation. Sometimes there's a quadratic approximation (i.e. p=2) that depends on the algorithm, but usually it's a pretty low order approximation. But the idea is that we really restrict the window around $x_0$, so we only take relatively few points in a very small window, and use those points to estimate the linear fit close to those points and then we move across the whole $x$ domain. Note that if $p$ is equal to zero, so you just have a constant term, then the loess method is equivalent to the kernel estimator.

The above method need to move $x_0$ across the $x$ domain of $f(x)$. It turns out that we can fix this problem and use only data points close to $x_0$ by modifying our MSE using the Taylor expansion here to weight points close to $x_0$ more heavily than points far away from $x_0$ using a weight function $w_i(x)$ (tricubic function).

$$ \begin{alignat*}{4} && MSE_{Taylor}^w =& \dfrac{1}{n} \sum\limits_{i=1}^n w_i(x-x_0)\ \left(Y_i- \sum\limits_{j=0}^p \beta_j(x-x_0)^p \right)^2 && \ && where & && \ && w_i(x) =&\begin{cases} (1-\lvert x\rvert^3)^3 && if \lvert x \rvert \lt 1\ 0 && if \lvert x \rvert \ge 1\ \end{cases} \end{alignat*} $$

Since Loess models are non-parametric and locally weighted, standard methods for estimating uncertainty (like in linear regression) do not directly apply. However, uncertainty can still be quantified using resampling or asymptotic approximations.

In R, we could use the geom_smooth() function of ggplot2 to plot loess fit but this method does not return the fitted model. If we want a fitted loess model, we will have to use the loess() method.