Linear Regression - niranjv/ml-notes GitHub Wiki
-
Is there a relationship between covariates & response? - Fit a multiple linear regression model to the data and test
H_0: model coefficients for all covariates are 0. Use F-statistic to rejectH_0. -
How strong is the relationship? - Use
RSE&R^2as a measure of strength of relationship - Which covariates affect response the most? - Look at p-values of each covariate in the model
-
Effect of each covariate on the response? - Look at value of model coefficients. Ideally, 95% CI for model coefficient should be narrow and far from 0. Collinearity can affect width of this interval, so look at
VIFscores also. - How accurately can we predict future response values from new covariates? - For mean response, use confidence interval. To predict response for a set of covariates, use prediction interval (will always be wider than CI due to irreducible error).
- Is the relationship linear? Look at plots of (studentized) residuals vs. fitted values. Transform response or predictors to remove non-linearity
-
Any interaction effects? Include interaction term and look at its p-value. Also examine increase in
R^2and decrease inRSEof model.
- Need to estimate model coefficients s.t. resulting line is as close as possible to the data
- Most common measure of closeness if
least squares=> estimate model coefficients to minimizeresidual sum of squares - Error term in model represents everything that the model does not account for - missing covariates, non-linearity, noise in data, etc. It is usually assumed to be independent of
X - Linear approximation to
fisY = B_0 + B_1* X + eand defines thepopulation regression line. This line is the best linear approximation to the true relationship betweenXandY. - Model coefficients determined by least squares regression determine the
least squares line - Least squares estimates of population mean & model coefficients are unbiased
- Variance of estimate of mean =
sigma^2/n, when the observations are all uncorrelated - Std errors of population mean & model coefficients ~ avg amount by which estimate differs from true value. Std error formula assumes errors of observations are uncorrelated.
- Variance of error (
RSS) is estimated asresidual standard error = sqrt( RSS/(n-2) ) - Std errors are used for hypothesis testing, typically between
H_0(no relationship) vs.H_1(some relationship). Calculate t-statistic from std error and compare to t-distribution withn-2degrees of freedom to determine how far estimate is from 0. The resulting probability is calledp-value, i.e., for a given model, assuming null hypothesis, it is the probability of getting a t-statistic value greater than or equal to the computed t-statistic value due to chance. Reject null hypothesis if p-value < a threshold (typically 0.05).
-
If model is plausible (i.e., model coefficients have low p-values), quantify extent to which model fits data
-
This typically done using:
-
Residual standard error (RSE) - this is an estimate of the std deviation of error
e.RSE = sqrt(RSS/(n-2). It is a measure of lack of fit of the model. Large deviations of estimates from data will result in high RSE. It is measures in units of response. -
R^2statistic is a measure of model fit, i.e., a measure of the linear relationship between response & predictors, similar to correlation. It is theproportion of variance explainedby model. Is always between0and1and independent of scale of response.R^2 = 1 - (RSS/TSS). GoodR^2value depends on application. For simple linear regression,R^2= correlation, but not for multiple linear regression. -
In multiple linear regression, model coefficient of a covariate represents the average effect on response of increasing the covariate by one unit while holding all other covariates fixed
-
Non-linearity - Use residual plots to identify non-linearity in relationship between response & covariates (Residuals (Y-axis) vs. Fitted values (X-axis). There should be no discernable pattern. If non-linear pattern exists, use a non-linear transformation like
log(X),sqrt(X),X^2, etc. on covariates. - Correlated errors - Std errors for model coefficients assume errors are uncorrelated. If errors are correlated, std errors of model coefficients will be underestimated and confidence/prediction intervals will be incorrectly narrower. usually occurs in time series data. To determine, plot residuals vs. observation (by time). There should be no discernable pattern.
-
Non-constant variance of errors - Std errors, confidence intervals & hypothesis tests assume constant variance of errors. Again, plot residuals vs. fitted values. If funnel shape, transform response with concave function like
log(Y)orsqrt(Y). In other cases, can use weighted least squares. - Outliers - Outlier := point far from the value predicted by model, i.e., response is unusual given covariates. To find outliers, plot residuals vs. fitted values. Alternatively, plot studentized residuals (residuals / std error) vs. fitted values. Points with |studentized residuals| > 3 are outliers.
-
High leverage - Points with unusual values for covariates. Can affect regression line. Use
leverage statisticto identify high-leverage points. This statistic is always between1&1/nand average is always(p+1)/n. Plot studentized residuals vs. leverage. -
Collinearity - 2 or more covariates are highly correlated. Increases std error of estimates of model coefficients and reduces power of hypothesis tests. Multicollinearity => 3 or more covariates are correlated. Use
variance inflation factorto detect multicollinearity. To remove collinearity, drop of of the affected covariates or created a new covariates based on collinear covariates.
-
Parametric methods like linear regression assume a functional form for
f(X). Simplify the process of findingf(X)and only a few model coefficients need to be estimated. Can also be easy to interpret some times. But if assumed functional form is wrong, then model fit will be poor. Use parametric method if its form is close to true form off(X). -
Non-parametric methods do not assume any functional form and are therefore more flexible. But need a lot more data to get a good fit. E.g., K-nearest neighbor regression, where, given
Kand set of covariatesx_0, get theKtraining data points with covariates closest tox_0and average their response to give response for new set of covariates. Bias-variance trade-off determines optimal value ofK. -
KNNregression decreases in performance with increase in number of covariatesp, since there might be no near neighbors, forcing KNN regression to use covariate sets far from the set for which response needs to be predicted. - Generally, parametric methods will outperform non-parametric methods when there is a small number of data points per covariate. Parametric methods can be easier to interpret even when a large amount of data is available.
ISLR, Chapter 3