Linear Regression - niranjv/ml-notes GitHub Wiki
-
Is there a relationship between covariates & response? - Fit a multiple linear regression model to the data and test
H_0
: model coefficients for all covariates are 0. Use F-statistic to rejectH_0
. -
How strong is the relationship? - Use
RSE
&R^2
as a measure of strength of relationship - Which covariates affect response the most? - Look at p-values of each covariate in the model
-
Effect of each covariate on the response? - Look at value of model coefficients. Ideally, 95% CI for model coefficient should be narrow and far from 0. Collinearity can affect width of this interval, so look at
VIF
scores also. - How accurately can we predict future response values from new covariates? - For mean response, use confidence interval. To predict response for a set of covariates, use prediction interval (will always be wider than CI due to irreducible error).
- Is the relationship linear? Look at plots of (studentized) residuals vs. fitted values. Transform response or predictors to remove non-linearity
-
Any interaction effects? Include interaction term and look at its p-value. Also examine increase in
R^2
and decrease inRSE
of model.
- Need to estimate model coefficients s.t. resulting line is as close as possible to the data
- Most common measure of closeness if
least squares
=> estimate model coefficients to minimizeresidual sum of squares
- Error term in model represents everything that the model does not account for - missing covariates, non-linearity, noise in data, etc. It is usually assumed to be independent of
X
- Linear approximation to
f
isY = B_0 + B_1* X + e
and defines thepopulation regression line
. This line is the best linear approximation to the true relationship betweenX
andY
. - Model coefficients determined by least squares regression determine the
least squares line
- Least squares estimates of population mean & model coefficients are unbiased
- Variance of estimate of mean =
sigma^2/n
, when the observations are all uncorrelated - Std errors of population mean & model coefficients ~ avg amount by which estimate differs from true value. Std error formula assumes errors of observations are uncorrelated.
- Variance of error (
RSS
) is estimated asresidual standard error = sqrt( RSS/(n-2) )
- Std errors are used for hypothesis testing, typically between
H_0
(no relationship) vs.H_1
(some relationship). Calculate t-statistic from std error and compare to t-distribution withn-2
degrees of freedom to determine how far estimate is from 0. The resulting probability is calledp-value
, i.e., for a given model, assuming null hypothesis, it is the probability of getting a t-statistic value greater than or equal to the computed t-statistic value due to chance. Reject null hypothesis if p-value < a threshold (typically 0.05).
-
If model is plausible (i.e., model coefficients have low p-values), quantify extent to which model fits data
-
This typically done using:
-
Residual standard error (RSE) - this is an estimate of the std deviation of error
e
.RSE = sqrt(RSS/(n-2)
. It is a measure of lack of fit of the model. Large deviations of estimates from data will result in high RSE. It is measures in units of response. -
R^2
statistic is a measure of model fit, i.e., a measure of the linear relationship between response & predictors, similar to correlation. It is theproportion of variance explained
by model. Is always between0
and1
and independent of scale of response.R^2 = 1 - (RSS/TSS)
. GoodR^2
value depends on application. For simple linear regression,R^2
= correlation, but not for multiple linear regression. -
In multiple linear regression, model coefficient of a covariate represents the average effect on response of increasing the covariate by one unit while holding all other covariates fixed
-
Non-linearity - Use residual plots to identify non-linearity in relationship between response & covariates (Residuals (Y-axis) vs. Fitted values (X-axis). There should be no discernable pattern. If non-linear pattern exists, use a non-linear transformation like
log(X)
,sqrt(X)
,X^2
, etc. on covariates. - Correlated errors - Std errors for model coefficients assume errors are uncorrelated. If errors are correlated, std errors of model coefficients will be underestimated and confidence/prediction intervals will be incorrectly narrower. usually occurs in time series data. To determine, plot residuals vs. observation (by time). There should be no discernable pattern.
-
Non-constant variance of errors - Std errors, confidence intervals & hypothesis tests assume constant variance of errors. Again, plot residuals vs. fitted values. If funnel shape, transform response with concave function like
log(Y)
orsqrt(Y)
. In other cases, can use weighted least squares. - Outliers - Outlier := point far from the value predicted by model, i.e., response is unusual given covariates. To find outliers, plot residuals vs. fitted values. Alternatively, plot studentized residuals (residuals / std error) vs. fitted values. Points with |studentized residuals| > 3 are outliers.
-
High leverage - Points with unusual values for covariates. Can affect regression line. Use
leverage statistic
to identify high-leverage points. This statistic is always between1
&1/n
and average is always(p+1)/n
. Plot studentized residuals vs. leverage. -
Collinearity - 2 or more covariates are highly correlated. Increases std error of estimates of model coefficients and reduces power of hypothesis tests. Multicollinearity => 3 or more covariates are correlated. Use
variance inflation factor
to detect multicollinearity. To remove collinearity, drop of of the affected covariates or created a new covariates based on collinear covariates.
-
Parametric methods like linear regression assume a functional form for
f(X)
. Simplify the process of findingf(X)
and only a few model coefficients need to be estimated. Can also be easy to interpret some times. But if assumed functional form is wrong, then model fit will be poor. Use parametric method if its form is close to true form off(X)
. -
Non-parametric methods do not assume any functional form and are therefore more flexible. But need a lot more data to get a good fit. E.g., K-nearest neighbor regression, where, given
K
and set of covariatesx_0
, get theK
training data points with covariates closest tox_0
and average their response to give response for new set of covariates. Bias-variance trade-off determines optimal value ofK
. -
KNN
regression decreases in performance with increase in number of covariatesp
, since there might be no near neighbors, forcing KNN regression to use covariate sets far from the set for which response needs to be predicted. - Generally, parametric methods will outperform non-parametric methods when there is a small number of data points per covariate. Parametric methods can be easier to interpret even when a large amount of data is available.
ISLR, Chapter 3