09.Machine learning02.Regression - sporedata/researchdesigneR GitHub Wiki

1. Use cases: in which situations should I use this method?

  • The prediction of scores generated from a demanding data collection: Whenever the calculation of a score, such as physical function, takes significant effort to conduct in the clinic, can use machine learning regression models to predict those same score levels without going through the time-consuming data collection for each of the components that go into that score. For example, one could take variables captured through administrative databases (information about the use of canes, physical therapy, and medications associated with physical impairment) to predict the final physical function score.
  • Predictors are used to anticipate future, continuous clinical events or healthcare utilization metrics involving prognosis or response to therapy.

2. Input: what kind of data does the method require?

  1. A dataset with predictors and outcomes, either cross-sectional (for diagnostics) or longitudinal (for prediction)

3. Algorithm: how does the method work?

Model mechanics

  • Regression diagnostics are a set of tools to assess model validity [1,2]. This can be done in different ways, for example:

    1. Determining whether a data-fitted regression model adequately represents the structure of the data.
    2. Evaluating the model's assumptions and investigating whether or not there are observations that have undue influence on the analysis.

    Whether for outliers detection or model assumption checking, regression diagnostics can help with models based on Generalized Linear Model (GLM). Additionally, most diagnostic techniques are based on residuals.

  • The GLM, which is a linear model, is not considered a robust methodology in terms of prediction. Due to its simplicity of interpretation, GLM is still seen as a desirable technique, especially when determining how each predictor affects the outcome. Nevertheless, the linear relationship is susceptible to discrepancies and is not always valid. Therefore, it is not advisable to fit a GLM without diagnosing it.

    GLM is a broad class of models that includes linear regression, logistic, and count models (Poisson, negative binomial, etc.). A generalized linear model (GLM) with a Gaussian link function is another name for multiple linear regression (MLR), thanks to their common basis [14]. In GLM, we assume that every single individual in the matrix have similar distributions, and the prediction for them is similar. Therefore there are not subgroups.

    In GLM diagnostics, it is essential to find the outliers through quantitative or graphical analysis. In the former, one should detect those points that have an abnormally large influence on the model or those that the fitted model is most sensitive to. For that, both the leverage and the Cook's distance are metrics that could be applied to a fitted model. On the other hand, to detect outliers graphically in GLM, you may use the QQ plot. Differently from what is expected in the Gaussian linear models, residuals are not anticipated to follow a normal distribution in the QQ plot in GLM diagnostics. Therefore, one should not expect to visualize a straight line.

    Also in diagnostics, a plot of residuals vs. fitted values is the most significant graph, which is used to verify the model assumptions and look for any indication of nonlinearity between the residuals and the fitted values. If a linear model is correctly specified, then the Pearson residuals are independent of the fitted values and the regressors or the predictors on which they are based. These graphs should be null plots with no systematic features, so the conditional distribution of the residuals plotted on the y-axis should not change with the fitted values, a regressor, or a predictor on the x-axis. In most cases, systematic features indicate that one or more model assumptions were incorrect. You can check here for more useful information.

Describing in words

  • Regularization methods such as lasso and ridge penalize models for including additional variables (adding beta coefficients), thus making simpler models more likely to be chosen (Ockham's razor)

Learning materials

4. References

[1] Ching-Ti Liu, Milton J, McIntosh A. Correlation and Regression with R- Regression Diagnostics. Boston University School of Public Health. 2016. p. 1–18.

[2] Fox J, Weisberg S. Chapter 6: Diagnosing Problems in Linear and Generalized. In: An R Companion to Applied Regression. 2011. p. 285–328.

[3] Nelder JA, Wedderburn RW. Generalized linear models. Journal of the Royal Statistical Society: Series A (General). 1972 May;135(3):370-84.

⚠️ **GitHub.com Fallback** ⚠️