Machine learning - bhym/wiki GitHub Wiki
Ecological patterns are interesting only if they are repeated and that repeated patterns were of special interest because of their generality
(McArthur)
Niche modelling moved through increasingly sophisticated forms of regression and is presently flirting with Machine learning. Machine learning could be assimilated to an extremely complicated, refined and performant regression, able to fit very closely the data it was trained on.
Machine learning methods are not able to distinguish between information and noise in the dataset; that's a significant pitfall because it could make the model see patterns associated with non-relevant information (noise). This causes a backslash in predictive power because the noise patterns are unique to the training dataset.
-
Overfitting
-
Definition 1: Big data, big noise, tight fit on information+noise, high training R2, noise not found in test/validation, low test/validation R2 -> 10fold cv
-
Definition 2: accepting a predictor variable that is nominally correlated with the response variable in the dataset, but which does not represent a relationship that holds generally.
-
Reasons:
-
Weak correlations among variables arise as a result of random noise
-
Association between the predictor and response variables real in the dataset, but do not occur under a wide range of conditions
-
-
Issue: Spatial autocorrelation
-
Poor prediction when test/validation is spatially distinct from the model calibration data, because validation data is often not independent of the calibration data.
-
Overfitting when training and test are close
-
Spatial correlation is not an issue within the source domains of machine learning
-
-
Issue: Affects spatial transferability
-
Issue: Affects temporal transferability
-
Implication: lack of generality
-
-
One-dimensional view: we only worry about R2
-
Only variable importance: no sign, magnitude, specific interaction, non-linearity etc
- Partial response plots allow the interpretation...
"Models carry the meaning of science. This puts a tremendous burden on the process of model selection. In general practice, models are selected based on their relative goodness of fit to data penalized by model complexity. However, this may not be the most effective approach for selecting models to answer a specific scientific question because the model fit is sensitive to all aspects of a model, not just those relevant to the question."
Use AIC, BIC, or WAIC, or WBIC for compensating overfitting
Problem with the ensemble!
-
Boosted regression trees allow all kinds of non-linear relationships including thresholds and unimodal responses
-
Regression trees: address instability (different sample of the data gives different tree)
It is not possible to create a 95% CI around a nonparametric function.
Different techniques, different forms:
-
Logistic -> Sigmoidal
-
Regression tree -> branching decision tree
-
Bagged tree/ random forest -> ensemble of regression trees over the different data sample
-
Multivariate adaptive regression spline -> many basis functions
-
Neural nets -> composition of stimulation functions
L'idea di base è di fittare il modello ad un alto grado di complessità, per poi testare il fit su un insieme separato di dati ("holdout", o dati di validazione [1] ) mantenendo costanti i parametri del modello ma aumentando la complessità da bassa ad alta (ad esempio aggiungendo più nodi alla rete neurale, più rami di un albero di regressione, ecc.).
[1] L'approccio più semplice è tenere, diciamo, 1/3 dei dati per la validazione. Più comunemente ora viene utilizzata una tecnica più elaborata nota come cross validation 10 fold. Qui il 90% dei dati viene utilizzato per la calibrazione e il restante 10% viene utilizzato per la convalida. Quindi viene concesso un altro 10% per la convalida. Questo viene ripetuto 10 volte, quindi tutti i dati vengono utilizzati una volta.
Holdout:
- Randomly select x% of the data to calibrate, and then use the (1-x)% to validate
Cross-validation
- Randomly select 90% of the data, use the remaining 10% to calibrate, and repeat this ten times so that you use all your data
A fundamental goal of ecology is to identify relationships and patterns that are repeatable or general. Such entities can be said to have generality, generalizability (the capacity of a model to produce accurate predictions with new data), or transferability (the capacity of a model to be geographically or temporally cross-applicable) to datasets other than the one for which they were developed:
Does species model developed in one regio can successfully predict in a different region?
Does models developed in one time period can predict a different time period with different weather or climatic conditions?
Within most datasets there exists an underlying systematic or deterministic relationship, obscured to varying degrees by other factors, including stochastic noise.
The success of the analysis depends on the ability to tease apart the systematic and stochastic components of the data, thereby producing a model representing the underlying systematic aspects of the data, rather than capturing the specific details (i.e. noise contribution) of the particular dataset.
THis objective is the central premise behind model selection (i.e., inclusion of significant variables in the model) and validation )i.e. Assessing model predictive performance)