Statistical methods in ML project - SoojungHong/StatisticalMind GitHub Wiki
1. Problem Framing
Statistical methods that can aid in the exploration of the data during the framing of a problem include:
- Exploratory Data Analysis. Summarization and visualization in order to explore ad hoc views of the data.
- Data Mining. Automatic discovery of structured relationships and patterns in the data.
2. Data Understanding
Data understanding means having an intimate grasp of both the distributions of variables and the relationships between variables.
Two large branches of statistical methods are used to aid in understanding data; they are:
- Summary Statistics. Methods used to summarize the distribution and relationships between variables using statistical quantities.
- Data Visualization. Methods used to summarize the distribution and relationships between variables using visualizations such as charts, plots, and graphs.
3. Data Cleaning
Although the data is digital, it may be subjected to processes that can damage the fidelity of the data, and in turn any downstream processes or models that make use of the data.
Some examples include:
- Data corruption
- Data errors
- Data loss
Statistical methods are used for data cleaning; for example:
- Outlier detection. Methods for identifying observations that are far from the expected value in a distribution.
- Imputation. Methods for repairing or filling in corrupt or missing values in observations.
4. Data Selection
Not all observations or all variables may be relevant when modeling.
The process of reducing the scope of data to those elements that are most useful for making predictions is called data selection.
Two types of statistical methods that are used for data selection include:
- Data Sample. Methods to systematically create smaller representative samples from larger datasets.
- Feature Selection. Methods to automatically identify those variables that are most relevant to the outcome variable.
5. Data Preparation
Data can often not be used directly for modeling.
Some transformation is often required in order to change the shape or structure of the data to make it more suitable for the chosen framing of the problem or learning algorithms.
Data preparation is performed using statistical methods. Some common examples include:
- Scaling. Methods such as standardization and normalization.
- Encoding. Methods such as integer encoding and one hot encoding.
- Transforms. Methods such as power transforms like the Box-Cox method.
6. Model Evaluation
A crucial part of a predictive modeling problem is evaluating a learning method.
Generally, the planning of this process of training and evaluating a predictive model is called experimental design. This is a whole subfield of statistical methods.
- Experimental Design. Methods to design systematic experiments to compare the effect of independent variables on an outcome, such as the choice of a machine learning algorithm on prediction accuracy.
As part of implementing an experimental design, methods are used to resample a dataset in order to make economic use of available data in order to estimate the skill of the model. These two represent a subfield of statistical methods.
- Resampling Methods. Methods for systematically splitting a dataset into subsets for the purposes of training and evaluating a predictive model.
7. Model Configuration
A given machine learning algorithm often has a suite of hyperparameters that allow the learning method to be tailored to a specific problem.
The configuration of the hyperparameters is often empirical in nature, rather than analytical, requiring large suites of experiments in order to evaluate the effect of different hyperparameter values on the skill of the model.
The interpretation and comparison of the results between different hyperparameter configurations is made using one of two subfields of statistics, namely:
- Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).
- Estimation Statistics. Methods that quantify the uncertainty of a result using confidence intervals.
8. Model Selection
One among many machine learning algorithms may be appropriate for a given predictive modeling problem.
The process of selecting one method as the solution is called model selection.
- Statistical Hypothesis Tests. Methods that quantify the likelihood of observing the result given an assumption or expectation about the result (presented using critical values and p-values).
- Estimation Statistics. Methods that quantify the uncertainty of a result using confidence intervals.
9. Model Presentation
Once a final model has been trained, it can be presented to stakeholders prior to being used or deployed to make actual predictions on real data.