Glossary - gmohandas/handson-ml GitHub Wiki
Data Mining Applying machine learning to dig into large amounts of data so as to help discover patterns that were not immediately apparent.
Anomaly Detection Detection of instances that depart significantly from normal.
Association Rule Learning Discovering interesting relations between data that were not immediately apparent by examining vast quantities of data.
Batch Learning A system that is only capable of learning offline using all available training data but cannot learn incrementally.
Online Learning The system is trained incrementally by feeding it date instances sequentially, either individually or in mini-batches. Learning on the fly.
Out-of-core Learning Online learning algorithms that train on huge datasets (that do not all fit into main memory) piecemeal.
Learning Rate A measure of how fast the learning system adapts to changing data.
Measure of Similarity A quantitative estimate of how similar or dissimilar two data instances are.
Cost Function A quantitative estimate of how good the training model is.
Sampling Noise Non-representative data that arise as a result of chance.
Sampling Bias A data sampling method that is flawed or unsuitable to the problem at hand.
Non-response Bias A special type of sampling bias introduced when part of the source doesn't contribute to data.
Feature Engineering Determining good features with which to train the system. It involves feature selection and feature extraction.
Feature Selection Selecting the most useful features to train on from the existing set of available features.
Feature Extraction Combining existing features into a more useful one thereby enabling accomplishes dimensionality reduction.
Regularization Constraining a model to make it simpler and thereby reduce the risk of overfitting.
Hyperparameter A parameter of the learning algorithm (not the model) that determines the amount of regularization to apply during learning.
Model-based Training A training algorithm that tunes parameters to fit training data to a model and uses a cost function to generalize to new instances.
Instance-based Training A training algorithm that learns the training data by heart and uses a similarity measure to generalize to new instances.
Generalization Error The error rate on new instances of data taken from the test set.
Cross-Validation Splitting the training set into complementary subsets (train + validation) and training the model against a different combination of these subsets and validating against the remaining parts.