terminology - morinim/ultra GitHub Wiki
The following terms are frequently used in Ultra-related documentation.
Class imbalance
The class imbalance problem arises when certain classes have significantly more instances than others. In such cases, standard classifiers tend to favour the majority classes while underperforming on the minority ones.
Discriminant function
A discriminant function is a mathematical function used in statistics and machine learning to classify data points into predefined categories or groups. It helps determine the class to which a new, unseen data point likely belongs, based on its features. In essence, it creates boundaries in the feature space to separate different classes.
Key Concepts
- Classification: discriminant functions are primarily used for classification problems, where the goal is to assign data points to specific categories.
- Feature Space: this refers to the multi-dimensional space defined by the features (variables) of the data points.
- Decision Boundary: the discriminant function defines a decision boundary in the feature space that separates different classes.
- Linear and Non-linear: discriminant functions can be linear (straight lines or planes) or non-linear, depending on the complexity of the data and the desired decision boundary.
How it works in Ultra
A discriminant function is evolved using a dataset with known class labels (the training data). The function evolves to approximate the relationship between features and class labels, effectively learning the decision boundary.
Once evolved, the discriminant function is used by an oracle to predict the class of a new, unlabeled data point by evaluating it on the new data point's features.
Example
A pair consisting of an instance (with its features) and a label.
In Ultra the class representing an example is dataframe::example (see dataframe.h and dataframe.cc).
Heterogeneous Genetic Algorithm (HGA)
The domain of genetic algorithms can be extended by employing more complex solution encodings. One such approach involves concatenating multiple heterogeneously encoded gene segments into a single chromosome.
This allows the algorithm to tackle optimisation problems where parameters belong to vastly different domains. Suppose you're solving a job shop scheduling problem in which:
- you need to assign operations to machines (a permutation of tasks);
- you must also tune algorithmic parameters of the scheduler itself (e.g. priority rules, cooling rate in a hybrid GA-SA approach).
These two components are different in nature:
- job assignment โ typically represented as a permutation (e.g. [task3, task1, task2, ...]);
- parameter tuning โ represented as real numbers or categorical values (e.g. cooling rate = 0.95, rule = "earliest due date")
Using a heterogeneous genetic algorithm allows you to:
- encode the task permutation and parameter vector in separate sections of the chromosome;
- apply crossover and mutation operators tailored to each part (order-based crossover for the task sequence, standard crossover for numeric parameters).
This separation improves convergence and avoids disruptive operations (for example, preventing permutation crossover from corrupting the scheduler's parameters).
In Ultra the class used for HGA chromosomes is hga::individual (see hga/individual.h and hga/individual.cc). The eight queen tutorial provides a good practical introduction.
Instance
An entity for which a prediction is made. For example, an instance might be an image classified as either hotdog or not hotdog (Silicon Valley: Season 4 Episode 4).
Feature
A measurable attribute of an instance used for prediction. For example, a car might have a feature Mileage.
NOTE: in machine learning, a feature can have several meanings depending on the context. It can refer to a data type (e.g. Mileage) or a data type along with its value. Many people use attribute and feature interchangeably (including in this wiki).
Label
An answer for a prediction task, either the machine learning system's output or the correct answer provided in the training data.
For example, the label for an image might be hotdog.
Consider the following:
-
in symbolic regression tasks, labels are numerical values and can be retrieved using the
label_asfunction; -
in classification tasks, class labels may be encoded using different data types. However, Ultra employs a uniform scheme where classes are represented as integers (
class_t). This simplifies data handling and manipulation.The actual integer label can be accessed via the
label(example)function, while the original label can be retrieved using thedataframe::class_namemethod.
Metric
A numerical value of interest. It may or may not be directly optimized.
Model
A statistical representation of a prediction task. A model is trained on examples and subsequently used to make predictions.
Objective
A metric that the algorithm aims to optimize.
Pipeline
The infrastructure surrounding a machine learning algorithm. It encompasses data collection from the front end, processing into training data files, training one or more models, and deploying the models into production.
Primitive set
The primitive set is the alphabet of a genetic programming program. It consists of a terminal set (variables, constants,ย and functions without arguments) and a function set (functions and constructs driven by the problem domain).
Stratified sampling
Stratified sampling is a technique used in statistics and machine learning to ensure that the distribution of samples across different classes or categories remains representative of the population.
The population is divided into distinct groups based on specific characteristics (e.g., age, gender, income level), and samples are then randomly selected from each group in proportion to their representation in the population. This method helps ensure adequate representation of each subgroup, helping to reduce bias in the results.
Stratified sampling is particularly useful when handling imbalanced datasets, where certain classes or categories are significantly more prevalent than others. The goal is to maintain class proportions in the sample that closely reflect those in the overall population.