terminology - morinim/ultra GitHub Wiki

The following terms are frequently used in Ultra-related documentation.

Class imbalance

The class imbalance problem arises when certain classes have significantly more instances than others. In such cases, standard classifiers tend to favour the majority classes while underperforming on the minority ones.

Discriminant function

A discriminant function is a mathematical function used in statistics and machine learning to classify data points into predefined categories or groups. It helps determine the class to which a new, unseen data point likely belongs, based on its features. In essence, it creates boundaries in the feature space to separate different classes.

Key Concepts

  • Classification: discriminant functions are primarily used for classification problems, where the goal is to assign data points to specific categories.
  • Feature Space: this refers to the multi-dimensional space defined by the features (variables) of the data points.
  • Decision Boundary: the discriminant function defines a decision boundary in the feature space that separates different classes.
  • Linear and Non-linear: discriminant functions can be linear (straight lines or planes) or non-linear, depending on the complexity of the data and the desired decision boundary.

How it works in Ultra

A discriminant function is evolved using a dataset with known class labels (the training data). The function evolves to approximate the relationship between features and class labels, effectively learning the decision boundary. Once evolved, the discriminant function is used by an oracle to predict the class of a new, unlabeled data point by evaluating it on the new data point's features.

Example

A pair consisting of an instance (with its features) and a label.

In Ultra the class representing an example is dataframe::example (see dataframe.h and dataframe.cc).

Heterogeneous Genetic Algorithm (HGA)

The domain of genetic algorithms can be extended by employing more complex solution encodings. One such approach involves concatenating multiple heterogeneously encoded gene segments into a single chromosome.

This allows the algorithm to tackle optimisation problems where parameters belong to vastly different domains. Suppose you're solving a job shop scheduling problem in which:

  • you need to assign operations to machines (a permutation of tasks);
  • you must also tune algorithmic parameters of the scheduler itself (e.g. priority rules, cooling rate in a hybrid GA-SA approach).

These two components are different in nature:

  • job assignment โ†’ typically represented as a permutation (e.g. [task3, task1, task2, ...]);
  • parameter tuning โ†’ represented as real numbers or categorical values (e.g. cooling rate = 0.95, rule = "earliest due date")

Using a heterogeneous genetic algorithm allows you to:

  • encode the task permutation and parameter vector in separate sections of the chromosome;
  • apply crossover and mutation operators tailored to each part (order-based crossover for the task sequence, standard crossover for numeric parameters).

This separation improves convergence and avoids disruptive operations (for example, preventing permutation crossover from corrupting the scheduler's parameters).

In Ultra the class used for HGA chromosomes is hga::individual (see hga/individual.h and hga/individual.cc). The eight queen tutorial provides a good practical introduction.

Instance

An entity for which a prediction is made. For example, an instance might be an image classified as either hotdog or not hotdog (Silicon Valley: Season 4 Episode 4).

Feature

A measurable attribute of an instance used for prediction. For example, a car might have a feature Mileage.

NOTE: in machine learning, a feature can have several meanings depending on the context. It can refer to a data type (e.g. Mileage) or a data type along with its value. Many people use attribute and feature interchangeably (including in this wiki).

Label

An answer for a prediction task, either the machine learning system's output or the correct answer provided in the training data.

For example, the label for an image might be hotdog.

Consider the following:

  • in symbolic regression tasks, labels are numerical values and can be retrieved using the label_as function;

  • in classification tasks, class labels may be encoded using different data types. However, Ultra employs a uniform scheme where classes are represented as integers (class_t). This simplifies data handling and manipulation.

    The actual integer label can be accessed via the label(example) function, while the original label can be retrieved using the dataframe::class_name method.

Metric

A numerical value of interest. It may or may not be directly optimized.

Model

A statistical representation of a prediction task. A model is trained on examples and subsequently used to make predictions.

NVI (Non-Virtual Interface pattern)

The pattern is a design strategy where you define a class interface using public, non-virtual functions that internally call private or protected virtual functions.

In a standard public virtual interface, the base class yields total control to the derived classes. Consider this typical example:

class Animal
{
public:
  virtual void speak() const = 0;
};

class Dog : public Animal
{
public:
  void speak() const override { std::cout << "Woof!" << std::endl; }
};

class Cat : public Animal
{
public:
  void speak() const override { std::cout << "Meow!" << std::endl; }
};

This uses the usual public virtual interface that we're used to, but it has a couple of problems:

  • each derived animal is repeating code -- the only part that changes is the string, yet each derived class needs the whole std::cout << ... << std::endl; boilerplate code;
  • the base class can't make guarantees about what speak() does. A derived class may forget the new line or write it to cerr or anything for that matter.

By moving the customization point to a private virtual function, the base class regains control of the workflow. This is a specific implementation of the Template Method Pattern.

class Animal
{
public:
 // The Non-Virtual Interface (entry point)
 void speak() const
 {
   // The base class controls the "how" (cout and newline)
   std::cout << getSound() << std::endl;
 }

private:
 // The customization point ("what")
 // Derived classes override this to provide specific data.
 virtual std::string getSound() const = 0;
};

class Dog : public Animal
{
private:
  std::string getSound() const override { return "Woof!"; }
};

class Cat : public Animal
{
private:
  std::string getSound() const override { return "Meow!"; }
};
classDiagram
note "NVI"

class Animal {
  +speak()
  -getSound()* string
}

class Dog {
  -getSound() string -- override
}

class Cat {
  -getSound() string -- override
}

Animal <|-- Dog : Inheritance
Animal <|-- Cat : Inheritance

The advantages of this pattern are:

  • enforcement of guarantees. The base class ensures that every "speak" action follows the same format (e.g., always ending in a newline);
  • ease of maintenance. If you want to change the output from std::cout to a logging file later, you only change the Animal::speak() method. All derived classes benefit immediately;
  • separation of concerns. The base class handles the workflow (the "template"), while the derived classes focus solely on the implementation details (the "sound").

For further reading, Herb Sutter's article on Virtuality remains the definitive guide on why public virtual functions should generally be avoided in favor of NVI.

The example is based on Peter Alexander's answer on stackoverflow.

Objective

A metric that the algorithm aims to optimize.

Pipeline

The infrastructure surrounding a machine learning algorithm. It encompasses data collection from the front end, processing into training data files, training one or more models, and deploying the models into production.

Primitive set

The primitive set is the alphabet of a genetic programming program. It consists of a terminal set (variables, constants,ย and functions without arguments) and a function set (functions and constructs driven by the problem domain).

Steady state

Steady-state population/genetic algorithm (SSGA) refers to a management strategy where only a small number of individuals are replaced at any given time, rather than replacing the entire population at once.

Key features include:

  • minimal turnover. Instead of creating a completely new generation of offspring (the "generational" model), an SSGA typically selects two parents, creates one or two offspring and immediately reinserts them into the current population;
  • overlapping generations. Parents and offspring coexist in the same population pool, meaning high-fitness individuals can survive for many iterations without needing to be "reselected" for a new generation;
  • replacement strategies. Since the population size remains constant, new offspring must replace existing members based on specific criteria (replace worst, replace random, crowding...);
  • efficiency, memory savings, diversity preservation.
graph TD
  Start([Start]) --> Init[Initialize population]
  Init --> Eval[Evaluate fitness of initial population]
    
  Eval --> Select[Select parents from current population]
  Select --> Crossover[Apply crossover & mutation]
  Crossover --> EvalOff[Evaluate fitness of new offspring]
    
  EvalOff --> Replace[Identify & replace member with new offspring]
    
  Replace --> Check{Termination?}
    
  Check -- No --> Select
  Check -- Yes --> End([End & return best solution])

  style Replace fill:#e1f5fe,stroke:#01579b,stroke-width:2px
  style Check fill:#fff9c4,stroke:#fbc02d

Stratified sampling

Stratified sampling is a technique used in statistics and machine learning to ensure that the distribution of samples across different classes or categories remains representative of the population.

The population is divided into distinct groups based on specific characteristics (e.g., age, gender, income level), and samples are then randomly selected from each group in proportion to their representation in the population. This method helps ensure adequate representation of each subgroup, helping to reduce bias in the results.

Stratified sampling is particularly useful when handling imbalanced datasets, where certain classes or categories are significantly more prevalent than others. The goal is to maintain class proportions in the sample that closely reflect those in the overall population.