symbolic_regression - morinim/ultra GitHub Wiki
Symbolic regression is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset.
No predefined model is assumed at the outset. Instead, candidate expressions are formed by randomly combining mathematical building blocks such as operators, analytic functions, constants, and input variables (sometimes referred to as state variables).
While Genetic Programming (GP) can be used to perform a wide variety of tasks, symbolic regression is probably one of the most frequent areas of application (the term symbolic regression stems from earlier work by John Koza on GP).
GP builds a population of simple random formulae to represent relationships among independent variables in order to predict new data. Successive generations of formulae (aka individuals / programs) are evolved from the previous one by selecting the fittest individuals from the population to undergo genetic operations.
The fitness function that drives the evolution can take into account not only error metrics (to ensure the models accurately predict the data), but also special complexity measures, thus ensuring that the resulting models reveal the data's underlying structure in a way that's understandable from a human perspective. This facilitates reasoning and increases the chances of gaining insights about the data-generating system.
#include "kernel/ultra.h"
int main()
{
using namespace ultra;
// DATA SAMPLE (output, input)
src::raw_data training = // the target function is `y = x + sin(x)`
{
{ "Y", "X"},
{-9.456,-10.0},
{-8.989, -8.0},
{-5.721, -6.0},
{-3.243, -4.0},
{-2.909, -2.0},
{ 0.000, 0.0},
{ 2.909, 2.0},
{ 3.243, 4.0},
{ 5.721, 6.0},
{ 8.989, 8.0}
};
// READING INPUT DATA
src::problem prob(training, symbol_init::all);
// SEARCHING
src::search s(prob);
const auto result(s.run());
std::cout << "\nCANDIDATE SOLUTION\n"
<< out::c_language << result.best_individual()
<< "\n\nFITNESS\n" << *result.best_measurements().fitness << '\n';
}(for convenience, the above code is available in examples/symbolic_regression/symbolic_regression.cc file)
All the classes and functions are placed into the ultra namespace.
#include "kernel/ultra.h"ultra.h is the only header file you have to include: it provides everything needed for genetic programming (both symbolic regression and classification), genetic algorithms, and differential evolution.
The training data is provided in the f(X), X format. In this example, it's hardcoded as a table in memory:
src::raw_data training =
{
{ "Y", "X"},
{-9.456,-10.0},
{-8.989, -8.0},
...
};but in general it can be read from a stream or a file (such as a CSV file):
std::istringstream training(R"(
-9.456,-10.0
-8.989, -8.0
...
)");The data points are generated from an unknown target function f. Our goal is to discover this function.
On a graph:

src::problem prob(training, symbol_init::all);The src::problem object encapsulates everything required for evolution: parameters, dataset, and symbol definitions.
By passing symbol_init::all, ULTRA automatically enables a predefined set of mathematical primitives (such as arithmetic operators and common analytic functions). This allows you to start immediately without worrying about which building blocks to include.
Input variables are still inferred directly from the dataset.
Now all that's left is to start the search:
src::search s(prob);
const auto result(s.run());and print the results:
std::cout << "\nCANDIDATE SOLUTION\n"
<< out::c_language << result.best_individual()
<< "\n\nFITNESS\n" << *result.best_measurements().fitness << '\n';out::c_language is a manipulator that makes it possible to control the output format. python_language and cpp_language are other possibilities (see individual.h for the full list).
The predefined set is convenient, but in many cases you may want finer control over the search space.
Instead of using symbol_init::all, you can explicitly specify which primitives to include:
src::problem prob(training);
prob.insert<real::sin>();
prob.insert<real::cos>();
prob.insert<real::add>();
prob.insert<real::sub>();
prob.insert<real::div>();
prob.insert<real::mul>();(the above code is available in examples/symbolic_regression/symbolic_regression_bis.cc file)
ULTRA provides a rich collection of ready-to-use primitives (see primitive set).
Choosing the right set of primitives is important:
- fewer primitives → simpler models, faster convergence;
- more primitives → richer models, but a larger and harder search space.
This trade-off is central to symbolic regression and often problem-dependent.
A typical run produces something like:
[INFO] Importing dataset...
[INFO] ...dataset imported
[INFO] Examples: 10, features: 1, classes: 0
[INFO] Setting up terminals...
[INFO] Category 0 variables: `X`
[INFO] ...terminals ready
[INFO] Number of layers set to 1
[INFO] Population size set to 72
0.058 0: -0.00553475
[INFO] Evolution completed at generation: 102. Elapsed time: 5.023
Run 0 TRAINING. Fitness: -0.00553475
CANDIDATE SOLUTION
sin(X)+X
FITNESS
-0.00553475
Graphically, this is what happens to the population:

(if you're curious about the animation take a look at examples/symbolic_regression/symbolic_regression02.cc and examples/symbolic_regression02.py)
