dataframe - morinim/ultra GitHub Wiki

Structurally, the ultra::dataframe is a 2D data structure designed to hold columns of various types, similar to a SQL table or a spreadsheet.

While it shares functional similarities with Python's pandas library, it is specifically tailored for evolutionary computation tasks.

Key differences from pandas include:

Feature-label splitting. It automatically distinguishes between features (inputs) and labels (outputs) by default.
Row-oriented design. Unlike pandas, which is predominantly column-oriented, this implementation is more row-oriented to suit the needs of genetic programming.
Focused scope. It supports only a subset of pandas' operations, as it is intended for basic usage rather than extensive data pre-processing.

It supports storing unlabelled examples (e.g. for unsupervised learning tasks or for storing examples to be classified).

Basic functionality

Import data

A user can import data from a CSV file:

std::istringstream dataset(R"(
   A,   B, C,  D
  a0, 0.0, 0, d0
  a1, 0.1, 1, d1
  a2, 0.2, 2, d2)");

dataframe d;
d.read(dataset);

Or from a table in memory:

// `raw_data` is basically a table: rows of variant values.
raw_data dataset2 =
{
  { "A", "B", "C",  "D"},
  {"a0", 0.0,   0, "d0"},
  {"a1", 0.1,   1, "d1"},
  {"a2", 0.2,   2, "d2"}
};

dataframe d;
d.read(dataset2);

In both cases, we have the following:

d.columns[0].name() == "A"
d.columns[1].name() == "B"
d.columns[2].name() == "C"
d.columns[3].name() == "D"

By default, the first column (column 0) is treated as the output column. The user can specify a different column of the CSV / table:

d.read(dataset, dataframe::params().output(2));

In this case, the specified output column is shifted to the first position:

d.columns[0].name() == "C"
d.columns[1].name() == "A"
d.columns[2].name() == "B"
d.columns[3].name() == "D"

Note:

for CSV files, the parser sniffs for the presence of column headers. If this fails (CSV is a textbook example of how not to design a text-based file format) the user can manually indicate the correct configuration (using params::header / params::no_header);
for tables in memory, read behaves differently and simply assumes the first row contains the headers.

To access label (output value) and features (input values):

std::cout << "Label of the first example is: " << lexical_cast<double>(d.front().output)
          << "\nFeatures are:"
          << "\nA: " << lexical_cast<std::string>(d.front().input[0])
          << "\nB: " << lexical_cast<double>(     d.front().input[1])
          << "\nD: " << lexical_cast<std::string>(d.front().input[2]) << '\n';

For unlabeled examples, use the no_output modifier:

d.read(dataset, dataframe::params().no_output());

In this case:

d.columns[0].name() == ""
d.columns[1].name() == "A"
d.columns[2].name() == "B"
d.columns[3].name() == "C"
d.columns[4].name() == "D"

a surrogate empty output column is added at the beginning and has_value(d.front().output) == false.

Columns

To access information about the columns structure, use the columns member function:

std::cout << "Name of the first column: " << d.columns[0].name()
          << "\nCategory of the first column: " << d.columns[0].domain();

std::cout << "\nThere are " << d.columns.size() << " columns\n";

Manual setup

If you need to build the dataframe one piece at a time, you have to:

set up the general schema;
insert the data.

For example:

dataframe d;

d.set_schema({{"A", ultra::d_string}, {"B", ultra::d_double},
              {"C", ultra::d_double}, {"D", ultra::d_string}});

d.push_back({"a0", {0.0,   0, "d0"}});
d.push_back({"a1", {0.1,   1, "d1"}});
d.push_back({"a2", {0.2,   2, "d2"}});

This produces the same dataframe as before.

References

CSV dataset format