dataframe - morinim/ultra GitHub Wiki
Structurally, a dataframe is a 2D data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table.
Functionally it resembles the pandas object, but:
- it supports only a subset of the operations available in Python's Pandas
DataFrame.ultra::dataframecovers basic usage scenarios and is not intended as a replacement for tools that provide extensive data preprocessing; - by default, it automatically splits an example into features (input) and label (output). It supports storing unlabeled examples (e.g. for unsupervised learning tasks or for storing examples to be classified);
- it is more row-oriented, whereas Pandas
DataFramepredominantly column-oriented.
User can import data from a CSV file:
std::istringstream dataset(R"(
A, B, C, D
a0, 0.0, 0, d0
a1, 0.1, 1, d1
a2, 0.2, 2, d2)");
dataframe d;
d.read(dataset);or from a table in memory:
// `raw_data` is basically a table: rows of variant values.
raw_data dataset2 =
{
{ "A", "B", "C", "D"},
{"a0", 0.0, 0, "d0"},
{"a1", 0.1, 1, "d1"},
{"a2", 0.2, 2, "d2"}
};
dataframe d;
d.read(dataset2);In both cases, we have:
d.columns[0].name() == "A"
d.columns[1].name() == "B"
d.columns[2].name() == "C"
d.columns[3].name() == "D"By default, the first column (column 0) is treated as the output column. The user can specify a different column of the CSV / table:
d.read(dataset, dataframe::params().output(2));In this case, the specified output column is shifted to the first position:
d.columns[0].name() == "C"
d.columns[1].name() == "A"
d.columns[2].name() == "B"
d.columns[3].name() == "D"Note:
- for CSV files, the parser sniffs the presence of column headers (if this fails - CSV is a textbook example of how not to design a text-based file format - the user can manually indicate the correct configuration using
params::header/params::no_header); - for tables in memory
readbehaves differently and simply assumes the first row contains the headers.
To access label (output value) and features (input values):
std::cout << "Label of the first example is: " << lexical_cast<double>(d.front().output)
<< "\nFeatures are:"
<< "\nA: " << lexical_cast<std::string>(d.front().input[0])
<< "\nB: " << lexical_cast<double>( d.front().input[1])
<< "\nD: " << lexical_cast<std::string>(d.front().input[2]) << '\n';For unlabeled examples, use the no_output modifier:
d.read(dataset, dataframe::params().no_output());In this case:
d.columns[0].name() == ""
d.columns[1].name() == "A"
d.columns[2].name() == "B"
d.columns[3].name() == "C"
d.columns[4].name() == "D"a surrogate empty output column is added at the beginning and has_value(d.front().output) == false.
To access information about the columns structure, use the columns member function:
std::cout << "Name of the first column: " << d.columns[0].name()
<< "\nCategory of the first column: " << d.columns[0].domain();
std::cout << "\nThere are " << d.columns.size() << " columns\n";