Design - emer/etable GitHub Wiki
etable
etable.Table
is just a slice of etensor
objects. The primary functionality includes:
-
Basic structural management -- creating new (with
Schema
) adding / accessing cols, rows, cells. -
Basic Database-style functions:
- Sort
- Filter
- Split / Group
- Join
The latter functionality is much more efficient when done using an indexed view into a table, provided by the etable.IdxView
type. Thus, all data processing / analysis functions operate on the IdxView. There are also a few basic methods on the table that use the IdxView internally, for convenience.
Below is more discussion and rationale.
IdxView: An indexed view into a Table
Index indirection is very important for efficiency: operations like Sort & Filter are much more efficiently done on a single index list instead of moving entire rows of data around. In C++ emergent, we baked the index functionality directly into the DataTable, but here we keep it as a separate view type that is very lightweight and allows many different views into the same table, and avoids the considerable difficulty in maintaining the integrity of the indexes throughout the entire scope of usage of the table.
If the index is baked into the table itself, it permeates every single access to the underlying etensor data, and thus creates strong dependencies between the tensor columns and the parent table. In C++ emergent, we had complicated indexing functions built into the Matrix column type, and keeping all that synchronized was very expensive in terms of code and complexity.
Thus, instead of baking indexing thoroughly into the entire Table and etensor, it is better to keep those as simple and obvious as possible, as a slice of tensors and each tensor is just that.
Indexing is then added as an optional outer IdxView
view type that provides an indexed view onto a given Table
. The std lib sort.Sort
operating on sort.Interface
essentially requires such a wrapper class to manage the sorting methods, and we extend this to handle all such indexed functionality including Filter, Split, and Join. This design relieves the major burden of needing to track everything through the index and always maintain its integrity -- that is the main problem of baking it in. It is relatively easy to track an index through a specific controlled set of operations, but doing that in general across all time is hard.
Thus, the general workflow involves a performing a sequence of data manipulation operations using the IdxView methods, and when a final result is needed, NewTable
can be called to "render" or "flatten" the indexed view out into a new concrete Table with data physically arranged according to the indexes. This is more-or-less how pandas works, as described in Python Data Science.
Instead of a monolithic "Group" function (as present in C++ emergent), the new design uses Splits
which maintain multiple IdxView
s, each of which represents one subset or split of the full Table. Various further operations can then operate on these Splits, including aggregation functions that aggregate across each split and store the results in optional fields on the Splits itself. You can also spin out separate tables for each split (e.g., for train, test splits), etc.
Why not etensor.IdxView?
In principle, we could define an indexed view into an individual etensor.Tensor too, and define the core functionality there, e.g., for aggregation. However, the Tensor is not fundamentally a row-based data structure itself -- it is only in the context of a Table that the outer row dimension becomes special, and indexed operations like Sort, Filter, Group etc all only really apply to Tables.
Other non-core Data functionality (Analyze, Gen, etc)
General principles:
-
use package-level organization of different types of functionality -- definitely separate from basic Table.
-
a major question is whether to take etensor args directly vs. operating on table.. e.g.,
DistMatrix
could be its own package (distmat
), takessrc
etensor as main thing to compute on, and optionalnames
etc tensors (actually should just be []string, which we can directly get from a 1D string etensor column).
agg
The agg
package provides aggregation operations, Sum
, Mean
, etc, defined over IdxView
, operating on a given column. These operations use the IdxView.AggCol
method as the core iterator over a column, which uses the indexed indirection. This allows the same methods to automatically work for Group-level aggregation when applied over the views in a Splits collection of views.
For convenience we provide three different api's for each function, e.g., for Mean
:
agg.Mean
takes a column name, doesn't do error checking: will return nil or panic if column name not found.agg.MeanTry
takes a column name but is fully safe and returnserror
if not found or other error encountered during function.agg.MeanIdx
takes a column index, and is the core computational function -- will panic if idx out of range.
dist
The dist
package provides distance metric comparisons between tensors and a distance matrix from all pairwise distances across rows of a column in a table. It supports basic Squared, Euclidean, InnerProd, Cosine, Covar, etc metrics.
etensor
Much of the numerical heavy-lifting happens at the etensor level. Details below, but key conclusions are:
-
use
gonum/floats
andFloats
method on etensor for all "flat"[]float64
computation -- strongly favor use of Float64 for all analysis-focused data -- Float32 will require a conversion. Someday maybe write a floats32 version for gonum and can use that, but maybe not.. -
use
gonum/mat
directly on 2D etensor elements which support mat.Matrix interface.
In general, the approach in Go (e.g., gonum) is to have a rather minimal structural interface api associated with the data storage itself (so it is easy to implement that api in another case), and then there are other functional packages that operate on that..
We can look at Gonum api to see how they’ve organized everything, and whether we can just work with their stuff or not..
-
gonum/floats
is the main “core” math api (uses optimized asm / vectorized code) — it just operates directly on []float64 — and unless there is a bug in Gide/Symbols, there is NO “Set” or “Zero” method to be found in there even! I guess it is so trivial that you are expected to just write it yourself every time.. gonum/stats works on top of that, again on []float64. The fact that they didn’t introduce a wrapper interface around float64 means that we can’t directly use this code on tensor objects except for the one Float64 case. OTOH, the speed cost of such a wrapper would be high relative to operating directly.- We can convert any tensor into []float64 for the read-only methods: the
Floats
method that returns[]float64
, which can be the directValues
for Float64 etensor.
- We can convert any tensor into []float64 for the read-only methods: the
-
gonum/mat.Dense
is backed by the blas64 type. It actually DOES look like there is a bug in the Pi parser, because Symbols does not show all the methods in dense_arithmetic which are actually implemented as methods on Dense for all the basic arithmetic ops etc. This is not an interface though, just methods on a concrete type, similar to what I ended up doing on the mat32 Vec3 etc. -
most of the heavy-lifting in
gonum/mat
just copies the source to a Dense and runs it through blas anyway. So we don't need to sweat it too much. The basic logic is that we're going to be a lot faster than Python anyway, so who cares.. :)
Specific cases
- Use
etensor.SetFunc
for all "InitVals" kinds of functionality, but can add aSetZero
function for convenience.