Data Structures - RConsortium/ddR GitHub Wiki

There are three primary data structures that we want to support: darray, dlist, and dframe. These correspond to the three most important shapes of data: rectangular, hierarchical and tabular. We might consider a dvector alias that constructs 1D darrays. We should aim to support all of the canonical operations for each data type.

Laziness

These operations will very often be lazy. In particular, extracting a column from a dframe should yield a distributed object, typically a darray. Same goes for extracting an element from dlist. Coercing a distributed object to the native equivalent, e.g. via as.vector, as.list or as.data.frame should collect the data. We should not expect the user to call collect directly.

Partitioning

Each of these objects has an underlying partitioning that will typically represent how the data are distributed over the nodes of a cluster.

Grouping

Somehow there should be a formal notion of (lazy) grouping, especially of dframe. The primary verb to generate a grouping is split. It could yield special type of dlist that represents groups of dframe rows.

Copied and pasted from here