Data Structures - RConsortium/ddR GitHub Wiki
There are three primary data structures that we want to support: darray, dlist, and dframe. These correspond to the three most important shapes of data: rectangular, hierarchical and tabular. We might consider a dvector
alias that constructs 1D darray
s. We should aim to support all of the canonical operations for each data type.
Laziness
These operations will very often be lazy. In particular, extracting a column from a dframe
should yield a distributed object, typically a darray
. Same goes for extracting an element from dlist
. Coercing a distributed object to the native equivalent, e.g. via as.vector
, as.list
or as.data.frame
should collect the data. We should not expect the user to call collect
directly.
Partitioning
Each of these objects has an underlying partitioning that will typically represent how the data are distributed over the nodes of a cluster.
Grouping
Somehow there should be a formal notion of (lazy) grouping, especially of dframe
. The primary verb to generate a grouping is split
. It could yield special type of dlist
that represents groups of dframe
rows.
Copied and pasted from here