Drivers and Backends - RConsortium/ddR GitHub Wiki
The terms 'Backend' and 'Driver' are sometimes used casually to refer to the same thing, but it's useful to distinguish them.
Backends in are the systems providing the distributed computing capability. They are not objects in ddR. Currently supported backends are the parallel package and HP's DistributedR. Parallel is included as a recommended package in R, and it is the default.
Drivers are objects in ddR providing the connection from a local ddR session to the backend.
ddR's Model
ddR can begin creating and computing on distributed objects once useBackend
is called. If it's not explicitly called then it's called implicitly using the default parallel backend. For example:
dr <- useBackend("parallel", type = "PSOCK", executors = 3)
# useBackend is called for its side effects, so this is equivalent:
# useBackend("parallel", type = "PSOCK", executors = 3)
The code above does several things:
- Creates a local parallel cluster. This is the backend.
- Creates an instance of a driver, which is an object of class
c("parallel.ddR", "ddRDriver")
connecting to this particular backend. This driver instance is calleddr
. - Appends
dr
to the global list of active drivers. - Changes the current global driver to
dr
.
New calls to ddR functions like dmapply
will now happen using dr
, since it is the current global driver. Then every new distributed object created will contain a reference to dr
as its driver.
Writing New Drivers
This section describes how to write a new driver for a backend which is not currently supported.
To define a new driver, write an S4 class extending ddRDriver
. This class should implement methods do_dmapply
, do_collect
, and get_parts
. Define classes for DListClass
, DFrameClass
, and DArrayClass
which extend DObject
. The instances of this new driver class represent a connection to a running backend.
ddR learns about new drivers when the function registerDriver
is called. This stores the common name of the backend together with an initialization function called in useBackend
. The initialization function should return an instance of the driver class.
Drivers should pass all tests in the test-driver-common.R
. This is a necessary but not sufficient condition for a new driver to be functional.
Parallel and Fork
Thinking further on this, PSOCK
on a SNOW
cluster is quite different from a system call to fork
. The former can actually represent distributed data structures, while the latter cannot, since it's limited by the memory of the host system. However, fork
is much faster and more efficient because it doesn't copy or serialize data. This suggests that fork
could be very useful to represent and compute on "parallel" data structures. If one was going to write methods to exploit this then it would be best to make these improvements as far upstream as possible, even in base R.
The differences between fork
and a physical cluster where data movement on the network can happen should be addressed by totally different implementations. In contrast, basic implementations for SNOW
clusters and Apache Spark should be similar, since these systems both serialize data to physically remote workers that run the computation.
References
R-Core's vignette on the parallel package is a valuable reference for understanding parallel computation in R.
Copied and pasted from here