XLANG Meeting April 22 23 Thunderdome Breakout - darpa-xdata/xlang GitHub Wiki

Notes from the Thunderdome breakout

Multi-language support

Sometimes it will be easy to exchange data between different languages. In general, R, Python, Julia, and C will interoperate well. JVM-based languages offer generally more problems accessing and ownership of data.

If "A" calls "B", which then calls environment "A" again, does this cause a problem? There are only one instances of each environment unless explicitly managed by the user.

Granularity - the current Julia implementation is a direct function call. What exactly to expose is a big question. The environment endpoints have the implicit assumption of being fairly heavy-weight calls. CUDA blast operations, an SVD, etc. is an operation. Fusion of operators together is a powerful behavior in a DSL, so if TD is wrapping. DSL would generate an entry point for sufficiently large algorithms.

run-time compilation could be invoked as a program string is passed from TD to the DSL environment.

Pipeline Parallelism Discussion

We discussed how endpoints exposed by environments to TD (thunder dome) would be "heavy weight" calls, expected to take a while to run. It would be neat if environments linked together through TD could operate using pipeline parallelism. We discussed this could be done blocking, with queues. Also discussed Futures.

If an environment is multithreaded, it would return a Future and then continue processing using an offline thread.

TD calls are not considered to be fine-grained. Discussion about how DSL (domain specific language) should integrate with TD. Cross-environment optimization is too hard. Environments that contribute to meta-level optimization. This would be really powerful in the long term, but seems too hard at first.

We discussed the value of providing a multi-step candidate application. If TD has this as an example, then others in XDATA might just copy the example and change out the environment calls with their own analyses.

R-specific discussions

R has a specific “NA” value. It is implemented becomes an encoding. NA = the smallest negative number for integrations. R is doubles only, no floats. R uses an inefficient wrapper by encoding the NA explicitly. We discussed the need to expose the datatype as an RINT32 so users know. R object headers are always at the beginning of of the object.

R data frames are list of vectors. Factors (looks like characters, but they are categorical variables). Categorical variables are like enumerations. Once we have a list and a factor (enumerated value), we can compose them together to make data frames.

Rcpp library - generates a C++ representation of a data frame without doing a copy. Lists, vectors, etc. generate. These are STL-compatible representations of objects. This means one way conversion (out of R) is well implemented. Some input functions are also implemented in Rccp, but not as complete.

No copy constructor - There is NO mechanism for creating a copy constructor in R for external data. A default copy constructor exists for native R objects, but this cannot be done for external objects. If we pass an array from an external library. It will be wrapped by R. It won’t copy the way the native objects do. A function may behave slightly differently because these “external data objects’ may not behave as truly native objects.

dlopen discussion

There might be environments that HAVE to run main. This will be allowed. dlopen works if an environment can be invoked as a shared library. Problems with dlopen are that it is not platform-independent. dlopen is very different on windows. Some environments will have to own main, keep standard input/out and be invoked through system calls.

DSL Discussion

The script written in a DSL (domain specific language) could be kept as a string, then compiled. The compiled output could be loaded as an environment.

Streaming discusion

Cloud deployment and Distributed Data

Avro (R and Python) and Cryo (for JVM-based languages) are currently implemented in Spark.

Thunderdome programs can be accessed through the streaming API. There will be a serialization getting data in and out form the cloud architecture.

distributed data where data is across a cluster has to be implemented. The data representations that are distributed contain internal references to the local parts. Other environments might not understand how to handle these distributed references. We discussed how to traverse the distributed object and create a set of thunderdome-shared objects.