XLang April Meeting Datashape Discussion - darpa-xdata/xlang GitHub Wiki

Andy presented comments from the first tables review:

why implement a new language for data management. Should we find existing work instead of reimplement? why use a language instead of just a library? Isn't this just a database schema or a serialization library?

Four Proposals Currently

use data shape (Continuum using this approach)
use thrift / AVPRO - (no current champion within XDATA)
build a way for us all to talk to C
Just doing tagging unions on the heap in C

Concern of not being able to represent complicated, hierarchical objects in TD (thunder dome)

AVRO can handle plugins, but we would have to heavily extend AVRO and our use would not be generally useful to AVRO. Discussion of engineering challenges: annotations - detailed information on top of basic datatypes can be carried through AVRO. It isn't processed by AVRO, so we could use annotation to represent.

Cap'N Proto designers feel that XDATA applications were an "off label" use of protobuf. Cap'n Proto has a message format, which happens to match a C x86 format (for arrays and primitive types). We cannot define our own in-memory format. They are specifically trying to prevent different layouts. They are focusing on a serialization format.

Datascript - another data description language (data script.berlioz.de) that might inspire our work, but doesn't seem. [http://datascript.sourceforge.net]

Discussions of the tradeoffs of passing more complex datatypes

We certainly can interchange primitive types through TD, but how much farther (toward complex data interchange) do we want to go?

Current DSL / JVM implementation converts arrays of structures into columnar store - a list of arrays, where each array was one component of a structure.

Why not extend a serialization format? This is still an open question, but nobody in the current XDATA project can champion this approach. AVRO is likely to be the better choice than Thrift.

What cases of data interchange don't require copy & shuffle? We will need to copies at the CPU/GPU boundary but lets try to minimize copies at other parts of the interchange.

R is column-major organization, so it isn't the same orientation as most other languages. We decided there is a quite small set of datatypes that would reliably interchange between all the languages.

List of Use Cases we discussed

We are assuming that avoiding copies is important. Let's evaluate if this needed.

MIT/LL and Oculus used AVRO - Oculus could query; MIT/LL could listen to the query and deliver data back. A graph shape (a subgraph of a bigger graph). Each graph node was a structure, so this was an array of structs. A much larger series of structures went back from MIT/LL to Oculus. Lists of 30k-40k structures would be returned. The backend controlled the data and sent "chunks" of the data back. They had to serialize to get back to.
MIT/LL and Stanford DNN - the DNN is a backend processing system from the front end user interface from MIT. Deep network.
Large graph exploration - (use the XDATA web hyperlink dataset?) do SVD (R) and modularity (in Java), now there are small representations of the network. Integrate these to the visualization. Xgraph could be involved in graph reduction (linear algebra or centrality and swap out in SVD). Use Tangelo/Vega/Bokeh for rendering
time series aggregation - browse through large amounts of time series data (twitter, network browsing, flickr records) and aggregate in multiple ways to support browsing. For example, aggregate by country, date, etc. Time Series Correlation analysis?
ETL use case - A Java Map Reduce process where the Map step is not implemented in Java. Lets say there is a python map function which does what we want. We currently have to serialize data from HDFS to send to the Python function, when serialize data back to go to the reduce steps (back in Hadoop/spark/shark). In this case, Hadoop is in control. We could use a "thunderdome mapper" instead to avoid the copies. Sequence files are a Hadoop format - RHIPE is a direct mapping implementation in R. There is an implicit interface that presents Java data serialized through protobufs to R.
Use R's Bigmemory package, which is using m-mapping for stock ticker data. Python interators that work across the "big memory matrix". This would be passing a large R-based data structure over for accessing by Python.
Typed caching system - A pipelined application could be constructed where each component can open an output cache, describe the data using data shape descriptions and start writing to the cache. The next step in the workflow could read the cache, check the metadata. Would this be Memcache without the transformation?
igraph - Some important igraph functionality is only available in R. Timeseries graph anomaly detection. This is an example of where there is good functionality in R and we could make its algorithms available to non-R users. Can we transfer data in memory? igraph is implemented in C with a Python and R interface. Anomaly detection is only implemented in R. This application needs sparse matrices or adjacency list.
Streaming twitter case - the twitter dataset is dirty. It could be cleaned up by one process then passed to other analyses that graph datasets geospatially, a natural language processing application. Text plus records (which are structs) identifying. An Array-of-structs plus a string comes out. NLTK. Sentiment analysis. Represent a graph.
Clustering of data - group documents together based on similarity. Groupings that end up being arrays of arrays of numbers. (Ragged array of arrays). Compare different clustering algorithms or regular clustering vs. hierarchical clustering. Multiple Linear Algebra operations. Centrality.

General Topics

gridgain.com - a newly open-sourced package which is basically an in-memory HDFS system, so Hadoop processing can be done most quickly in memory.

Missing Data Discussion - Missing data has different semantics than exceptions. We often want to handle them differently, but often only a single way (like NA in R) are used for both.

Minimal Data Interchange Required

What is the minimal set of types that we have to exchange? Thunderdome currently has a restricted set of datatypes. Datashape has a richer set of what it can represent.

struct, nested struct, primitive datatypes, arrays, dates, missing data