Oil and the R Language - oilshell/oil GitHub Wiki

Up: Structured Data in Oil

I've mentioned to several people that Oil will have a table data structure influenced by the R language for statistical computing.

Why add tables to Oil? The slogan I'm using is that the output of "ls" and "ps" are both tables (blog post). It would be nice to have a uniform way of selecting columns, filtering rows with boolean expression, sorting by a columns, making histograms, etc.

Why will they be influenced by R? Because R is a language built around tables (data frames in R terminology). That is, it uses the data model of heterogeneous typed columns (contrast with Matlab, which is based on homogeneous vectors/matrices, and Mathematica, which is based on Lisp-like symbols).

Although data frames have a scientific metaphor (rows are observations and columns are variables), they're still applicable to the shell. Oil won't have advanced statistical functions, but it will be very good at manipulating data.

Another way to think about it is that the Unix cut, join, paste, etc. tools are annoying hacks that we'll subsume with a proper table data structure.

Examples

Data Frames Cheat Sheet

It's not clear exactly what syntax I'll use. R has at least 3 different syntaxes:

  1. base R data frames: df[df$age > 30,]
  2. data.table package: dt[age > 30,] (I like this better)
  • dt[row_expr, col_expr, by=...]
  1. dplyr package: explicit select(), filter() composed using the %>% pipeline operator.

SQL?

Data frames are just like SQL tables, but they're in memory, and the query language is more like a mathematical expression than English (SELECT * FROM ... WHERE ... GROUP BY).

Another way I think of it is that R is the only language without the ORM problem.. In many languages, it's common to map SQL tables, which use the relational model, to native objects in the language.

In R there is no impedance mismatch. There's even a popular R package called sqldf that lets you use SQL syntax on data frames.

SQL syntax doesn't compose. Syntax like this is supported, but ugly:

SELECT name, age from (SELECT name, age from (SELECT name, age, from inner_table) where ...) where ...)

It would be nicer to write this in a left-to-right fashion, like a pipeline.

Interactivity

Another interesting thing about R is that it's primarily an interactive language, much like shell. This is in contrast to Python/Perl/Ruby/JavaScript.

R has good completion of function names, function arguments, and path names in its default REPL, unlike Python.

Related Links

  • tidyverse by Hadley Wickham (sometimes Hadleyverse)

    • Tidy Data -- A good paper explaining the philosophy. It's basically an algebra of tables, similar to the relational algebra, but geared toward data analysis and numerical data.
    • ggplot2 -- most powerful and concise plotting library by a mile, using the Grammar of Graphics, although it's quite abstract.
    • ggvis is the newer interactive version.
    • plyr, dplyr . dplyr reads left-to-right like shell pipelines. It has the pipe operator %>%. Oil should just use |.
  • Pandas imports the R model of data frames in to Python.

Serialization

Design idea: have a convention of foo_data.csv and foo_metadata.csv / foo_schema.csv. This gets around the R problem where it has to guess about the data type of a CSV column. It often does a bad job (e.g. stringsAsFactors=FALSE should be the default).

Are there any other serialization formats? TODO: Link to new ones.

Comparison to Shell

R is like shell in that it's a somewhat crappy language, but a lot of work gets done with it. They're both interactive languages for non-programmers.

The syntax of R is pretty good, but has semantics can be surprising and cause bugs. The lack of distinction between scalars and vectors is a big problem.

The implementation is also slow, memory-hungry, and otherwise naive (lots of globals).

Language Design

Evaluating the design of the R language -- fantastic paper

  • Scheme / JavaScript-like, with lazy evaluation, so it has the potential for Metaprogramming everywhere. Instead of evaluating expression, you can compute on it.

Appendix: Silly ways to select fields

  • sort -k 1,1

  • cut -f 1 -d ','

  • awk '{print $1}'

  • ps, ls -- have lots of flags for controlling the fields

  • find, stat, time, etc. have printf-style format strings. This is similar to selecting columns and formatting them. xargs is missing --format.