Coprocess Protocol Proposal - oilshell/oil GitHub Wiki

April 2021: This is OLD (2018). See Capers

Coprocess Protocol Proposal (FCLI)

Abstract

This document sketches a protocol to allow coprocesses to substitute for normal "batch" processes in shell scripts. A coprocess can be thought of as a single-threaded server that reads and writes from pipes.

The goal is to make shell scripts faster. It can also make interactive completion faster, since completion scripts often invoke (multiple) external tools.

Motivation / Analogy

Many language runtimes start up slowly, e.g. when they include a JIT compiler or when many libraries are loaded: Python, Ruby, R, Julia, the JVM (including Clojure), etc.

This problem seems to be getting worse. Python 3 is faster than Python 2 in nearly all dimensions except startup time.

Let's call the protocol FCLI for now. There's a rough analogy to FastCGI and CGI: CGI starts one process per request, while FastCGI handles multiple requests in a process. (I think FastCGI is threaded unlike FCLI, but let's ignore that for now.)

Example / Sketch

Suppose we have a Python command line tool that copies files to a cloud file system. It works like this:

cloudcopy foo.jpg //remote/myhome/mydir/

(This could also be an R tool that does a linear regression, but let's use the cloudcopy example to be concrete. The idea is that a lot of the work is "startup time" like initializing libraries, not "actual work".)

It could be converted to a FCLI coprocess by wrapping main() in a while True loop.

A shell would invoke such a process with these environment variables:

  • FCLI_VERSION -- the process should try to become a coprocess. Some scripts may ignore this! That is OK; the shell/client should handle it.
  • FCLI_REQUEST_FIFO -- read requests from this file system path (a named pipe)
  • FCLI_RESPONSE_FIFO -- write responses to this file system path (a named pipe)

For worker #9, the shell might set variables like this:

FCLI_REQUEST_FIFO=/tmp/cloudcopy-pool/request-fifo-9 \
FCLI_RESPONSE_FIFO=/tmp/cloudcopy-pool/response-fifo-9 \
  cloudcopy  # no args; they'll be sent as "argv" requests

The requests and responses will look like this. Note the actual encoding will likely not be JSON, but I'm writing in JSON syntax for convenience.

# written by the shell to request-fifo-9
{ argv: ["cloudcopy", "bar.jpg", "//remote/myhome/mydir"]
  env: {"PYTHONPATH": "."}   # optional ENV to override actual env.  May be ignored by some processes.
}

-> 

# written by the cloudcopy process to response-fifo-9
{ "status": 0 }  # 0 on success, 1 on failure

stderr is for logging. stdin / stdout are used as usual. We probably need to instruct the server to flush its streams in order to properly delimit requests (?). We won't get an EOF because the pipes are open across multiple requests.

If you wanted to copy 1,000 files, you could start a pool of 20 or so coprocesses and drive them from an event loop. You would only pay the startup time 20 times instead of 1000 times.

In some cases, it would be possible to add a --num-threads option to your cloudcopy tool. But there are many cases where something like FCLI would be easier to implement. Wrapping main() is a fairly basic change.

Errors

The process may also just exit 1 or exit 123, and that will be treated as {"status": 123}. A new coprocess will be started for the next request.

List of Request Types

  • argv -- run a new command and print a response to the fifo. Use stdin/stdout/stderr as normal.
  • flush -- flush stdout and stderr. I think this will make it easier to delimit responses from adjacent commands.
  • echo -- for testing protocol conformance?
  • version -- maybe?
  • cd -- instruct the process to change directories? This should be straightforward in most (all?) languages.
  • env -- should this be a separate request, and not part of the argv request? Not sure.

Note: Shells are servers too!

Shells are usually thought of as clients that drive coprocess "tools" in parallel. But they can also be servers, i.e. processing multiple invocations of sh -c in a single process.

Shells are often invoked recursively (including by redo).

Shell Implementation Strategy: Proxy Processes

Internally, a shell can use a mechanism similar to subshells like ( myfunc ) and myfunc | tee foo.txt. That is myfunc has to be run in a subprocess.

So we can have a proxy process that is passed the file descriptors for a coprocess. And then the shell can interact with the proxy process normally. It can wait() on it, and it can redirect its output.

Waiting simultaneously for a process exit and an event from a pipe is somewhat annoying in Unix, requiring DJB's "self-pipe trick". This turns the exit event into a I/O event.

In a sense, this strategy is the opposite here. We're turning an I/O event (coprocess prints {"status": 0} into a process exit event!

The key is that fork() is very fast, but starting Python interpreters and JVMs is slow. So this will still be a big win.

Why Coprocesses and not Multi-threaded Servers?

Because it will be easier for existing command line tools to implement this protocol. Many tools are written with global variables, or they are written in languages that don't freely thread anyway (Python, R, etc.).

Use Cases

  • I could have used this for RAPPOR and several other "data science" projects in R.
  • The redo build system starts many short-lived processes.
    • it starts many shell processes to intrepret rules, and many "tool" processes.
  • Shellac Protocol Proposal -- this protocol for shell-independent command completion can build on top of the coprocess protocol. It has more of a structured request/response flavor than some command line tools, but that's fine. FCLI works for both use cases.

Relation to Bash Coprocesses

Bash coprocesses communicate structured data over two file descriptors / pipes:

http://wiki.bash-hackers.org/syntax/keywords/coproc

They are not drop-in replacements for command line tools.

FCLI uses at least 4 one-way pipes, in order to separate control (argv, status) from data (stdin/stdout).

Can bash be a client?

It would be nice for adoption to distribute a script like fcli-lib.sh or fcli-lib.bash that could call coprocesses in a transparent fashion.

However bash can't even determine the length of a byte string, which limits the kind of protocols you can construct with it (i.e. length-prefixed). (It counts unicode characters unreliably.)

So bash will not be a client, but it can easily invoke a client, e.g. fcli-driver.

Oil can be a "first-class" client. That is, coprocesses can be substituted for batch processes without a syntax change.

foo() { foo-batch "$@"; }
seq 3 | foo x y z >out.txt 2>err.txt  # runs batch job

foo() { foo-coprocess "$@"; }
seq 3 | foo x y z >out.txt 2>err.txt  # runs coprocess

stdin problems

Don't many tools read until EOF? Consider a simple Python filter:

for line in sys.stdin:
  print(line.upper())

It is somewhat hard to turn this into a coprocess, because the iterator wants an EOF event. Won't it block forever waiting for the next line? I guess that is why we need the FIFOs.

stderr

TODO: Should the shell capture stderr? Or just use it as the normal logging/error stream? Usage errors could be printed there.

cwd

Do processes have to change directories? It shouldn't be super hard for them to implement a cd command. (The shell can optimize that away in some cases.)

Windows?

Process startup time is slow on Windows. I think it has named pipes, but they might not be on the file system? They might have their own namespace.

Advanced Ideas

  • If you start a coprocess pool, some requests might have affinity for certain replicas, i.e. to try to reuse a certain network connection. The shell could allow the user to specify this logic in a small shell function.

Andy's Notes

I wrote something like this a few years ago, but it assumed too much about the process. It assumed that you controlled all I/O in the process.

Places where you might not:

  • On errors, the Python interpreter prints a stack trace to stderr
  • R will randomly print warnings and other info to stderr !!!
  • Some libraries print to stderr on errors.

It seems like this is mostly a problem for stderr.

Update

Coprocess Protocol V2