Distributed Shell - oilshell/oil GitHub Wiki

Old: Project Goals

"Toil" for multi-cloud distributed builds: http://www.oilshell.org/blog/2020/11/fixes-and-updates.html#buildssrht-and-toil.
- Results: http://travis-ci.oilshell.org/
- Problems:
- Does too much work (not incremental), and doesn't do it fast enough (not parallel)
  - needs dependencies for both problems!
- In some cases, the framework has more overhead than the work done by the application. We want lightweight distributed processes.
- YAML is a really bad syntax for a shell script.
Pash and Posh are related: https://github.com/oilshell/oil/issues/867
- Actually Posh is from many of the same authors as gg, but doesn't appear to be open source, and is technically unrelated?
- https://www.usenix.org/conference/atc20/presentation/raghavan. Section 6.2 talks about the execution engine which is not similar to gg. I suppose it is a slightly different problem.

Notes on gg (Ad Hoc Multi-Cloud Distribution with Lambdas)

Great intro blog post, concentrating mostly on the C++ build use case, which indeed has some unique elements: https://buttondown.email/nelhage/archive/papers-i-love-gg/
- reaction: distcc pump is another solution to the preprocessor problem, although neither model substitution or distcc pump are fully general
Great Usenix ATC '19 Video: https://www.youtube.com/watch?v=Cc_MVldSijA&ab_channel=USENIX
- I really like the framing: low latency (which is why I use shell in the first place), warm vs. cold clusters
- IR is extremely similar to Blaze/Forge (and described with a tiny set of protobufs!)
HN comments from July 2019: https://news.ycombinator.com/item?id=20433315
- Lambda still has some limitations for huge packages. Good experience report here (althuogh it sounds like the commenter could benefit from "proper" declared dependencies)
- What about state in lambdas?
Source Code: https://github.com/StanfordSNR/gg
- Example of how to use it to build LLVM: https://github.com/StanfordSNR/gg/tree/master/examples/llvm
My initial reaction: https://lobste.rs/s/virbxa/papers_i_love_gg#c_nbmnod
Concepts
- Model Substitution
- Tail Recursion
- Dynamic dependendencies, not static (how does it relate to Shake?)
- Lambdas can talk to each other (via NAT traversal?) Solves a well known performance issue.
Citations: UCop, Ciel
My sense on limitations
- It's not a fully general shell parallelizer, because it's mainly about small data and big compute. Some problems are big data and small compute, like analytics (joins, etc.) Although POSH fills that gap to some extent!
- For C++ compilation, these models are pretty complex: https://github.com/StanfordSNR/gg/tree/master/src/models
  - but maybe it's done so it's not a problem?
  - No I think this is the core difficulty with gg. Suppose you pass -fsanitize=address to the C++ compiler. Then the "model" has to know that this also involves linking the ASAN runtime. That is a lot of duplication.
  - Also is tracing Python command line tools like MyPy a problem? Then you have to write a Python "model"
  - This is very related to Guo's CDE -- using tracing to package up a minimal reproducible binary
- The single machine with 48 cores is very competitive with distributed C++ compilation, and beats it in many cases, which shows how hard that problem is.
- It can't handle say a shell script that downloads unknown files in the middle, e.g. doing an apt install. I think the full set of files has to be known up front, and then gg can select the subset that needs to be transmitted to a particular node.
Their Notes on Limitations / Future Work
- Worker communication (didn't understand the NAT traversal bit)
- They want to schedule thunks onto GPUs
- A gg DSL! They have a C++ and Python SDK. They say they want "parallel map", "fold", etc. What does this look like?
Questions
- Where does the scheduler run? (on a lambda? Or does the client need to be connected the whole time)
- How does the worker-to-worker communication work?
- What would the DSL look like?

Project Ideas

Can Pash, Posh, gg, be unified under a single IR? What does language support look like?
- And dgsh and Walker's distributed shell https://github.com/oilshell/oil/issues/867 ?
First, try gg to see how well it works
- run their examples
- create a new example (e.g. out of our own repo)
Second: Oil front end rather than model sub, "scripting", Python, C++). Does that make sense? That's in their future work -- a gg DSL.
- Is there some sort of command line wrapper style that specifies inputs / outputs unambiguously that can be used to wrap every command? Then you don't need model substitution?
- Can we use gg without syscall tracing? It's not really mainstream; it may cause engineering problems or limit generality.
Run "Toil" on gg. For portable continuous builds
- And try building container images ? A "meta build" system.
Does it make sense to augment gg with streams? For shell pipelines?
- dgsh uses Unix domain sockets to implement pipelines
Big project: write a faster executor that addresses the object distribution problem with differential compression / affinity (e.g. OSTree/casync). Lambda has some limitations on what containers you can run (500 MiB).
Could Oil be a local executor the gg runtime? What does the file system look like?
- you need a component to set up the file system, I guess a user space chroot / bind tool?
How can Oil express explicit dependencies but also get have fine grained tasks that are fast to start up?
- I think we prefer explicit dependencies over model substitution for most problems. Although I feel like the system should be "factored" to support both.

Research Projects

PaSH (parallelizing pipelines, with &)
POSH (moving code to data)
PUSH, a DISC Shell: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.461.1152&rep=rep1&type=pdf
dgsh -- directed graph shell (Unix domain sockets)
Walker Shell: forks, joins, cycles, and key-value aggregation