Thoughts on Builds - KirillOsenkov/Bliki GitHub Wiki

Disk layout

Build systems should be tripartite: immutable source directory, intermediate build directory, final output directory.

Source directory is read-only, output directory is write-only and every file is write-once. Deleting the intermediate and output directories should be roughly equivalent to git clean.

Builds should pull down their own dependencies as much as possible and isolate in a packages folder. Dependencies should be locked and not floating (so that builds are reproducible).

All file paths should be absolute, not relying on current directory. Nothing should rely on current directory, or at least immediately converted to full paths.

Double-writes

When a destination file is copied to from more than one source or is being written more than once, it's a double-write. It's a harmful condition that introduces non-determinism and race conditions into builds, resulting in unpredictable build results.

Build languages

Build systems consist of two parts: declarative ("What" to build) and imperative ("How" to build). E.g. MSBuild items (declarative) and Tasks (imperative). The declarative part should be some DSL or a data language. The imperative part is best in C# or any other OOP with great tooling and a debugger. The declarative and imperative parts should be orthogonal.

Traits of builds and Problems

Determinism

Deterministic builds are of paramount importance. The same inputs should reliably produce the same outputs in any build environment. From a given source control revision it should be possible to enlist, build and obtain the exact byte-for-byte identical outputs any time. This eliminates the "we lost the symbols"/"we lost the binaries for this release" problem.

Determinism is also important for verifiability - given this set of binaries, was it really built from these exact sources? Can we trust this release or was this binary tampered with? See e.g. https://twitter.com/reprobuilds

Incrementality/Idempotence

If you've already built this project, building it again should be a no-op and very fast. Overbuilding (unnecessary build operations that could be avoided if the build was precise) is draining productivity and wasting precious time and resources. It is one of the main sources of friction that slows down software development. Builds should be highly tuned and optimized to only rebuild what really needs to be rebuild on changes to minimize work across builds.

Independence of environment/Reproducibility

The famous "builds on my machine". The most common problem plaguing projects is it doesn't build after enlisting. Such builds rely on implicit/unspecified machine state, such as machine-wide installed SDKs, global environment variables, hardcoded paths, assumptions about file system layout, etc. etc. Ideally every repo should be able to support git clone followed by build, resulting in 0 errors 0 warnings on any machine. It's OK to require prerequisites, but they need to be explicitly specified, ideally in a machine-readable configuration script. The build should bring as much as possible with it, all dependencies, SDKs, tools, compilers, etc. to demand as little as possible from the build machine. The build shouldn't assume anything but the very necessary minimum about the environment.

Purity (Side-effects free build logic)

Build logic should be side-effects free ("pure"), so that given inputs result in the desired outputs, and all outputs are specified explicitly. This way large builds may support a cache system where results for given inputs are cached, so that when the same inputs are requested to be built in the future, the results are simply retrieved from the cache instead of recomputing them again from scratch. Global or distributed content-addressable stores are important to implement efficient build caching.

Purity and reproducibility (correctness) allow for distributed builds, where parts of the build are parallelized and potentially executed on different machines. This together with caching may result in dramatic speed and scale improvements for large builds where a single machine (no matter how powerful) can be a limiting factor.

Correctness

Inputs and outputs should be fully specified (or at least specified as much as possible) and any other disk access that was not declared should be a correctness violation. This is important to catch unintended side-effects, mutation of state, corruption of inputs, corruption of outputs, unintended sharing/collisions in intermediates, etc. Inputs and outputs should form a dependency DAG that is verifiable, deterministic (modulo ordering of independent operations in parallel builds) and side-effects free.

It goes without saying that corrupting or mutating global machine state should be strictly disallowed. Depending on global machine state should be minimized.