mkconda scientific computing package bundle - kutaslab/mkconda GitHub Wiki

What is a "package bundle" and why?

Because for scientific computing there are ...

  • Programming languages

    Such as Python and R.

  • Programming language software packages

    Collections of useful functions and data structures in Python (numpy, pandas, statsmodels) and R (the tidyverse, lme4).

  • Programming language software package versions (vN.N.N)

    Packages are under continuous development so there are many different versions of the packages available and more versions coming all the time.

  • Programming language software package version builds

    For each version, there are different "builds" where where each "build" is a binary computer file(s) that can be and often is specific to one kind and version of operating system (half a dozen flavors of linux, MacOS, Windows) and one kind and version of computer hardware (32-bit vs 64-bit, Intel, AMD, NVIDIA chips).

  • Programming language software package version build dependencies

    Each language-package-version-operating_system-hardware-specific-build binary file is constructed and runs properly only in the context of a great deal of other software infrastructure which may include other packages and programming languages like C/C++ compilers, efficient general purpose math libraries, e.g., LAPACK, BLAS, for linear algebra.

Thus, we want to run packages A and B but ...

  • Talk of running packages A and B is speaking loosely. In reality we need to run Package_A-version-build with its dependencies and Package_B-version-build with its dependencies.

  • The package A and B dependencies may be and often are different.

  • Different dependencies may be and often are different and incompatible.

  • Finding a set of compatible dependencies for packages A and B can be and often is tedious and difficult (perhaps impossible)

  • The more packages A, B, C, ... Z we want to run the more difficult it gets to satisfy all the dependency requirements.

  • A set of compatible dependencies can be and often is brittle; any revision for a new or updated package-version-build can introduce an incompatible dependency which restarts the tedious, difficult perhaps impossible process of finding compatible dependencies.

Conda enviroments

The Anaconda (conda) package management system 1) automates the process of finding compatible dependencies for sets of packages (if possible) and 2) encapsulates the set of compatible package-version-builds in a "virtual" environment.

Virtual environment is poor terminology, "isolated" or "encapsulated" environments is better. A conda environment is real, just isolated from from other conda environments in the way that different rooms in a house are isolated from one another. Each room has its own furnishings. The furnishings in different rooms may be the same (this would be odd) or similar but different (master vs. guest bedroom) or wholly different (bedroom vs. kitchen). Stepping into one room makes the furnishings in that room available for use but just those. There is no bed in the kitchen and no oven in the bedroom.

Conda enviroments work the same way. Just as new rooms may be added onto a house and furnished differently, each user on a computer system can create new environments, each furnished with various packages different purposes. The packages in different environments may be the same (this would be odd), or similar but different (Python 3.5 with numpy 1.15 vs Python 3.6 with numpy 1.16) or wholly different (Python + jupyter vs R + RStudio). The different environments are used one at a time ("activate", "deactivate") like the different rooms in a house ("step into", "step out of"). Putting each set of compatible package-version-builds and dependencies in its own isolated environment allows incompatible packages and dependencies to co-exist. The cost of multiple environments is the computer disk space required to store and backup multiple copies of similiar-but-different packages. As of this writing, 8 TB of USB storage around the size of a pack of cigarettes can be had for $100US. Just sayin.

mkconda metapackage: compatibility and reproducibility

mckonda is a conda package that contains a set of other conda packages.

Why a package of packages instead of just installing each package one by one into an enviroment? Installing packages one at a time only ensures the new package is compatible to with those already in place. When many packages are installed incrementally, it may be and often is impossible to resolve dependencies for packages later in the sequence. Installing the packages as a bundle allows the solver to check the dependencies for everything together.

Installing mkconda into an empty conda environment furnishes the environment with Python and R and a small but powerful stack of general purpose scientific computing packages for each language (numpy, scipy, statsmodels, jupyter, fitgrid; tidyverse, lme4, lmerTest) and special purpose Kutas lab software (mkpy).

The conda installer finds and fetches the compatible package-version-builds binaries so everything runs when the environment is active and the results are reproducible and because the binary executables in the environment are frozen and documented.