User Tutorial : Docker and Kubernetes - Schlumberger/distpy GitHub Wiki

Running a processing flow using Docker and Kubernetes

Docker provides a more flexible approach to cloud than block-booking virtual machines (VMs). When processing very large datasets there can be cost and efficiency reasons for first estimating your usage and then directly booking VMs which you will use to the full. For out of the box scalability, however, the approach outlined here is better.

What is Docker?

Docker encapsulates your code in a single thread. It is lightweight in that it does not contain the whole operating system in the way that a VM does. The default approach to parallel in distpy is to use Python's multiprocessing, and the architecture level at which that happens is at the scope of a data-chunk. This means that we can simply encapsulate the workers as separate Docker containers.

A short tutorial guide to creating a distpy Docker container

The files for this walkthrough can be found in the docker directory.

Copy the files Dockerfile and distpy_docker.py to a clean location.
Create a docker container using

docker build --no-cache -t distpy-python-app .

Test your container using

docker run distpy-python-app

The output should look like (additional missing-file messages will appear on non-gpu systems):

Using TensorFlow backend.
usage: distpy-docker.py [-h] [-f FILE] [-c JSON] [-m MODULE]
optional arguments:
  -h,        --help            show this help message and exit
  -f FILE,   --file    FILE    write report to FILE
  -c JSON,   --config  JSON    read config from JSON
  -m MODULE, --module  MODULE  select a processing module

This basic example supports 4 commands in the -m option corresponding to the ingestion and processing high level workers used in the distpy system:

 -m segy_ingest
 -m ingest_h5
 -m strainrate2summary 
 -m plotgenerator

This covers the main use cases. The -f option requires the specific file, for example the sgy file to be ingested. The -c option is where you supply the JSON configuration. This means that a single Docker image is all that is needed for all distpy workflows, and that as many containers as needed can be spawned from that image using a Kubernetes orchestration.

As a concrete example, you can test ingestion of a single SEGY file (see the Tutorials for details on projects and distpy JSON configuration)

 docker run -v C:\NotBackedUp:/scratch distpy-python-app \
            -f /scratch/myproject/sgy/test.sgy           \
            -c /scratch/myproject/config/docker_sgyConfig.json
            -m segy_ingest

The -v option that comes before the image name binds the local C:\NotBackedUp folder on a Windows machine to the container's /scratch, the mapping creates a shared space that all containers could use. The options after the image name are inputs to the container when it is generated and so constitute the instructions:

Configure distpy using docker_sgyConfig.json
Use this configuration to ingest the segy file test.sgy

This level of granularity means you would have one container for each of the SEGY files you want to ingest, followed by one container for each 1-second numpy file containing ingested strain-rate data. So the execution style is the same as achieved through the CASE00.py example.

We replace the multiprocessing by having a separate container where we previously had a separate thread. The other half of the equation is replacing the controllers that served up the separate threads and for this we use Kubernetes which orchestrates the lifecycles of the containers.

What is Kubernetes?

Kubernetes is an orchestrator for containers. Which means that it can take your containers and map them across your available hardware, tracking them and re-running any failed calculations, organizing everything. From the distpy perspective we need a flexible batch runner to fire off lots of instances of our containerized workers. So we are interested in the Kubernetes concept of jobs run to completion.

Consider a basic pod that uses our Docker image

apiVersion : v1
kind: Pod
metadata:
  name: distpy-job
spec:
  containers:
  - name: distpy-container
    image: distpy-python-app
    args: ["-f FILE","-c JSON","-m MODULE"]
    volumeMounts:
    - name: scratch
      mountPath: "/scratch"
  volumes:
  - name: scratch
    hostPath:
      path: "C:\\NotBackedUp"

Referencing back to the command-line SEGY ingestion above, we can see the Window's drive mapped to /scratch and the three arguments in their templated form.