Securing Docker Container Workloads - MatthewJacques/Wiki GitHub Wiki

Namespaces

A Linux namespace is a kernel construct, which allows for the isolation of an operating system resource, from the perspective of a running process.

Types

Mount - isolates the set of filesystem mount points seen by a process in the namespace
UTS - isolates system identifiers for hostname and NIS domainname
PID - isolates the process ID number space
IPC - isolates System V IPC objects, POSIX message queues
Network - isolates network stack including interfaces, ports net filtering rules etc
User - isolates set of user IDs and group IDs
CGroup - isolates view of the cgroup hierarchy root directories

Creating Namespaces

Calls

clone() - clones calling process, placing the child in a new namespace
unshare() - calling process is removed from existing namespace and placed in a new namespace
setns() - Calling process is placed into a pre-existing namespace

Cloning a Process

// Clone child process
flags = CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS
child = clone(childFunction, child_stack + STACK_SIZE, flags, &args);
if (child == -1) {
  perror("Parent: main: clone");
  exit(EXIT_FAILURE);
}

// Wait for child to finish
if (waitpid(child, NULL, 0) == -1) {
  perror("Parent: main: waitpid");
  exit(EXIT_FAILURE);
}

Control Groups

A control group is a kernel construct, which allows for limiting access to, and accounting for, the usage of the physical resources of a host system, by a group of processes. It is good practice to limit docker resources to keep containers running smoothly and being good neighbors to other containers running on the same host.

Cgroup Subsystems

Subsystem	Purpose
blkio	Controls and monitors block I/O operations
devices	Controls access to system devices
memory	Controls and monitors the use of memory
pids	Sets a limit to the number of processes in a cgroup
cpu	Controls access to cpu cycles
cpuset	Pins processes to cpu cores and memory nodes

Managing Privileges

Every program and every user of the system should operate using the least set of privileges necessary to complete the job. By default, Docker runs a container with the processes running as a privileged user. If the container workload does not need any privileges, it is better to run the container as a non-privileged user.

Running Processes as Non-privileged Users

Create a user in the Docker image and set with the USER instruction. For example in alpine

# Create user and group for container
RUN addgroup -g 3000 -S http && adduser -u 3000 -S -G http http

# Set container's user/group
USER http:http

Protecting Container Filesystems

Many exploits come from tampering of filesystem content therefore steps should be taken to prevent this. Containers can be started with a read-only filesystem by using the --read-only config flag when using docker run. If it is not possible to have the entire filesystem as read-only then a minimal filesystem can be used which reduces the potential attack surface available for compromise.

A container can be built from scratch so that the filesystem only includes exactly what is needed to run the workload. The problem with starting from scratch is that there are is no user or group files to use for non-privileged users. To get around this we can build a container using a multiple build stages.

FROM golang:alpine as build

RUN mkdir -p /web/assets

RUN addgroup -S http && adduser -S -G http http
.
.
FROM scratch

COPY --from=build /web /
COPY --from=build /etc/passwd /etc/group /etc/
.
.
USER http:http
ENTRYPOINT ["./httpserv"]

Example of Minimising Container Filesystem With a Non-privileged User

This example also reduces the container size from 277MB down to 4.57MB which is great for distribution and start times

FROM golang:alpine as build

# Add dependencies and create source directory
RUN apk add --no-cache git         && \
    go get github.com/gorilla/mux  && \
    apk del git                    && \
    addgroup -S -g 500 api         && \
    adduser -S -u 500 -G api api   && \
    mkdir -p $GOPATH/src/apiserver

# Set working directory
WORKDIR $GOPATH/src/apiserver

# Copy src into image
COPY ./src/ ./

# Build simple api server
RUN CGO_ENABLED=0 go build -installsuffix cgo -ldflags '-w -s' -o apiserver

FROM scratch

COPY --from=build /go/src/apiserver/apiserver /
COPY --from=build /etc/passwd /etc/group /etc/

# Expose port for service to be consumed
EXPOSE 8000

# Set user as non-privileged alternative to root
USER api:api

# Define entrypoint for container
ENTRYPOINT ["./apiserver"]

Managing User Capabilities

Current Docker implementation does not allow a non-privileged user to be given some privileges but it does still allow a privileged user to have privileges removed. This means that if the user needs some privileges to process the container workload then you will need to start with a privileged user and drop the privileges that are not needed.

# Drop all whitelisted capability except CAP_NET_RAW
docker container run --cap-drop ALL --cap-add NET_RAW

# Add CAP_SYS_MODULE to whitelisted capabilities
docker container run --cap-add SYS_MODULE

To add or drop every capability, use ALL as a argument to either config option

Limiting the System Calls Available to Container Workloads

Managing the Default Seccomp Profile

You can override the default docker seccomp profile by changing the config.

# Apply custom seccomp profile for all containers but adding a key/value pair to /etc/docker/daemon.json
{
  "seccomp-profile": "/etc/docker/seccomp/custom.json"
}

Changing Seccomp Profiles on Per Container Basis

The below config option can be applied to the docker container run command

# Config option & argument for unconfined system calls
--security-opt seccomp=unconfined

# Config option & argument for specific seccomp profile
--security-opt seccomp=$PWD/seccomp/ngix.json

Creating a Custom Seccomp Profile

A good place to start when creating your own seccomp profile is to copy Docker's default seccomp profile. To add or remove system calls from the whitelist, add or subtract from the 'name' array that contains a list of the whitelisted system calls. Some system calls are required by Docker to start a container, so use the 'no-new-privileges' security option to minimise what is required.

Determining the Required System Calls

Determining which system calls a particular container workload requires is not an exact science. A tracing tool such a 'strace' can be used while processing the container workload in order to establish the required system calls. To do this, a new variant of the application's Docker image will need to be created which includes the tracing utility in order to exercise the workload effectively. When invoking the container workload, remove all constraints in order to get a true reflection of the requirements.

Here is an example of changing the dockerfile from above to run strace.

FROM golang:alpine as build

# Add dependencies and create source directory
RUN apk add --no-cache git         && \
    go get github.com/gorilla/mux  && \
    apk del git                    && \
    mkdir -p $GOPATH/src/apiserver

# Set working directory
WORKDIR $GOPATH/src/apiserver

# Copy src into image
COPY ./src/ ./

# Build simple api server
RUN CGO_ENABLED=0 go build -installsuffix cgo -ldflags '-w -s' -o apiserver

FROM alpine

RUN apk add --no-cache strace

COPY --from=build /go/src/apiserver/apiserver /
COPY --from=build /etc/passwd /etc/group /etc/

# Expose port for service to be consumed
EXPOSE 80

# Define entrypoint for container
ENTRYPOINT ["strace", "-cf", "./apiserver"]

Docker run command for the container:

docker container run -itd --name apiserver -p 80:80 \
--security-opt apparmor=unconfined \
--security-opt seccomp=unconfined \
--cap-add ALL \
apiserver:strace

The summary of strace will be output when the container is stopped so the workload can be fully exercised before stopping the container to get a full view of which system calls are needed in the container. docker container logs apiserver | grep -v strace | less

Running Container With Least Privileges

An example of the command the run a container with the least privileges after creating a custom seccomp profile:

docker container run -itd --name apiserver -p 80:80 \
--cap-drop ALL --cap-add NET_BIND_SERVICE \
--security-opt no-new-privileges \
--security-opt seccomp=$PWD/seccomp/custom.json \
apiserver:multi

Implementing Access Control for Container Workloads

Bane can be used to help generate and debug custom AppArmor profiles.

A very simple example .toml file used to generate AppArmor profiles looks like this

# Name of the profile, we will auto prefix with `docker-`
Name = "apiserver"

# Allowed capabilities
[Capabilities]
Allow = [
	"net_bind_service",
	"net_raw"
]

[Network]
# Set Raw to false and deny network raw, required by ping
Raw = true
Protocols = [
	"tcp",
	"icmp"
]

To load the custom AppArmor profile using Bane, use the command sudo bane profile_name. You can then view loaded profiles by using the command sudo aa-status | less.

You can view a generated AppArmour profile by using the path /etc/apparmor.d/containers/docker-profile_name.

To run a container using the AppArmor profile, the run command should look something like this

docker container run -itd -p 80:80 --name apiserver --security-opt apparmor=docker-apiserver apiserver