Kubernetes and etcd - sipb/homeworld GitHub Wiki

A Homeworld cluster runs as a cluster of machines, each running multiple containers. We would like to be able to easily create and destroy containers in a resource-efficient way without caring about the underlying infrastructure, as well as expand the network when necessary and handle machine failures. Kubernetes is used to manage this.

Refer to the kubernetes overview for more information about its features.

Kubernetes uses etcd to coordinate data across machines. Etcd operates on the Raft protocol, which we're happy with because it's considered a better approach to data coherency than Paxos. You don't need to know any details about this.

Some Kubernetes terminology: A "node" is Kubernetes' abstraction of a machine (since it could also be either a physical or virtual machine) and a "pod" is Kubernetes' abstraction of a group of containers. In most cases, there is only one container per pod.

In Homeworld, we have three kinds of nodes:

Master nodes (usually ~3)
Supervisor nodes (usually ~1)
Worker nodes (as many as we can afford)

Master nodes

The master nodes run the etcd servers, along with the key services that Kubernetes uses to manage the cluster. By having three (or more) of these nodes, the cluster can survive the temporary loss of any of the master nodes. etcd operates using quorum, such that two out of three (or three out of five, or four out of seven...) master nodes need to be up and mutually accessible for the cluster to function. By requiring strictly more than 50% of the etcd cluster to be up, etcd functions properly even in the face of network partitions.

The kubernetes apiserver, which handles requests from clients and administrators and other kubernetes services, is stateless, so it can simultanously run on all of the master nodes at once. The controller manager, which hosts "controllers" that supervise container replication, and the scheduler, which assigns pods to nodes, perform a leadership election, such that exactly one of them is performing the corresponding task at any given time.

etcd is only accessible from other master nodes, and access is not granted to the rest of the cluster. Everything else communicates through the apiserver.

Supervisor nodes

The supervisor nodes run additional helper services, which provide cluster upkeep services and administrative services. For example, these nodes contain the key management infrastructure and the services that let administrators authenticate to the cluster's plumbing. Because none of these services are critical for the active functioning of the cluster, it is sufficient to have one of these nodes.

Because these nodes contain the authentication system, they technically have access to absolutely the entire cluster.

Note: supervisor nodes are not technically part of the kubernetes cluster, and are not a standard concept for deploying kubernetes.

Worker nodes

The worker nodes run the actual user containers, as directed by the master nodes. If any worker node goes down, the controller manager running on one of the master nodes will notice and reinstantiate new copies of its pods on other worker nodes.

Services

Kubernetes allows us to define services, which are IP addresses that load-balance to a certain subset of pods within the cluster. For example, we could use this to split up computationally intensive jobs onto different nodes, or even just to have a consistent IP address for a particular database or other internal system.

Each worker and master node runs a kube-proxy, which allows the service IPs for the cluster to work.

Additional reading

The kubernetes documentation (the authoritative reference for kuberneets, though not our particular deployment.) Introduction to Kubernetes Architecture (note: we use rkt instead of docker, and there are a few minor factual inaccuracies, so if this linked post disagrees with this wiki page, trust this wiki page. also don't trust the install guide portion.)