Kubernetes - kamialie/knowledge_corner GitHub Wiki

Contents

Registration (new nodes register with a control plane and accept a workload) and service discovery (automatic detection of new services via DNS or environment variables) enable easy scalability and availability.

Containers are isolated user spaces per running application code. The user space is all the code that resides above the kernel, that includes applications and their dependencies. Abstraction is at the level of the application and its dependencies.

Containerization helps with dependency isolation and integration problem troubleshooting.

Core technologies that enhanced containerization:

  • process - each process has its own virtual memory address space separate from others
  • Linux namespaces - are used to control what application can see (process ID numbers, directory trees, IP addresses, etc)
  • cgroups - control what resources application can use (CPU time, memory, IO bandwidth, etc)
  • union file system - encapsulating application with its dependencies

Everything in Kubernetes is represented by an object with state and attributes that user can change. Each object has two elements: object spec (desired state) and object state (current state). All Kubernetes objects are identified by a unique name(set by user) and a unique identifier(set by Kubernetes).

Sample image for testing gcr.io/google-samples/hello-app:1.0.

Architecture

Cluster Add-on Pods provide special services in the cluster, e.g. DNS, Ingress (HTTP load balancer), dashboard. Popular option for logging - Fluentd, metrics - Prometheus.

Control plane (master) components:

Worker components (also present on control plane node):

# Get status on components
$ kubectl get componentstatuses

kube-apiserver

Exposes RESTful operations and accepts commands to view or change the state of a cluster (user interacts with it via kubectl). Handles all calls, both internal and external. All actions are validated and authenticated. Manages cluster state stored in etcd database being the only component that has connection to it.


kube-scheduler

Watches API server for unscheduled Pods and schedules them on nodes. Uses an algorithm to determine where a Pod can be scheduled: first current quota restrictions are checked, then taints, tolerations, labels, etc. Scheduling is done by simply adding the node in Pod's object data.

pod-eviction-timeout (default 5m) specifies a timeout after which Kubernetes should give up on a node and reschedule Pod(s) to a different node.


kube-controller-manager

Continuously monitors cluster's state through API server. If state does not match, contacts necessary controller to match the desired state. Multiple roles are included in a single binary:

  • node controller - worker state
  • replication controller - maintaining correct number of Pods
  • endpoint controller - joins services and Pods together
  • service account and token controller - access management

Generally controllers use a watch mechanism to be notified of changes, but also perform a re-list operations periodically to make sure they haven't missed anything. Controllers source code.


etcd

Cluster's database (distributed b+tree key-value store) for storing cluster, network states and other persistent info. Instead of updating existing data, new data is always appended to the end; previous copies are marked for removal.

The etcdctl command provides snapshot save and snapshot restore actions.


kube-cloud-manager

Manages controllers that interact with external cloud providers. Documentation.


Container runtime

Handles container's lifecycle. Kubernetes supports the following runtimes and can use any other that is CRI (Container Runtime Interface) compliant:

  • docker
  • containerd, includes ctr for managing images and crictl for managing containers
    # View running containers on a node
    $ sudo crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps
  • CRI-O
  • frakti

Each node can run different container runtime.


Kubelet

Kubernetes agent on each node that interacts with apiserver. Initially registers a node it is running on by creating a Node resource. Then continuously monitors apiserver for Pods scheduled on the node it is running on and starts Pod's containers. Other duties are:

  • receives PodSpec (Pod specification)
  • passes requests to local container runtime
  • mounts volumes to Pods
  • ensures access to storage, Secrets and ConfigMaps
  • executes health checks for Pod/node
  • reports Pod's status, events and resource consumption to apiserver

kubelet process is managed by systemd when building cluster with kubeadm. Once running it will start every Pod found in staticPodPath setting (default is /etc/kubernetes/manifests). View the status and config files of kubelet with systemctl status kubelet.service, /etc/systemd/system/kubelet.service.d/10-kubeadm.conf - systemd unit config. Default kubelet config - /var/lib/kubelet/config.yaml.

Also runs container's liveness probes, and deletes containers when Pod is deleted from apiserver.


kube-proxy

Provides network connectivity to Pods and maintains all networking rules using iptables entries. Works as a local load-balancer, forwards TCP and UDP traffic, implements services.

kube-proxy has 3 modes:

  • userspace - iptables are modified in a way that connection goes through kube-proxy itself (the reason component got its name) and then get redirected to backing Pod. This mode also ensured true round-robin load balancing.
  • iptables - current implementation, which modifies iptables, but sends traffic directly to targets. Load balancing selects Pods randomly.
  • ipvs (alpha)

Services at the core are implemented by iptables rules. kube-proxy is watching for Service and Endpoints resource updates and sets rules accordingly. If a client sends traffic to a Service, matching iptables rules substitutes the destination IP with randomly selected backing Pod's IP.

High availability

HA cluster (stacked etcd topology) has at least 3 control plane nodes, since etcd requires at least 3 nodes online to reach a quorum. apiserver and etcd Pods run on all control plane nodes, scheduler and controller-manager run only on active node, which ensures only one replica is running at any given time (implemented via lease mechanism). Load balancer in front of apiserver evenly distributes traffic from worker nodes and from outside the cluster. Since apiserver and etcd are linked together on a given control plane node, there is a linked redundancy between components.

In external etcd topology etcd cluster (at least 3 nodes) is set up separately from the control plane. apiserver references this etcd cluster, while the rest is the same as in previous topology.

Simple leader election code example.

Scheduler (or controller-manager) elects a leader using Endpoints resource (or ConfigMap already). Each replica tries to write its name in special annotation; ones succeed based on optimistic locking mechanism, making others know that they should stand by. Leader also updates the resource regularly, so that other replicas know that it is still alive. If update doesn't happen within specified amount of time, new election process takes place.

Installation and configuration

Recommended minimum hardware requirements:

Node CPU RAM (GB) Disk (GB)
master 2 2 8
worker 1 1 8

Cluster network ports:

Component Default port Used by
apiserver 6443 all
etcd 2379-2380 API/etcd
scheduler 10251 self
controller manager 10252 self
kubelet (both on Control Plane node and worker node) 10250 control plane
NodePort (worker node) 30000-32767 all

Self-install options:

  • kubeadm
  • kubespray - advanced Ansible playbook for setting up cluster on various OSs and using different network providers
  • kops (Kubernetes operations) - CLI tool for creating a cluster in Cloud (AWS officially supported, GKE, Azure, etc on the way); also provisions necessary cloud infrastructure; how to
  • kind - running Kubernetes locally on Docker containers
  • Kubernetes in Docker Desktop for Mac

kubeadm

Create cluster with kubeadm.

kubeadm init performs the following actions in the order by default (highly customizable):

  1. Pre-flight checks (permissions on the system, hardware requirements, etc)
  2. Create certificate authority
  3. Generate kubeconfig files
  4. Generate static Pod manifests (for Control Plane components)
  5. Wait for Control Plane Pods to start
  6. Taint Control Plane node
  7. Generate bootstrap token
    # List join token
    $ kubeadm token list
    # Regenerate join token
    $ kubeadm token create
  8. Start add-on Pods (DNS, kube-proxy, etc)

High Availability

kubeadm allows joining multiple control plane nodes with collocated etcd databases. At least 2 more instances are required for etcd to be able to determine a quorum and select a leader. Common architecture also includes a load balancer in front of control planes.

Additional control planes are added similar to workers, but with a --control-plane and --certificate-key parameters. New key needs to be generated, unless secondary nodes are added within two hours of initial boostrapping.

It is also possible to set up external etcd cluster. etcd must be configured first, then certificates are copied over to the first control plane, redundant control planes are added one at a time, fully initialized.

Pod networking

Container-to-container networking is implemented by Pod concept, External-to-Pod is implemented by services, while Pod-to-Pod is expected to be implemented outside Kubernetes by networking configuration.

Overlay networking (also software defined networking) provides layer 3 single network that all Pods can use for intercommunication. Popular network add-ons:

  • Flannel - L3 virtual network between nodes of a cluster
  • Calico - flat L3 network without IP encapsulation; policy based traffic management; calicoctl; Felix (interface monitoring and management, route programming, ACL configuration and state reporting) and BIRD (dynamic IP routing) daemons - routing state is read by Felix and distributed to all nodes allowing a client to connect to any node and get connected to a workload even if it is on a different node. Quickstart.
  • Weave Net multi-host network typically used as an add-on in CNI-enabled cluster.
  • Kube-Router

Maintenance

Node

Node is an API object outside a cluster representing an virtual/physical instance. All nodes reside in kube-node-lease namespace.

Scheduling Pods on the node can be turned on/off with kubectl cordon/uncordon.

# Remove node from cluster
# 1. remove object from API server
$ kubectl delete node <node_name> 
# 2. remove cluster specific info
$ kubeadm reset
# 3. may also need to remove iptable entries

# View CPU, memory and other resource usage, limits, etc
$ kubectl describe node <node_name>

If node is rebooted, Pods running on that node stay scheduled on it, until kubelet's eviction timeout parameter (default 5m) is exceeded.

Upgrading cluster

kubeadm upgrade

  • plan - check installed version against newest in the repository, and verify that upgrade is possible
  • apply - upgrade first Control Plane node to the specified version
  • diff - show difference applied during an upgrade (similar to apply --dry-run)
  • node - allows updating kubelet on worker nodes or additional control plane nodes; accesses phase command to step through the process

Control plane node(s) should be upgraded first. Steps are similar for control plane and worker nodes.

kubeadm-based cluster can only be upgraded by one minor version (e.g 1.16 -> 1.17).

Control plane upgrade

Check available and current versions. Then upgrade kubeadm and verify. Drain the Pods (ignoring DaemonSets). Verify upgrade plan and apply it. kubectl get nodes would still show old version at this point. Upgrade kubelet, kubectl and restart the daemon. Now ..get nodes should output updated version. Allow Pods to be scheduled on the node.


Worker upgrade

Same process as on control plane, except kubeadm upgrade command is different, and kubectl commands are still being executed from control plane node.


Upgrade CLI

  • view available versions
     $ sudo apt update
     $ sudo apt-cache madison kubeadm
     # view current version
     $ sudo apt list --installed | grep -i kube
  • upgrade kubeadm on the given node
     $ sudo apt-mark unhold kubeadm
     $ sudo apt-get install kubeadm=<version>
     $ sudo apt-mark hold kubeadm
     $ sudo kubeadm version
  • drain Pods (from control plane for both)
     $ kubectl drain <node_name> --ignore-daemonsets
  • view and apply node update
     # control plane
     $ sudo kubeadm upgrade plan
     $ sudo kubeadm upgrade apply <version>
     # worker (on the node)
     $ sudo kubeadm upgrade node
  • upgrade kubelet and kubectl
     $ sudo apt-mark unhold kubelet kubectl
     $ sudo apt-get install kubelet=<version> kubectl=<version>
     $ sudo apt-mark hold kubelet kubectl
    
     # restart daemon
     $ sudo systemctl daemon-reload
     $ sudo systemctl restart kubelet
  • allow Pods to be deployed on the node
     $ kubectl uncordon <node_name>

etcd backup

Etcd backup file contains the entire state of the cluster. Secrets are not encrypted (only hashed), therefore, backup file should be encrypted and securely stored.

Usually backup script is performed by Linux or Kubernetes cron jobs.

By default, there is single etcd Pod running on control plane node. All data is stored at /var/lib/etcd, which is backed by hostPath volume on the node.

etcdctl should match etcd running in a Pod, use etcd --version command to find out.

$ export ETCDCTL_API=3

$ etcdctl --endpoints=<host>:<port> <command> <args>
# Running etcdctl on master node
$ etcdctl --endpoints=http://127.0.0.1:2379
	--cacert=/etc/kubernetes/pki/etcd/ca.crt \
	--cert=/etc/kubernetes/pki/etcd/server.crt \
	--key=/etc/kubernetes/pki/etcd/server.key \
	snapshot save /var/lib/dat-backup.db
# Check the status of backup
$ etcdctl --write-out=table snapshot status <backup_file>

Restoring the backup to the default location:

$ export ETCDCTL_API=3

# By default restores in the current directory at subdir ./default.etcd
$ etcdctl snapshot restore <backup_file>
# Move the original data directory elsewhere
$ mv /var/lib/etcd /var/lib/etcd.OLD
# Stop etcd container at container runtime level, since it is static container
# Move restored backup to default location, `/var/lib/etcd`
$ mv ./default.etcd /var/lib/etcd
# Restarted etcd will find new data

Restoring the backup to the custom location:

$ etcdctl snapshot restore <backup_file> --data-dir=/var/lib/etcd.custom
# Update static Pod manifest:
# 1. --data-dir=/var/lib/etcd.custom
# 2. mountPath: /var/lib/etcd.custom (volumeMounts)
# 3. path: /var/lib/etcd.custom (volumes, hostPath)
# Updating manifest triggers Pod restart (also kube-controller-manager
# and kube-scheduler

Troubleshooting

Network troubleshooting

sniff plugin allows to see networking traffic from within, since cluster network traffic is encrypted. sniff requires Wireshark and ability to export graphical display. sniff command will use the first container, unless -c option is used:

$ kubectl krew install sniff <pod> -c <container>

Troubleshooting DNS can be done by creating a Pod with network tools, creating a Service and running a DNS lookup (other tools include dig, nc, wireshark):

$ nslookup <service_name> <kube-dns_ip>

API

Home page

Object organization:

  • Kind - Pod, Service, Deployment, etc (available object types)
  • Group - core, apps, storage, etc (grouped by similar functionality)
  • Version - v1, beta, alpha
apiVersion: v1
kind: Pod
metadata:
  name: nginx-pod
spec:
  containers:
    - name: nginx
      image: nginx

Core (Legacy) group includes fundamental objects such as Pod, Node, Namespace, PersistentVolume, etc. Other objects are grouped under named API groups such as apps (Deployment), storage.k8s.io (StorageClass), rbac.authorization.k8s.io (Role) and so on.

Versioning follows Alpha->Beta->Stable lifecycle:

  • Alpha (v1alpha1) - disabled by default
  • Beta (v1beta1) - enabled by default, more stable, considered safe and tested, forward changes are backward compatible
  • Stable (v1) - backwards compatible, production ready

List of known API resources (also short info):

$ kubectl api-resources
$ kubectl api-resources --api-group=apps

# list api versions and groups
$ kubectl api-versions | sort

Request

API requests are RESTful (GET, POST, PUT, DELETE, PATCH)

Special API requests:

  • LOG - retrieve container logs
  • EXEC - execute command in a container
  • WATCH - get change notifications on a resource

API resource location:

  • Core API:
    • http://<apiserver>:<port>/api/<version>/<resource_type>
    • (in namespace) http://<apiserver>:<port>/api/<version>/namespaces/<namespace>/<resource_type>/<resource_name>
  • API groups
    • http://<apiserver>:<port>/apis/<group_name>/<version>/<resource_type
    • (in namespace) http://<apiserver>:<port>/apis/<group_name>/<version>/namespace/<namespace>/<resource_type>/<resource_name>

Response codes:

  • 2xx (success) - e.g. 201 (created), 202 (request accepted and performed async)
  • 4xx (client side errors) - e.g. 401 (unauthorized, not authenticated), 403 (access denied), 404 - not found
  • 5xx (server side errors) - 500 (internal error)

Curl

Get certificates for easy request writing:

$ export client=$(grep client-cert $HOME/.kube/config | cut -d" " -f6)
$ export key=$(grep client-key-data $HOME/.kube/config | cut -d" " -f6)
$ export auth=$(grep certificate-authority-data $HOME/.kube/config | cut -d" " -f6)

$ echo $client | base64 -d - > ./client.pem
$ echo $key | base64 -d - > ./client-key.pem
$ echo $auth | base64 -d - > ./ca.pem

Make requests using keys and certificates from previous step:

$ curl --cert client.pem \
		--key client-key.pem \
		--cacert ca.pem \
		https://k8sServer:6443/api/v1/pods

Another way to make authenticated request is to start a proxy session in the background:

# Run in a separate session or fg and Ctrl+C
$ kubectl proxy &
# Address (port) is displayed by previous command
$ curl localhost:8001/<request>

Custom Resource Definition

Custom resources can be part of declarative API, which also requires a controller that is able to retrieve structured data and maintain the declared state. There are two ways: Custom Resource Definition (CRD) can be added to the cluster or Aggregated APIs (AA) could be implanted via new API server, which would run alongside main apiserver (more flexible).

CRD objects can only use the same API functionality as build-in objects (respond to REST requests, configuration state validation and storage). New CRDs are available at apiextentions.k8s.io/v1 API path.

name must match spec declared later. group and version will be part of REST API - /apis/<group>/<version> (e.g. /apis/stable/v1), and used as apiVersion in resource manifest. scope is one of Namespaced or Cluster, defines if an object exists in a single namespace or available cluster-wide. plural defines the last part of the API URL - /apis/stable/v1/backups. singular and shortNames are used for display and CLI. kind is used in resource manifests.

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: backups.stable.linux.com
spec:
  group: stable.linux.com
  version: v1
  scope: Namespaced
  names:
    plural: backups
    singular: backup
    shortNames:
      - bks
    kind: BackUp

spec field depends on the controller. Validation is performed by controller; only existence of the variable is checked by default.

apiVersion: "stable.linux.com/v1"
kind: BackUp
metadata:
  name: a-backup-object
spec:
  timeSpec: "* * * * */5"
  image: linux-backup-image
  replicas: 5

Finalizer

Asynchronous pre-delete hook is a Finalizer. Once delete request is received, metadata.deletionTimestamp is updated, then controller triggers configured Finalizer.

metadata:
  finalizers:
    - finalizer.stable.linux.com

API server aggregation

Custom apiserver can be created to validate custom object(s); those could also be already "baked" in to apiserver without the need to use CRDs. Aggregated API is exposed at a central location and hides away the complexity from clients. Each apiserver can use its own etcd store ot use core apiserver's store (in that case need to create CRDs before creating instances of CRD).

Custom apiserver runs as a Pod and is exposed via Service. Integrate it with core apiserver using the object below:

apiVersion: apiregistration.k8s.io/v1beta1
kind: APIService
metadata:
  name: v1alpha1.extentions.example.com
spec:
  group: extentions.example.com # API group custom apiserver is responsible for
  version: v1alpha1 # supported version in the API group
  priority: 150
  service:
    name: <service_name>
    namespace: default

Manifest

Minimal Deployment manifest explained:

# API version
# If API changes, API objects follow and may introduce breaking changes
apiVersion: apps/v1

# Object type
kind: Deployment

metadata:
  # Required, must be unique to the namespace
  name: foo-deployment
  # Specification details of an object
  spec:
    # Number of pods
    replicas: 1
    # A way for a deployment to identify which pods are members of it
    selector:
      matchLabels:
        app: foo
    # Pod specifications
    template:
      metadata:
        # Assigned to each Pod, must match the selector
        labels:
          app: foo
      # Container specifications
      spec:
        containers:
          - name: foo
            image: nginx

Generate with --dry-run parameter:

$ kubectl create deployment hello-world \
            --image=nginx \
            --dry-run=client \
            -o yaml > deployment.yaml
  • Root metadata should have at least a name field.
  • generation represents a number of changes made to the object.
  • resourceVersion value is tied to etcd to help with concurrency of objects. Any changes in database will change this number.
  • uid - unique id of the object throughout its lifetime.

Pod

Pod is the smallest deployable object (not container). Pod embodies the environment where container lives, which can hold one or more containers. If there are several containers in a Pod, they share all resources like networking (unique IP is assigned to a Pod), access to storage and namespace (Linux). Containers in a Pod start in parallel (no way to determine which container becomes available first, but InitContainers are set to run sequentially). Loopback interface, writing to files in a common filesystem or inter-process communication (IPC) can be used by containers within a Pod for communication.

Secondary container may be used for logging, responding to requests, etc. Popular terms are sidecar, adapter, ambassador.

Pod states:

  • Pending - image is retrieved, but container hasn't started yet
  • Running - Pod is scheduled on a node, all containers are created, at least one is running
  • Succeded - containers terminated successfully and won't be restarting
  • Failed - all containers have terminated with at least one with failed status
  • Unknown - most likely communication error between master and kubelet
  • CrashLoopBackOff - one of containers unexpectedly exited after it was restarted at least once (most likely Pod isn't configured correctly); Kubernetes repeatedly makes new attempts

Specifying ports is purely informational and doesn't effect clients connecting to a Pod (can even be omitted).

Containers that crash are restarted automatically by kubelet. Exit code is a sum of 2 numbers: 128 and x, where x is a signal number sent to the process that caused it to terminate, e.g. 137 = 128 + 9 (SIGKILL), 143 = 128 + 15 (SIGTERM). When container is killed, a completely new container is created.

hostPID, hostIPC, hostNetwork Pod spec properties allow Pod to use host's resources - see process tree, network interfaces, etc.

imagePullPolicy set to Always commands container runtime to contact image registry every time a new Pod is deployed. This slows down Pod startup time, and can potentially completely prevent Pod from starting, if registry is unreachable. Prefer using proper version tag, instead of latest and avoid using Always setting with imagePullPolicy.

Environment variables

User defined environment variables are defined in Pod's (specifically each container) spec as key/value pairs or via valueFrom parameter referencing some location or other Kubernetes resource.

System defined environment variables include Service names in the same namespace available at the time of Pod's creation.

Both types can not be updated once Pod is created.

Refer to another variable using $(VAR) syntax:

env:
  - name: FIRST_VAR
    value: "foo"
  - name: SECOND_VAR
    value: "$(FIRST_VAR)foo"

Pause container

Pause container is used to provide shared Linux namespaces to user containers. This container is not seen within Kubernetes, but can be discovered by container engine tools.

For example, an IP address is acquired prior to other containers, which is then used in a shared network namespace. Container(s) will have eth0@tunl0 interface. IP persists throughout the life of a Pod.

If pause container dies, kubelet recreates it and all Pod's containers.


Init container

InitContainer runs (must successfully complete) before main application container. Multiple init containers can be specified, in which case they run sequentially (in Pod spec order). Primary use cases are setting up environment, separating duties (different storage and security settings) and environment verification (block main application start up if environment is not properly set up).

spec:
  containers:
    - name: main-app
      image: databaseD
  initContainers:
    - name: wait-database
      image: busybox
      command: ['sh', '-c', 'until ls /db/dir ; do sleep 5; done; ']

Static Pod

Static Pod is managed directly by kubelet (not apiserver) on nodes. Pod's manifest is placed in a specific location on a node (staticPodPath in kubelet's configuration file) that kubelet is continuously watching (files starting with dots are ignored). Default location is /etc/kubernetes/manifests.

kubelet automatically creates a mirror Pod for each static Pod to make them visible in apiserver, but can not be controlled from there; deleting such Pod through apiserver will not affect it, and mirror Pod will be recreated.

kubelet can also fetch web-hosted static Pod manifest.

Control plane component manifests (build by kubeadm) - etcd, apiserver, controller-manager, scheduler are static Pods.

Resources

resources section in container's spec is used to specify desired and maximum amount of resources (CPU, memory) a container requests/expected to use. Pod's resources is a sum of container resources it contains. If limits are set, but requests are not, the latter is set to limits values.

Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  labels:
    app: hog
  name: hog
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: hog
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: hog
    spec:
      containers:
        - image: vish/stress
          imagePullPolicy: Always
          name: stress
          resources:
            limits:
              cpu: "1"
              memory: "2Gi"
            requests:
              cpu: "0.5"
              memory: "500Mi"
        args:
          - -cpus
          - "2"
          - -mem-total
          - "950Mi"
          - -mem-alloc-size
          - "100Mi"
          - -mem-alloc-sleep
          - "1s"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

ResourceQuota object can set hard and soft limits (and of more types of resources) in a namespace, thus, on multiple objects.

scheduler has 2 functions related to resources - LeastRequestedPriority (prefer nodes with more unallocated resources) and MostRequestedPriority (prefer nodes with less unallocated resources). Only one of functions can be configured for scheduler at a time. The latter makes sure Pods are tightly packed, resulting in less nodes in total required to run a workload.

top command inside container shows memory and CPU of the whole node it is running on, not of the container (with set limits).

Requests

requests represent minimum amount of resources a container needs to run properly. scheduler includes this information in its decision making process. It considers only nodes with enough unallocated resources to meet requests of a Pod.

CPU resource can be specified as a whole or millicores (1 = 1000m). If no limits are set, but requests are set, Pods share spare resources in the same proportion as requests are set - 2 Pods with 200m and 1000m millicores respectively will share all remaining CPU in 1 to 5 ratio (if, one Pod is idle, another one is still allowed to use all available CPU until first one needs more again).

Limits

limits set the maximum amount of a resources that a container can consume. While CPU is compressible resource, meaning amount used by container can be throttled, memory is incompressible - once a chunk is given, it can only be released by the container itself. Thus, always set limit for memory.

If CPU limit is exceeded, container isn't given more CPU time. If memory limits is exceeded, container is killed (OOMKilled, Out Of Memory).

Configured CPU limits can be viewed directly in the container's filesystem:

  • /sys/fs/cgroup/cpu/cpu.cfs_quota_us
  • /sys/fs/cgroup/cpu/cpu.cfs_period_us

Quality of Service

QoS defines the priority between Pods and determines the order in which Pods get killed in overcommitted system. Pod's QoS is determined based on QoS of all containers. If all containers are assigned Best effort or Guaranteed class, Pod's class is the same; any other combination results in Burstable class.

  • Best effort (lowest) - assigned to containers that don't have requests or limits set.
  • Guaranteed (highest) - assigned to Pods that have requests set equal to limits both for CPU and memory.
  • Burstable - all other containers fall within this class.

Best effort Pods are killed before Burstable and both are killed before Guaranteed Pods, which in turn can be killed, if system Pods need more resources.

For Pod's in the same class OOM score is used; highest score gets killed first. OOM score is based on:

  • percentage of the available memory the process is using (container using more of it's requested memory gets killed first)
  • fixed score adjustment based on Pod's QoS class and container's requested memory

QoS is shown on kubectl describe and in status.qosClass field of YAML file.

LimitRange

Provides means to specify resource limits that objects can consume in a namespace. Applies to each individual Pod/container, not total consumption in a namespace.

LimitRange resource is used by LimitRange Admission Control plugin - when Pod spec is posted to apiserver the contents are validated before being applied. Common practice is to set a limit to the biggest node, otherwise apiserver would still accept a Pod with resource request that can't be satisfied.

min, max, etc refer to limits/requests unless a setting with Request suffix is also present, in which case first ones specify only limits. At the Pod level only min/max limits can be set. On container level default limits and default requests can be set, which are applied, if an object didn't provide values at all. PVC min/max can also be set in LimitRange resource. All setting can be specified in a single resource or be split into multiple, for example, by type.

piVersion: v1
kind: LimitRange
metadata:
  name: example
spec:
  limits:
    - type: Pod
      min:
        cpu: 50m
        memory: 5Mi
      max:
        cpu: 1
        memory: 1Gi
    - type: Container
      defaultRequest:
        cpu: 100m
        memory: 10Mi
      default:
        cpu: 200m
        memory: 100Mi
      min:
        cpu: 50m
        memory: 5Mi
      max:
        cpu: 1
        memory: 1Gi
      maxLimitRequestRatio:
        cpu: 4
        memory: 10
    - type: PersistentVolumeClaim
      min:
        storage: 1Gi
      max:
        storage: 10Gi

ResourceQuota

A ResourceQuota resource limits amount of computational resources Pods can use, amount of storage PVC can claim and total number of API resources that can exist in a namespace.

When a quota is set on a specific resource (CPU or memory) Pods also need values to be set for that resource. Therefore, common practice is to provide LimitRange resource with defaults set alongside ResourceQuota.

Used by ResourceQuota Admission Control plugin, which checks, if posted Pod spec violates rules set by ResourceQuota resource. Therefore, doesn't affect already running Pods in a namespace, but newly posted ones.

Quotas can also be applied to a specific quota scope: BestEffort, NotBestEffort (QoS), Terminating, NotTerminating. Last 2 scopes are related to activeDeadlineSeconds setting in the Pod spec, which configures maximum duration a Pod can be active on a node relative to starting time. Terminating scope is Pods that have this setting set, while NotTerminating represents Pods without this setting.

  • max values for requests and limits:
    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: cpu-and-mem
    spec:
      hard:
        requests.cpu: 400m
        requests.memory: 200Mi
        limits.cpu: 600m
        limits.memory: 500Mi
  • limit amount of storage to be claimed by PVCs:
    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: storage
    spec:
      hard:
        requests.storage: 500Gi # overall
        ssd.storageclass.storage.k8s.io/requests.storage: 300Gi # for particular class
        standard.storageclass.storage.k8s.io/requests.storage: 1Ti # for particular class
  • Limit number of API objects that can be created:
    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: objects
    spec:
      hard:
        pods: 10
        replicationcontrollers: 5
        secrets: 10
        configmaps: 10
        persistentvolumeclaims: 4
        services: 5
        services.loadbalancers: 1
        services.nodeports: 2
        ssd.storageclass.storage.k8s.io/persistentvolumeclaims: 2
  • Apply quota to specific scope (Pod must match all for them to apply):
    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: besteffort-notterminating-pods
    spec:
      scopes:
        - BestEffort
        - NotTerminating
      hard:
        pods: 4
# Show how much of a quota is currently used
$ kubectl describe quota

Probes

Probes let you run custom health checks on container(s) in a Pod.

  • livenessProbe is a continuous check to see if a container is running. Restart policy is applied on failure event.
  • readinessProbe is a diagnostic check to see if a container is ready to receive requests. On failure event Pod's IP address is removed from the Endpoints object (restart policy is not applied). Usually used to protect applications that temporary can't serve requests. If app isn't ready to serve requests (and readiness probe isn't configured), clients see "connection refused" type of errors.
  • startupProbe is a one time check during startup process, ensuring containers are in a Ready state. All other probes are disabled until startupProbe succeeds. On failure event restart policy is applied. Usually used for applications requiring long startup times.

Probes can be defined using 3 types of handlers: command, HTTP, and TCP.

  • command's exit code of zero is considered healthy:
     exec:
       command:
         - cat
         - /tmp/ready
  • HTTP GET request return code => 200 and < 400:
     [...]
     httpGet:
       path: /healthz
       port: 8080
  • successful attempt establishing TCP connection:
     [...]
     tcpSocket:
       port: 8080

Settings:

Name Default Description
initialDelaySeconds 0s number of seconds after a container has started before running probes
probeInterval 10s how frequently to run probes
timeteoutSeconds 1s execution time for a probe before declaring a failure, probe would return Unknown status
failureThreshold 3 number of missed checks to declare a failure
successThreshold 1 number of successful probes after a failure to consider a container as healthy

Lifecycle

Stopping/terminating Pod: When stop command is sent to a Pod, SIGTERM is sent to containers and Pod's status is set to Terminating. If container is not terminated by the end of grace period timer (default 30s), SIGKILL is sent, apiserver and etcd are updated.

# To immediately delete records from API and etcd, if termination is stuck
# Still have to clean up resources manually
$ kubectl delete pod <name> --grace-perioud=0 --force

Container(s) in a Pod can restart independent of the Pod. Restart process is protected by exponential backoff - 10s, 20s, 40s and up to 5m. Resets to 0s after 10m of continuous successful run.

Restart policy:

  • Always (default) - restarts all containers in a Pod, if one stops running
  • OnFailure - restarts only on non-graceful termination (non-zero exit codes)
  • Never

Shutdown process:

  1. Deleting Pod's object via apiserver. apiserver sets deletionTimestamp field, which also makes Pod to go to terminating state
  2. kubelet stops each container in a Pod with a grace period, which is configurable per Pod
    1. Pre-stop hook (if configured) and wait for it to finish
    2. Send SIGTERM to the main process
    3. Wait for clean shutdown or grace period (grace period countdown starts from pre-hook)
    4. Send SIGKILL
  3. When all containers stop, kubelet notifies apiserver and Pod object is deleted. Force delete an object with --grace-period=0 --force options.

terminationGracePeriodSeconds is 30 seconds by default. Can be specified in the Pod's spec and also overridden, when deleting the Pod:

$ kubectl delete pod foo --grace-period=5

Tip: the best way to to ensure orphaned data is not lost and migrated to remaining Pod(s) is to configure CronJob or continuously running Pod that will scan for such event and trigger/manage the migration of the data.

Lifecycle hooks

Lifecycle hooks are specified per container, and either perform a command inside a container or perform an HTTP GET against URL.

Post-start hook executes immediately after container's main process is started. Doesn't wait for that process to start fully, and runs in parallel (asynchronously). Until hooks completes, container stays in Waiting and Pod in Pending states accordingly. If hook fails, main container is killed. Logs written to stdout aren't visible anywhere; in case of an error FailedPostStartHook warning is written to Pod's events (make post-start hook to write logs to filesystem for easy debugging).

spec:
  containers:
    - name: foo
      image: bar
      lifecycle:
        postStart:
          exec:
            command:
              - sh
              - -c
              - "echo postStart hook ran"

Pre-stop hook executes immediately before container is terminated - first configured hook is run, then SIGTERM is sent, and lastly SIGKILL is sent, if unresponsive. Regardless of the status of pre-stop hook container will be terminated; on failure FailedPreStartHook warning is written to Pod's events (might happen unnoticed, since Pod is deleted shortly after).

lifecycle:
  preStop:
    httpGet:
      port: 8080
      path: shutdown

Tip: in many cases pre-stop hook is used to pass SIGTERM to the application, because it seems Kubernetes (kubelet) doesn't send it. This may happen, if image is configured to run a shell, which in turn runs the application - in this case shell "eats up" the signal. Either handle the signal in shell script and pass to the application or use exec form of ENTRYPOINT or CMD, and run application directly.

Pre-stop hook can also be used to ensure graceful termination of client requests. When Pod termination is initialized, the route for updating iptables (apiserver -> Endpoints controller -> apiserver -> kube-proxy -> iptables) is considerably longer than the one to remove the Pod (apiserver -> kubelet -> container(s)). Some meaningful delay (5-10 seconds), may be enough to ensure iptables are updated and no new requests are accepted. Application handles and waits for all active connections to finish, closes inactive ones, and shuts down completely after last active request is completed.

Security context

Security-related features that can be specified under Pod or individual container spec. Pod level settings serve as defaults for containers, which can override them. Configuring a security context.

Some options are:

  • runAsUser - specify user ID
  • runAsNonRoot - specify true to enforce container to run as any other user
  • privileged - specify true to allow Pod to do anything on a node (use protected system devices, kernel features, devices, etc)
  • capabilities - specify individual kernel capability to add or drop for a container (Linux kernel capabilities are usually prefixed with CAP_; when specifying in Pod spec, leave out the prefix)
  • readOnlyRootFilesystem - allow processes to only read from mounted volumes
  • fsGroup - special supplemental group, applies to all volumes attached to Pod (if volume plugin allows that); can be used to share volumes between containers that run as different users
  • supplementalGroups - list of additional groups IDs the user is associated with

By default container runs as a user defined in the image (in Dockerfile; if USER directive is omitted, defaults to root).

Metadata

Downward API allows to pass information about Pod's metadata to the containers inside via environment variables or files (downwardAPI volume). For example, Pod's metadata such as Pod's name, IP address, namespace it belongs, labels and annotations and so on.

Labels and annotations can change during Pod's lifecycle. Since environment variables can not be updated, labels and annotations can only be exposed via downwardAPI volume - Kubernetes continuously updates it, when changes occur.

Since volume is defined at the Pod level, exposing resources via volume requires container's name. However, this way container can access resource request data of other containers in the Pod.

  • Environment variables (resource limits require a divisor parameter - actual value is divided by divisor)
    kind: Pod
    spec:
      containers:
        - name: main
          ...
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: CPU_REQUESTS_MILICORES
              valueFrom:
                resourceFieldRef:
                  resource: requests.cpu
                  divisor: 1m # specify which unit
            - name: MEMORY_LIMIT_KIBIBYTES
              valueFrom:
                resourceFieldRef:
                  resource: limits.memory
                  divisor: 1Ki # specify which unit
  • downwardAPI (each item is exposed as a file, path parameter serving as a file name):
    kind: Pod
    metadata:
      labels:
        foo: bar
    spec:
      containers:
        - name: main
          volumeMounts:
            - name: downward
              mountPath: /etc/downward
      volume:
        - name: downward
          downwardAPI:
            items:
              - path: "podName"
                fieldRef:
                  fieldPath: metadata.name
              - path: "labels"
                fieldRef:
                  fieldPath: metadata.labels
              - path: "cpuRequestMilliCores"
                resourceFieldRef:
                  containerName: one
                  resource: requests.cpu
                  divisor: 1m

Controllers

Controllers or operators are series of watch-loops that request apiserver for a particular object state and modify the object until desired state is achieved. Kubernetes comes with a set of default controllers, while others can be added using custom resource definitions.

Create custom controller via operator framework.

ReplicaSet

Deploy and maintain defined number of Pods. Usually, not used by itself, but through Deployment controller. Consists of a selector, number of replicas and Pod spec.

ReplicaSet selector can be of matchLabels or matchExpression type (older ReplicationControler object only allowed to use direct matching for key/value pairs). The latter allows the use of operators In, NotIn, Exists and DoesNotExist.

matchExpressions:
  - key: foo
    operator: In
    values:
      - bar

Deployment

Manages the state of ReplicaSet and the Pods within, thus, providing flexibility with updates and administration. Rolling updates is performed by creating a second ReplicaSet and increasing/decreasing Pods in two sets. It is also possible to roll back to a previous version, pause the deployment and make changes.

Designed for stateless applications, like web front end that doesn't store data or application state to a persistent storage.

Changes in configuration file automatically trigger rolling updates. You can pause, resume and check status of this behavior. Exit code of 0 for status command indicates success, while non-zero - failure. If Deployment is paused, the undo command won't do anything, until Deployment is resumed.

$ kubectl rollout [pause|resume|status] deployment <name>

$ kubectl rollout history deployment <name>
# Get detailed info
$ kubectl rollout history deployment <name> --revision=<number>

# Roll back to a previous version
$ kubectl rollout undo deployment <name>
# Roll back to a specific version
$ kubectl rollout undo deployment <name> --to-revision=<number>

# Restart all Pods. New ReplicaSet is created with the same Pod spec.
# Specified update strategy is applied.
$ kubectl rollout restart deployment <name>

Pod names are constructed as follows - <deployment_name>-<pod_template_hash>-<pod_id>. pod_template_hash is unique ReplicaSet hash within Deployment. pod_id is unique Pod identifier within ReplicaSet.

Create Deployment:

  • declaratively via YAML file:
     $ kubectl apply -f <deployment_file>
  • imperatively using kubectl create command:
     $ kubectl create deployment <name> \
     			--image <image>:<tag> \
     			--replicas <number> \
     			--labels <key>:<value> \
     			--port <port_number> \
     			--generator deployment/apps.v1 \ # api version to be used
     			--save-config # saves the yaml config for future use

To keep desired replica count the same, even when applying changes, do not include it in the YAML when using kubectl apply.


Update strategy

RollingUpdate (default) strategy - new ReplicaSet starts scaling up, while old one starts scaling down. maxUnavailable specifies number of Pods from the total number in a ReplicaSet that can be unavailable (rounded down), maxSurge specifies number of Pods allowed to run concurrently on top of total number of replicas in a ReplicaSet (rounded up). Both can be specified as a number of Pods or percentage.

Recreate strategy - all old Pods are terminated before new ones are created. Used when two versions can't run concurrently.

Other strategies that can be implemented:

  • blue/green deployment - create completely new Deployment of an application and change app's version. Traffic can be redirected using Services. Good for testing, disadvantage in doubled resources
  • canary deployment - based on blue/green, but traffic is shifted gradually to a new version. This is achieved by avoiding specifying app's version in the Service selector and just by creating Pods of a new version. Can also be achieved by pausing rolling update.

Related settings:

  • progressDeadlineSeconds - time in seconds until a progress error is reported (image issues, quotas, limit ranges)
  • revisionHistoryLimit (default 10) - how many old ReplicaSet specs to keep for rollback

StatefulSet

Managing stateful application with a controller. Provides network names, persistent storage and ordered operations for scaling and rolling updates.

Each Pod maintains a persistent identity and has an ordinal index with a relevant Pod name, stable hostname, and stable identified storage. Ordinal index is just a unique zero-based sequential number given to each Pod representing the order in sequence of Pods. Deployment, scaling and updates are performed based on this index. For example, second Pod waits until first one is ready and running before it is deployed. Scaling and updates happen in reverse order. Can be changed in Pod management policy, where OrderedReady is default and can be switched to Parallel. Each Pod has it's own unique PVC, which uses ReadWriteOnce access mode.

Examples are database workloads, caching servers, application state for web farms.

Naming must be persistent and consistent, as stateful application often needs to know exactly where data resides. Persistent storage ensures data is stored and can be retrieved later on. Headless service (without load balancer or cluster IP) allows applications to use cluster DNS to locate replicas by name.

Governing Service

StatefulSet requires a service to control its networking. This is a headless Service, thus, each Pod has its own DNS entry. With foo Service in default namespace Pod named A-0 will have a-0.foo.default.svc.cluster.local FQDN. example.yml specifies headless service with no load balancing by using clusterIP: None option.

Headless Service also creates a SRV record, which points to individual Pods inside StatefulSet. Thus, each Pod can just perform a SRV DNS lookup to find out its peers.

StatefulSet storage

StatefulSet keeps its state by keeping data in PVs. volumeClaimTemplates section is used to define a template, which is then used to create PVC for each Pod. StatefulSet automatically adds volume inside Pod's spec and configures it to be bound to PVC.

PVCs are not deleted automatically on a scale-down event to prevent the deletion of potentially important data. Therefore, PVCs and PVs are to be removed manually.

DaemonSet

Ensures that a specific single Pod is always running on all or some subset of the nodes. If new nodes are added, DaemonSet will automatically set up Pods on those nodes with the required specification. The word daemon is a computer science term meaning a non-interactive process that provides useful services to other processes.

Examples include logging (fluentd), monitoring, metric and storage daemons.

RollingUpdate (default) update strategy terminates old Pods and creates new in their place. maxUnavailable can be set to integer or percentage value, default is 1. In OnDelete strategy old Pods are not removed automatically. Only if administrator removes them manually, new Pods are created.

Job

Define, run and ensure that specified number of Pods successfully terminate.

restartPolicy must be set to either OnFailure or Never, since default policy is Always. In case of restart failed Pods are recreated with an exponentially increasing delay: 10, 20, 40... seconds, to a maximum of 6 minutes.

No matter how Job completes (success or failure) Pods are not deleted (for logs and inspection). Administrator can delete Job manually, which will also delete Pods.

  • activeDeadlineSeconds - max duration time, has precedence over backoffLimit
  • backoffLimit - number of retries before being marked as Failed, defaults to 6
  • completions - number of Pods that need to finish successfully
  • parallelism - max number of Pods that can run simultaneously

Execute from cli:

$ kubectl run pi --image perl --restart Never -- perl -Mbignum -wle 'print bpi(2000)'

Parallel

Parallel Job can launch multiple Pods to run the same task. There are 2 types of parallel Jobs - fixed task completion count and a work queue.

Work queue is created by leaving completions field empty. Job controller launches specified number of Pods simultaneously and waits until one of them signals successful completion. Then it stops and removes all Pods.

In a situation of a Job with both completion and parallelism options set, the controller won't start new containers, if the remaining number of completions is less that parallelism value.

CronJob

Create and manage Jobs on a defined schedule. CronJob is created at the time of submission to apiserver, but Job is created on schedule.

  • suspend - set to true to not run Jobs anymore
  • concurrencyPolicy - Allow (default), Forbid, or Replace. Depending on how frequently Jobs are scheduled and how long it takes to finish a Job, CronJob might end up executing more than one job concurrently.

In some cases may not run during a time period or run twice, thus, requested Pod should be idempotent. startingDeadlineSeconds ensures a Pods starts no later that X seconds after scheduled time. If Pod doesn't start, no new attempts will be made and the Job will be marked as failed.

Kubernetes retains number of successful and failed Jobs in history, which is by default 3 and 1 respectively. Options successfulJobsHistoryLimit and failedJobsHistoryLimit may be used to control this behavior. Deleting CronJob also deletes all Pods.

Scheduling

The job of a scheduler is to assign new Pods to nodes. Default is kube-scheduler, but a custom one can be written and set. Multiple schedulers can work in parallel.

Node selection goes through 3 stages:

  • Filtering - remove nodes that can not run the Pod (apply hard constraints, such as available resources, nodeSelectors, etc)
  • Scoring - gather list of nodes that can run the Pod (apply scoring functions to prioritize node list for the most appropriate node to run the workload); ensure Pods of the same service are spread evenly across nodes, node affinity and taints are also applied
  • Binding - updating node name in Pod's object

PriorityClass and PriorityClassName Pod's settings can be used to evict lower priority Pods to allow higher priority ones to be scheduled (scheduler determines a node where a pending Pod could run, if one or more lower priority ones were to be evicted). PodDisruptionBudget resource can limit number of Pods to be evicted and ensure enough Pods are running at all times, but it could still be violated by scheduler, if no other option is available. Both percentage or absolute number can be specified for minAvailable or maxUnavailable setting.

End result of a scheduling process is assigning a Binding (Kubernetes API object in api/v1 group) to a Pod that specifies where it should run. Can also be assigned manually without any scheduler.

To manually schedule a Pod to a node (bypass scheduling process) specify nodeName (node must already exist); resource constraints still apply. This way a Pod can still run on a cordoned node, since scheduling is basically disabled and node is assigned directly.

Custom scheduler can be implemented; also multiple schedulers can run concurrently. Custom scheduler is packed and deployed as a system Pod. Default scheduler code. Define which scheduler to use in Pod's spec, if none specified, default is used. If specified one isn't running, the Pod remains in Pending state.

Scheduling policy

Priorities are functions used to weight resources. By default, node with the least number of Pods will be ranked the highest (unless SelectorSpreadPriority is set). ImageLocalityPriorityMap favors nodes that already have the container image. cp/pkg/scheduler/algorithm/priorities contains the list of priorities.

Example file for a scheduler policy:

kind: Policy
apiVersion: v1
predicates:
  - name: MatchNodeSelector
    order: 6
  - name: PodFitsHostPorts
    order: 2
  - name: PodFitsResources
    order: 3
  - name: NoDiskConflict
    order: 4
  - name: PodToleratesNodeTaints
    order: 5
  - name: PodFitsHost
    order: 1
priorities:
  - name: LeastRequestedPriority
    weight: 1
  - name: BalancedResourceAllocation
    weight: 1
  - name: ServiceSpreadingPriority
    weight: 2
  - name: EqualPriority
    weight: 1
hardPodAffinitySymmetricWeight: 10

Typically passed as --policy-config-file and --scheduler-name parameters. This would result in 2 schedulers running in a cluster. Client can then choose one in Pods spec.

Node selector

Assign labels to nodes and use nodeSelector on Pods to place them on certain nodes. Simple key/value check based on matchLabels. Usually used to apply hardware specification (hard disk, GPU) or workload isolation. All selectors must be met, but node could have more labels.

nodeName could be used to schedule a Pod to a specific single node.

Affinity

Like nodeSelector uses labels on nodes to make scheduling decisions, but with matchExpressions. matchLabels can still be used with affinity as well for simple matching.

  • nodeAffinity - use labels on nodes (should some day replace nodeSelector)
  • podAffinity - try to schedule Pods together using Pod labels (same nodes, zone, etc)
  • podAntiAffinity - keep Pods separately (different nodes, zones, etc)

Scheduling conditions:

  • requiredDuringSchedulingIgnoredDuringExecution - Pod is scheduled only if all conditions are met (hard rule)
  • preferredDuringSchedulingIgnoredDuringExecution - Pod gets scheduled even if a node with all matching conditions is not found (soft rule, preference); weight 1 to 100 can be assigned to each rule

Affinity rules use In, NotIn, Exists, and DoesNotExist operators. Particular label is required to be matched when the Pod starts, but is not required, if the label is later removed. However, requiredDuringSchedulingRequiredDuringExecution is planned for the future.

Schedule caching Pod on the same node as a web server Pod.

spec:
  containers:
    - name: cache
  ...
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - webserver
            topologyKey: "kubernetes.io/hostname"

scheduler takes other Pods' affinity rules into account, even if the Pod to be scheduled doesn't define any (InterPodAffinityPriority) - this ensures that other Pods' affinity rules don't break, if initial Pod was deleted by accident.

topologyKey can be any label on the node (with some exceptions). Such label must be present on all nodes, otherwise, it could lead to undefined behavior. Some well-known labels for spreading or putting Pods together:

  • kubernetes.io/hostname - host
  • topology.kubernetes.io/zone - availability zone
  • topology.kubernetes.io/region - region

By default labelSelector matches only Pods in the same namespace as the Pod being scheduled. Pods from other namespaces can also be selected by adding namespaces field on the same level as labelSelector.

Taints and tolerations

Opposite of selectors, keeps Pods from being placed on certain nodes. Taints allow to avoid scheduling, while Tolerations allow to ignore a Taint and be scheduled as normal.

$ kubectl taint nodes <node_name> <key>=<value>:<effect>
$ kubectl taint nodes <node_name> key1=value1:NoSchedule
# Remove a taint
$ kubectl taint nodes <node_name> key:<effect>-

Effects:

  • NoSchedule - do not schedule Pod on a node, unless toleration is present; all existing Pods continue to run
  • PreferNoSchedule - try to avoid particular node; all already running Pods are unaffected
  • NoExecute - evacuate all existing Pods, unless one has a toleration, and do not schedule new Pods; tolerationSeconds can specify for how long a Pod can run before being evicted, in certain cases kubelet could add 300 seconds to avoid unnecessary evictions

Toleration with NoExecute effect and tolerationSeconds setting can be used to configure when Pods on unresponsive nodes should be rescheduled.

Default operator is Equal, which is used to tolerate a specific value. Exists generally should not be specified, used to tolerate all values for a specific key. If an empty key uses Exists operator, it will tolerate every taint. If effect is not specified, but a key and operator are declared, all effects are matched.

All parts have to match to the taint on the node:

spec:
  containers:
  ...
  tolerations:
  - key: <key>
    operator: "Equal"
    value: <value>
    effect: NoSchedule

Node cordoning

Marks node as unschedulable, preventing new Pods from being scheduled, but does not remove already running Pods. Used as preparatory step before reboot or maintenance.

# Mark node as unschedulable
$ kubectl cordon <node>

# Mark node as unschedulable
# Gracefuly evict Pods
# Optionally ignore daemonsets, since f.e. `kube-proxy` is deployed as daemonset
$ kubectl drain <node> --ignore-daemonsets

# Mark node as schedulable again
$ kubectl uncordon <node>

Individual Pods won't be removed by draining the node, since they are not managed by a controller. Add --force option to remove.

Autoscaling

# Manual scaling
$ kubectl scale <object_type> <name> --replicas=<number>

HorizontalPodAutoscaler

Automatically scales Replication Controller, ReplicaSet, StatefulSet or Deployment based on resource utilization percentage, such as CPU and memory by updating replicas field. Modification is made through Scale sub-resource, which is exposed for previously mentioned objects only (Autoscaler can operate on any resource that exposes Scale sub-resource).

Custom metrics can also be used. If multiple metrics are specified, target Pod count is calculated for each, then highest value is used.

At most double of current number of Pods can be added in a single operation, if there are more than 2 currently running Pods. For less than 2 - max 4 Pods in a single step. Scale-up can happens at most once in 3 minutes, scale-down - once in 5 minutes.

# Create HPA resource
$ kubectl autoscale deployment <name> \
				--min=5 \
				--max=15 \
				--cpu-percent=75

Autoscaling has a thrashing problem, that is when the target metric changes frequently, which results in frequent up/down scaling. Use --horizontal-pod-autoscaler-downscale-delay option to control this behavior (by specifying a wait period before next down scale; default is 5 minute delay).

Resource metric type

Container resource metrics (defined in resource requests).

CPU usage percentage is based on CPU requests setting, which means that it needs to be present on the Pod. Usage percentage can be over 100%, because Pod can use more than requested amount of CPU.

Memory usually isn't a good metric, because application has to control memory consumption. Even if new Pods (after killing old ones) don't use less memory, Kubernetes will continue endlessly adding new Pod until the limit.


Pod metric type

Any other (including custom) metric related to Pod directly, such as queries-per-second, message queue size, etc

spec:
  metrics:
    - type: Pods
      resource:
        metricName: qps
        targetAverageValue: 100

Object metric type

Metrics that don't relate to Pods, such as average request latency on Ingress object. Unlike other type, where an average is taken from all Pod, a single value is acquired.

spec:
  metrics:
    - type: Object
      resource:
        metricName: latencyMillis
        target:
          apiVersion: extensions/v1beta1
          kind: Ingress
          name: frontend
          targetValue: 20
      scaleTargetRef:
        apiVersion: extensions/v1beta1
        kind: Deployment
        name: kubia

Vertical Pod Autoscaler

Runs as a separate deployment, adjusts the amount of CPU and memory requested by Pods. Refer to the project.

Cluster Autoscaler

Adds or removes node(s) based on inability to deploy Pods or having low utilized nodes. Contacts cloud provider API to add a new node. Best node is determined based on available or already deployed node groups. Refer to project page for deployment options for particular cloud.

Karpenter, currently supports AWS only.

Resource management

ResourceQuota

Define limits for total resource consumption in a namespace. Applying ResourceQuota with a limit less than already consumed resources doesn't affect existing resources and objects consuming them.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: storagequota
spec:
  hard:
    persistentvolumeclaims: "10"
    requests.storage: "500Mi"

LimitRange

Define limits for resource consumption per objects. For example:

  • min/max compute resource per Pod or container
  • min/max storage request per PersistentVolumeClaim

Observability

Namespace

Namespaces can abstract single physical layer into multiple clusters. They provide scope for naming resources like Pods, controllers and Deployments. Primarily used for resource isolation/management. User can create namespaces, while Kubernetes has 4 default ones:

  • default - for objects with no namespace defined
  • kube-system - for objects created by Kubernetes itself (ConfigMap, Secrets, Controllers, Deployments); by default these items are excluded, when using kubectl command (can be viewed explicitly)
  • kube-public - for objects publicly readable for all users
  • kube-node-lease - worker node lease info

Creating a Namespace also creates DNS subdomain <ns_name>.svc.<cluster_domain>, thus, Namespace name can not contain dots, otherwise follows RFC 1035 (Domain name) convention.

Can also be used as a security boundary for RBAC or naming boundary (same resource name in different namespaces). A given object can exist only in one namespace. Not all objects are namespaced (generally physical objects like PersistenVolumes and Nodes).

$ kubectl api-resources --namespaced=true
$ kubectl api-resources --namespaced=false

# List all resources in a namespace
$ kubectl api-resources --verbs=list --namespaced -o name \
  | xargs -n 1 kubectl get --show-kind --ignore-not-found -n <namespace>

Namespace can be specified in command line or manifest file. Deleting a Namespace deletes all resources inside of it as well. Namespace is defined in metadata section of an object.

apiVersion:
kind:
metadata:
  namespace:
# Create namespace
$ kubectl create namespace <namespace_name>

# Set default namespace
$ kubectl config set-context --current --namespace=<namespace_name>
# Validate it
$ kubectl config view --minify | grep namespace

Label

Labels enable managing objects or collection of objects by organizing them into groups, including objects of different types. Label selectors allow querying/selecting multiple objects. Kubernetes also leverages Labels for internal operations.

Non-hierarchical key/value pair (up to 63/253 characters long). Can be assigned at creation time or be added/edited later. Add --overwrite parameter to rewrite already existing label.

$ kubectl label <object> <name> <key1>=<value1> <key2>=<value2> 
$ kubectl label <object> <name> <key1>=<value1 <key2>=<value2> --overwrite
$ kubectl label <object> --all <key>=<value>
# Delete
$ kubectl label <object> <name> <key>-

# Output additional column with all labels
$ kubectl get <object> --show-labels
# Specify columns (labels) to show
$ kubectl get <object> -L <key1>,<key2>

Controllers and Services match Pods using labels. Pod scheduling (e.g. based on hardware specification, SSD, GPU, etc) uses Labels as well.

Deployment and Service example, all labels must match:

kind: Deployment
...
spec:
  selector:
    matchLabels:
      <key>: <value>
  ...
  template:
    metadata:
      labels:
        <key>: <value>
    spec:
      containers:
---
kind: Service
...
spec:
  selector:
    <key>: <value>

Labels are also used to schedule Pods on a specific Node(s):

kind: Pod
...
spec:
  nodeSelector:
    <key>: <value>

Best practices include:

  • Name of application resource belongs to
  • Application tier (frontend, backend, etc)
  • Environment (dev, prod, QA, etc)
  • Version
  • Type of release (stable, canary, blue/green, etc)
  • Tenant (if multiple used in the same namespace)
  • Shard (for sharded systems)

Label selector

Labels can be used to query/filter set of objects.

# Long format
$ kubectl get <object> --selector <key>=<value>

# Check if label exists (or doesn't exist)
$ kubectl get <object> -l <key>
$ kubectl get <object> -l '!<key>'

# Check multiple labels
$ kubectl get <object> -l '<key1>=<value1>,<key2>!=<value2>'
$ kubectl get <object> -l '<key1> in (<value1>,<value2>)'
$ kubectl get <object> -l '<key1> notin (<value1>,<value2>)'

Annotation

Annotations include object's metadata that can be useful outside cluster's object interaction, that is used by people or third-party applications. For example, timestamp, pointer to related objects from other ecosystems, developer's email responsible for the object and so on. Non-hierarchical key/value pairs (up to 63 characters, 256KB). Can't be used for querying/selecting.

Manifest file:

kind: Pod
...
metadata:
  annotation:
    owner: Max
$ kubectl annotate <object_type> <name> key=<value>
$ kubectl annotate <object_type> --all key=<value> --namespace <name>
$ kubectl annotate <object_type> <name> key=<new_value> --overwrite

# Delete
$ kubectl annotate <object_type> <name> <key>-

Best practices include at least decryption of the resource and contact information of the responsible person. Also could include names of services it is using, build and version info, and so on.

Monitoring

Install Kubernetes dashboard (runs Pods in kubernetes-dashboard namespace):

  • deploy

     $ kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-rc3/aio/deploy/recommended.yaml
  • start proxy

     $ kubectl proxy
    
  • access the following page (port may vary) http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy

  • choose token option and supply the output of the following command:

     $ kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | awk '/^deployment-controller-token-/{print $1}') | awk '$1=="token:"{print $2}'
  • Comparing monitoring options

Events

Events show cluster operations on resources defined in the cluster, such as scheduling decisions, scaling operation, etc. Retained for one hour. An Event is also a Kubernetes resource like any other.

$ kubectl get events
# Sort chronologically
$ kubectl get events --sort-by='.metadata.creationTimestamp'
# Filter events
$ kubectl get events --field-selector type=Warning,reason=Failed

# Specific object events
$ kubectl describe <type> <name>

Resource usage

kubelet collects resource consumption data both on container and node levels via cAdvisor agent.

Kubernetes Metrics Server collects resource metrics from kubelets and exposes them via apiserver. This includes CPU and memory usage info for Pods and nodes. Intended to be used internally by the cluster - is used to feed data to scheduler for horizontal and vertical autoscalers (not to feed third-party monitoring solutions). Once installed accessible at /apis/metrics/k8s.io.

Can use labels, selectors and --sort-by parameters as well.

# Access actual data resource usage
$ kubectl top pods
# Per container utilization
$ kubectl top pods --containers

# All Pods on a node
$ kubectl top nodes

Logging

Kubernetes keeps container's logs in a file. The location depends on the container runtime (default for containerd is /var/logs/containers). Two logs are retained on a node - current and, if container has restarted, the previous run log. Access the one before recent restart with --previous parameter.

$ kubectl logs <pod>
# multicontainer Pod
$ kubectl logs <pod> -c container
# logs come in sequence - container1 -> container2 ...
$ kubectl logs <pod> --all-containers

Logs are automatically rotated daily and every time the file reaches 10MB in size. kubectl logs only shows logs from the last rotation.

Once container is removed, so is its logs. Use aggregation tools, such as Fluentd to gather logs for safekeeping and analysis. ELK is a common stack for aggregation, search and visualization of logging data.

Aggregations tools treat each line as an entry, which makes multi-line logs appear as separate entries. Either configure outputting logs in JSON format or keep human-readable logs in stdout, while writing JSON to a specific location. Aggregation tool will need additional node-level configuration or be deployed as sidecar.

Nodes run kubelet and kube-proxy. On systemd systems kubelet runs as a systemd service, which means its logs are stored in journald. kube-proxy runs as a Pod in general (same log access methods apply). If it doesn't run inside a Pod, logs are stored in /var/log/kube-proxy.

# -u <service_name>
# opens in pager format: f (forward), b (back)
# add --no-pager parameter to disable it
$ journalctl -u kubelet.service
# narrow down time frame
$ journalctl -u kubelet.service --since today
# non-systemd
$ tail /var/log/kubelet.log
# Locate apiserver log file on a node (systemd)
$ kubectl find / -name "*apiserver*log"

Termination message path

A process in container can write a termination message (reason for termination) into specific file, which is read by kubelet and shown with kubectl describe in the Message field. Default location is /dev/termination-log; can be set to custom location with terminationMessagePath field in the container definition in the Pod spec. Can also be used in Pods that run completable task and terminate successfully.

terminationMessagePath set to FallbackToLogsOnError will use last few lines in container's logs as termination message (only on unsuccessful termination).

Network

All Pods can communicate with each other on all nodes. Software (agents) on a given node can communicate with all Pods on that node.

Network types:

  • node (real infrastructure)
  • Pod - implemented by network plugin, IPs are assigned from PodCidrRange, but could also be assigned from the node network
  • cluster - used by Services using ClusterIP type, assigned from ServiceClusterIpRange parameter from API server and controller manager configurations

Pod-to-Pod communication on the same node goes through bridge interface. On different nodes could use Layer 2/Layer 3/overlay options. Services are implemented by kube-proxy and can expose Pods both internally and externally.

Pause/Infrastructure container starts first and sets up the namespace and network stack inside a Pod, which is then used by the application container(s). This allows container(s) restart without interrupting network namespace. Pause container has a lifecycle of the Pod (created and deleted along with the Pod).

Container Network Interface (CNI) is abstraction for implementing container and Pod networking (setting namespaces, interfaces, bridge configurations, IP addressing). CNI sits between Kubernetes and container runtime. CNI plugins are usually deployed as Pods controlled by DaemonSets running on each node.

Expose individual Pod directly to the client:

$ kubectl port-forward <pod_name> <localhost_port>:<pod_port>

Use apiserver as proxy to reach individual Pod or Service (use with kubectl proxy to handle authentication):

# Pod
$ curl <apiserver_host>:<port>/api/v1/namespaces/<namespace>/pods/<pod>/proxy/<path>
# Service
$ curl <apiserver_host>:<port>/api/v1/namespaces/<namespace>/services/<service>/proxy/<path>

By default Pods run in a separate network namespace. hostNetwork: true spec can makes Pod use host's network namespace, effectively making it behave as if it was running directly on a node. Process in a Pod that binds to a port, will be bound to node's port. hostPort property in spec.containers.ports allows binding to host's port, without using host network.

DNS

DNS is available as a Service in a cluster, and Pods by default are configured to use it. Provided by CoreDNS (since v1.13). Configuration is stored as ConfigMap coredns in kube-system namespace, which is mounted to coredns Pods as /etc/coredns/Corefile. Updates to ConfigMap get propagated to CoreDNS Pods in about 1-2 minutes - check logs for reload message. More plugins can be enabled for additional functionality.

dnsPolicy settings in Pod spec can be set to the following:

  • ClusterFirst (default) - send DNS queries with cluster prefix to coredns service
  • Default - inherit node's DNS
  • None - specify DNS settings via another parameter, dnsConfig
     spec:
       dnsPolicy: "None"
       dnsConfig:
         nameservers:
           - 9.9.9.9

A records:

  • for Pods - <ip_in_dash_form>.<namespace>.pod.cluster.local
  • for Services - <service_name>.<namespace>.svc.cluster.local

Traffic can access a Service using a name, even in a different namespace just by adding a namespace name:

# will fail if service is in different namespace
$ curl <service_name>

# works across namespaces
$ curl <service_name>.<namespace>

Network Policy

Can be used for managing Pod networking (communication between Pods). Depends whether networking plugin supports it. Applies to Pods that match its label selector, all Pods in a namespace that match namespace selector, or matching CIDR block.

Common practice is to drop all traffic, then adding other policies, which allow desired ingress and egress traffic. Default ingress deny example:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
spec:
  podSelector: {} # applies to all Pods
  policyTypes:
    - Ingress

Network policy recipes.

Service

Provides persistent endpoint for clients (virtual IP and DNS). Load balances traffic to Pods and automatically updates during Pod controller operations. Labels and selectors are used to determine which Pods are part of a Service. Default and popular implementation is kube-proxy on the node's iptables.

Acts as a network abstraction for Pod access. Allows communication between sets of deployments. A unique ID is assigned at creation time, which can be used by other Pods to talk to each other.

A Service is an controller, which listens to Endpoints controller to provide persistent IP for Pods. Sends messages (settings) via apiserver to network plugin (e.g. Calico) and to kube-proxy on every node. Also handles access policies for inbound requests.

Service also creates an Endpoint object(s), which are individual IP:PORT pairs of underlying Pods. See the routing IPs (mostly used for troubleshooting):

$ kubectl describe endpoints <service_name>

Imperatively create a new Service (NodePort type):

# Create a service
$ kubectl expose deployment <name> \
	--port 80 \
	--target-port 8080

service/kubernetes is an API server service.

Each Service gets a DNS A/AAAA record in cluster DNS in the form <svc_name>.<namespace>.svc.<cluster_domain>. If Pods and Service are in the same namespace, the latter can be references simply by the name. Pods that are created after the Service also get environment variables set with the information about Services available at that time.

kubectl proxy command creates a local proxy allowing sending requests to Kubernetes API:

$ kubectl proxy &
# Access foo service
$ http://localhost:8001/api/v1/namespaces/default/services/foo
# If service has a port_name configured
$ http://localhost:8001/api/v1/namespaces/default/services/foo:<port_name>

sessionAffinity setting can be set to ClientIP directing single client to the same Pod. Cookie based affinity isn't possible since Services operate at TCP/UDP level.

targetPort setting can also refer to port names specified in Pod spec, instead of numbers. Thus, Pods port number can change without requiring similar change on Service side.

kind: Pod
spec:
  containers:
    - name: foo
      ports:
        - name: http
          containerPort: 8080
        - name: https
          containerPort: 8443
---
kind: Service
spec:
  ports:
    - name: http
      port: 80
      targetPort: http
    - name: https
      port: 443
      targetPort: https

To manually remove a Pod from Service, add enabled=true as a label, and switch it to false or remove completely for a given Pod.

ClusterIP

Default Service type. Exposes a Service on a cluster-internal IP (exists in iptables on the nodes). IP is chosen from a range specified as a ServiceClusterIPRange parameter both on apiserver and kube-controller-manager configurations. If Service is created before corresponding Pods, they get hostname and IP address as environment variables.


NodePort

Exposes a Service on the IP address of each node in the cluster at a specific port number, making it available outside the cluster. Built on top of ClusterIP Service - creates ClusterIP Service and allocates port on all nodes with a firewall rule to direct traffic on that node to the ClusterIP persistent IP. NodePort option is set automatically from the range 30000 to 32767 or can be specified by user (should still fall within that range).

Regardless of which node is requested traffic is routed to ClusterIP Service and then to Pod(s) (all implemented by kube-proxy on the node).


LoadBalancer

Exposes a Service externally, using a load balancing Service provided by a cloud provider or add-on.

Creates a NodePort Service and makes an async request to use a load balancer. If listener does not answer (no load balancer is created), stays in Pending state.

In GKE it is implemented using GCP's Network Load Balancer. GCP will assign static IP address to load balancer, which directs traffic to nodes (randomly). kube-proxy chooses random Pod, which may reside on a different node to ensure even balance (default behavior). Respond will take same route back. Use externalTrafficPolicy: Local option to disable this behavior and enforce kube-proxy to direct traffic to local Pods.


ExternalName

Provides service discovery for external services. Kubernetes creates a CNAME record for external DNS record, allowing Pods to access external services (does not have selectors, defined Endpoints or ports).

apiVersion: v1
kind: Service
metadata:
  name: external-service
spec:
  type: ExternalName
  externalName: someapi.somecompany.com
  ports:
    - port: 80

Headless services

Expose individual Pod IPs backing a Service directly. Define by explicitly specifying None in spec.clusterIP field (headless). Cluster IP is not allocated and kube-proxy does not handle this Service (no load balancing nor proxying). Allows interfacing with other service discovery mechanisms (not tied to Kubernetes).

Service with selectors - Endpoint controller creates endpoint records and modifies DNS config to return A records (IP addresses) pointing directly to Pods. Client decides which one to use. Often used with stateful applications.

Service without selectors - no Endpoints are created. DNS config may look for CNAME record for ExternalName type or any Endpoint records that share a name with a Service (Endpoint object(s) needs to be created manually, and can also include external IP).

Endpoints

Usually not managed directly, represents IPs for Pods that match particular Service. Endpoint controller runs as part of kube-controller-manager.

If Endpoints is empty, meaning no matching Pods, Service definition might be wrong (labels).

On Pod deletion event Endpoints controller removes the Pod as an endpoint (by modifying Endpoints API object). kube-proxies that watch for changes update iptables on respective nodes; however, removing iptables rules doesn't break existing connections with clients.

Ingress

Consists of an Ingress object describing various rules on how HTTP traffic gets routed to Services (and ultimately to Pods) and an Ingress controller (daemon in a Pod) watching for new rules (/ingresses endpoint in the apiserver). Cluster may have multiple Ingress controllers. Both L4 and L7 can be configured. Ingress class or annotation can be used to associate an object with a particular controller (can also create a default class). Absence of an Ingress class or annotation will cause every controller to try to satisfy the traffic.

Ingress also provides load balancing directly to Endpoints bypassing ClusterIP. Name-based virtual hosting is available via host header in HTTP request. Path-based routing and TLS termination are also available.

Ingress controller can be implemented in various ways: Nginx Pods, external hardware (e.g. Citrix), cloud-ingress provider (f.e. AppGW, AWS ALB). Currently 3 Ingress Controllers are supported: AWS, GCE, nginx. Nginx Ingress setup.

Main difference with a LoadBalancer Service is that this resource operates on level 7, which allows it to provide name-based virtual hosting, path-based routing, TLS termination and other capabilities.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: example-ingress
spec:
  ingressClassName: nginx
  # non-matching traffic or when no rules are defined
  defaultBackend:
    service:
      name: example-service
      port:
        number: 80

Storage

Rook is a storage orchestration solution.

Kubernetes provides storage abstraction as volumes and persistent volumes. Volumes share lifecycle of the Pod. That means volume persists between container restarts. PersistentVolume stays intact even after Pod is deleted and can be reused again. Volumes are attached to Pods, not containers. volumeMount is used to attach volume defined in a Pod to a container.

Access modes:

  • ReadWriteOnce - read/write to a single node
  • ReadOnlyMany - read-only by multiple nodes
  • ReadWriteMany - read/write by multiple nodes

Kubernetes groups volumes with the same access mode together and sorts them by size from smallest to largest. Claim is checked against each volume in the access mode group until matching size is found.

EmptyDir

Simply empty directory that can be mount to a container in a Pod. When a Pod is destroyed, the directory is deleted. Kubernetes creates emptyDir volume from node's local disk or using a memory band file system.

PersistentVolume

Storage abstraction with a separate lifecycle from Pod. Managed by kubelet - maps storage on the node and exposes it as a mount.

Persistent volume abstraction has 2 components: PersistentVolume and PersistentVolumeClaim. PersistentVolume is a durable and persistent storage resource managed at the cluster level. PersistentVolumeClaim is a request and claim made by a Pod to use a PersistentVolume (namespaced object, same namespace as Pod). User specifies volume size, access mode, and other storage characteristics. If a claim matches a volume, then claim is bound to that volume and Pod can consume that resource. If no match can be found, Kubernetes will try to allocate one dynamically.

Static provisioning workflow includes manually creating PersistentVolume, PersistentVolumeClaim, and specifying volume in Pod's spec.

PersistentVolume:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-store
spec:
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteMany
  nfs:
    ...

PersistentVolumeClaim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-store
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

Reclaim policy

When PersistentVolumeClaim object is deleted, PersistentVolume may be deleted depending on reclaim policy. Reclaim policy can be changed on an existing PersistentVolume.

With Retain reclaim policy PersistentVolume is not reclaimed after PersistentVolumeClaim is deleted. PersistentVolume status changes to Released. Creating new PersistentVolumeClaim doesn't provide access to that storage, and if no other volume is available, claim stays in Pending state.


Dynamic provisioning

StorageClass resource allows admin to create a persistent volume provisioner (with type specific configurations). User requests a claim, and apiserver auto-provisions a PersistentVolume. The resource is reclaimed according to reclaim policy stated in StorageClass (default is Delete). Similar to PersistentVolume StorageClass isn't namespaced.

Dynamic provisioning workflow includes creating a StorageClass object and PersistentVolumeClaim pointing to this class. When a Pod is created, PersistentVolume is dynamically created. Delete reclaim policy in StorageClass will delete the PersistentVolume, if PersistentVolumeClaim is deleted.

StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: main
provisioner:
  kubernetes.io/azure-disk
parameters:
  ...

If PersistentVolumeClaim doesn't specify a StorageClass, default StorageClass is used (default class is marked with http://storageclass.beta.kubernetes.io/is-default-class: true annotation). To use a preprovisioned PersistentVolume specify StorageClass as empty string. Deleting a StorageClass doesn't affect existing PV/PVCs.

ConfigMap

Provides a way to inject application configuration data into Pods, e.g. config files, command line arguments, environment variables, port number, etc. Can be referenced in a volume. Can ingest data from a literal value, from a file or from a directory of files. Name must be DNS compliant.

ConfigMap can be updated. Also can be set as immutable, meaning can't be changed after creation. kubelet periodically syncs with ConfigMaps to keep ConfigMap volume up to date. Data is updated, even if it is already connected to a Pod (matter of seconds-minutes).

System components and controllers can also use ConfigMaps.

Manifest:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app1config
data:
  key: value

In command line --from-file option take the file name as key and contents of the file as value; --from-env-file reads lines of key=value pairs. --from-file also takes a directory path as value, in which case a map is created, where file name is a key.

# create
$ kubectl create configmap [NAME] [DATA]
$ kubectl create configmap [NAME] --from-file=[KEY_NAME]=[FILE_PATH]

# examples
$ kubectl create configmap demo --from-literal=lab.difficulty=easy
$ kubectl create configmap demo --from-file=color.properties
$ kubectl create configmap demo --from-file=customkey=color.properties
$ cat color.properties
color.good=green
color.bad=red

Reference a ConfigMap

Environment variables (if key contains dash, it is not converted to underscore, but skipped altogether; key names must be valid environment variable names).

spec:
  containers:
    - name: app1
      # Pass individual value from ConfigMap
      env:
       - name: username
         valueFrom:
           configMapKeyRef:
             name: app1config
             key: username
      # Create environment variable for each entry in ConfigMap
      envFrom:
        - configMapRef:
            name: app1env

Container command setting - expose as environment variable first, then refer to it in command setting:

spec:
  containers:
    - name: app1
      env:
        - name: username
          valueFrom:
            configMapKeyRef:
              name: app1config
              key: username
      args: ["(USERNAME)"]

Volume - depending on how ConfigMap is created could result in one file with many values or many files with value in each one. Default permissions are 644.

spec:
  containers:
    - name; app1
      volumeMounts:
        - name: app1config
          mountPath: /etc/config
  volumes:
    - name: app1config
      configMap:
        name: app1config

Volume type ConfigMap can also expose individual entries via items attribute; need to specify a file name for each entry:

volumes:
  - name: config
    configMap:
      name: foo
      items:
        - key: bar
          name: custom

If ConfigMap is mounted over non-empty directory, all items stored in that directory are hidden away. However, individual items can be mounted from a volume, instead of volume as a whole via subPath property of the volumeMount.

spec:
  containers:
    - image: some/image
      volumeMounts:
        - name: myvolume
          mountPath: /etc/someconfig.conf
          subPath: myconfig.conf

Secret

Similar to ConfigMap, but is used to store sensitive data.

In case of passing values from files, they should have single entries. File's name serves as a key, while value is its content. Kubelet syncs Secrets volumes just as ConfigMaps.

Secret resource is namespaced, and only Pods in the same namespace can reference a given Secret. Always stored in memory (tmpfs), as opposed to physical storage for ConfigMaps. Maximum size is 1MB.

Values must be base64 encoded (when applying manifest, CLI automatically encodes data). The reason for base64 coding is that values could also be binary files. stringData field allows passing data without encoding, however, it is automatically merged with data field (stringData overrides preexisting field). When reading Secret's data in a Pod, both through volumes or environment variables, actual value is automatically decoded. Encryption can also be set up.

kind: Secret
apiVersion: v1
stringData:
  foo: bar
data:
  cert: LS0TL..

Values passed will be base64 encoded strings (check result with commands below):

$ echo -n "admin" | base64
$ echo -n "password" | base64

Secret types:

  • generic - creating secrets from files, directories or literal values
  • TLS - private-public encryption key pair; supply both Kubernetes public key certificate encoded in PEM format and the private key of that certificate
     $ kubectl create secret tls tls-secret --cert=tls.cert --key=tls.key
  • docker-registry - credentials for a private docker registry (Docker Hub, cloud based container registries)

Can be exposed to a Pod as environment variable or volume/file, latter being able to be updated and reflected in a Pod. A Secret can be marked as immutable - meaning it can not be changed after creation. A Pod using such Secret must also be deleted to be able to read a new Secret with the same name and updated value.

Secrets can be specified individually or all together from a Secret object, in which case keys will be used as environment names:

spec:
  containers:
  - name: one
    env:
      - name: APP_USERNAME
        valueFrom:
          secretKeyRef:
            name: app1
            key: USERNAME
  - name: two
    envFrom:
      - secretKeyRef:
        name: app2

Exposing as a file creates a file in a container for each key and puts its value inside the file:

spec:
  containers:
    volumeMounts:
      - name: appconfig
        mountPath: /etc/appconfig
  volumes:
    - name: appconfig
      secret:
        secretName: app

Image pull secret

Image pull secret is used to pull images from private registries. First, create a docker-registry secret. A single entry .dockercfg is created in the Secret, just like Docker creates a file in user's home directory for docker login.

kubectl create secret docker-registry mydockerhubsecret \
  --docker-username=myusername --docker-password=mypassword \
  [email protected]

Reference a Secret as imagePullSecrets in Pod's spec:

kind: Pod
spec:
  imagePullSecrets:
    - name: mydockerhubsecret

Security

Every request to API server goes through the 3 step process:

  1. Authentication
  2. Authorization
  3. Admissions (validate contents of the request, optionally modify it)

Kubernetes provides two types of identities: normal user and service account. Users are not created nor managed by the API (there is no API object), but should be managed by external systems. Service Accounts are created by Kubernetes itself to provide identity for processes in Pods to interact with apiserver.

Controlling access to the Kubernetes API.

# check allowed action as current or any given user
$ kubectl auth can-i create deployments
$ kubectl auth can-i create deployments --as <user_name>
$ kubectl auth can-i list pods --as=system:serviceaccount:<namespace>:<service_account>
$ kubectl get pods -v6 --as=system:serviceaccount:<namespace>:<service_account>

kubeadm-based cluster creates a self-signed CA, which is used to create certificates for system components and signed user certificates. kubernetes-admin user is also created, which has all access across the cluster.

Authentication

Authentication validation is performed by authentication plugin. Multiple plugins can be configured, which apiserver calls in turn until one of them determines the identity of the sender - username, user ID, group it belongs to. Below are the main methods:

Method Description
client certificate Username is included in the certificate itself (Common Name field). Most commonly used in kubeadm-bootstrapped and cloud managed clusters.
authentication token Included in HTTP authorization header. Used with Service Accounts, during bootstrapping, and can also authenticate users via Static Token File, which is read only at apiserver startup, changes in this file require apiserver restart.
basic HTTP User credentials are stored in Static password file. This file is also read only at apiserver startup. However, easy to set up and use for dev environments.
OpenID provider Allow external identity providers for authentication services, SSO is also possible.

Authentication type is defined in apiserver startup options. Documentation.

Groups are simple strings, representing arbitrary group names. They are used to grant permission to multiple identities at once.

Build-in groups:

  • system:unauthenticated - used for requests where non of authentication plugins could authenticate
  • system:authenticated - automatically assigned to user who is authenticated
  • system:serviceaccounts - encompasses all service accounts in the system
  • system:serviceaccounts:<namespace> - encompasses all service accounts in the specific namespace

Authorization

Similarly to authentication plugins, multiple authorization plugins can be configured, which apiserver calls in turn, until one of them determines that the user can do the requested action. Authorization plugins:

  • RBAC
  • Node - grant access to kubelets on nodes
  • ABAC (Attribute-based Access Control) - policies with attributes

In kubeadm-based clusters RBAC and Node authorization plugins are enabled by default.

RBAC

Configure RBAC.

Base elements:

  • subject (who) - users or processes that can make requests to apiserver
  • resources (on what) - API objects such as Pods, Deployments, etc
  • actions (what) - verbs, operations such as get, watch, create

Elements are connected together using 2 RBAC API objects: roles (connect API resources and actions) and role bindings (connect roles to subjects). Both can be applied on a cluster or namespace level.

Roles are what can be done to resources. A Role includes one or many rules that specify allowed verbs on resources. Rules are permissive; default action is deny (there is no deny rule). Subjects are users, groups or Service Accounts.

get, list and watch are often used together to provide read-only access. patch and update are also usually used together as a unit. Only get, update, delete and patch can be used on named resources. * represents all actions, full access.

To prevent privilege escalation, the API server only allows users to create and update Roles, if they already have all the permissions listed in that Role (and for the same scope).

Default ClusterRoles and ClusterRoleBindings are updated (recreated) each time apiserver starts - in case one was accidentally deleted or new Kubernetes version brings updates.

Combination Scope
Role + RoleBinding namespaced resources in a specific namespace
ClusterRole + RoleBinding namespaced resources in a specific namespace (reusing same ClusterRole in multiple namespaces)
ClusterRole + ClusterRoleBinding namespaced resources in any or all namespaces, cluster level resources, non-resource URLs
# Role and RoleBinding
$ kubectl create role <name> --verb=<list_of_verbs> --resource=<list_of_resource>
$ kubectl create rolebinding <name> --role=<role_name> --serviceaccount=<namespace>:<service_account>
$ kubectl create rolebinding <name> --role=<role_name> --user <user_name>

$ kubectl create role newrole --verb=get,list --resource=pods
$ kubectl create rolebinding newrolebinding --role=newrole --serviceaccount=default:newsvcaccount

# ClusterRoleBinding
$ kubectl create clusterrolebinding <name> --clusterrole=view --user=<user_name>

Role and ClusterRole

ClusterRole is defined at cluster level. Enables access to:

  • cluster scoped resources (Nodes, PersistentVolumes, etc)
  • resources in more than one or all namespaces (acts as a common role to be bound inside individual namespaces)
  • and non-resource URL (/healtz, /version, etc); for non-resource URLs plain HTTP verbs must be specified (post, get, etc, also need to be lowercase)

Rule anatomy:

  • apiGroups - empty string represents the Core API Group (Pods, etc)
  • resources - Pods, Services, etc (plural form must be used)
  • verbs - get, list, create, update, patch, watch, delete, deletecollection; * represents all verbs
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: demorole
  namespace: ns1
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]

Default ClusterRoles:

Name Description
cluster-admin Cluster-wide super user, with RoleBinding gives admins access within a namespace (Roles, RoleBindings, ResourceQuotas)
admin Full access within a namespace, with RoleBinding gives admin access within a namespace (Roles, RoleBindings)
edit Read/write access within a namespace. Can't view/edit Roles, RoleBindings, ResourceQuotas, can access Secrets
view Read-only access within a namespace. Can't view/edit Roles, RoleBindings, ResourceQuotas, no access to Secrets

RoleBinding and ClusterRoleBinding

Role binding always references a single role, but can bind a role to multiple subjects. They can also bind to Service Accounts in another namespace. RoleBinding must be in the same namespace with Role. ClusterRoleBinding provides access across all namespaces.

  • roleRef - Role or ClusterRole reference (RoleBinding can reference Role or ClusterRole, while ClusterRoleBinding can only reference ClusterRole)
  • Subjects
    • kind (user, group, SA)
    • Name
    • Namespace (optionally for SA, since user and group are not namespaced)
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: demobinging
  namespace: ns1
roleRef:
  apiGroup: rbac.authorization.k8s.io/v1
  kind: Role
  name: demorole
  subjects:
    - apiGroup: rbac.authorization.k8s.io/v1
      kind: User
      name: demouser

Admission controller

Admission controller has access to the content of the objects being created, able to modify and validate it, and potentially can deny the request. If a request is trying to create, modify, or delete a resource, it is sent to admission control. Multiple controls can be configured; a request goes through all of them.

Controllers are compiled into apiserver binary, can be enabled or disabled during startup: --enable-admission-plugins=Initializers, --disable-admission-plugins=PodNodeSelector.

Initializer allows dynamic modification of the API request, ResourceQuota ensures the object created doesn't violate any existing quotas.

# View list of enabled and/or disabled admission controllers
$ sudo grep admission /etc/kubernetes/manifests/kube-apiserver.yaml

Certificate authority

In kubeadm-based cluster self signed CA is created by default. However, an external PKI (Public Key Infrastructure) can also be joined. CA is used for secure cluster communications (e.g. apiserver) and for authentication of users and cluster components.

Certificates

CA and core cluster component certificates and keys, etcd cert setup and more are located at etc/kubernetes/pki. Service Account tokens are seeded from sa.key and sa.pub also located there.

ca.crt is a CA certificate that is used by clients to trust certificates issued by this CA (presented by apiserver to encrypt communication). It is distributed to:

  • nodes during bootstrapping
  • clients, users, to interact with the cluster (e.g. kubeconfig)
  • included in the Secret that is created as part of Service Account

ca.key is a private key that is matched with ca.crt.

apiserver exposes an API to create and sign x509 certificates (through Certificate Signing Request, CSR).

Create a user certificate:

  • create a private key (openssl or cfssl)
     $ openssl genrsa -out <user_name>.key 2048
  • create a CSR (openssl or cfssl), needs to be base64 encoded, header and trailer need to be trimmed out
     # CN - username
     # O - group
     $ openssl req -new -key <user_name>.key -out <user_name>.csr -subj "/CN=new_user"
     $ cat <user_name>.csr | base64 | tr -d "\n" > <user_name>.base64.csr
  • create and submit CSR object
     apiVersion: certificates.k8s.io/v1
     kind: CertificateSigningRequest
     metadata:
       name: <csr_name>
     spec:
       groups:
         - system:authenticated
       request: <contents of <user_name>.base64.csr>
       signerName: kubernetes.io/kube-apiserver-client
       usages:
         - client auth
     $ kubectl apply -f <csr>.yaml
     $ kubectl get csr
  • approve CSR
     $ kubectl certificate approve <csr_name>
  • retrieve certificate
     $ kubectl get certificatesigningrequests <csr_name> \
     	-o jsonpath='{ .status.certificate }' | base64 --decode > <user_name>.crt

CSR objects are garbage collected from the apiserver in 1 hour - CSR approval and certificate retrieval must be done within that 1 hour.

Service account

Namespaced API object that provides an identity for processes in a Pod to access API server and perform actions. Certificates are mounted as a volume and are made available to a Pod at /var/run/secrets/kubernetes.io/serviceaccount/.

Each namespace has a default Service Account (created automatically with a namespace). All Pods must have a Service Account defined; if none is specified, default is used. This setting must be set at creation time; can not be changed later.

Each Service Account is tied to a Secret (created and deleted automatically) stored in the cluster. That Secret contains CA certificate, authentication token (JWT) and namespace of Service Account.

Create Service Account:

  • Declaratively:
     apiVersion: v1
     kind: ServiceAccount
     metadata:
       name: mysvcaccount
  • Imperatively:
     $ kubectl create serviceaccount mysvcaccount

PodSecurityAdmission

Automate security enforcement namespace-wide with Pod Security Admission.

kubectl

kubectl [command] [type] [Name] [flag]

All commands can be supplied with object type and a name separately or in the form of object_type/object_name.

Common commands:

  • apply/create - create resources
  • run - start a pod from an image
  • explain - built-in documentation of object or resource, can also pass object's properties via dot notation, e.g. pod.spec.containers
  • delete - delete resources
  • get - list resources, get all shows all Pods and controller objects
  • describe - detailed information on a given resource
  • exec - execute command on a container (in multi container scenario, specify container with -c or --container; defaults to the first declared container)
  • logs - view logs on a container
  • cp path/on/host <pod_name>:path/in/container - copy files from host
  • diff - check the difference between existing object and the one defined in the manifest
     $ kubectl diff -f manifest.yaml

Common flags:

  • -o <format> - output format, one of wide, yaml, json
  • --dry-run <option>- either server or client, client is useful for validating syntax and generate syntactically correct manifest, server sends a request to apiserver, but doesn't persist in storage, could be used to validate syntax errors
     $ kubectl create deployment nginx --image nginx --dry-run=client -o yaml
  • -v - verbose output, can be set to different levels, e.g. 7. Any number can be specified starting from 0 (less verbose), but there is no implementation for greater than 10
  • --watch - give output over time (updates when status changes)
  • --recursive - show all inner fields
  • --show-labels - add extra column labels to the output with all labels
  • --tail=<number> - limit output to the last 20 lines
  • --since=3h - limit output based on time limit
  • --cascade=orphane deletes the controller, but not objects it has created.

Running imperative commands, adhocs, like set, create, etc does not leave any change information. --record option can be used to write the command to kubernetes.io/change-cause annotation to be inspected later on, for example, by kubectl rollout history.

Change manifest and object from cli (in JSON parameter specify field to change from root):

$ kubectl patch <object_type> <name> -p <json>

Troubleshooting

Get privileged access to a node through interactive debugging container (will run in host namespace, node's filesystem will be mounted at /host)

$ kubectl debug node/<name> -ti --image=<image_name>

Troubleshoot a Pod that doesn't have necessary utilities or shell altogether. Adds a new container to a running Pod. This will be listed as ephemeral container. Added via handler (via API call), not Pod's spec. --target option adds new container to the same namespace (so that ps would show all processes) - must be implemented by container runtime, otherwise added in its own separate namespace.

$ kubectl debug -it <pod> --image=busybox --target=<pod>

tutum/dnsutils image contains nslookup and dig for DNS debugging.

Access objects

JSONPath support docs. To build up JSONPath parameter output desired objects with -o json first.

# get names of all Pods
$ kubectl get pod -o jsonpath='{ .items[*].metadata.name }'
# get images used in all namespaces
$ kubectl get pod -A -o jsonpath='{ .items[*}.spec.containers[*].image }'

Filter specific field instead of retrieving all elements in a list with *. ? - define a filter, @ - refer to current object.

# Retrieve internal IPs of all nodes
$ kubectl get nodes -o jsonpath="{ .items[*].status.addresses[?(@.type=='InternalIP')].address }"

Output can be formatted for easy reading:

$ kubectl get pod -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'

Sorting can be done based on any string or numeric field with --sort-by parameter. Also data can be presented in columns; usually used in with custom-columns (e.g. to outputs fields that are not part of default kubectl output).

$ kubectl get pod -o jsonpath='{ .items[*].metadata.name }' --sort-by=.metadata.name
$ kubectl get pod -o jsonpath='{ .items[*].metadata.name }' \
	--sort-by=.metadata.creationTimestamp \
	--output=custom-columns='Name:metadata.name,CREATIONTIMESTAMP:metadata.creationTimestamp'

kubeconfig

kubectl interacts with kube-apiserver and uses configuration file, home/.kube/config, as a source of server information and authentication. context is a combination of cluster and user credentials. Can be passed as cli parameters, or switch the shell contexts:

$ kubectl config use-context <context>

kubeconfig files define connection settings to the cluster: mainly client certificates and apiserver network location. Often CA certificate that was used to sign the certificate of apiserver is also included, thus, client can trust the certificate presented by apiserver upon connection.

During kubeadm bootstrapping kubeconfig files for various components are placed at /etc/kubernetes:

  • admin.conf - cluster admin account (kubernetes-admin)
  • kubelet.conf
  • controller-manager.conf
  • scheduler.conf

Each worker node also has kubeconfig.conf file that is used by kubelet to authenticate to the apiserver. kube-proxy's kubeconfig file is stored as ConfigMap in kube-system namespace.

kubeconfig consists of 3 sections: clusters, users, and contexts (combination of a user and a cluster with optionally a namespace). Each has a name field for reference. Context name convention - <user_name>@<cluster_name>. current-context field specifies the default context to use with all kubectl commands. User defines a user name and either one of certificate/token/password.

  • cluster
    • certificate-authority-data - base64-encoded ca.crt
    • server - URL, location of API server
  • user
    • client-certificate-data - base64-encoded certificate that is presented to API server for authentication (username is encoded inside certificate)
    • client-key-data - correlated private key
  • context
    • cluster - referenced by name
    • user - referenced by name

kubectl config is used to interact with kubeconfig file. Default user location - $HOME/.kube/config. Use --kubeconfig parameter or KUBECONFIG environment variable to use a different file in a custom location.

# View contents, basic server info
# Certificate data is refacted
$ kubectl config view
# View all including certificates (base64 encoded)
$ kubectl config view --raw
$ kubectl config view --kubeconfig=/path/to/kubeconfig

# context, cluster info
# useful to verify the context
$ kubectl cluster-info

# list all contexts
$ kubectl config get-contexts

$ kubectl config use-context <context_name>

# configure user credentials
# token and username/password are mutually exclusive
$ kubectl config set-credentials

# Remove entries
$ kubectl config delete-context <context>
$ kubectl config delete-cluster <cluster>
$ kubectl config unset users.<user>

Manually create kubeconfig file using user certificates obtained earlier:

# Define cluster
# Optionally specify --kubeconfig to use custom file instead of default
# --embed-certs base64 encodes certificate data and inserts it
$ kubectl config set-cluster <cluster_name> \
	--server=<api_server_url> \
	--certificate-authority=<path_to_ca.crt> \
	--embed-certs=true \
	--kubeconfig=<path_to_kubeconfig.conf>

# Define a credential
$ kubectl config set-credentials <user_name> \
	--client-key=<path_to_user.key> \
	--client-certificate=<path_to_user.crt> \
	--embed-certs=true \
	--kubeconfig=<path_to_kubeconfig.conf>

# Define context
$ kubectl config set-context <context_name> \
	--cluster=<cluster_name> \
	--user=<user_name> \
	--kubeconfig=<path_to_kubeconfig.conf>

krew

krew is a plugin (extensions) manager for kubectl. Plugins introduce new commands, but don't overwrite or extend existing kubectl commands. Extend kubectl with plugins.

Ensure that PATH includes plugins (krew's home directory, most likely $HOME/.krew).

Useful commands

  • get container ids in a pod:
     $ kubectl get pod <pod_name> -o=jsonpath='{range .status.containerstatuses[*]}{@.name}{" - "}{@.containerid}{"\n"}{end}'
  • start up ubuntu pod:
     $ kubectl run <name> --rm -i --tty --image ubuntu -- bash

Related products

References

⚠️ **GitHub.com Fallback** ⚠️