Kubernetes - vi-credo/knowledge_corner Wiki

Contents

Registration (new nodes can easily register with a master node and accept a workload) and service discovery (automatic detection of new services via DNS or environment variables) enable easy scalability and availability.

Containers are isolated user spaces per running application code. The user space is all the code that resides above the kernel, and includes the applications and their dependencies. Abstraction is at the level of the application and its dependencies.

Containerization helps with dependency isolation and integration problem troubleshooting.

Core technologies that enhanced containerization:

  • process - each process has its own virtual memory address space separate from others
  • Linux namespaces - used to control what application can see (process ID numbers, directory trees, IP addresses, etc)
  • cgroups - controls what resources an application can use (CPU time, memory, IO bandwidth, etc)
  • union file system - encapsulating application with its dependencies

Everything in Kubernetes is represented by an object with state and attributes that user can change. Each object has two elements: object spec (desired state) and object state (current state). All Kubernetes objects are identified by a unique name(set by user) and a unique identifier(set by Kubernetes).

Sample image for testing gcr.io/google-samples/hello-app:1.0.

Architecture

Cluster Add-on pods provide special services to the cluster, f.e. DNS, Ingress (HTTP load balancer), dashboard. Popular option for logging - Fluentd, metrics

  • Prometheus.

Control plane (master) components:

Worker components (also present on control plane node):

API server

Exposes RESTFUL operations and accepts commands to view or change the state of a cluster (user interacts with it via kubectl). Handles all calls, both internal and external. All actions are validated and authenticated. Manages cluster state stored in etcd database being the only component that has connection to it.


Scheduler

Watches API server for unscheduled Pods and schedules them on nodes. Uses an algorithm to determine where a Pod can be scheduled: first current quota restrictions are checked, then taints, tolerations, labels, etc. Scheduling is done by simply adding the node in Pod's data.


Controller manager

Continuously monitors cluster's state through API server. If state does not match, contacts necessary controller to match the desired state. Multiple roles are included in a single binary:

  • node controller - worker state
  • replication controller - maintaining correct number of Pods
  • endpoint controller - joins services and pods together
  • service account and token controller - access management

pod-eviction-timeout (default 5m) specifies a timeout after which Kubernetes should give up on a node and reschedule Pod(s) to a different node.


ETCD

Cluster's database (distributed b+tree key-value store) for storing cluster, network states and other persistent info. Instead of updating existing data, new data is always appended to the end; previous copies are marked for removal.

The etcdctl command allows for snapshot save and snapshot restore.


Cloud manager

Manages controllers that interact with external cloud providers.

Documentation.


Container runtime

Container runtime, which handles container's lifecycle. Kubernetes supports the following and can use any other that is CRI (Container Runtime Interface) compliant:

Each node can run different container runtime.


Kubelet

Kubernetes agent on each node that interacts with API server. Responsible for Pod's lifecycle.

  • receives PodSpec (Pod specification)
  • passes requests to local container runtime
  • mounts volumes to Pods
  • ensures access to storage, Secrets and ConfigMaps
  • executes health checks for Pod/node

kubelet process is managed by systemd when building cluster with kubeadm. Once running it will start every Pod found in /etc/kubernetes/manifests.


Proxy

Provides network connectivity to Pods and maintains all networking rules using iptables entries. Works as a local load-balancer, forwards TCP and UDP traffic, implements services.

kube-proxy has 3 modes:

  • user mode
  • iptables mode
  • ipvs mode (alpha)

High availability

HA cluster (stacked etcd topology) has at least 3 control plane nodes, since etcd requires at least 2 nodes online to reach a quorum. API server and etcd Pods run on all control plane nodes, which scheduler and controller-manager are on standby node, which ensures only one replica is running at any given time (implemented via lease mechanism). Load balancer in front of API server evenly distributes traffic from nodes and outside the cluster. Since API server and etcd are linked together on a given control plane node, there is a linked redundancy between components.

In external etcd topology etcd cluster (at least 3 nodes) is set up separately from the control plane. API server references this etcd cluster, while the rest is the same as in previous topology.

Installation and configuration

Recommended minimum hardware requirements:

Node CPU RAM (GB) Disk (GB)
master 2 2 8
worker 1 1 8

Cluster network ports:

Component Default port Used by
API server 6443 all
etcd 2379-2380 API/etcd
scheduler 10251 self
controller manager 10252 self
kubelet (both on Control Plane node and worker node) 10250 control plane
NodePort (worker node) 30000-32767 all

Self-install options:

  • kubeadm
  • kubespray - advance Ansible playbook for setting up cluster on various OSs and using different network providers
  • kops (Kubernetes operations) - CLI tool for creating a cluster in Cloud (AWS officially supported, GKE, Azure, etc on the way); also provisions necessary cloud infrastructure; how to
  • kind - running Kubernetes locally on Docker containers
  • Kubernetes in Docker Desktop for Mac

Bootstrapping with kubeadm

kubeadm init performs the following actions in the order by default (highly customizable):

  1. Pre-flight checks (permissions on the system, hardware requirements, etc)
  2. Create certificate authority
  3. Generate kubeconfig files
  4. Generate static Pod manifests (for Control Plane components)
  5. Wait for Control Plane Pods to start
  6. Taint Control Plane node
  7. Generate bootstrap token
    # List join token
    $ kubeadm token list
    # Regenerate join token
    $ kubeadm token create
  8. Start add-on Pods (DNS, kube-proxy, etc)

Pod networking

Container-to-container networking is implemented by Pod concept, External-to-Pod is implemented by services, while Pod-to-Pod is expected to be implemented outside Kubernetes by networking configuration.

Overlay networking (also software defined networking) provides layer 3 single network that all Pods can use for intercommunication. Popular network add-ons:

  • Flannel - L3 virtual network between nodes of a cluster
  • Calico - flat L3 network without IP encapsulation; policy based traffic management; calicoctl; Felix (interface monitoring and management, route programming, ACL configuration and state reporting) and BIRD (dynamic IP routing) daemons - routing state is read by Felix and distributed to all nodes allowing a client to connect to any node and get connected to a workload even if it is on a different node
  • Weave Net multi-host network typically used as an add-on in CNI-enabled cluster.
  • Kube-Router

Maintenance

Node

Node is an API object outside a cluster representing an virtual/physical instance. All nodes reside in kube-node-lease namespace.

Scheduling Pods on the node on/off with kubectl cordon/uncordon.

# Remove node from cluster
# 1. remove object from API server
$ kubectl delete node <node_name> 
# 2. remove cluster specific info
$ kubeadm reset
# 3. may also need to remove iptable entries

# View CPU, memory and other resource usage, limits, etc
$ kubectl describe node <node_name>

If node is rebooted, Pods running on that node stay scheduled on it, until kubelet's eviction timeout parameter (default 5m) is exceeded,

Upgrading cluster

kubeadm upgrade

  • plan - check installed version against newest in the repository, and verify that upgrade is possible
  • apply - upgrade first cp node to the specified version
  • diff - (similar to apply --dry-run) show difference applied during an upgrade
  • node - allows updating kubelet of worker node or control planes of other cp node; accesses phase command to step through the process

Control plane node(s) should be upgraded first. Steps are similar for control plane and worker nodes.

kubeadm-based cluster can only be upgraded by one minor version (e.g 1.16 -> 1.17).

Control plane upgrade

Check available and current versions. Then upgrade kubeadm and verify. Drain the Pods (ignoring DaemonSets). Verify upgrade plan and apply it. kubctl get nodes would still show old version at this point. Upgrade kubelet, kubectl and restart the daemon. Now ..get nodes should be updated. Allow pods to be scheduled on the node.


Worker upgrade

Same process as on control plane, except kubeadm upgrade command difference, and kubectl commands being executed on control plane.


Upgrade CLI

  • view available versions
     $ sudo apt update
     $ sudo apt-cache madison kubeadm
     # view current version
     $ sudo apt list --installed | grep -i kube
  • upgrade kubeadm on the given node
     $ sudo apt-mark unhold kubeadm
     $ sudo apt-get install kubeadm=<version>
     $ sudo apt-mark hold kubeadm
     $ sudo kubeadm version
  • drain pods (from control plane for both)
     $ kubectl drain <node_name> --ignore-daemonsets
  • view and apply node update
     # control plane
     $ sudo kubeadm upgrade plan
     $ sudo kubeadm upgrade apply <version>
     # worker (on the node)
     $ sudo kubeadm upgrade node
  • upgrade kubelet and kubectl
     $ sudo apt-mark unhold kubelet kubectl
     $ sudo apt-get install kubelet=<version> kubectl=<version>
     $ sudo apt-mark hold kubelet kubectl
    
     # restart daemon
     $ sudo systemctl daemon-reload
     $ sudo systemctl restart kubelet
  • allow pods to be deployed on the node
     $ kubectl uncordon <node_name>

etcd backup

Etcd backup file contains the entire state of the cluster. Secrets are not encrypted (only hashed), therefore, backup file should be encrypted and securely stored.

Usually backup script is performed by Linux or Kubernetes cron jobs.

By default, there is single etcd Pod running on control plane node. All data is stored at /var/lib/etcd, which is backed by hostPath volume on the node.

etcdctl should match etcd running in a Pod, use etcd --version command to find out.

$ ETCDCTL_API=3 etcdctl --endpoints=<host>:<port> <command> <args>
# Running etcdctl on master node
$ ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379
	--cacert=/etc/kubernetes/pki/etcd/ca.crt \
	--cert=/etc/kubernetes/pki/etcd/server.crt \
	--key=/etc/kubernetes/pki/etcd/server.key \
	snapshot save /var/lib/dat-backup.db
# Check the status of backup
$ etcdctl --write-out=table snapshot status <backup_file>

Restoring the backup to the default location:

$ export ETCDCTL_API=3
# Restore backup to another location
# By default restores in the current directory at subdir ./default.etcd
$ etcdctl snapshot restore <backup_file>
# Move the original data directory elsewhere
$ mv /var/lib/etcd /var/lib/etcd.OLD
# Stop etcd container at container runtime level, since it is static container
# Move restored backup to default location, `/var/lib/etcd`
$ mv ./default.etcd /var/lib/etcd
# Restarted etcd will find new data

Restoring the backup to the custom location:

$ etcdctl snapshot restore <backup_file> --data-dir=/var/lib/etcd.custom
# Update static Pod manifest:
# 1. --data-dir=/var/lib/etcd.custom
# 2. mountPath: /var/lib/etcd.custom (volumeMounts)
# 3. path: /var/lib/etcd.custom (volumes, hostPath)
# Updating manifest triggers Pod restart (also kube-controller-manager
# and kube-scheduler

API

Home page

Object organization:

  • Kind - Pod, Service, Deployment, etc (available object types)
  • Group - core, apps, storage, etc (grouped by similar functionality)
  • Version - v1, beta, alpha
apiVersion: v1
kind: Pod
metadata:
  name: nginx-pod
spec:
  containers:
  - name: nginx
    image: nginx

Core (Legacy) group includes fundamental objects such as Pod, Node, Namespace, PersistentVolume, etc. Other objects are grouped under named API groups such as apps (Deployment), storage.k8s.io (StorageClass), rbac.authorization.k8s.io (Role) and so on

Versioning follows Alpha->Beta->Stable lifecycle:

  • Alpha (v1alpha1) - disabled by default
  • Beta (b1beta1) - enabled by default, more stable, considered safe and tested, forward changes are backward compatible
  • Stable (v1) - backwards compatible, production ready

Request

API requests are RESTful (GET, POST, PUT, DELETE, PATCH)

Special API requests:

  • LOG - retrieve container logs
  • EXEC - exec command in a container
  • WATCH - get change notifications on a resource

API resource location:

  • Core API:
    • http://apiserver:port/api/$VERSION/$RESOURCE_TYPE
    • http://apiserver:port/api/$VERSION/namespaces/$NAMESPACE/$RESOURCE_TYPE/$RESOURCE_NAME (in namespace)
  • API groups
    • http://apiserver:port/apis/$GROUP_NAME/$VERSION/$RESOURCE_TYPE
    • http://apiserver:port/apis/$GROUP_NAME/$VERSION/namespace/$NAMESPACE/$RESOURCE_TYPE/$RESOURCE_NAME

Response codes:

  • 2xx (success) - f.e. 201 (created), 202 (request accepted and performed async)
  • 4xx (client side errors) - f.e. 401 (unauthorized, not authenticated), 403 (access denied), 404 - not found
  • 5xx (server side errors) - 500 (internal error)

Curl

Get certificates for easy request writing:

$ export client=$(grep client-cert $HOME/.kube/config | cut -d" " -f6)
$ export key=$(grep client-key-data $HOME/.kube/config | cut -d" " -f6)
$ export auth=$(grep certificate-authority-data $HOME/.kube/config | cut -d" " -f6)

$ echo $client | base64 -d - > ./client.pem
$ echo $key | base64 -d - > ./client-key.pem
$ echo $auth | base64 -d - > ./ca.pem

Make requests using keys and certificates from previous step:

$ curl --cert client.pem --key client-key.pem \  
	--cacert ca.pem \   
	https://k8sServer:6443/api/v1/pods 

Another way to make authenticated request is to start a proxy session in the background:

# once done run fg and ctrl+C
$ kubectl proxy &
# or whatever is outputed by proxy command
$ curl localhost:8001/<request>

Manifest

Minimal Deployment manifest explained:

# API version
# if API changes, API objects follow and may introduce breaking changes
apiVersion: apps/v1

# object type
kind: Deployment

metadata:
  # required, must be unique to the namespace
  name: foo-deployment
  # specification details of an object
  spec:
  # number of pods
  replicas: 1
  # a way for a deployment to identify which pods are members of it
  selector:
    matchLabels:
      app: foo
  # pod specifications
  template:
    metadata:
    # assigned to each pod
    # must match the selector
      labels:
        app: foo
    # container specifications
    spec:
      containers:
      - image: nginx
        name: foo

Generate with --dry-run parameter:

$ kubectl create deployment hello-world \
            --image=nginx \
            --dry-run=client \
            -o yaml > deployment.yaml
  • Root metadata should have at least a name.
  • generation represents a number of changes made to the object.
  • resourceVersion value is tied to etcd to hepl with concurrency of objects. Any changes in database will change this number.
  • uid - unique id of the object throughout its lifetime.

Pod

Pod is the smallest deployable object (not container). Pod embodies the environment where container lives, which can hold one or more containers. If there are several containers in a Pod, they share all resources like networking (unique IP is assigned to a Pod), access to storage and namespace (Linux). Containers in a Pod start in parallel (no way to determine which container becomes available first, but InitContainers can set the start up order). Loopback interface, writing to files in a common filesystem or inter-process communication (IPC) can be used by containers within a Pod for communication.

Secondary container may be used for logging, responding to requests, etc. Popular terms are sidecar, adapter, ambassador.

Pod states:

  • Pending - image is retrieved, but container hasn't started yet
  • Running - pod is scheduled on a node, all containers are created, at least one is running
  • Succeded - containers terminated successfully and won't be restarting
  • Failed - all container have terminated with at least one with failed status
  • Unknown - most likely communication error between master and kubelet
  • CrashLoopBackOff - one of containers unexpectedly exited after it was restarted at least once (most likely pod isn't configured correctly); k8s repeatedly makes new attempts

Environment variables

User defined environment variables are defined in Pod spec as key/value pairs or via valueFrom parameter referencing some location or other Kubernetes resource.

System defined environment variables include Service names available at the time of Pod creation.

Both types can not be updated once Pod is created.


Pause container

Pause container is used to get an IP address, prior to other containers, which is then used in a shared network namespace. This container is not seen within Kubernetes, but can be discovered by container engine tools. Container(s) will have [email protected] interface. IP persists through the life of a Pod.


Init container

InitContainer runs (must successfully complete) before main application container. Multiple init containers can be specified, in which case they run sequentially (in Pod spec order). Primary reasons are setting up environment, separating duties (different storage and security settings) and environment verification (block main application start up if environment is not properly set up).

spec:
  containers:
  - name: main-app
    image: databaseD
  initContainers:
  - name: wait-database
    image: busybox
    command: ['sh', '-c', 'until ls /db/dir ; do sleep 5; done; ']

Static Pod

Static Pod is managed by kubelet (not API server) on Nodes. Pod's manifest is placed on a specific location on a node, staticPodPath, that kubelet is continuously watching. Default is /etc/kubernetes/manifests. Kubelet's configuration - /var/lib/kubelet/config.yaml.

Kubelet creates a mirror Pod, so that it is visible to API server commands. However, deleting those Pods through API server will not affect those, and mirror Pod will be shown again.

Control Plane core component manifests - etcd, API server, controller manager, scheduler are static Pods.

Resources

Kubernetes does not have a direct manipulation on a container level, but resources can be managed through resources section in PodSpec, where CPU and memory resources that a container can consume can be specified. ResourceQuota object can set hard and soft limits (and of more types of resources) in a namespace, thus, on multiple objects.

Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  labels:
    app: hog
  name: hog
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: hog
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: hog
    spec:
      containers:
      - image: vish/stress
        imagePullPolicy: Always
        name: stress
        resources:
          limits:
            cpu: "1"
            memory: "2Gi"
          requests:
            cpu: "0.5"
            memory: "500Mi"
        args:
        - -cpus
        - "2"
        - -mem-total
        - "950Mi"
        - -mem-alloc-size
        - "100Mi"
        - -mem-alloc-sleep
        - "1s"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

Probes

Probes let you run custom health checks on container(s) in a Pod.

  • livenessProbe is a continuous check to see if a container is running. Restart policy is applied on failure event.
  • readinessProbe is diagnostic check to see if a container is ready to receive requests. On failure event Pod's IP address is removed from all service endpoints by the endpoint controller (restart policy is not applied). Usually used to protect applications that temporary can't serve requests.
  • startupProbe - one time check during startup process , ensuring containers are in a Ready state. All other probes are disabled until startupProbe succeeds. On failure event restart policy is applied. Usually used for applications requiring long startup times.

Probes can be defined using 3 types of handlers: command, HTTP, and TCP.

  • command's exit code of zero is considered healthy:
     exec:
       command:
       - cat
       - /tmp/ready
  • HTTP GET request return code => 200 and < 400:
     [...]
     httpGet:
       path: /healthz
       port: 8080
  • successful attempt establishing TCP connection:
     [...]
     tcpSocket:
       port: 8080

Settings:

Name Default Description
initialDelaySeconds 0s number of seconds after a container has started before running probes
probeInterval 10s how frequently to run probes
timteoutSeconds 1s execution time for a probe before declaring a failure, probe would return Unknown status
failureThreshold 3 number of missed checks to declare a failure
successThreshold 1 number of successful probes after a failure to consider a container as healthy

Lifecycle

Stopping/terminating Pod: When stop command is sent to a Pod, SIGTERM is sent to containers and Pod's status is set to Terminating. If container is not terminated by the end of grace period timer (default 30s) SIGKILL is sent, API server and etcd are updated.

# if termination is stuck
# to immediately delete records from API and etcd
# still have to clean up resources manually
$ kubectl delete pod <name> --grace-perioud=0 --force

Container(s) in a Pod can restart independent of the Pod. Restart process is protected by exponential backoff - 10s, 20s, 40s and up to 5m. Resets to 0s after 10m of continuous successful run.

Restart policy:

  • Always (default) - restarts all containers in a Pod, if one stops running
  • OnFailure - restarts only on non-graceful termination (non-zero exit codes)
  • Never

Controllers

Controllers or operators are series of watch-loops that request API server for a particular object state and modify the object until desired state is achieved. Kubernetes comes with a set of default controllers, while others can be added using custom resource definitions.

ReplicaSet

Deploy and maintain defined number of Pods. Usually, not used by itself, but through Deployment controller. Consists of a selector, number of replicas and Pod spec.

ReplicaSet selector can be of matchLabels or matchExpression type. The latter allows the use of operators In, NotIn, Exists and DoesNotExist.

matchExpressions:
- key: foo
  operator: In
  values:
  - bar

Deployment

Manages the state of ReplicaSet and the Pods within, thus, providing flexibility with upgrades and administration. Rolling upgrade is performed by creating a second ReplicaSet and increasing/descreasing Pods in two sets. It is also possible to roll back to a previous version, pause the deployment and make changes.

Designed for stateless applications, like web front end that doesn't store data or application state to a persistent storage.

Changes in configuration file automatically trigger rolling updates. You can pause, resume and check status of this behaviour. Exit code of 0 for status command indicates success, while non-zero - failure.

$ kubectl rollout [pause|resume|status] deployment <name>

$ kubectl rollout history deployment <name>
# get detailed info
$ kubectl rollout history deployment <name> --revision=<number>

# roll back to a previous version
$ kubectl rollout undo deployment <name>
# roll back to a specific version
$ kubectl rollout undo deployment <name> --to-revision=<number>

Restart all Pods. New ReplicaSet is created with the same Pod spec. Specified update strategy is applied.

$ kubectl rollout restart deployment <name>

Pod names are constructed as follows - <deployment_name>-<pod_template_hash>-<pod_ip>. pod_template_hash is unique ReplicaSet hash within Deployment. pod_id is unique Pod identifier within ReplicaSet.

Create Deployment:

  • declaratively via YAML file:
     $ kubectl apply -f [DEPLOYMENT_FILE]
  • imperatively using kubectl run command:
     $ kubectl create deployment <name> \
     			--image [IMAGE]:[TAG]
     			--replicas 3
     			--labels [KEY]:[VALUE]
     			--port 8080
     			--generator deployment/apps.v1 # api version to be used
     			--save-config # saves the yaml config for future use

Scaling

$ kubectl scale deployment [DEPLOYMENT_NAME] -replicas=5 # manual scaling
$ kubectl autoscale deployment [DEPLOYMENT_NAME] \
				--min=5 \
				--max=15 \
				--cpu-percent=75 # autoscale based on cpu threshold

Autoscaling creates HorizontalPodAutoscaler object. Autoscaling has a thrashing problem, that is when the target metric changes frequently, which results in frequent up/down scaling. Use --horizontal-pod-autoscaler-downscale-delay option to control this behavior (by specifying a wait period before next down scale; default is 5 minute delay).


Update strategy

RollingUpdate (default) strategy - new ReplicaSet starts scaling up, while old one starts scaling down. maxUnavailable specifies number of Pods from the total number in a ReplicaSet that can be unavailable, maxSurge specifies number of Pods allowed to run concurrently on top of total number of replicas in a ReplicaSet. Both can be specified as a number of Pods or percentage.

Recreate strategy - all old Pods are terminated before new ones are created. Used when two versions can't run concurrently.

Other strategies that can be implemented:

  • blue/green deployment - create completely new Deployment of an application and change app's version. Traffic can be redirected using Services. Good for testing, disadvantage in doubled resources
  • canary deployment - based on blue/green, but traffic is shifted gradually to a new version. This is achieved by avoiding specifying app's version and just by creating pods of a new version.

Related settings:

  • progressDeadlineSeconds - time in seconds until a progress error is reported (image issues, quotas, limit ranges)
  • revisionHistoryLimit (default 10) - how many old ReplicaSet specs to keep for rollback

StatefulSet

Managing stateful application with a controller. Provides network names, persistent storage and ordered operations for scaling and rolling updates.

Each pod maintains a persistent identity and has an ordinal index with a relevant pod name, stable hostname, and stable identified storage. Ordinal index is just a unique sequential number given for each pod representing the order in sequence of pods. Deployment, scaling and updates are performed based on this index. For example, second pod waits until first one is ready and running before it is deployed. Scaling and updates happen is reverse order. Can be changed in pod management policy, where OrderedReady is default and can be switched to Parallel. Each pod has it's own unique PVC, which uses ReadWriteOnce access mode.

StatefulSet require a service to control their networking. example.yml specifies headless service with no load balancing by clusterIP: None option.

Examples are database workloads, caching servers, application state for web farms.

Naming must be persistent and consistent, as stateful application often needs to know exactly where data resides. Persistent storage ensures data is stored and can be retrieved later on. Headless service (without load balancer or cluster IP) allows applications to use cluster DNS to locate replicas by name.

DaemonSet

Ensures that a specific single Pod is always running on all or some subset of the nodes. If new nodes are added, DaemonSet will automatically set up Pods on those nodes with the required specification. The word daemon is a computer science term meaning a non-interactive process that provides useful services to other processes.

Examples include logging (fluentd), monitoring, metric and storage daemons.

RollingUpdate (default) update strategy terminates old Pods and creates new in their place. maxUnavailable can be set to integer or percentage value, default is 1. In OnDelete strategy old Pods are not removed automatically. Only if administrator removes them manually, new Pods are created.

Job

Define, run and ensure that specified number of Pods successfully terminate.

restartPolicy must be set to either OnFailure or Never, since default policy is Always. In case of restart failed Pods are recreated with an exponentially increasing delay: 10, 20, 40... seconds, to a maximum of 6 minutes.

No matter how Job completes (success or failure) Pods are not deleted (for logs and inspection). Administrator can delete Job manually, which will also delete Pods.

  • activeDeadlineSeconds - max duration time, has precedence over backoffLimit
  • backoffLimit - number of retries before being marked as Failed, defaults to 6
  • completions - number of Pods that need to finish successfully
  • parallelism - max number of Pods running in a Job simultaneously

Non-parallel

Creates only one pod at a time; job is completed when pod terminates successfully or, if completion counter is defined, when the required number of completions is performed.

Execute from cli:

$ kubectl run pi --image perl --restart Never -- perl -Mbignum -wle 'print bpi(2000)'

Parallel

Parallel job can launch multiple pods to run the same task. There are 2 types of parallel jobs - fixed task completion count and the other which processes a work queue.

Work queue is created by leaving completions field empty. Job controller launches specified number of Pods simultaneously and waits until one of them signals successful completion. Then it stops and removes all pods.

In a situation of a job with both completion and parallelism options set, the controller won't start new containers if the remaining number of completions is less that parallelism value.

CronJob

Create and manage Jobs on a defined schedule. CronJob is created at the time of submission to API server, but Job is created on schedule.

  • suspend - set to true to not run Jobs anymore
  • concurrencyPolicy - Allow (default), Forbid, or Replace. Depending on how frequently jobs are scheduled and how long it takes to finish a Job, CronJob might end up executing more than one job concurrently.

In some cases may not run during a time period or run twice, thus, requested Pod should be idempotent.

Kubernetes retains number of successful and failed jobs in history, which is by default 3 and 1 respectively. Options successfulJobsHistoryLimit and failedJobsHistoryLimit may be used to control this behavior. Deleting CronJob also deletes all Pods.

Scheduling

The job of a scheduler is to assign new Pods to nodes. Default is kube-scheduler, but a custom one can be written and set.

Node selection goes through 3 stages:

  • Filtering - remove nodes that can not run the Pod (apply hard constraints, such as available resources, nodeSelectors, etc)
  • Scoring - gather list of nodes that can run the Pod (apply scoring functions to prioritize node list for the most appropriate node to run workload); ensure Pods of the same service are spread evenly across nodes, node affinity and taints are also applied
  • Binding - updating node name in Pod's object

PriorityClass and PriorityClassName Pod's settings can be used to evict lower priority Pods to allow higher priority ones to be scheduled (scheduler determines a node where a pending Pod could run if one or more lower priority ones were to be evicted). Pod Disruption Budget can limit number of Pods to be evicted and ensure enough Pods are running at all times, but it could still be violated by scheduler, if no other option is available.

End result of a scheduling process is assigning a Binding (Kubernetes API object in api/v1 group) to a Pod that specifies where it should run. Can also be assigned manually without any scheduler.

To manually schedule a Pod to a node (bypass scheduling process) specify nodeName (node must already exist); resource constraints still apply. This way a Pod can still run on a cordoned node, since scheduling is basically disabled and node is assigned directly.

Custom scheduler can be implemented; also multiple schedulers can run concurrently. Custom scheduler is packed and deployed as a system Pod. Default scheduler code. Define which scheduler to use in Pod's spec, if none specified, default is used. If specified one isn't running, the Pod remains in Pending state.

# View scheduler and other info
$ kubectl get events

Scheduling policy

Priorities are functions used to weight resources. By default, node with the least number of Pods will be ranked the highest (unless SelectorSpreadPriority is set). ImageLocalityPriorityMap favors nodes that already have the container image. cp/pkg/scheduler/algorithm/priorities contains the list of priorities.

Example file for a scheduler policy:

kind: Policy
apiVersion: v1
predicates:
  - name: MatchNodeSelector
    order: 6
  - name: PodFitsHostPorts
    order: 2
  - name: PodFitsResources
    order: 3
  - name: NoDiskConflict
    order: 4
  - name: PodToleratesNodeTaints
    order: 5
  - name: PodFitsHost
    order: 1
priorities:
  - name: LeastRequestedPriority
    weight : 1
  - name: BalancedResourceAllocation
    weight : 1
  - name: ServiceSpreadingPriority
    weight : 2
  - name: EqualPriority
    weight : 1
hardPodAffinitySymmetricWeight: 10

Typically passed as --policy-config-file and --scheduler-name parameters. This would result in 2 schedulers running in a cluster. Client can then choose one in Pods spec.

Node selector

Assign labels to nodes and use nodeSelector on Pods to place them on certain nodes. Simple key/value check based on matchLabels. Usually used to apply hardware specification (hard disk, GPU) or workload isolation. All selectors must be met, but node could have more labels.

nodeName could be used to schedule a Pod to a specific single node.

Affinity

Like node selector uses labels on nodes to make scheduling decisions, but with matchExpressions. matchLabels can still be used with affinity as well for simple matching.

  • nodeAffinity - use labels on nodes (should some day replace nodeSelector)
  • podAffinity - try to schedule Pods together using Pod labels (same nodes, zone, etc)
  • podAntiAffinity - keep Pods separately (different nodes, zones, etc)

Scheduling conditions:

  • requiredDuringSchedulingIgnoredDuringExecution - Pod is scheduled only if all conditions are met (hard rule)
  • preferredDuringSchedulingIgnoredDuringExecution - Pod gets scheduled even if a node with all matching conditions is not found (soft rule, preference); weight 1 to 100 can be assigned to each rule

Affinity rules use In, NotIn, Exists, and DoesNotExist operators. Particular label is required to be matched when the Pod starts, but not required, if the label is later removed. However, requiredDuringSchedulingRequiredDuringExecution is planned for the future.

Schedule caching Pod on the same node as a web server Pod.

spec:
  containers:
    - name: cache
  ...
  affinity:
    podAffinity:
	  requiredDuringSchedulingIgnoredDuringExecution:
	    - labelSelector:
	        matchExpressions:
		      - key: app
		        operator: In
			    values:
			  - webserver
	      topologyKey: "kubernetes.io/hostname"

Taints and tolerations

Opposite of selectors, keeps Pods from being placed on certain nodes. Taints allow to avoid scheduling, while Tolerations allow to ignore a Taint and be scheduled as normal.

$ kubectl taint nodes <node_name> <key>=<value>:<effect>
$ kubectl taint nodes <node_name> key1=value1:NoSchedule
# Remove a taint
$ kubectl taint nodes <node_name> key:<effect>-

Effects:

  • NoSchedule - do not schedule Pod on a node, unless toleration is present, existing Pods continue to run
  • PreferNoSchedule - try to avoid particular node, running pods are unaffected
  • NoExecute - evacuate all existing Pods, unless one has a toleration, and do not schedule new Pods; tolerationSeconds can specify for how long a Pod can run before being evicted, in certain cases kubelet could add 300 seconds to avoid unnecessary evictions

Default operator is Equal. Exists should not be specified; if an empty key uses Exists operator, it will tolerate every taint. If effect is not specified, but a key and operator are declared, all effects are matched.

All parts have to match to the taint on the node:

spec:
  containers:
  ...
  tolerations:
  - key: <key>
    operator: "Equal"
	value: <value>
	effect: NoSchedule

Node cordoning

Marks node as unschedulable, preventing new Pods from being scheduled, but does not remove already running Pods. Used as preparatory step before reboot or maintenance.

$ kubectl cordon <node>
# gracefuly evict Pods
# optionally ignore daemonsets, since f.e. kube-proxy is deployed as daemonset
$ kubectl drain <node> --ignore-daemonsets

Individual Pods won't be removed by draining the node, since they are not managed by a controller. Add --force option to remove.

Autoscaling

Horizontal Pod Autoscaler

Automatically scales Replication Controllers, ReplicaSets, or Deployments to based on target of 50% CPU by default.


Cluster Autoscaler

Adds or removes a node based on inability to deploy Pods or having low utilized nodes for at least 10 minutes.


Vertical Pod Autoscaler

In development. Adjust the amount of CPU and memory requested by Pods.

Resource management

ResourceQuota

Define limits for total resource consumption in a namespace. Applying ResourceQuota with a limit less than already consumed resources doesn't affect existing resources and objects consuming them.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: storagequota
spec:
  hard:
    persistentvolumeclaims: "10"
    requests.storage: "500Mi"

LimitRange

Define limits for resource consumption per objects. For example:

  • min/max compute resource per Pod or container
  • min/max storage request per PersistentVolumeClaim

Observability

Namespace

Namespaces can abstract single physical layer into multiple clusters. They provide scope for naming resources like pods, controllers and deployments. Primarily used for resource isolation/organization. User can create namespaces, while Kubernetes has 3 default ones:

  • default - for objects with no namespace defined
  • kube-system - for objects created by Kubernetes itself (ConfigMap, Secrets, Controllers, Deployments); by default these items are excluded, when using kubectl command (can be viewed explicitly)
  • kube-public - for objects publicly readable for all users

Creating a Namespace also creates DNS subdomain <ns_name>.svc.<cluster_domain>.

Can also be used as a security boundary for RBAC or naming boundary (same resource name in different namespaces). An given object can exist only in one namespace. Not all objects are namespaced (generally physical objects like PersistenVolumes and Nodes).

$ kubectl api-resources --namespaced=true
$ kubectl api-resources --namespaced=false

Can be applied for a resource in command line or manifest file, while first option is preferred for manifest file to be more flexible.

Deleting namespace deletes all resources inside of it as well.

Namespace is defined in metadata section on an object.

apiVersion:
kind:
metadata:
  namespace:

Some related commands:

  • create namespace

     $ kubectl create namespace [NAME]
  • deploy object in a specific namespace, append -n [NAMESPACE_NAME]

Set namespace for all subsequent commands:

$ kubectl config set-context --current --namespace=<insert-namespace-name-here>
# Validate it
$ kubectl config view --minify | grep namespace:

Not all resources are in a namespace. Use commands below to list resources that are in a namespace and those that are not.

# In a namespace
$ kubectl api-resources --namespaced=true

# Not in a namespace
$ kubectl api-resources --namespaced=false

Four default namespaces:

  • default - assumed namespace
  • kube-node-lease - worker node lease info
  • kube-public - readable by all, even not authenticated, general info
  • kube-system - infrastructure pods

Specify in metadata section of an object.

Label

Labels enable managing objects or collection of objects by organizing them into groups, including objects of different types. Label selectors allow querying/selecting multiple objects. Kubernetes also leverages labels for internal operations.

Non-hierarchical key/value pair (up to 63/253 characters long). Can be assigned at creation time or be added/edited later. Immutable.

$ kubectl label <object> <name> <key1>=<value1> <key2>=<value2> 
$ kubectl label <object> <name> <key1>=<value1 <key2>=<value2> --overwrite
$ kubectl label <object> --all <key>=<value>
# delete
$ kubectl label <object> <name> <key>-

# output additional column with all labels
$ kubectl get <object> --show-labels
# specify columns (labels) to show
$ kubectl get <object> -L <key1>,<key2>

# query
$ kubectl get <object> --selector <key>=<value>
$ kubectl get <object> -l '<key1>=<value1>,<key2>!=<value2>'
$ kubectl get <object> -l '<key1> in (<value1>,<value2>)'
$ kubectl get <object> -l '<key1> notin (<value1>,<value2>)'

Controllers and Services match Pods using labels. Pod scheduling (f.e. based on hardware specification, SSD, GPU, etc) uses labels as well.

Deployment and Service example, all labels must match:

kind: Deployment
...
spec:
  selector:
    matchLabels:
      <key>: <value>
  ...
  template:
    metadata:
      labels:
	    <key>: <value>
    spec:
      containers:
---
kind: Service
...
spec:
  selector:
    <key>: <value>

Labels are also used to schedule Pods on a specific Node(s):

kind: Pod
...
spec:
  nodeSelector:
    <key>: <value>

Annotation

Annotations include object's metadata that can be useful outside cluster's object interaction. For example, timestamp, pointer to related objects from other ecosystems, developer's email responsible for the object.

Used to add additional info about objects in the cluster. Mostly used by people or third-party applications. Non-hierarchical key/value pairs (up to 63 characters, 256KB). Can't be used for querying/selecting.

Operations:

  • manifest file
kind: Pod
...
metadata:
  annotation:
	owner: Max
  • cli
$ kubectl annotate <object_type> <name> key=<value>
$ kubectl annotate <object_type> --all key=<value> --namespace <name>
$ kubectl annotate <object_type> <name> key=<new_value> --overwrite
# delete
$ kubectl annotate <object_type> <name> <key>-

Network

All Pods can communicate with each other on all nodes. Software (agents) on a given node can communicate with all Pods on that node.

Network types:

  • node (real infrastructure)
  • pod - implemented by network plugin, IPs are assigned from PodCidrRange, but could also be assigned from the node network
  • cluster - used by Services using ClusterIP type, assigned from ServiceClusterIpRange parameter from API server and controller manager configurations

Pod-to-Pod communication on the same node goes through bridge interface. On different nodes could use Layer 2/Layer 3/overlay options. Services are implemented by kube-proxy able to expose Pods both internally and externally.

Pause/Infrastructure container starts first and sets up the namespace and network stack inside a Pod, which is then used by the application container(s). This let's container(s) restart without interrupting network namespace. Pause container has a lifecycle of the Pod (created and deleted along with the Pod).

Container Network Interface (CNI) is abstraction for implementing container and pod networking (setting namespaces, interfaces, bridge configurations, IP addressing). CNI sits between k8s and container runtime. CNI plugins are usually deployed as Pods controlled by DaemonSets running on each node.

Expose container directly to the client:

$ kubectl port-forward <pod_name> <localhost_port>:<pod_port>

DNS

DNS is available as a Service in a cluster, and Pods by default are configured to use it. Provided by CoreDNS (since v1.13). Configuration is stored as ConfigMap coredns in kube-system namespace, which is mounted to coredns Pods as /etc/coredns/Corefile. Updates to ConfigMap get propagated to CoreDNS Pods in about 1-2 minutes - check logs for reload message. More plugins can be enabled for additional functionality.

dnsPolicy settings in Pod spec can be set to the following:

  • ClusterFirst (default) - send DNS queries with cluster prefix to coredns service
  • Default - inherit node's DNS
  • None - specify DNS settings via another parameter, dnsConfig
     spec:
       dnsPolicy: "None"
       dnsConfig:
     	nameservers:
     	- 9.9.9.9

A records:

  • for Pods - <ip_in_dash_form>.<namespace>.pod.cluster.local
  • for Services - <service_name>.<namespace>.svc.cluster.local

Troubleshooting DNS can be done by creating a Pod with network tools, creating a Service and running a DNS lookup (other tools dig, nc, wireshark):

$ nslookup <service_name> <kube-dns_ip>

Traffic can access a Service using a name, even in a different namespace just by adding a namespace's name:

# will fail if service is in different namespace
$ curl <service_name>

# works across namespaces
$ curl <service_name>.<namespace>

Service

Provides persistent endpoint for clients (virtual IP and DNS). Load balances traffic to Pods and automatically updates during Pod controller operations. Labels and selectors are used to determine which Pods are part of a Service. Default and popular implementation is kube-proxy on the node's iptables.

Acts as a network abstraction for Pod access. Allows communication between sets of deployments. A unique ID is assigned at creation time, which can be used by other Pods to talk to each other.

Service is a operator inside kube-controller-manager, which sends requests via kube-apiserver to network plugin (f.e. Calico) and kube-proxy on nodes. Also creates an Endpoint operator, which queries Pods with specific label for ephemeral IPs.

Imperatively create a new Service (NodePort type):

# create a service
$ kubectl expose deployment <name> \
	--port 80 \
	--target-port 8080

Service lists endpoints which are individual IP:PORT pairs of underlying pods. Can access directly for troubleshooting.

service/kubernetes is a API server service.

A Service is an controller which listens to endpoint controller to provide persistent IP for Pods. Send messages (settings) via API server to kube-proxy on every node, and to network plugin. Also handles access policies for inbound requests.

Creates an Endpoint object. See the routing IPs:

$ kubectl describe endpoints <service_name>

Each Service gets a DNS A/AAAA record in cluster DNS in the form <svc_name>.<namespace>.svc.<cluster_domain>.

kubectl proxy command create a local proxy allowing requests to Kubernetes API:

$ kubectl proxy &
# access foo service
$ http://localhost:8001/api/v1/namespaces/default/services/foo
# if service has a port_name configured
$ http://localhost:8001/api/v1/namespaces/default/services/foo:<port_name>

ClusterIP

Default Service type. Exposes the Service on a cluster-internal IP (exists in iptables on the nodes). IP is chosen from a range specified as a ServiceClusterIPRange parameter both on API server and controller manager configurations. If service is created before corresponding Pods, they get hostname and IP address as environment variables.


NodePort

Exposes the Service on the IP address of each node in the cluster at a specific port number, making it available outside the cluster. Built on top of ClusterIP Service - creates ClusterIP Service and allocates port on all nodes with a firewall rule to direct traffic on that node to ClusterIP persistent IP. NodePort option is set automatically from the range 30000 to 32767 or can be specified by user, if it falls within the range.

Regardless of which node is requested traffic is routed to ClusterIP Service and then to Pod(s) (all implemented by kube-proxy on the node).


LoadBalancer

Exposes the service externally, using a load balancing service provided by a cloud provider or add-on.

Creates a NodePort Service and makes an async request for to use a load balancer. If listener does not answer (no load balancer is created), stays in Pending state.

With GKE it is implemented using GCP's network Load Balancer. GCP will assign assign static IP address to load balancer, which directs traffic to nodes (random). kube-proxy will choose random Pod, which may reside on a different node to ensure even balance (this is default). Respond will take same route back. Use externalTrafficPolicy: Local option to disable this behaviour and enforce kube-proxy to direct traffic to local pods.


ExternalName

Provides service discovery for external services. Kubernetes creates a CNAME record for external DNS record, allowing Pods to access external services (does not have selectors, defined Endpoints or ports).


Headless services

Allows interfacing with other service discovery mechanisms (not tied to Kubernetes). Define by explicitly specifying None in spec.clusterIP field. ClusterIP is not allocated and kube-proxy does not handle this Service (no load balancing nor proxying).

Service with selectors - Endpoint controller creates endpoint records and modifies DNS config to return A records (IP addresses) pointing directly to Pods. Client decides which one to use. Often used with stateful applications.

Service without selectors - no Endpoints are created. DNS config may look for CNAME record for ExternalName type or any Endpoint records that share a name with a Service (Endpoint object(s) needs to be created manually, and can also include external IP).

Endpoint

Usually not managed directly, represents IPs for Pods that match particular service. If Endpoint is empty, meaning no matching Pods, service definition might be wrong.

Ingress

Consists of Ingress object describing various rules on how HTTP traffic gets routed to Services (and ultimately to Pods) and Ingress controller (daemon in a Pod) watching for new rules (/ingresses endpoint in the API server). Cluster may have multiple Ingress controllers. Ingress class or annotation can be used to associated an object with a particular controller (can also create a default class). Absense of an Ingress class or annotation will cause every controller to try to satisfy the traffic.

Both L4 and L7 can be configured. Ingress also provides load balancing directly to Endpoint bypassing ClusterIP. Name-based virtual hosting is available via host header in HTTP request. Also provides path-based routing and TLS termination.

Ingress controller can be implemented in various ways: nginx Pods, external hardware (f.e. Citrix), cloud-ingress provider (f.e. AppGW, AWS ALB).

Main difference with a LoadBalancer Service is that this resource operates on level 7, which allows it to provide name-based virtual hosting, path-based routing, tls termination and other capabilities.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: example-ingress
spec:
  ingressClassName: nginx
  # non-matching traffic or when no rules are defined
  defaultBackend:
    service:
	  name: example-service
	  port:
	    number: 80

Currently 3 Ingress Controllers are supported: AWS, GCE, nginx. Nginx Ingress setup.

Storage

Rook is a storage orchestration solution.

Kubernetes provides storage abstraction as volumes and persistent volumes. Volumes is a method by which storage is attached to Pods (not containers).

Volume is a persistent storage deployed as part of of Pod spec, including implementation details of particular Volume type (nfs, ebs). Has a same lifecycle as Pod.

Volumes are declared with spec.volumes and mount points are declared with spec.containers.volumeMounts parameters.

Access modes:

  • ReadWriteOnce - read/write to a single node
  • ReadOnlyMany - read-only by multiple nodes
  • ReadWriteMany - read/write by multiple nodes

Kubernetes groups volumes with the same access mode together and sorts them by size from smallest to largest. Claim is checked against each volume in the access mode group until matching size is found.

PersistentVolume

Storage abstraction with a separate from Pod lifecycle. Managed by kubelet

  • maps storage on the node and exposes it as a mount.

Persistent volume abstraction has 2 components: PersistentVolume and PersistentVolumeClaim. PersistentVolume is a durable and persistent storage resource managed at the cluster level. PersistentVolumeClaim are request and claim made by Pods to use PersistentVolumes. User specifies volume size, access mode, and other storage characteristics. If a claim matches a volume, then claim is bound to that volume and Pod can consume that resource. If no match can be found, Kubernetes will try to allocate one dynamically.

PersistentVolume is not a namespaced object, while PersistentVolumeClaim is.

Static provisioning workflow includes manually creating PersistentVolume, PersistentVolumeClaim, and specifying volume in Pod spec.

PersistentVolume:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-store
spec:
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteMany
  nfs:
    ...

PersistentVolumeClaim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-store
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
	  storage: 10Gi

Reclaim policy

When PersistentVolumeClaim object is deleted, PersistentVolume may be deleted depending on reclaim policy.

With Retain reclaim policy PersistentVolume is not reclaimed after PersistentVolumeClaim is deleted. PersistentVolume changes to Released status. Creating new PersistentVolumeClaim doesn't provide access to that storage, and if no other volume is available, claim stays in Pending state.


Dynamic provisioning

StorageClass resource allows admin to create a persistent volume provisioner (with type specific configurations). User requests a claim, and API server auto-provisions a PersistentVolume. The resource is reclaimed according to reclaim policy stated in StorageClass.

Dynamic provisioning workflow includes creating a StorageClass object and PersistentVolumeClaim pointing to this class. When a Pod is created, PersistentVolume is dynamically created. Delete reclaim policy in StorageClass will delete a PersistentVolume, if PersistentVolumeClaim is deleted.

StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: main
provisioner:
  kubernetes.io/azure-disk
parameters:
  ...

GKE has a default standard storage class that uses Compute Engine standard Persistent Disk. In GKE PVC with no defined storage class will use the standard one.

emptyDir

Simply empty directory that can be mount to container in a Pod. When a Pod is destroyed, the directory is deleted. Kubernetes creates emptyDir volume from a node's local disk or using a memory band file system.

ConfigMap

Can ingest data from a literal value, from a file or from a directory of files.

Provides a way to inject application configuration data into pods. Can be referenced in a volume.

Used to store config files, command line arguments, environment variables, port number, etc.

kubelet periodically syncs with ConfigMaps to keep ConfigMap volume up to date. Data is updated even if it is already connected to a pod (matter of seconds-minutes).

Volume ConfigMap can be updated. Can be set as immutable, meaning can't be changed once created. Namespaced.

Creating individual key value pairs, f.e. via --from-literal, results in a map, while --from-file results in a single key with file's contents as a value.

apiVersion: v1
kind: ConfigMap
metadata:
  name: app1config
data:
  key: value

ConfigMap from command line:

# create
$ kubectl create configmap [NAME] [DATA]
$ kubectl create configmap [NAME] --from-file=[KEY_NAME]=[FILE_PATH]

# examples
$ kubectl create configmap demo --from-literal=lab.difficulty=easy
$ kubectl create configmap demo --from-file=color.properties
$ cat color.properties
color.good=green
color.bad=red

Pods can refer to ConfigMap in 3 ways:

  • environment variables - valueFrom specifies each key individually, envFrom creates a variable for each key/value pair in ConfigMap
     spec:
       containers:
       - name: app1
         env:
     	- name: username
     	  valueFrom:
     	    configMapKeyRef:
     		  name: app1config
     		  key: username
          envFrom:
     	 - configMapRef:
     	     name: app1env
  • volume - depending on how ConfigMap is created could result in one file with many values or many files with value in each one
     spec:
       volumes:
         - name: app1config
     	  configMap:
     	    name: app1config
       containers:
       - name; app1
         volumeMounts:
     	- name: app1config
     	  mountPath: /etc/config
  • Pods commands

System components and controllers can also use ConfigMaps.

Secret

Similar to ConfigMap, but used to store sensitive data.

In case of passing value from files, they should have single entries. File's name serves as a key, while value is it's content.

Kubelet syncs Secret volumes just as ConfigMap volumes.

By default values are only base64 encoded, but encryption can also be set up. Secret resource is namespaced, and only Pods in the same namespace can reference given Secret.

Values passed will be base64 encoded strings (check result with commands below):

$ echo -n "admin" | base64
$ echo -n "password" | base64

Secret types:

  • generic - creating secrets from files, directories or literal values
  • TLS - private-public encryption key pair; pass Kubernetes public key certificate encoded in PEM format, and also supply the private key of that certificate
  • docker-registry - credentials for a private docker registry (Docker Hub, cloud based container registries)

Can be exposed to Pod as environment variable or volume/file, latter being able to be updated and reflected in a Pod. A Secret can be marked as immutable - can no be changed after creation. A Pod using such Secret must also be deleted to be able to read a new Secret with the same name and updated value.

Secrets can be specified individually or all from a Secret object, in which case keys will be used as environment names:

spec:
  containers:
  - name: one
    env:
	- name: APP_USERNAME
	  valueFrom:
	    secretKeyRef:
		  name: app1
		  key: USERNAME
  - name: two
    envFrom:
	- secretKeyRef:
        name: app2

Exposing as a file creates a file for each key and puts its value inside the file:

spec:
  volumes:
  - name: appconfig
    secret:
	  secretName: app
  containers:
    volumeMounts:
	- name: appconfig
	  mountPath: /etc/appconfig

Security

Certificate authority

By default self signed CA is created. However, an external PKI (Public Key Infrastructure) can also be joined. CA is used for secure cluster communications (API server) and for authentication of users and cluster components. Files are located at /etc/kubernetes/pki and distributed to each node upon joining the cluster.

Authentication and authorization

Kubernetes provides two types of users: normal user and service account. Normal users are managed outside Kubernetes, while service accounts are created by Kubernetes itself to provide identity for processes in pods to intereact with Kubernetes cluster. Each namespace has default service account.

After successful authentication, there are two main ways to authorize what an account can do: Cloud IAM and RBAC (Kubernetes roles-base access control). Cloud IAM is access control system to use cloud resources and perform operations on project and cluster levels (outside cluster - view and change configuration of Kubernetes cluster). RBAC provides permission inside cluster at the cluster and namespace levels (view and change Kubernetes objects).

API server listens for remote requests on HTTPS port and all requests must be authenticated before it's acted upon. API server provides 3 methods for authentication: OpenID connect tokens, x509 client certs and basic authentication using static passwords. While OpenID is preferred method, last two are disabled by default in GKE.

# check allowed action as current or any given user
$ kubectl auth can-i create deployments
$ kubectl auth can-i create deployments --as bob

Service account

Provides an identifier for processes in a Pod to access API server and perform actions. Certificates for are made available to a Pod at /var/run/secrets/kubernetes.io/serviceaccount/.

RBAC

How to configure

Built on 3 base elements: subject (who - users or processes that can make requests to Kubernetes API), which (resources - API objects such as pods, deployments..), what (verbs, operations such as get, watch, create).

Elements are connected together using 2 RBAC API objects: Roles (connect API resources and verbs) and RoleBidnings (connect Roles to subjects). Both can be applied on a cluster or namespace level.

RBAC has Roles (defined at namespace level) and ClusterRoles (define at cluster level).

get, list and watch are often used together to provide read-only access. patch and update are also usually used together as a unit.

Only get, update, delete and patch can be used on named resources.

Monitoring and logging

Install Kubernetes dashboard (runs Pods in kubernetes-dashboard namespace):

kubectl

kubectl [command] [type] [Name] [flag]

  • apply/create - create resources
  • run - start a pod from an image
  • explain - documentation of object or resource
  • delete - delete resources
  • get - list resources
  • get - detailed resources information
  • exec - execute command on a container (in multi container scenario, specify container with -c or --container; defaults to the first declared container)
  • logs - view logs on a container

output formation (-o <format>) - wide, yaml, json, dry-run (print object without sending to the API server)

-v adds verbosity to the output, can set different level, f.e. 7. Can be any number, starts from 0, but there is no implementation greater than 10.

# basic server info
$ kubectl config view

# list all contexts
$ kubectl config get-contexts

$ kubectl config use-context <context_name>

# context, cluster info
# useful to verify the context
$ kubectl cluster-info

# configure user credentials
# token and username/password are mutually exclusive
$ kubectl config set-credentials

kubectl interacts with kube-apiserver and uses configuration file, home/.kube/config, as a source of server information and authentication. context is a combination of cluster and user credentials. can be passed as cli params, or switch the shell contexts:

$ kubectl config use-context <context>

list known api resources (also short info):

$ kubectl api-resources
$ kubectl api-resources --api-group=apps

# list api versions and groups
$ kubectl api-versions | sort

all commands can be supplied with object type and a name separately or in the form object_type/object_name.

--watch give output over time (updates when status changes)

kubectl:

  • explain [object] - show built-in documentation, can also pass object's properties via dot notation, fe pod.spec.containers
  • --recursive - show all inner fields
  • get [object] [name] - show info on object group (pods, deployments, all)
  • --show-labels - adds extra columb labels to the output
  • -l, --selector [label=value] - search using labels; add more labels after comma; can also be supplied to other commands, such as delete
    • label in (value1,value2) - outputs objects that satisfy the supplied list of possible values; notin operator can be used to show inverse
  • -o yaml - show full info in yaml format
  • delete [object] [name] - delete k8s object
  • describe [object] [name] - get detailed info on specific object
  • label [object] [name] [new_label=value] - add a label to a resource, supply --overwrite flag if changing the existing label
  • label [object] [name] [label]- - delete label
  • logs [pod_name] - view logs of a pod; use -c flag to specify container; contains both stdout and stderr
  • --tail=[number] - limit output to the last 20 lines
  • --since=3h - limit output based on time limit
  • exec [pod_name] -- [command] - execute commands and application on a pod; use -c flag to specify container
  • cp path/on/host $[pod_name]:path/in/container - copy files
  • top nodes - info on nodes status

imperative

$ kubectl create deployment nginx --image=nginx
# single pod without a controller
$ kubectl run nginx --image=nginx

use cricrl to view containers running on a node (for containerd):

$ sudo crictl --runtime-endpoint unix://run/continerd/containerd.sock ps

get runtime info:

$ kubectl get deployment <deployment_name> -o yaml | less
$ kubectl get deployment <deployment_name> -o json | less

dry-run --dry-run either server or client.

server side processed as a typical request, but aren't persisted in storage. could fail if syntax error is present or if object already exists. on client side the request is presented on stdout and is useful for validating the syntax. could also be used to generate syntactically correct manifests:

$ kubectl create deployment nginx --image nginx --dry-run=client -o yaml

when applying change check the difference with diff command. outputs differences between objects in the cluster and the ones defined in the manifest to stdout. kubectl diff -f manifest.yaml.

running imperative commands, adhocs, like set, create, etc does not leave any change information. --record option can be used to write the command to kubernetes.io/change-cause annotation to be inspected later on, for example, by kubectl rollout history.

--cascade=orphane deletes the controller, but not objects it has created.

Change manifest and object from cli (in JSON parameter specify field to change from root):

$ kubectl patch <object_type> <name> -p <json>

Get privileged access to a node through interactive debugging container (will run in host namespace, node's filesystem will be mounted at /host)

$ kubectl debug node/<name> -ti --image=<image_name>

Events

Record events

$ kubectl get events --watch &

useful notes

  • get container ids in a pod:
$ kubectl get pod <pod_name> -o=jsonpath='{range .status.containerstatuses[*]}{@.name}{" - "}{@.containerid}{"\n"}{end}'
  • fire up ubuntu pod:
$ kubectl run <name> --rm -i --tty --image ubuntu -- bash

kubeconfig

kubeconfig files define connection settings to a cluster: client certificates, cluster api server network location. often ca certificate that was used to sign the certificate of api server is also included, thus, client can trust the certificate presented by api server upon connection.

various files are created for different components at /etc/kubernetes. admin.conf is cluster admin account. kubelet.conf, controller-manager.conf, scheduler.conf include location of api server and client certificate to use.

References

⚠️ **GitHub.com Fallback** ⚠️