Kubernetes - kamialie/knowledge_corner GitHub Wiki
- Architecture
- Installation and configuration
- Maintenance
- API
- Manifest
- Pod
- Controllers
- Scheduling
- Autoscaling
- Resource management
- Observability
- Monitoring
- Logs
- Network
- Storage
- Security
- kubectl
- Troubleshooting
- Related products
- References
Registration (new nodes register with a control plane and accept a workload) and service discovery (automatic detection of new services via DNS or environment variables) enable easy scalability and availability.
Containers are isolated user spaces per running application code. The user space is all the code that resides above the kernel, that includes applications and their dependencies. Abstraction is at the level of the application and its dependencies.
Containerization helps with dependency isolation and integration problem troubleshooting.
Core technologies that enhanced containerization:
- process - each process has its own virtual memory address space separate from others
- Linux namespaces - are used to control what application can see (process ID numbers, directory trees, IP addresses, etc)
- cgroups - control what resources application can use (CPU time, memory, IO bandwidth, etc)
- union file system - encapsulating application with its dependencies
Everything in Kubernetes is represented by an object with state and attributes that user can change. Each object has two elements: object spec (desired state) and object state (current state). All Kubernetes objects are identified by a unique name(set by user) and a unique identifier(set by Kubernetes).
Cluster Add-on Pods provide special services in the cluster, e.g. DNS, Ingress (HTTP load balancer), dashboard. Popular option for logging - Fluentd, metrics - Prometheus.
Control plane (master) components:
Worker components (also present on control plane node):
# Get status on components
$ kubectl get componentstatuses
Exposes RESTful operations and accepts commands to view or change the state of
a cluster (user interacts with it via kubectl
). Handles all calls, both
internal and external. All actions are validated and authenticated. Manages
cluster state stored in etcd database being the only component that has
connection to it.
Watches API server for unscheduled Pods and schedules them on nodes. Uses an algorithm to determine where a Pod can be scheduled: first current quota restrictions are checked, then taints, tolerations, labels, etc. Scheduling is done by simply adding the node in Pod's object data.
pod-eviction-timeout
(default 5m) specifies a timeout after which Kubernetes
should give up on a node and reschedule Pod(s) to a different node.
Continuously monitors cluster's state through API server. If state does not match, contacts necessary controller to match the desired state. Multiple roles are included in a single binary:
- node controller - worker state
- replication controller - maintaining correct number of Pods
- endpoint controller - joins services and Pods together
- service account and token controller - access management
Generally controllers use a watch mechanism to be notified of changes, but also perform a re-list operations periodically to make sure they haven't missed anything. Controllers source code.
Cluster's database (distributed b+tree key-value store) for storing cluster, network states and other persistent info. Instead of updating existing data, new data is always appended to the end; previous copies are marked for removal.
The etcdctl
command provides snapshot save and snapshot restore
actions.
In HA configuration Raft Consensus Algorithm is used to a group of machines to work together and survive failures of some of its members. At any given time one node acts as a leader (no node is favored), while the rest are followers.
Manages controllers that interact with external cloud providers. Documentation.
Handles container's lifecycle. Kubernetes supports the following runtimes and can use any other that is CRI (Container Runtime Interface) compliant:
- docker
-
containerd, includes
ctr
for managing images andcrictl
for managing containers# View running containers on a node $ sudo crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps
- CRI-O
- frakti
Each node can run different container runtime.
Kubernetes agent on each node that interacts with apiserver
. Initially
registers a node it is running on by creating a Node resource. Then
continuously monitors apiserver
for Pods scheduled on the node it is running
on and starts Pod's containers. Other duties are:
- receives
PodSpec
(Pod specification) - passes requests to local container runtime
- mounts volumes to Pods
- ensures access to storage, Secrets and ConfigMaps
- executes health checks for Pod/node
- reports Pod's status, events and resource consumption to
apiserver
kubelet
connect to container runtimes through a plugin based interface,
Container Runtime Interface.
CRI consists of protocol buffers, gRPC API, libraries, and additional
specifications and tools. In order to connect to interchangeable container
runtimes kubelet
uses CRI shim, an application that provides abstraction
layer between kubelet
and container runtime; kubelet
acts as a gRPC client,
while CRI provides ImageService and RuntimeService.
kubelet
process is managed by systemd when building cluster with kubeadm
.
Once running it will start every Pod found in staticPodPath
setting (default
is /etc/kubernetes/manifests
). View the status and config files of kubelet
with systemctl status kubelet.service
,
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
- systemd unit config.
Default kubelet
config - /var/lib/kubelet/config.yaml
.
Also runs container's liveness probes, and deletes containers when Pod is
deleted from apiserver
.
Provides network connectivity to Pods and maintains all networking rules using
iptables
entries. Works as a local load-balancer, forwards TCP and UDP
traffic, implements services.
kube-proxy
has 3 modes:
- userspace - iptables are modified in a way that connection goes through
kube-proxy
itself (the reason component got its name) and then gets redirected to a backing Pod. This mode also ensured true round-robin load balancing. -
iptables
- current implementation, which modifies iptables, but sends traffic directly to targets. Load balancing selects Pods randomly. -
ipvs
(alpha) - works in the kernel space (greater speed), provides a configurable load-balancing algorithm, such as round-robin, shortest expected delay, least connection, etc. Uses IPVS kernel modules.
Services at the core are implemented by iptables rules. kube-proxy
is
watching for Service and Endpoints resource updates and sets rules accordingly.
If a client sends traffic to a Service, matching iptables rules substitutes the
destination IP with randomly selected backing Pod's IP.
HA cluster (stacked etcd topology) has at least 3 control plane nodes, since
etcd requires at least 3 nodes online to reach a quorum. apiserver
and etcd
Pods run on all control plane nodes, scheduler
and controller-manager
run
only on active node, which ensures only one replica is running at any given
time (implemented via lease mechanism). Load balancer in front of apiserver
evenly distributes traffic from worker nodes and from outside the cluster.
Since apiserver
and etcd are linked together on a given control plane node,
there is a linked redundancy between components.
In external etcd topology etcd cluster (at least 3 nodes) is set up separately
from the control plane. apiserver
references this etcd cluster, while the rest
is the same as in previous topology.
Simple leader election code example.
Scheduler (or controller-manager
) elects a leader using Endpoints resource
(or ConfigMap already). Each replica tries to write its name in special
annotation; ones succeed based on optimistic locking mechanism, making others
know that they should stand by. Leader also updates the resource regularly, so
that other replicas know that it is still alive. If update doesn't happen
within specified amount of time, new election process takes place.
Recommended minimum hardware requirements:
Node | CPU | RAM (GB) | Disk (GB) |
---|---|---|---|
master | 2 | 2 | 8 |
worker | 1 | 1 | 8 |
Cluster network ports:
Component | Default port | Used by |
---|---|---|
apiserver | 6443 | all |
etcd | 2379-2380 | API/etcd |
scheduler | 10251 | self |
controller manager | 10252 | self |
kubelet (both on Control Plane node and worker node) | 10250 | control plane |
NodePort (worker node) | 30000-32767 | all |
Self-install options:
- kubeadm
- kubespray - advanced Ansible playbook for setting up cluster on various OSs and using different network providers
- kops (Kubernetes operations) - CLI tool for creating a cluster in Cloud (AWS and GCE are officially supported; Azure, DigitalOcean etc on the way); also provisions necessary cloud infrastructure; how to
- kind - running Kubernetes locally on Docker containers
- k3s - lightweight Kubernetes cluster for local, cloud, edge, IoT deployments, originally from Rancher
- minikube - single node local Kubernetes cluster
- microk8s - local and cloud options for developers and production, from Canonical
kubeadm init
performs the following actions in the order by default (highly
customizable):
- Pre-flight checks (permissions on the system, hardware requirements, etc)
- Create certificate authority
- Generate kubeconfig files
- Generate static Pod manifests (for Control Plane components)
- Wait for Control Plane Pods to start
- Taint Control Plane node
- Generate bootstrap token
# List join token $ kubeadm token list # Regenerate join token $ kubeadm token create
- Start add-on Pods (DNS,
kube-proxy
, etc)
kubeadm
allows joining multiple control plane nodes with
collocated etcd databases. At least 2 more instances are required for etcd to
be able to determine a quorum and select a leader. Common architecture also
includes a load balancer in front of control planes.
Additional control planes are added similar to workers, but with a
--control-plane
and --certificate-key
parameters. New key needs to be
generated, unless secondary nodes are added within two hours of initial
boostrapping.
It is also possible to set up external etcd cluster. etcd must be configured first, then certificates are copied over to the first control plane, redundant control planes are added one at a time, fully initialized.
Container-to-container networking is implemented by Pod concept, External-to-Pod is implemented by services, while Pod-to-Pod is expected to be implemented outside Kubernetes by networking configuration.
Containers are integrated with Kubernetes networking model through Container Network Interface, which is a set of specifications and libraries that allow plugins to configure networking for containers. There are some core plugins, while most are 3rd-party tools. Some also implement Network policies and other additional features. Container runtime offloads the IP assignment to CNI, which connects to underlying plugin.
Overlay network (also software defined network) provides layer 3 single network that all Pods can use for intercommunication. Popular network add-ons:
-
Flannel - L3 virtual network between nodes of a cluster
-
Calico - flat L3 network without IP encapsulation; policy based traffic management;
calicoctl
; Felix (interface monitoring and management, route programming, ACL configuration and state reporting) and BIRD (dynamic IP routing) daemons - routing state is read by Felix and distributed to all nodes allowing a client to connect to any node and get connected to a workload even if it is on a different node. Quickstart. -
Weave Net multi-host network typically used as an add-on in CNI-enabled cluster.
Node is an API object outside a cluster representing an virtual/physical
instance. All nodes reside in kube-node-lease
namespace.
Scheduling Pods on the node can be turned on/off with kubectl cordon/uncordon
.
# Remove node from cluster
# 1. remove object from API server
$ kubectl delete node <node_name>
# 2. remove cluster specific info
$ kubeadm reset
# 3. may also need to remove iptable entries
# View CPU, memory and other resource usage, limits, etc
$ kubectl describe node <node_name>
If node is rebooted, Pods running on that node stay scheduled on it, until
kubelet
's eviction timeout parameter (default 5m) is exceeded.
kubeadm upgrade
-
plan
- check installed version against newest in the repository, and verify that upgrade is possible -
apply
- upgrade first Control Plane node to the specified version -
diff
- show difference applied during an upgrade (similar toapply --dry-run
) -
node
- allows updatingkubelet
on worker nodes or additional control plane nodes; accessesphase
command to step through the process
Control plane node(s) should be upgraded first. Steps are similar for control plane and worker nodes.
kubeadm-based cluster can only be upgraded by one minor version (e.g 1.16 -> 1.17).
Check available and current versions. Then upgrade kubeadm
and verify. Drain
the Pods (ignoring DaemonSets). Verify upgrade plan and apply it. kubectl get nodes
would still show old version at this point. Upgrade kubelet
, kubectl
and restart the daemon. Now ..get nodes
should output updated version. Allow
Pods to be scheduled on the node.
Same process as on control plane, except kubeadm upgrade
command
is different, and kubectl
commands are still being executed from control
plane node.
- view available versions
$ sudo apt update $ sudo apt-cache madison kubeadm # view current version $ sudo apt list --installed | grep -i kube
- upgrade
kubeadm
on the given node$ sudo apt-mark unhold kubeadm $ sudo apt-get install kubeadm=<version> $ sudo apt-mark hold kubeadm $ sudo kubeadm version
- drain Pods (from control plane for both)
$ kubectl drain <node_name> --ignore-daemonsets
- view and apply node update
# control plane $ sudo kubeadm upgrade plan $ sudo kubeadm upgrade apply <version> # worker (on the node) $ sudo kubeadm upgrade node
- upgrade
kubelet
andkubectl
$ sudo apt-mark unhold kubelet kubectl $ sudo apt-get install kubelet=<version> kubectl=<version> $ sudo apt-mark hold kubelet kubectl # restart daemon $ sudo systemctl daemon-reload $ sudo systemctl restart kubelet
- allow Pods to be deployed on the node
$ kubectl uncordon <node_name>
Etcd backup file contains the entire state of the cluster. Secrets are not encrypted (only hashed), therefore, backup file should be encrypted and securely stored.
Usually backup script is performed by Linux or Kubernetes cron jobs.
By default, there is single etcd Pod running on control plane node. All data is
stored at /var/lib/etcd
, which is backed by hostPath
volume on the node.
etcdctl
should match etcd running in a Pod, use etcd --version
command to
find out.
$ export ETCDCTL_API=3
$ etcdctl --endpoints=<host>:<port> <command> <args>
# Running etcdctl on master node
$ etcdctl --endpoints=http://127.0.0.1:2379
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /var/lib/dat-backup.db
# Check the status of backup
$ etcdctl --write-out=table snapshot status <backup_file>
Restoring the backup to the default location:
$ export ETCDCTL_API=3
# By default restores in the current directory at subdir ./default.etcd
$ etcdctl snapshot restore <backup_file>
# Move the original data directory elsewhere
$ mv /var/lib/etcd /var/lib/etcd.OLD
# Stop etcd container at container runtime level, since it is static container
# Move restored backup to default location, `/var/lib/etcd`
$ mv ./default.etcd /var/lib/etcd
# Restarted etcd will find new data
Restoring the backup to the custom location:
$ etcdctl snapshot restore <backup_file> --data-dir=/var/lib/etcd.custom
# Update static Pod manifest:
# 1. --data-dir=/var/lib/etcd.custom
# 2. mountPath: /var/lib/etcd.custom (volumeMounts)
# 3. path: /var/lib/etcd.custom (volumes, hostPath)
# Updating manifest triggers Pod restart (also kube-controller-manager
# and kube-scheduler
sniff plugin allows to see networking traffic from within, since cluster
network traffic is encrypted. sniff requires Wireshark and ability to
export graphical display. sniff command will use the first container,
unless -c
option is used:
$ kubectl krew install sniff <pod> -c <container>
Troubleshooting DNS can be done by creating a Pod with network tools, creating a Service and running a DNS lookup (other tools include dig, nc, wireshark):
$ nslookup <service_name> <kube-dns_ip>
Object organization:
- Kind - Pod, Service, Deployment, etc (available object types)
- Group - core, apps, storage, etc (grouped by similar functionality)
- Version - v1, beta, alpha
apiVersion: v1
kind: Pod
metadata:
name: nginx-pod
spec:
containers:
- name: nginx
image: nginx
Core (Legacy) group includes fundamental objects such as Pod, Node, Namespace, PersistentVolume, etc. Other objects are grouped under named API groups such as apps (Deployment), storage.k8s.io (StorageClass), rbac.authorization.k8s.io (Role) and so on.
Versioning follows Alpha->Beta->Stable
lifecycle:
- Alpha (
v1alpha1
) - disabled by default - Beta (
v1beta1
) - enabled by default, more stable, considered safe and tested, forward changes are backward compatible - Stable (
v1
) - backwards compatible, production ready
List of known API resources (also short info):
$ kubectl api-resources
$ kubectl api-resources --sort-by=name
$ kubectl api-resources --api-group=apps
# List api versions and groups
$ kubectl api-versions | sort
# Get preferred version of an API group
$ kubectl proxy 8001
$ curl localhost:8001/apis/<group_name>
API requests are RESTful (GET, POST, PUT, DELETE, PATCH)
Special API requests:
- LOG - retrieve container logs
- EXEC - execute command in a container
- WATCH - get change notifications on a resource
API resource location:
- Core API:
http://<apiserver>:<port>/api/<version>/<resource_type>
- (in namespace)
http://<apiserver>:<port>/api/<version>/namespaces/<namespace>/<resource_type>/<resource_name>
- API groups
http://<apiserver>:<port>/apis/<group_name>/<version>/<resource_type
- (in namespace)
http://<apiserver>:<port>/apis/<group_name>/<version>/namespace/<namespace>/<resource_type>/<resource_name>
Response codes:
-
2xx
(success) - e.g. 201 (created), 202 (request accepted and performed async) -
4xx
(client side errors) - e.g. 401 (unauthorized, not authenticated), 403 (access denied), 404 - not found -
5xx
(server side errors) - 500 (internal error)
Get certificates for easy request writing:
$ export client=$(grep client-cert $HOME/.kube/config | cut -d" " -f6)
$ export key=$(grep client-key-data $HOME/.kube/config | cut -d" " -f6)
$ export auth=$(grep certificate-authority-data $HOME/.kube/config | cut -d" " -f6)
$ echo $client | base64 -d - > ./client.pem
$ echo $key | base64 -d - > ./client-key.pem
$ echo $auth | base64 -d - > ./ca.pem
Make requests using keys and certificates from previous step:
$ curl --cert client.pem \
--key client-key.pem \
--cacert ca.pem \
https://k8sServer:6443/api/v1/pods
Another way to make authenticated request is to start a proxy session in the background:
# Run in a separate session or fg and Ctrl+C
$ kubectl proxy &
# Address (port) is displayed by previous command
$ curl localhost:8001/<request>
Custom resources can be part of declarative API, which also requires a
controller that is able to retrieve structured data and maintain the declared
state. There are two ways: Custom Resource Definition (CRD) can be added to the
cluster or Aggregated APIs (AA) could be implanted via new API server, which
would run alongside main apiserver
(more flexible).
CRD objects can only use the same API functionality as build-in objects
(respond to REST requests, configuration state validation and storage). New
CRDs are available at apiextentions.k8s.io/v1
API path.
name
must match spec declared later. group
and version
will be part of
REST API - /apis/<group>/<version>
(e.g. /apis/stable/v1
), and used as
apiVersion
in resource manifest. scope
is one of Namespaced
or Cluster
,
defines if an object exists in a single namespace or available cluster-wide.
plural
defines the last part of the API URL - /apis/stable/v1/backups
, must
match first piece of metadata.name
field. singular
and shortNames
are used
for display and CLI. kind
is used in resource manifests.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: backups.stable.linux.com
spec:
group: stable.linux.com
version: v1
scope: Namespaced
names:
plural: backups
singular: backup
shortNames:
- bks
kind: BackUp
spec
field depends on the controller. Validation is performed by controller;
only existence of the variable is checked by default.
apiVersion: "stable.linux.com/v1"
kind: BackUp
metadata:
name: a-backup-object
spec:
timeSpec: "* * * * */5"
image: linux-backup-image
replicas: 5
Asynchronous pre-delete hook is a Finalizer. Once delete request is
received, metadata.deletionTimestamp
is updated, then controller triggers
configured Finalizer.
metadata:
finalizers:
- finalizer.stable.linux.com
Custom apiserver
can be created to validate custom object(s); those could
also be already "baked" in to apiserver
without the need to use CRDs.
Aggregated API is exposed at a central location and hides away the complexity
from clients. Each apiserver
can use its own etcd store ot use core
apiserver
's store (in that case need to create CRDs before creating instances
of CRD).
Custom apiserver
runs as a Pod and is exposed via Service. Integrate it with
core apiserver
using the object below:
apiVersion: apiregistration.k8s.io/v1beta1
kind: APIService
metadata:
name: v1alpha1.extentions.example.com
spec:
group: extentions.example.com # API group custom apiserver is responsible for
version: v1alpha1 # supported version in the API group
priority: 150
service:
name: <service_name>
namespace: default
Minimal Deployment manifest explained:
# API version
# If API changes, API objects follow and may introduce breaking changes
apiVersion: apps/v1
# Object type
kind: Deployment
metadata:
# Required, must be unique to the namespace
name: foo-deployment
# Specification details of an object
spec:
# Number of pods
replicas: 1
# A way for a deployment to identify which pods are members of it
selector:
matchLabels:
app: foo
# Pod specifications
template:
metadata:
# Assigned to each Pod, must match the selector
labels:
app: foo
# Container specifications
spec:
containers:
- name: foo
image: nginx
Generate with --dry-run
parameter:
$ kubectl create deployment hello-world \
--image=nginx \
--dry-run=client \
-o yaml > deployment.yaml
- Root
metadata
should have at least a name field. -
generation
represents a number of changes made to the object. -
resourceVersion
value is tied to etcd to help with concurrency of objects. Any changes in database will change this number. -
uid
- unique id of the object throughout its lifetime.
Pod is the smallest deployable object (not container). Pod embodies the environment where container lives, which can hold one or more containers. If there are several containers in a Pod, they share all resources like networking (unique IP is assigned to a Pod), access to storage and namespace (Linux). Containers in a Pod start in parallel (no way to determine which container becomes available first, but InitContainers are set to run sequentially). Loopback interface, writing to files in a common filesystem or inter-process communication (IPC) can be used by containers within a Pod for communication.
Secondary container may be used for logging, responding to requests, etc. Popular terms are sidecar, adapter, ambassador.
Pod states:
-
Pending
- image is retrieved, but container hasn't started yet -
Running
- Pod is scheduled on a node, all containers are created, at least one is running -
Succeded
- containers terminated successfully and won't be restarting -
Failed
- all containers have terminated with at least one with failed status -
Unknown
- most likely communication error between master and kubelet -
CrashLoopBackOff
- one of containers unexpectedly exited after it was restarted at least once (most likely Pod isn't configured correctly); Kubernetes repeatedly makes new attempts
Specifying ports is purely informational and doesn't effect clients connecting to a Pod (can even be omitted).
Containers that crash are restarted automatically by kubelet
. Exit code is a
sum of 2 numbers: 128 and x, where x is a signal number sent to the process
that caused it to terminate, e.g. 137 = 128 + 9 (SIGKILL), 143 = 128 + 15
(SIGTERM). When container is killed, a completely new container is created.
hostPID
, hostIPC
, hostNetwork
Pod spec properties allow Pod to use host's
resources - see process tree, network interfaces, etc.
imagePullPolicy
set to Always
commands container runtime to contact image
registry every time a new Pod is deployed. This slows down Pod startup time,
and can potentially completely prevent Pod from starting, if registry is
unreachable. Prefer using proper version tag, instead of latest and avoid
using Always
setting with imagePullPolicy
.
User defined environment variables are defined in Pod's (specifically each
container) spec as key/value pairs or via valueFrom
parameter referencing
some location or other Kubernetes resource.
System defined environment variables include Service names in the same namespace available at the time of Pod's creation.
Both types can not be updated once Pod is created.
Refer to another variable using $(VAR)
syntax:
env:
- name: FIRST_VAR
value: "foo"
- name: SECOND_VAR
value: "$(FIRST_VAR)foo"
Pause container is used to provide shared Linux namespaces to user containers. This container is not seen within Kubernetes, but can be discovered by container engine tools.
For example, an IP address is acquired prior to other containers, which is then
used in a shared network namespace. Container(s) will have eth0@tunl0
interface. IP persists throughout the life of a Pod.
If pause container dies, kubelet
recreates it and all Pod's containers.
InitContainer runs (must successfully complete) before main application container. Multiple init containers can be specified, in which case they run sequentially (in Pod spec order). Primary use cases are setting up environment, separating duties (different storage and security settings) and environment verification (block main application start up if environment is not properly set up).
spec:
containers:
- name: main-app
image: databaseD
initContainers:
- name: wait-database
image: busybox
command: ['sh', '-c', 'until ls /db/dir ; do sleep 5; done; ']
Static Pod is managed directly by kubelet
(not apiserver
) on nodes.
Pod's manifest is placed in a specific location on a node (staticPodPath
in
kubelet
's configuration file) that kubelet
is continuously watching (files
starting with dots are ignored). Default location is
/etc/kubernetes/manifests
.
kubelet
automatically creates a mirror Pod for each static Pod to make
them visible in apiserver
, but can not be controlled from there; deleting
such Pod through apiserver
will not affect it, and mirror Pod will be
recreated.
kubelet
can also fetch web-hosted static Pod manifest.
Control plane component manifests (build by kubeadm
) - etcd, apiserver
,
controller-manager
, scheduler
are static Pods.
resources
section in container's spec is used to specify desired and maximum
amount of resources (CPU, memory) a container requests/expected to use. Pod's
resources is a sum of container resources it contains. If limits
are set, but
requests
are not, the latter is set to limits
values.
Example:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
labels:
app: hog
name: hog
namespace: default
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: hog
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: hog
spec:
containers:
- image: vish/stress
imagePullPolicy: Always
name: stress
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "0.5"
memory: "500Mi"
args:
- -cpus
- "2"
- -mem-total
- "950Mi"
- -mem-alloc-size
- "100Mi"
- -mem-alloc-sleep
- "1s"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
ResourceQuota object can set hard and soft limits (and of more types of resources) in a namespace, thus, on multiple objects.
scheduler
has 2 functions related to resources - LeastRequestedPriority
(prefer nodes with more unallocated resources) and MostRequestedPriority
(prefer nodes with less unallocated resources). Only one of functions can be
configured for scheduler
at a time. The latter makes sure Pods are tightly
packed, resulting in less nodes in total required to run a workload.
top
command inside container shows memory and CPU of the whole node it is
running on, not of the container (with set limits).
requests
represent minimum amount of resources a container needs to run
properly. scheduler
includes this information in its decision making process.
It considers only nodes with enough unallocated resources to meet requests of a
Pod.
CPU resource can be specified as a whole or millicores (1 = 1000m). If no limits are set, but requests are set, Pods share spare resources in the same proportion as requests are set - 2 Pods with 200m and 1000m millicores respectively will share all remaining CPU in 1 to 5 ratio (if, one Pod is idle, another one is still allowed to use all available CPU until first one needs more again).
limits
set the maximum amount of a resources that a container can consume.
While CPU is compressible resource, meaning amount used by container can be
throttled, memory is incompressible - once a chunk is given, it can only be
released by the container itself. Thus, always set limit for memory.
If CPU limit is exceeded, container isn't given more CPU time. If memory limits is exceeded, container is killed (OOMKilled, Out Of Memory).
Configured CPU limits can be viewed directly in the container's filesystem:
/sys/fs/cgroup/cpu/cpu.cfs_quota_us
/sys/fs/cgroup/cpu/cpu.cfs_period_us
QoS defines the priority between Pods and determines the order in which Pods get killed in overcommitted system. Pod's QoS is determined based on QoS of all containers. If all containers are assigned Best effort or Guaranteed class, Pod's class is the same; any other combination results in Burstable class.
- BestEffort (lowest) - assigned to containers that don't have
requests
orlimits
set. - Guaranteed (highest) - assigned to Pods that have
requests
set equal tolimits
both for CPU and memory. - Burstable - all other containers fall within this class.
Best effort Pods are killed before Burstable and both are killed before Guaranteed Pods, which in turn can be killed, if system Pods need more resources.
For Pod's in the same class OOM score is used; highest score gets killed first. OOM score is based on:
- percentage of the available memory the process is using (container using more of it's requested memory gets killed first)
- fixed score adjustment based on Pod's QoS class and container's requested memory
QoS is shown on kubectl describe
and in status.qosClass
field of YAML file.
Provides means to specify resource limits that objects can consume in a namespace. Applies to each individual Pod/container, not total consumption in a namespace.
LimitRange resource is used by LimitRange Admission Control plugin -
when Pod spec is posted to apiserver
the contents are validated before being
applied. Common practice is to set a limit to the biggest node, otherwise
apiserver
would still accept a Pod with resource request that can't be
satisfied.
min
, max
, etc refer to limits/requests unless a setting with Request
suffix is also present, in which case first ones specify only limits.
At the Pod level only min/max limits can be set. On container level default
limits and default requests can be set, which are applied, if an object didn't
provide values at all. PVC min/max can also be set in LimitRange resource. All
setting can be specified in a single resource or be split into multiple, for
example, by type.
piVersion: v1
kind: LimitRange
metadata:
name: example
spec:
limits:
- type: Pod
min:
cpu: 50m
memory: 5Mi
max:
cpu: 1
memory: 1Gi
- type: Container
defaultRequest:
cpu: 100m
memory: 10Mi
default:
cpu: 200m
memory: 100Mi
min:
cpu: 50m
memory: 5Mi
max:
cpu: 1
memory: 1Gi
maxLimitRequestRatio:
cpu: 4
memory: 10
- type: PersistentVolumeClaim
min:
storage: 1Gi
max:
storage: 10Gi
A ResourceQuota resource limits amount of computational resources Pods can use,
amount of storage PVC can claim and total number of API resources that can
exist in a namespace. Rules defined in a ResourceQuota are applied in the
same namespace where this ResourceQuota object exists, in other words
metadata.namespace
defined where rules will apply.
When a quota is set on a specific resource (CPU or memory) Pods also need values to be set for that resource. Therefore, common practice is to provide LimitRange resource with defaults set alongside ResourceQuota.
Used by ResourceQuota Admission Control plugin, which checks, if posted Pod spec violates rules set by ResourceQuota resource. Therefore, doesn't affect already running Pods in a namespace, but newly posted ones.
Quotas can also be applied to a specific quota scope: BestEffort,
NotBestEffort (QoS), Terminating, NotTerminating. Last 2 scopes are
related to activeDeadlineSeconds
setting in the Pod spec, which configures
maximum duration a Pod can be active on a node relative to starting time.
Terminating scope is Pods that have this setting set, while NotTerminating
represents Pods without this setting.
- max values for requests and limits:
apiVersion: v1 kind: ResourceQuota metadata: name: cpu-and-mem spec: hard: requests.cpu: 400m requests.memory: 200Mi limits.cpu: 600m limits.memory: 500Mi
- limit amount of storage to be claimed by PVCs:
apiVersion: v1 kind: ResourceQuota metadata: name: storage spec: hard: requests.storage: 500Gi # overall ssd.storageclass.storage.k8s.io/requests.storage: 300Gi # for particular class standard.storageclass.storage.k8s.io/requests.storage: 1Ti # for particular class
- Limit number of API objects that can be created:
apiVersion: v1 kind: ResourceQuota metadata: name: objects spec: hard: pods: 10 replicationcontrollers: 5 secrets: 10 configmaps: 10 persistentvolumeclaims: 4 services: 5 services.loadbalancers: 1 services.nodeports: 2 ssd.storageclass.storage.k8s.io/persistentvolumeclaims: 2
- Apply quota to specific scope (Pod must match all for them to apply):
apiVersion: v1 kind: ResourceQuota metadata: name: besteffort-notterminating-pods spec: scopes: - BestEffort - NotTerminating hard: pods: 4
# Show how much of a quota is currently used
$ kubectl describe quota
Probes let you run custom health checks on container(s) in a Pod. Possible probe results are Success, Failure, and Unknown.
-
livenessProbe
is a continuous check to see if a container is running. Restart policy is applied on failure event. -
readinessProbe
is a diagnostic check to see if a container is ready to receive requests. On failure event Pod's IP address is removed from the Endpoints object (restart policy is not applied). Usually used to protect applications that temporary can't serve requests. If app isn't ready to serve requests (and readiness probe isn't configured), clients see "connection refused" type of errors. -
startupProbe
is a one time check during startup process, ensuring containers are in aReady
state. All other probes are disabled untilstartupProbe
succeeds. On failure event restart policy is applied. Usually used for applications requiring long startup times.
Probes can be defined using 3 types of handlers: command, HTTP, and TCP.
- command's exit code of zero is considered healthy:
exec: command: - cat - /tmp/ready
- HTTP GET request return code
=> 200 and < 400
:[...] httpGet: path: /healthz port: 8080
- successful attempt establishing TCP connection:
[...] tcpSocket: port: 8080
Settings (separate per probe configuration):
Name | Default | Description |
---|---|---|
initialDelaySeconds |
0s | number of seconds after a container has started before running a probe |
periodSeconds |
10s | how frequently to run a probe |
timeteoutSeconds |
1s | execution time for a probe before declaring a failure, probe would return Unknown status |
failureThreshold |
3 | number of missed checks to declare a failure |
successThreshold |
1 | number of successful checks after a failure to consider a container as healthy |
Stopping/terminating Pod:
When stop command is sent to a Pod, SIGTERM is sent to containers and Pod's
status is set to Terminating. If container is not terminated by the end of
grace period timer (default 30s), SIGKILL is sent, apiserver
and etcd are
updated.
# To immediately delete records from API and etcd, if termination is stuck
# Still have to clean up resources manually
$ kubectl delete pod <name> --grace-perioud=0 --force
Container(s) in a Pod can restart independent of the Pod. Restart process is protected by exponential backoff - 10s, 20s, 40s and up to 5m. Resets to 0s after 10m of continuous successful run.
Restart policy:
- Always (default) - restarts all containers in a Pod, if one stops running
- OnFailure - restarts only on non-graceful termination (non-zero exit codes)
- Never
Shutdown process:
- Deleting Pod's object via
apiserver
.apiserver
sets deletionTimestamp field, which also makes Pod to go to terminating state -
kubelet
stops each container in a Pod with a grace period, which is configurable per Pod- Pre-stop hook (if configured) and wait for it to finish
- Send SIGTERM to the main process
- Wait for clean shutdown or grace period (grace period countdown starts from pre-hook)
- Send SIGKILL
- When all containers stop,
kubelet
notifiesapiserver
and Pod object is deleted. Force delete an object with--grace-period=0 --force
options.
terminationGracePeriodSeconds
is 30 seconds by default. Can be specified in
the Pod's spec and also overridden, when deleting the Pod:
$ kubectl delete pod foo --grace-period=5
Tip: the best way to to ensure orphaned data is not lost and migrated to remaining Pod(s) is to configure CronJob or continuously running Pod that will scan for such event and trigger/manage the migration of the data.
Lifecycle hooks are specified per container, and either perform a command inside a container or perform an HTTP GET against URL.
Post-start hook executes immediately after container's main process is started. Doesn't wait for that process to start fully, and runs in parallel (asynchronously). Until hooks completes, container stays in Waiting and Pod in Pending states accordingly. If hook fails, main container is killed. Logs written to stdout aren't visible anywhere; in case of an error FailedPostStartHook warning is written to Pod's events (make post-start hook to write logs to filesystem for easy debugging).
spec:
containers:
- name: foo
image: bar
lifecycle:
postStart:
exec:
command:
- sh
- -c
- "echo postStart hook ran"
Pre-stop hook executes immediately before container is terminated - first configured hook is run, then SIGTERM is sent, and lastly SIGKILL is sent, if unresponsive. Regardless of the status of pre-stop hook container will be terminated; on failure FailedPreStartHook warning is written to Pod's events (might happen unnoticed, since Pod is deleted shortly after).
lifecycle:
preStop:
httpGet:
port: 8080
path: shutdown
Tip: in many cases pre-stop hook is used to pass SIGTERM to the application,
because it seems Kubernetes (kubelet
) doesn't send it. This may happen, if
image is configured to run a shell, which in turn runs the application - in
this case shell "eats up" the signal. Either handle the signal in shell script
and pass to the application or use exec form of ENTRYPOINT or CMD, and run
application directly.
Pre-stop hook can also be used to ensure graceful termination of client
requests. When Pod termination is initialized, the route for updating iptables
(apiserver
-> Endpoints controller -> apiserver
-> kube-proxy
->
iptables) is considerably longer than the one to remove the Pod (apiserver
->
kubelet
-> container(s)). Some meaningful delay (5-10 seconds), may be enough
to ensure iptables are updated and no new requests are accepted. Application
handles and waits for all active connections to finish, closes inactive ones,
and shuts down completely after last active request is completed.
Security-related features that can be specified under Pod or individual container spec. Pod level settings serve as defaults for containers, which can override them. Configuring a security context.
Some options are:
-
runAsUser
- specify user ID -
runAsNonRoot
- specifytrue
to enforce container to run as any other user -
privileged
- specifytrue
to allow Pod to do anything on a node (use protected system devices, kernel features, devices, etc) -
capabilities
- specify individual kernel capability to add or drop for a container (Linux kernel capabilities are usually prefixed withCAP_
; when specifying in Pod spec, leave out the prefix) -
readOnlyRootFilesystem
- allow processes to only read from mounted volumes -
fsGroup
- special supplemental group, applies to all volumes attached to Pod (if volume plugin allows that); can be used to share volumes between containers that run as different users -
supplementalGroups
- list of additional groups IDs the user is associated with
By default container runs as a user defined in the image (in Dockerfile; if USER directive is omitted, defaults to root).
Downward API allows to pass information about Pod's metadata to the
containers inside via environment variables or files (downwardAPI
volume).
For example, Pod's metadata such as Pod's name, IP address, namespace it
belongs, labels and annotations and so on.
Labels and annotations can change during Pod's lifecycle. Since environment
variables can not be updated, labels and annotations can only be exposed via
downwardAPI
volume - Kubernetes continuously updates it, when changes occur.
Since volume is defined at the Pod level, exposing resources via volume requires container's name. However, this way container can access resource request data of other containers in the Pod.
- Environment variables (resource limits require a
divisor
parameter - actual value is divided by divisor)kind: Pod spec: containers: - name: main ... env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP - name: CPU_REQUESTS_MILICORES valueFrom: resourceFieldRef: resource: requests.cpu divisor: 1m # specify which unit - name: MEMORY_LIMIT_KIBIBYTES valueFrom: resourceFieldRef: resource: limits.memory divisor: 1Ki # specify which unit
-
downwardAPI
(each item is exposed as a file,path
parameter serving as a file name):kind: Pod metadata: labels: foo: bar spec: containers: - name: main volumeMounts: - name: downward mountPath: /etc/downward volume: - name: downward downwardAPI: items: - path: "podName" fieldRef: fieldPath: metadata.name - path: "labels" fieldRef: fieldPath: metadata.labels - path: "cpuRequestMilliCores" resourceFieldRef: containerName: one resource: requests.cpu divisor: 1m
Controllers or operators are series of watch-loops that request apiserver
for a particular object state and modify the object until desired state is
achieved. Kubernetes comes with a set of default controllers, while others
can be added using custom resource definitions.
Create custom controller via operator framework.
Deploy and maintain defined number of Pods. Usually, not used by itself, but through Deployment controller. Consists of a selector, number of replicas and Pod spec.
ReplicaSet selector can be of matchLabels
or matchExpression
type (older
ReplicationControler object only allowed to use direct matching for key/value
pairs). The latter allows the use of operators In
, NotIn
, Exists
and
DoesNotExist
.
matchExpressions:
- key: foo
operator: In
values:
- bar
Manages the state of ReplicaSet and the Pods within, thus, providing flexibility with updates and administration. Rolling update is performed by creating a second ReplicaSet and increasing/decreasing Pods in two sets. It is also possible to roll back to a previous version, pause the deployment and make changes.
Designed for stateless applications, like web front end that doesn't store data or application state to a persistent storage.
Changes in configuration file automatically trigger rolling update. One can
pause, resume and check status of this behavior. Exit code of 0 for status
command indicates success, while non-zero - failure. If Deployment is
paused, the undo
command won't do anything, until Deployment is
resumed.
$ kubectl rollout [pause|resume|status] deployment <name>
$ kubectl rollout history deployment <name>
# Get detailed info
$ kubectl rollout history deployment <name> --revision=<number>
# Roll back to a previous version
$ kubectl rollout undo deployment <name>
# Roll back to a specific version
$ kubectl rollout undo deployment <name> --to-revision=<number>
# Restart all Pods. New ReplicaSet is created with the same Pod spec.
# Specified update strategy is applied.
$ kubectl rollout restart deployment <name>
Pod names are constructed as follows -
<deployment_name>-<pod_template_hash>-<pod_id>
. pod_template_hash
is unique
ReplicaSet hash within Deployment. pod_id
is unique Pod identifier within
ReplicaSet.
Create Deployment:
- declaratively via YAML file:
$ kubectl apply -f <deployment_file>
- imperatively using
kubectl create
command:$ kubectl create deployment <name> \ --image <image>:<tag> \ --replicas <number> \ --labels <key>:<value> \ --port <port_number> \ --generator deployment/apps.v1 \ # api version to be used --save-config # save config in annotation
To keep desired replica count the same, even when applying changes, do not
include it in the YAML when using kubectl apply
.
RollingUpdate
(default) strategy - new ReplicaSet starts scaling up, while
old one starts scaling down. maxUnavailable
specifies number of Pods from
the total number in a ReplicaSet that can be unavailable (rounded down),
maxSurge
specifies number of Pods allowed to run concurrently on top of total
number of replicas in a ReplicaSet (rounded up). Both can be specified as a
number of Pods or percentage.
Recreate
strategy - all old Pods are terminated before new ones are
created. Used when two versions can't run concurrently.
Other strategies that can be implemented:
- blue/green deployment - create completely new Deployment of an application
and change app's version. Traffic can be redirected using Services. Good
for testing, disadvantage in doubled resources. Implemented with label
selectors and Service objects. A public Service initially points to blue
deployment (with dedicated label, e.g.
role=blue
. Once a green deployment is rolled out and tested (most likely via another Service), a label selector is updated in the public Service. - canary deployment - based on blue/green, but traffic is shifted gradually to a new version. This is achieved by avoiding specifying app's version in the Service selector and just by creating Pods of a new version. Can also be achieved by pausing rolling update.
Related settings:
-
minReadySeconds
- time in seconds a new Pod should be considered healthy; 0 means immediately -
progressDeadlineSeconds
- time in seconds until a progress error is reported (image issues, quotas, limit ranges) -
revisionHistoryLimit
(default 10) - how many old ReplicaSet specs to keep for rollback
Managing stateful application with a controller. Provides network names, persistent storage and ordered operations for scaling and rolling updates.
Each Pod maintains a persistent identity and has an ordinal index with a
relevant Pod name, stable hostname, and stable identified storage. Ordinal
index is just a unique zero-based sequential number given to each Pod
representing the order in sequence of Pods. Deployment, scaling and updates are
performed based on this index. For example, second Pod waits until first one is
ready and running before it is deployed. Scaling and updates happen in reverse
order. Can be changed in Pod management policy, where OrderedReady
is default
and can be switched to Parallel
. Each Pod has it's own unique PVC, which uses
ReadWriteOnce
access mode.
Examples are database workloads, caching servers, application state for web farms.
Naming must be persistent and consistent, as stateful application often needs to know exactly where data resides. Persistent storage ensures data is stored and can be retrieved later on. Headless service (without load balancer or cluster IP) allows applications to use cluster DNS to locate replicas by name.
StatefulSet requires a service to control its networking. This is a headless
Service, thus, each Pod has its own DNS entry. With foo Service in default
namespace Pod named A-0
will have a-0.foo.default.svc.cluster.local
FQDN.
example.yml
specifies headless service with no load balancing by using
clusterIP: None
option.
Headless Service also creates a SRV record, which points to individual Pods inside StatefulSet. Thus, each Pod can just perform a SRV DNS lookup to find out its peers.
StatefulSet keeps its state by keeping data in PVs. volumeClaimTemplates
section is used to define a template, which is then used to create PVC for each
Pod. StatefulSet automatically adds volume inside Pod's spec and configures it
to be bound to PVC.
PVCs are not deleted automatically on a scale-down event to prevent the deletion of potentially important data. Therefore, PVCs and PVs are to be removed manually.
Ensures that a specific single Pod is always running on all or some subset of the nodes. If new nodes are added, DaemonSet will automatically set up Pods on those nodes with the required specification. The word daemon is a computer science term meaning a non-interactive process that provides useful services to other processes.
Examples include logging (fluentd
), monitoring, metric and storage daemons.
RollingUpdate
(default) update strategy terminates old Pods and creates new
in their place. maxUnavailable
can be set to integer or percentage value,
default is 1. In OnDelete
strategy old Pods are not removed automatically.
Only if administrator removes them manually, new Pods are created.
Define, run and ensure that specified number of Pods successfully terminate.
restartPolicy
must be set to either OnFailure
or Never
, since default
policy is Always
. In case of restart failed Pods are recreated with an
exponentially increasing delay: 10, 20, 40... seconds, to a maximum of 6
minutes.
No matter how Job completes (success or failure) Pods are not deleted (for logs and inspection). Administrator can delete Job manually, which will also delete Pods.
-
activeDeadlineSeconds
- max duration time, has precedence overbackoffLimit
-
backoffLimit
- number of retries before being marked asFailed
, defaults to 6 -
completions
- number of Pods that need to finish successfully -
parallelism
- max number of Pods that can run simultaneously -
ttlSecondsAfterFinished
- kick off cleanup process after that amount of time (CLARIFY)
Execute from cli:
$ kubectl run pi --image perl --restart Never -- perl -Mbignum -wle 'print bpi(2000)'
It's a good idea to set job.spec.template.spec.restartPolicy
to Never
to be
able to view logs of the failed Pod.
Parallel Job can launch multiple Pods to run the same task. There are 2 types of parallel Jobs - fixed task completion count and a work queue.
Work queue is created by leaving completions
field empty. Job controller
launches specified number of Pods simultaneously and waits until one of them
signals successful completion. Then it stops and removes all Pods.
In a situation of a Job with both completion and parallelism options set, the controller won't start new containers, if the remaining number of completions is less that parallelism value.
Create and manage Jobs on a defined schedule. CronJob is created at the
time of submission to apiserver
, but Job is created on schedule. Timezone for
cron schedule is based on apiserver
setting.
-
suspend
- set totrue
to not run Jobs anymore -
concurrencyPolicy
-Allow
(default), another concurrent job may run depending on duration of a Job and schedule;Forbid
, current Job continues, and new one is skipped;Replace
, current Job is cancelled, new one is started.
In some cases may not run during a time period or run twice, thus, requested
Pod should be idempotent. startingDeadlineSeconds
ensures a Pods starts no
later that X seconds after scheduled time. If Pod doesn't start, no new
attempts will be made and the Job will be marked as failed. Default is 0, which
means no limit. Controller checks for new tasks every 10 seconds, thus, if
startingDeadlineSeconds
is set to less than 10, some Jobs might be missed.
Kubernetes retains number of successful and failed Jobs in history, which is
by default 3 and 1 respectively. Options successfulJobsHistoryLimit
and
failedJobsHistoryLimit
may be used to control this behavior. Deleting
CronJob also deletes all Pods.
The job of a scheduler is to assign new Pods to nodes. Default is
kube-scheduler
, but a custom one can be written and set. Multiple schedulers
can work in parallel.
Node selection goes through 3 stages:
- Filtering - remove nodes that can not run the Pod (apply hard constraints,
such as available resources,
nodeSelectors
, etc) - Scoring - gather list of nodes that can run the Pod (apply scoring functions to prioritize node list for the most appropriate node to run the workload); ensure Pods of the same service are spread evenly across nodes, node affinity and taints are also applied
- Binding - updating node name in Pod's object
PriorityClass
and PriorityClassName
Pod's settings can be used to evict
lower priority Pods to allow higher priority ones to be scheduled (scheduler
determines a node where a pending Pod could run, if one or more lower priority
ones were to be evicted). PodDisruptionBudget resource can limit number of Pods
to be evicted and ensure enough Pods are running at all times, but it could still
be violated by scheduler, if no other option is available. Both percentage or
absolute number can be specified for minAvailable
or maxUnavailable
setting.
End result of a scheduling process is assigning a Binding (Kubernetes API
object in api/v1
group) to a Pod that specifies where it should run. Can also
be assigned manually without any scheduler.
To manually schedule a Pod to a node (bypass scheduling process) specify
nodeName
(node must already exist); resource constraints still apply. This
way a Pod can still run on a cordoned node, since scheduling is basically
disabled and node is assigned directly.
Custom scheduler can be implemented; also multiple schedulers can run concurrently. Custom scheduler is packed and deployed as a system Pod. Default scheduler code. Define which scheduler to use in Pod's spec, if none specified, default is used. If specified one isn't running, the Pod remains in Pending state.
Priorities are functions used to weight resources. By default, node with
the least number of Pods will be ranked the highest (unless
SelectorSpreadPriority
is set). ImageLocalityPriorityMap
favors nodes that
already have the container image. cp/pkg/scheduler/algorithm/priorities
contains the list of priorities.
Example file for a scheduler policy:
kind: Policy
apiVersion: v1
predicates:
- name: MatchNodeSelector
order: 6
- name: PodFitsHostPorts
order: 2
- name: PodFitsResources
order: 3
- name: NoDiskConflict
order: 4
- name: PodToleratesNodeTaints
order: 5
- name: PodFitsHost
order: 1
priorities:
- name: LeastRequestedPriority
weight: 1
- name: BalancedResourceAllocation
weight: 1
- name: ServiceSpreadingPriority
weight: 2
- name: EqualPriority
weight: 1
hardPodAffinitySymmetricWeight: 10
Typically passed as --policy-config-file
and --scheduler-name
parameters.
This would result in 2 schedulers running in a cluster. Client can then choose
one in Pods spec.
Assign labels to nodes and use nodeSelector
on Pods to place them on certain
nodes. Simple key/value check based on matchLabels. Usually used to apply
hardware specification (hard disk, GPU) or workload isolation. All selectors
must be met, but node could have more labels.
nodeName
could be used to schedule a Pod to a specific single node.
Like nodeSelector
uses labels on nodes to make scheduling decisions, but with
matchExpressions. matchLabels can still be used with affinity as well for
simple matching.
-
nodeAffinity
- use labels on nodes (should some day replacenodeSelector
) -
podAffinity
- try to schedule Pods together using Pod labels (same nodes, zone, etc) -
podAntiAffinity
- keep Pods separately (different nodes, zones, etc)
Scheduling conditions:
-
requiredDuringSchedulingIgnoredDuringExecution
- Pod is scheduled only if all conditions are met (hard rule) -
preferredDuringSchedulingIgnoredDuringExecution
- Pod gets scheduled even if a node with all matching conditions is not found (soft rule, preference); weight 1 to 100 can be assigned to each rule
Affinity rules use In, NotIn, Exists, and DoesNotExist
operators. Particular label is required to be matched when the Pod starts, but
is not required, if the label is later removed. However,
requiredDuringSchedulingRequiredDuringExecution
is planned for the future.
Schedule caching Pod on the same node as a web server Pod.
spec:
containers:
- name: cache
...
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- webserver
topologyKey: "kubernetes.io/hostname"
scheduler
takes other Pods' affinity rules into account, even if the Pod to
be scheduled doesn't define any (InterPodAffinityPriority) - this ensures
that other Pods' affinity rules don't break, if initial Pod was deleted by
accident.
topologyKey
can be any label on the node (with some exceptions). Such label
must be present on all nodes, otherwise, it could lead to undefined behavior.
Some well-known labels for spreading or putting Pods together:
-
kubernetes.io/hostname
- host -
topology.kubernetes.io/zone
- availability zone -
topology.kubernetes.io/region
- region
By default labelSelector
matches only Pods in the same namespace as the Pod
being scheduled. Pods from other namespaces can also be selected by adding
namespaces
field on the same level as labelSelector
.
Opposite of selectors, keeps Pods from being placed on certain nodes. Taints allow to avoid scheduling, while Tolerations allow to ignore a Taint and be scheduled as normal.
$ kubectl taint nodes <node_name> <key>=<value>:<effect>
$ kubectl taint nodes <node_name> key1=value1:NoSchedule
# Remove a taint
$ kubectl taint nodes <node_name> key:<effect>-
Effects:
-
NoSchedule
- do not schedule Pod on a node, unless toleration is present; all existing Pods continue to run -
PreferNoSchedule
- try to avoid particular node; all already running Pods are unaffected -
NoExecute
- evacuate all existing Pods, unless one has a toleration, and do not schedule new Pods;tolerationSeconds
can specify for how long a Pod can run before being evicted, in certain caseskubelet
could add 300 seconds to avoid unnecessary evictions
Toleration with NoExecute
effect and tolerationSeconds
setting can be used
to configure when Pods on unresponsive nodes should be rescheduled.
Default operator is Equal
, which is used to tolerate a specific value.
Exists
generally should not be specified, used to tolerate all values for a
specific key. If an empty key uses Exists
operator, it will tolerate every
taint. If effect is not specified, but a key and operator are declared, all
effects are matched.
All parts have to match to the taint on the node:
spec:
containers:
...
tolerations:
- key: <key>
operator: "Equal"
value: <value>
effect: NoSchedule
Marks node as unschedulable, preventing new Pods from being scheduled, but does not remove already running Pods. Used as preparatory step before reboot or maintenance.
# Mark node as unschedulable
$ kubectl cordon <node>
# Mark node as unschedulable
# Gracefuly evict Pods
# Optionally ignore daemonsets, since f.e. `kube-proxy` is deployed as daemonset
$ kubectl drain <node> --ignore-daemonsets
# Mark node as schedulable again
$ kubectl uncordon <node>
Individual Pods won't be removed by draining the node, since they are not
managed by a controller. Add --force
option to remove.
# Manual scaling
$ kubectl scale <object_type> <name> --replicas=<number>
Automatically scales Replication Controller, ReplicaSet, StatefulSet or
Deployment based on resource utilization percentage, such as CPU and memory by
updating replicas
field. Modification is made through Scale sub-resource,
which is exposed for previously mentioned objects only (Autoscaler can operate
on any resource that exposes Scale sub-resource).
Custom metrics can also be used. If multiple metrics are specified, target Pod count is calculated for each, then highest value is used.
At most double of current number of Pods can be added in a single operation, if there are more than 2 currently running Pods. For less than 2 - max 4 Pods in a single step. Scale-up can happens at most once in 3 minutes, scale-down - once in 5 minutes.
# Create HPA resource
$ kubectl autoscale deployment <name> \
--min=5 \
--max=15 \
--cpu-percent=75
Autoscaling has a thrashing problem, that is when the target metric changes
frequently, which results in frequent up/down scaling. Use
--horizontal-pod-autoscaler-downscale-delay
option to control this behavior
(by specifying a wait period before next down scale; default is 5 minute
delay).
Container resource metrics (defined in resource requests).
CPU usage percentage is based on CPU requests setting, which means that it needs to be present on the Pod. Usage percentage can be over 100%, because Pod can use more than requested amount of CPU.
Memory usually isn't a good metric, because application has to control memory consumption. Even if new Pods (after killing old ones) don't use less memory, Kubernetes will continue endlessly adding new Pod until the limit.
Any other (including custom) metric related to Pod directly, such as queries-per-second, message queue size, etc
spec:
metrics:
- type: Pods
resource:
metricName: qps
targetAverageValue: 100
Metrics that don't relate to Pods, such as average request latency on Ingress object. Unlike other type, where an average is taken from all Pod, a single value is acquired.
spec:
metrics:
- type: Object
resource:
metricName: latencyMillis
target:
apiVersion: extensions/v1beta1
kind: Ingress
name: frontend
targetValue: 20
scaleTargetRef:
apiVersion: extensions/v1beta1
kind: Deployment
name: kubia
Runs as a separate deployment, adjusts the amount of CPU and memory requested by Pods. Refer to the project.
Adds or removes node(s) based on inability to deploy Pods or having low utilized nodes. Contacts cloud provider API to add a new node. Best node is determined based on available or already deployed node groups. Refer to project page for deployment options for particular cloud.
Karpenter, currently supports AWS only.
Define limits for total resource consumption in a namespace. Applying ResourceQuota with a limit less than already consumed resources doesn't affect existing resources and objects consuming them.
apiVersion: v1
kind: ResourceQuota
metadata:
name: storagequota
spec:
hard:
persistentvolumeclaims: "10"
requests.storage: "500Mi"
Define limits for resource consumption per objects. For example:
- min/max compute resource per Pod or container
- min/max storage request per PersistentVolumeClaim
Namespaces can abstract single physical layer into multiple clusters. They provide scope for naming resources like Pods, controllers and Deployments. Primarily used for resource isolation/management. User can create namespaces, while Kubernetes has 4 default ones:
-
default
- for objects with no namespace defined -
kube-system
- for objects created by Kubernetes itself (ConfigMap, Secrets, Controllers, Deployments); by default these items are excluded, when using kubectl command (can be viewed explicitly) -
kube-public
- for objects publicly readable for all users -
kube-node-lease
- worker node lease info
Creating a Namespace also creates DNS subdomain
<ns_name>.svc.<cluster_domain>
, thus, Namespace name can not contain dots,
otherwise follows RFC 1035 (Domain name) convention.
Can also be used as a security boundary for RBAC or naming boundary (same resource name in different namespaces). A given object can exist only in one namespace. Not all objects are namespaced (generally physical objects like PersistenVolumes and Nodes).
$ kubectl api-resources --namespaced=true
$ kubectl api-resources --namespaced=false
# List all resources in a namespace
$ kubectl api-resources --verbs=list --namespaced -o name \
| xargs -n 1 kubectl get --show-kind --ignore-not-found -n <namespace>
Namespace can be specified in command line or manifest file. Deleting a Namespace deletes all resources inside of it as well. Namespace is defined in metadata section of an object.
apiVersion:
kind:
metadata:
namespace:
# Create namespace
$ kubectl create namespace <namespace_name>
# Set default namespace
$ kubectl config set-context --current --namespace=<namespace_name>
# Validate it
$ kubectl config view --minify | grep namespace
Labels enable managing objects or collection of objects by organizing them into groups, including objects of different types. Label selectors allow querying/selecting multiple objects. Kubernetes also leverages Labels for internal operations.
Non-hierarchical key/value pair (up to 63/253 characters long). Can be assigned
at creation time or be added/edited later. Add --overwrite
parameter to
rewrite already existing label.
$ kubectl label <object> <name> <key1>=<value1> <key2>=<value2>
$ kubectl label <object> <name> <key1>=<value1 <key2>=<value2> --overwrite
$ kubectl label <object> --all <key>=<value>
# Delete
$ kubectl label <object> <name> <key>-
# Output additional column with all labels
$ kubectl get <object> --show-labels
# Specify columns (labels) to show
$ kubectl get <object> -L <key1>,<key2>
Controllers and Services match Pods using labels. Pod scheduling (e.g. based on hardware specification, SSD, GPU, etc) uses Labels as well.
Deployment and Service example, all labels must match:
kind: Deployment
...
spec:
selector:
matchLabels:
<key>: <value>
...
template:
metadata:
labels:
<key>: <value>
spec:
containers:
---
kind: Service
...
spec:
selector:
<key>: <value>
Labels are also used to schedule Pods on a specific Node(s):
kind: Pod
...
spec:
nodeSelector:
<key>: <value>
Best practices include:
- Name of application resource belongs to
- Application tier (frontend, backend, etc)
- Environment (dev, prod, QA, etc)
- Version
- Type of release (stable, canary, blue/green, etc)
- Tenant (if multiple used in the same namespace)
- Shard (for sharded systems)
Labels can be used to query/filter set of objects.
# Long format
$ kubectl get <object> --selector <key>=<value>
# Check if label exists (or doesn't exist)
$ kubectl get <object> -l <key>
$ kubectl get <object> -l '!<key>'
# Check multiple labels
$ kubectl get <object> -l '<key1>=<value1>,<key2>!=<value2>'
$ kubectl get <object> -l '<key1> in (<value1>,<value2>)'
$ kubectl get <object> -l '<key1> notin (<value1>,<value2>)'
Label selectors can also be imperatively updated and/or created:
$ kubectl set selector <object> <name> "<key>=<value>"
Annotations include object's metadata that can be useful outside cluster's object interaction, that is used by people or third-party applications. For example, timestamp, pointer to related objects from other ecosystems, developer's email responsible for the object and so on. Non-hierarchical key/value pairs (up to 63 characters, 256KB). Can't be used for querying/selecting.
Manifest file:
kind: Pod
...
metadata:
annotation:
owner: Max
$ kubectl annotate <object_type> <name> key=<value>
$ kubectl annotate <object_type> --all key=<value> --namespace <name>
$ kubectl annotate <object_type> <name> key=<new_value> --overwrite
# Delete
$ kubectl annotate <object_type> <name> <key>-
Best practices include at least decryption of the resource and contact information of the responsible person. Also could include names of services it is using, build and version info, and so on.
Install Kubernetes dashboard (runs Pods in kubernetes-dashboard
namespace):
-
deploy
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-rc3/aio/deploy/recommended.yaml
-
start proxy
$ kubectl proxy
-
access the following page (port may vary)
http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy
-
choose token option and supply the output of the following command:
$ kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | awk '/^deployment-controller-token-/{print $1}') | awk '$1=="token:"{print $2}'
Events show cluster operations on resources defined in the cluster, such as scheduling decisions, scaling operation, etc. Retained for one hour. An Event is also a Kubernetes resource like any other.
$ kubectl get events
# Sort chronologically
$ kubectl get events --sort-by='.metadata.creationTimestamp'
# Filter events
$ kubectl get events --field-selector type=Warning,reason=Failed
# Specific object events
$ kubectl describe <type> <name>
kubelet collects resource consumption data both on container and node levels via cAdvisor agent.
Kubernetes Metrics Server
collects resource metrics from kubelet
s and exposes them via apiserver
.
This includes CPU and memory usage info for Pods and nodes. Intended to be used
internally by the cluster - is used to feed data to scheduler for horizontal
and vertical autoscalers (not to feed third-party monitoring solutions). Once
installed accessible at /apis/metrics/k8s.io
.
Can use labels, selectors and --sort-by
parameters as well.
# Access actual data resource usage
$ kubectl top pods
# Per container utilization
$ kubectl top pods --containers
# All Pods on a node
$ kubectl top nodes
# Raw (Prometheus) node data
$ kubectl get --raw /api/v1/nodes/<node_name>/proxy/metrics/resource
Kubernetes keeps container's logs in a file. The location depends on the
container runtime (default for containerd is /var/logs/containers
). Two logs
are retained on a node - current and, if container has restarted, the previous
run log. Access the one before recent restart with --previous
, -p
parameter.
$ kubectl logs <pod>
# Multicontainer Pod
$ kubectl logs <pod> -c container
$ kubectl logs --all-containers <pod>
# Logs come in sequence - container1 -> container2 ...
$ kubectl logs <pod> --all-containers
# Filter logs
$ kubectl logs --tail=20 <pod>
$ kubectl logs --since=10s <pod>
Logs are automatically rotated daily and every time the file reaches 10MB in
size. kubectl logs
only shows logs from the last rotation.
Once container is removed, so is its logs. Use aggregation tools, such as Fluentd to gather logs for safekeeping and analysis. ELK is a common stack for aggregation, search and visualization of logging data.
Aggregations tools treat each line as an entry, which makes multi-line logs appear as separate entries. Either configure outputting logs in JSON format or keep human-readable logs in stdout, while writing JSON to a specific location. Aggregation tool will need additional node-level configuration or be deployed as sidecar.
Nodes run kubelet
and kube-proxy
. On systemd systems kubelet
runs as a
systemd service, which means its logs are stored in journald. kube-proxy
runs
as a Pod in general (same log access methods apply). If it doesn't run inside
a Pod, logs are stored in /var/log/kube-proxy
.
# -u <service_name>
# opens in pager format: f (forward), b (back)
# add --no-pager parameter to disable it
$ journalctl -u kubelet.service
# narrow down time frame
$ journalctl -u kubelet.service --since today
# non-systemd
$ tail /var/log/kubelet.log
# Locate apiserver log file on a node (systemd)
$ kubectl find / -name "*apiserver*log"
- Why logging
- Logging with kibana and elasticsearch
- Logging with fluentd, kibana and elasticsearch
- Fluentd architecture
- cAdvisor
A process in container can write a termination message (reason for termination)
into specific file, which is read by kubelet
and shown with kubectl describe
in the Message
field. Default location is /dev/termination-log
;
can be set to custom location with terminationMessagePath
field in the
container definition in the Pod spec. Can also be used in Pods that run
completable task and terminate successfully.
terminationMessagePath
set to FallbackToLogsOnError
will use last few lines
in container's logs as termination message (only on unsuccessful termination).
All Pods can communicate with each other on all nodes. Software (agents) on a given node can communicate with all Pods on that node.
Network types:
- node (real infrastructure)
- Pod - implemented by network plugin, IPs are assigned from
PodCidrRange
, but could also be assigned from the node network - cluster - used by Services using
ClusterIP
type, assigned fromServiceClusterIpRange
parameter from API server and controller manager configurations
Pod-to-Pod communication on the same node goes through bridge interface. On
different nodes could use Layer 2/Layer 3/overlay options. Services are
implemented by kube-proxy
and can expose Pods both internally and
externally.
Pause/Infrastructure container starts first and sets up the namespace and network stack inside a Pod, which is then used by the application container(s). This allows container(s) restart without interrupting network namespace. Pause container has a lifecycle of the Pod (created and deleted along with the Pod).
Container Network Interface (CNI) is abstraction for implementing container and Pod networking (setting namespaces, interfaces, bridge configurations, IP addressing). CNI sits between Kubernetes and container runtime. CNI plugins are usually deployed as Pods controlled by DaemonSets running on each node.
Expose individual Pod directly to the client:
$ kubectl port-forward <pod_name> <localhost_port>:<pod_port>
Use apiserver
as proxy to reach individual Pod or Service (use with kubectl proxy
to handle authentication):
# Pod
$ curl <apiserver_host>:<port>/api/v1/namespaces/<namespace>/pods/<pod>/proxy/<path>
# Service
$ curl <apiserver_host>:<port>/api/v1/namespaces/<namespace>/services/<service>/proxy/<path>
By default Pods run in a separate network namespace. hostNetwork: true
spec
can makes Pod use host's network namespace, effectively making it behave as if
it was running directly on a node. Process in a Pod that binds to a port, will
be bound to node's port. hostPort
property in spec.containers.ports
allows
binding to host's port, without using host network.
DNS is available as a Service in a cluster, and Pods by default are configured
to use it. Provided by CoreDNS (since v1.13). Configuration is stored as
ConfigMap coredns
in kube-system
namespace, which is mounted to coredns
Pods as /etc/coredns/Corefile
. Updates to ConfigMap get propagated to CoreDNS
Pods in about 1-2 minutes - check logs for reload message. More
plugins can be enabled for additional
functionality.
dnsPolicy
settings in Pod spec can be set to the following:
-
ClusterFirst
(default) - send DNS queries with cluster prefix tocoredns
service -
Default
- inherit node's DNS -
None
- specify DNS settings via another parameter,dnsConfig
spec: dnsPolicy: "None" dnsConfig: nameservers: - 9.9.9.9
A records:
- for Pods -
<ip_in_dash_form>.<namespace>.pod.cluster.local
- for Services -
<service_name>.<namespace>.svc.cluster.local
Traffic can access a Service using a name, even in a different namespace just by adding a namespace name:
# will fail if service is in different namespace
$ curl <service_name>
# works across namespaces
$ curl <service_name>.<namespace>
Can be used for managing Pod networking (communication between Pods). Depends whether networking plugin supports it. Applies to Pods that match its label selector, all Pods in a namespace that match namespace selector, or matching CIDR block. Network policies themselves are also namespaced objects, which means it applies to pods in that namespace only. Traffic not in a policy is implicitly denied. All rules are "allow" rules.
Common practice is to drop all traffic, then add other policies, which allow desired ingress and egress traffic. Default ingress deny example:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {} # applies to all Pods, not selected by other policies
policyTypes:
- Ingress
Allow egress traffic from foo pods to bar pods.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: example
spec:
podSelector:
matchLabels:
workload: foo
policyTypes:
- Egress
egress:
- to:
- podSelector:
matchLabels:
workload: bar
ports:
- protocol: TCP
port: 9000
Currently network policies can only target pods. In rules section other pods
can be targeted via label selector and or namespace selector. The latter
doesn't accept a simple name, thus, if name selection is desired a special
label kubernetes.io/metadata.name
can be used:
...
spec:
...
ingress:
- from:
- podSelector:
matchLabels:
workload: foo
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: dev
Provides persistent endpoint for clients (virtual IP and DNS). Load balances
traffic to Pods and automatically updates during Pod controller operations.
Labels and selectors are used to determine which Pods are part of a Service.
Default and popular implementation is kube-proxy
on the node's iptables.
Acts as a network abstraction for Pod access. Allows communication between sets of deployments. A unique ID is assigned at creation time, which can be used by other Pods to talk to each other.
A Service is an controller, which listens to Endpoints controller to provide
persistent IP for Pods. Sends messages (settings) via apiserver
to network
plugin (e.g. Calico) and to kube-proxy
on every node. Also handles access
policies for inbound requests.
Service also creates an Endpoint object(s), which are individual IP:PORT pairs of underlying Pods. See the routing IPs (mostly used for troubleshooting):
$ kubectl describe endpoints <service_name>
Imperatively create a new Service (NodePort
type):
# Create a service
$ kubectl expose deployment <name> \
--port 80 \
--target-port 8080
service/kubernetes
is an API server service.
Each Service gets a DNS A/AAAA record in cluster DNS in the form
<svc_name>.<namespace>.svc.<cluster_domain>
. If Pods and Service are in the
same namespace, the latter can be references simply by the name. Pods that are
created after the Service also get environment variables set with the
information about Services available at that time.
kubectl proxy
command creates a local proxy allowing sending requests to
Kubernetes API:
$ kubectl proxy &
# Access foo service
$ http://localhost:8001/api/v1/namespaces/default/services/foo
# If service has a port_name configured
$ http://localhost:8001/api/v1/namespaces/default/services/foo:<port_name>
sessionAffinity
setting can be set to ClientIP
directing single client to
the same Pod. Cookie based affinity isn't possible since Services operate at
TCP/UDP level.
targetPort
setting can also refer to port names specified in Pod spec,
instead of numbers. Thus, Pods port number can change without requiring similar
change on Service side.
kind: Pod
spec:
containers:
- name: foo
ports:
- name: http
containerPort: 8080
- name: https
containerPort: 8443
---
kind: Service
spec:
ports:
- name: http
port: 80
targetPort: http
- name: https
port: 443
targetPort: https
To manually remove a Pod from Service, add enabled=true
as a label, and
switch it to false
or remove completely for a given Pod.
Default Service type. Exposes a Service on a cluster-internal IP (exists in
iptables on the nodes). IP is chosen from a range specified as a
ServiceClusterIPRange
parameter both on apiserver
and
kube-controller-manager
configurations. If Service is created before
corresponding Pods, they get hostname and IP address as environment variables.
Exposes a Service on the IP address of each node in the cluster at a specific
port number, making it available outside the cluster. Built on top of
ClusterIP
Service - creates ClusterIP
Service and allocates port on all
nodes with a firewall rule to direct traffic on that node to the ClusterIP
persistent IP. NodePort
option is set automatically from the range 30000 to
32767 or can be specified by user (should still fall within that range).
Regardless of which node is requested traffic is routed to ClusterIP
Service and then to Pod(s) (all implemented by kube-proxy
on the node).
Exposes a Service externally, using a load balancing Service provided by a cloud provider or add-on.
Creates a NodePort
Service and makes an async request to use a load balancer.
If listener does not answer (no load balancer is created), stays in Pending
state.
In GKE it is implemented using GCP's Network Load Balancer. GCP will assign
static IP address to load balancer, which directs traffic to nodes (randomly).
kube-proxy
chooses random Pod, which may reside on a different node to ensure
even balance (default behavior). Respond will take same route back. Use
externalTrafficPolicy: Local
option to disable this behavior and enforce
kube-proxy
to direct traffic to local Pods.
Provides service discovery for external services. Kubernetes creates a CNAME
record for external DNS record, allowing Pods to access external services (does
not have selectors, defined Endpoints or ports).
apiVersion: v1
kind: Service
metadata:
name: external-service
spec:
type: ExternalName
externalName: someapi.somecompany.com
ports:
- port: 80
Expose individual Pod IPs backing a Service directly. Define by explicitly
specifying None
in spec.clusterIP
field (headless). Cluster IP is not
allocated and kube-proxy
does not handle this Service (no load balancing
nor proxying). Allows interfacing with other service discovery mechanisms (not
tied to Kubernetes).
Service with selectors - Endpoint controller creates endpoint records and modifies DNS config to return A records (IP addresses) pointing directly to Pods. Client decides which one to use. Often used with stateful applications.
Service without selectors - no Endpoints are created. DNS config may look
for CNAME record for ExternalName
type or any Endpoint records that share a
name with a Service (Endpoint object(s) needs to be created manually, and can
also include external IP).
Usually not managed directly, represents IPs for Pods that match particular
Service. Endpoint controller runs as part of kube-controller-manager
.
If Endpoints is empty, meaning no matching Pods, Service definition might be wrong (labels).
On Pod deletion event Endpoints controller removes the Pod as an endpoint (by
modifying Endpoints API object). kube-proxies
that watch for changes update
iptables on respective nodes; however, removing iptables rules doesn't break
existing connections with clients.
Consists of an Ingress object describing various rules on how HTTP traffic gets
routed to Services (and ultimately to Pods) and an Ingress controller (daemon
in a Pod) watching for new rules (/ingresses
endpoint in the apiserver
).
Cluster may have multiple Ingress controllers. Both L4 and L7 can be
configured. Ingress class or annotation can be used to associate an object with
a particular controller (can also create a default class). Absence of an
Ingress class or annotation will cause every controller to try to satisfy the
traffic.
Ingress also provides load balancing directly to Endpoints bypassing
ClusterIP
. Name-based virtual hosting is available via host header in HTTP
request. Path-based routing and TLS termination are also available.
Ingress controller can be implemented in various ways: Nginx Pods, external hardware (e.g. Citrix), cloud-ingress provider (f.e. AppGW, AWS ALB). Currently 3 Ingress Controllers are supported: AWS, GCE, nginx. Nginx Ingress setup.
Main difference with a LoadBalancer
Service is that this resource operates
on level 7, which allows it to provide name-based virtual hosting, path-based
routing, TLS termination and other capabilities.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: example-ingress
spec:
ingressClassName: nginx
# non-matching traffic or when no rules are defined
defaultBackend:
service:
name: example-service
port:
number: 80
Rook is a storage orchestration solution.
Kubernetes provides storage abstraction as volumes and persistent volumes. Volumes share lifecycle of the Pod. That means volume persists between container restarts. PersistentVolume stays intact even after Pod is deleted and can be reused again. Volumes are attached to Pods, not containers. volumeMount is used to attach volume defined in a Pod to a container.
Access modes:
-
ReadWriteOnce
- read/write to a single node -
ReadOnlyMany
- read-only by multiple nodes -
ReadWriteMany
- read/write by multiple nodes
Kubernetes groups volumes with the same access mode together and sorts them by size from smallest to largest. Claim is checked against each volume in the access mode group until matching size is found.
Simply empty directory that can be mount to a container in a Pod. When a Pod
is destroyed, the directory is deleted. Kubernetes creates emptyDir
volume
from node's local disk or using a memory band file system.
Storage abstraction with a separate lifecycle from Pod. Managed by kubelet
-
maps storage on the node and exposes it as a mount.
Persistent volume abstraction has 2 components: PersistentVolume and PersistentVolumeClaim. PersistentVolume is a durable and persistent storage resource managed at the cluster level. PersistentVolumeClaim is a request and claim made by a Pod to use a PersistentVolume (namespaced object, same namespace as Pod). User specifies volume size, access mode, and other storage characteristics. If a claim matches a volume, then claim is bound to that volume and Pod can consume that resource. If no match can be found, Kubernetes will try to allocate one dynamically.
Static provisioning workflow includes manually creating PersistentVolume, PersistentVolumeClaim, and specifying volume in Pod's spec.
PersistentVolume:
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-store
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
nfs:
...
PersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-store
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
When PersistentVolumeClaim object is deleted, PersistentVolume may be deleted depending on reclaim policy. Reclaim policy can be changed on an existing PersistentVolume.
With Retain
reclaim policy PersistentVolume is not reclaimed after
PersistentVolumeClaim is deleted. PersistentVolume status changes to
Released
. Creating new PersistentVolumeClaim doesn't provide access
to that storage, and if no other volume is available, claim stays in Pending
state.
StorageClass resource allows admin to create a persistent volume provisioner
(with type specific configurations). User requests a claim, and apiserver
auto-provisions a PersistentVolume. The resource is reclaimed according to
reclaim policy stated in StorageClass (default is Delete
). Similar to
PersistentVolume StorageClass isn't namespaced.
Dynamic provisioning workflow includes creating a StorageClass object and
PersistentVolumeClaim pointing to this class. When a Pod is created,
PersistentVolume is dynamically created. Delete
reclaim policy in
StorageClass will delete the PersistentVolume, if PersistentVolumeClaim is
deleted.
volumeBindMode
set to WaitForFirstConsumer
ensures that volumes isn't
created (enen if PVC exists) until a Pod claims it, thus, guarnateeing that it
is created in the same zone and region as the Pod.
StorageClass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: main
provisioner:
kubernetes.io/azure-disk
parameters:
...
PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-wait
spec:
accessMode:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: main # reference existing StorageClass
If PersistentVolumeClaim doesn't specify a StorageClass, default StorageClass
is used (default class is marked with
http://storageclass.beta.kubernetes.io/is-default-class: true
annotation). To
use a preprovisioned PersistentVolume specify StorageClass as empty string.
Deleting a StorageClass doesn't affect existing PV/PVCs.
Provides a way to inject application configuration data into Pods, e.g. config files, command line arguments, environment variables, port number, etc. Can be referenced as a volume. Can ingest data from a literal value, from a file or from a directory of files. Name must be DNS compliant.
ConfigMap can be updated. Also can be set as immutable, meaning can't be
changed after creation. kubelet
periodically syncs with ConfigMap
s to keep
ConfigMap
volume up to date. Data is updated, even if it is already connected
to a Pod (matter of seconds-minutes).
System components and controllers can also use ConfigMaps.
Manifest:
apiVersion: v1
kind: ConfigMap
metadata:
name: app1config
data:
key: value
In command line --from-file
option take the file name as key and contents of
the file as value; --from-env-file
reads lines of key=value
pairs.
--from-file
also takes a directory path as value, in which case a map is
created, where file name is a key.
# create
$ kubectl create configmap [NAME] [DATA]
$ kubectl create configmap [NAME] --from-file=[KEY_NAME]=[FILE_PATH]
# examples
$ kubectl create configmap demo --from-literal=lab.difficulty=easy
$ kubectl create configmap demo --from-file=color.properties
$ kubectl create configmap demo --from-file=customkey=color.properties
$ cat color.properties
color.good=green
color.bad=red
Environment variables (if key contains dash, it is not converted to underscore, but skipped altogether; key names must be valid environment variable names).
spec:
containers:
- name: app1
# Pass individual value from ConfigMap
env:
- name: username
valueFrom:
configMapKeyRef:
name: app1config
key: username
# Create environment variable for each entry in ConfigMap
envFrom:
- configMapRef:
name: app1env
Container command setting - expose as environment variable first, then refer
to it in command
setting:
spec:
containers:
- name: app1
env:
- name: USERNAME
valueFrom:
configMapKeyRef:
name: app1config
key: username
args: ["$(USERNAME)"]
Volume - depending on how ConfigMap is created could result in one file with many values or many files with value in each one. Default permissions are 644.
spec:
containers:
- name; app1
volumeMounts:
- name: app1config
mountPath: /etc/config
volumes:
- name: app1config
configMap:
name: app1config
Volume type ConfigMap can also expose individual entries via items
attribute;
need to specify a file name for each entry:
volumes:
- name: config
configMap:
name: foo
items:
- key: bar
name: custom
If ConfigMap is mounted over non-empty directory, all items stored in that
directory are hidden away. However, individual items can be mounted from a
volume, instead of volume as a whole via subPath
property of the
volumeMount
.
spec:
containers:
- image: some/image
volumeMounts:
- name: myvolume
mountPath: /etc/someconfig.conf
subPath: myconfig.conf
Similar to ConfigMap
, but is used to store sensitive data.
Secret resource is namespaced, and only Pods in the same namespace can reference a given Secret. Always stored in memory (tmpfs), as opposed to physical storage for ConfigMaps. Maximum size is 1MB.
Values must be base64 encoded (when applying manifest, CLI automatically
encodes data). The reason for base64 coding is that values could also be binary
files. stringData
field allows passing data without encoding, however, it is
automatically merged with data
field (stringData
overrides preexisting
field). When reading Secret's data in a Pod, both through volumes or
environment variables, actual value is automatically decoded.
Encryption can also be set up.
kind: Secret
apiVersion: v1
stringData:
foo: bar
data:
cert: LS0TL..
Values passed will be base64 encoded strings (check result with commands below):
$ echo -n "admin" | base64
$ echo -n "password" | base64
Secret types:
-
generic
- creating secrets from files, directories or literal values -
TLS
- private-public encryption key pair; supply both Kubernetes public key certificate encoded in PEM format and the private key of that certificate$ kubectl create secret tls tls-secret --cert=tls.cert --key=tls.key
-
docker-registry
- credentials for a private docker registry (Docker Hub, cloud based container registries)
Can be exposed to a Pod as environment variable or volume/file, latter being able to be updated and reflected in a Pod. A Secret can be marked as immutable - meaning it can not be changed after creation. A Pod using such Secret must also be deleted to be able to read a new Secret with the same name and updated value.
Secrets can be specified individually or all together from a Secret object, in which case keys will be used as environment names; maximum value size is 1MB:
spec:
containers:
- name: one
env:
- name: APP_USERNAME
valueFrom:
secretKeyRef:
name: app1
key: USERNAME
- name: two
envFrom:
- secretKeyRef:
name: app2
Exposing as a file creates a file for each key and puts its value inside as
file content. kubelet
syncs Secret volumes just as ConfigMaps (except
subPath volume, which doesn't receive updates).
spec:
containers:
volumeMounts:
- name: appconfig
mountPath: /etc/appconfig
volumes:
- name: appconfig
secret:
secretName: app
Image pull secret is used to pull images from private registries. First, create
a docker-registry secret. A single entry .dockercfg
is created in the Secret,
just like Docker creates a file in user's home directory for docker login
.
kubectl create secret docker-registry mydockerhubsecret \
--docker-username=myusername --docker-password=mypassword \
[email protected]
Reference a Secret as imagePullSecrets
in Pod's spec:
kind: Pod
spec:
imagePullSecrets:
- name: mydockerhubsecret
Every request to API server goes through the 3 step process:
- Authentication
- Authorization
- Admissions (validate contents of the request, optionally modify it)
Kubernetes provides two types of identities: normal user and service account.
Users are not created nor managed by the API (there is no API object), but
should be managed by external systems. Service Accounts are created by
Kubernetes itself to provide identity for processes in Pods to interact with
apiserver
.
Controlling access to the Kubernetes API.
# check allowed action as current or any given user
$ kubectl auth can-i create deployments
$ kubectl auth can-i create deployments --as <user_name>
$ kubectl auth can-i list pods --as=system:serviceaccount:<namespace>:<service_account>
$ kubectl get pods -v6 --as=system:serviceaccount:<namespace>:<service_account>
kubeadm
-based cluster creates a self-signed CA,
which is used to create certificates for system components and signed user
certificates. kubernetes-admin
user is also created, which has all access
across the cluster.
Authentication validation is performed by authentication plugin. Multiple
plugins can be configured, which apiserver
calls in sequence until one of them
determines the identity of the sender - username, user ID, group it belongs
to. Below are the main methods:
Method | Description |
---|---|
client certificate | Username is included in the certificate itself (Common Name field). Most commonly used in kubeadm -bootstrapped and cloud managed clusters. |
authentication token | Included in HTTP authorization header. Used with Service Accounts, during bootstrapping, and can also authenticate users via Static Token File, which is read only at apiserver startup, changes in this file require apiserver restart. |
basic HTTP | User credentials are stored in Static password file. This file is also read only at apiserver startup. However, easy to set up and use for dev environments. |
OpenID provider | Allow external identity providers for authentication services, SSO is also possible. |
Authentication type is defined in apiserver
startup options.
Documentation.
Groups are simple strings, representing arbitrary group names. They are used to grant permission to multiple identities at once.
Build-in groups:
-
system:unauthenticated
- used for requests where non of authentication plugins could authenticate -
system:authenticated
- automatically assigned to user who is authenticated -
system:serviceaccounts
- encompasses all service accounts in the system -
system:serviceaccounts:<namespace>
- encompasses all service accounts in the specific namespace
Similarly to authentication plugins, multiple authorization plugins can be
configured, which apiserver
calls in sequence, until one of them determines that
the user can do the requested action. Authorization plugins:
- RBAC
- Node - grant access to
kubelet
s on nodes - ABAC (Attribute-based Access Control) - policies with attributes
In kubeadm
-based clusters RBAC and Node authorization plugins are enabled by
default.
Base elements:
- subject (who) - users or processes that can make requests to
apiserver
- resources (on what) - API objects such as Pods, Deployments, etc
- actions (what) - verbs, operations such as get, watch, create
Elements are connected together using 2 RBAC API objects: roles (connect API resources and actions) and role bindings (connect roles to subjects). Both can be applied on a cluster or namespace level.
Roles are what can be done to resources. A Role includes one or many rules that specify allowed verbs on resources. Rules are permissive; default action is deny (there is no deny rule). Subjects are users, groups or Service Accounts.
get
, list
and watch
are often used together to provide read-only access.
patch
and update
are also usually used together as a unit. Only get
,
update
, delete
and patch
can be used on named resources. *
represents
all actions, full access.
To prevent privilege escalation, the API server only allows users to create and update Roles, if they already have all the permissions listed in that Role (and for the same scope).
Default ClusterRoles and ClusterRoleBindings are updated (recreated) each time
apiserver
starts - in case one was accidentally deleted or new Kubernetes
version brings updates.
Combination | Scope |
---|---|
Role + RoleBinding | namespaced resources in a specific namespace |
ClusterRole + RoleBinding | namespaced resources in a specific namespace (reusing same ClusterRole in multiple namespaces) |
ClusterRole + ClusterRoleBinding | namespaced resources in any or all namespaces, cluster level resources, non-resource URLs |
# Role and RoleBinding
$ kubectl create role <name> --verb=<list_of_verbs> --resource=<list_of_resource>
$ kubectl create rolebinding <name> --role=<role_name> --serviceaccount=<namespace>:<service_account>
$ kubectl create rolebinding <name> --role=<role_name> --user <user_name>
$ kubectl create role newrole --verb=get,list --resource=pods
$ kubectl create rolebinding newrolebinding --role=newrole --serviceaccount=default:newsvcaccount
# ClusterRoleBinding
$ kubectl create clusterrolebinding <name> --clusterrole=view --user=<user_name>
ClusterRole is defined at cluster level. Enables access to:
- cluster scoped resources (Nodes, PersistentVolumes, etc)
- resources in more than one or all namespaces (acts as a common role to be bound inside individual namespaces)
- and non-resource URL (
/healtz
,/version
, etc); for non-resource URLs plain HTTP verbs must be specified (post, get, etc, also need to be lowercase)
Rule anatomy:
- apiGroups - empty string represents the Core API Group (Pods, etc)
- resources - Pods, Services, etc (plural form must be used)
- verbs - get, list, create, update, patch, watch, delete,
deletecollection;
*
represents all verbs
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: demorole
namespace: ns1
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
Default ClusterRoles:
Name | Description |
---|---|
cluster-admin | Cluster-wide super user, with RoleBinding gives admins access within a namespace (Roles, RoleBindings, ResourceQuotas) |
admin | Full access within a namespace, with RoleBinding gives admin access within a namespace (Roles, RoleBindings) |
edit | Read/write access within a namespace. Can't view/edit Roles, RoleBindings, ResourceQuotas, can access Secrets |
view | Read-only access within a namespace. Can't view/edit Roles, RoleBindings, ResourceQuotas, no access to Secrets |
Role binding always references a single role, but can bind a role to multiple subjects. They can also bind to Service Accounts in another namespace. RoleBinding must be in the same namespace with Role. ClusterRoleBinding provides access across all namespaces.
- roleRef - Role or ClusterRole reference (RoleBinding can reference Role or ClusterRole, while ClusterRoleBinding can only reference ClusterRole)
- Subjects
- kind (user, group, SA)
- Name
- Namespace (optionally for SA, since user and group are not namespaced)
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: demobinging
namespace: ns1
roleRef:
apiGroup: rbac.authorization.k8s.io/v1
kind: Role
name: demorole
subjects:
- apiGroup: rbac.authorization.k8s.io/v1
kind: User
name: demouser
If a request is trying to create, modify, or delete a resource, it is sent to admission controller. Multiple controllers can be configured; a request goes through all of them. Admission controller has access to the content of the request, is able to modify and validate it, and potentially can even deny the request. Do not and cannot block read requests.
Controllers are compiled into apiserver
binary, can be enabled or disabled
during startup (multiple controllers can be enabled by providing a comma
separated list): --enable-admission-plugins=Initializers
,
--disable-admission-plugins=PodNodeSelector
.
In addition to compiled-in plugins, admission plugins can be developed as extension and run as a webhook, more on that in Dynamic Admission Control.
Some examples of admission controllers:
- Initializer allows dynamic modification of the API request
- ResourceQuota ensures the object being created doesn't violate any existing quotas
- NamespaceAutoProvision checks a request and creates a namespace, if it doesn't already exist
- LimitRanger applies default memory/cpu limits for a namespace
- PersistentVolumeClaimResize checks incoming PVC resize requests
# View list of enabled and/or disabled admission controllers.
# Changes in this file will be picked up and applied by api-server
$ sudo grep admission /etc/kubernetes/manifests/kube-apiserver.yaml
# Enabling or disabling can also be done directly via kube-apiserver binary;
# get inside kube-apiserver container and run the following
$ kube-apiserver --enable-admission-plugins=LimitRanger
$ kube-apiserver --disable-admission-plugins=NamespaceAutoProvision
In kubeadm
-based cluster self signed CA is created by default. However, an
external PKI (Public Key Infrastructure) can also be joined. CA is used for
secure cluster communications (e.g. apiserver
) and for authentication of
users and cluster components.
CA and core cluster component certificates and keys, etcd cert setup and more
are located at etc/kubernetes/pki
. Service Account tokens are seeded from
sa.key
and sa.pub
also located there.
ca.crt
is a CA certificate that is used by clients to trust certificates
issued by this CA (presented by apiserver
to encrypt communication). It is
distributed to:
- nodes during bootstrapping
- clients, users, to interact with the cluster (e.g.
kubeconfig
) - included in the Secret that is created as part of Service Account
ca.key
is a private key that is matched with ca.crt
.
apiserver
exposes an API to create and sign x509 certificates (through
Certificate Signing Request, CSR).
Create a user certificate:
- create a private key (openssl or cfssl)
$ openssl genrsa -out <user_name>.key 2048
- create a CSR (openssl or cfssl), needs to be base64 encoded, header and
trailer need to be trimmed out
# CN - username # O - group $ openssl req -new -key <user_name>.key -out <user_name>.csr -subj "/CN=new_user" $ cat <user_name>.csr | base64 | tr -d "\n" > <user_name>.base64.csr
- create and submit CSR object
apiVersion: certificates.k8s.io/v1 kind: CertificateSigningRequest metadata: name: <csr_name> spec: groups: - system:authenticated request: <contents of <user_name>.base64.csr> signerName: kubernetes.io/kube-apiserver-client usages: - client auth
$ kubectl apply -f <csr>.yaml $ kubectl get csr
- approve CSR
$ kubectl certificate approve <csr_name>
- retrieve certificate
$ kubectl get certificatesigningrequests <csr_name> \ -o jsonpath='{ .status.certificate }' | base64 --decode > <user_name>.crt
CSR objects are garbage collected from the apiserver
in 1 hour - CSR approval
and certificate retrieval must be done within that 1 hour.
Namespaced API object that provides an identity for processes in a Pod to
access API server and perform actions. Certificates are mounted as a volume
and are made available to a Pod at /var/run/secrets/kubernetes.io/serviceaccount/
.
Each namespace has a default Service Account (created automatically with a namespace). All Pods must have a Service Account defined; if none is specified, default is used. This setting must be set at creation time; can not be changed later.
Each Service Account is tied to a Secret (created and deleted automatically) stored in the cluster. That Secret contains CA certificate, authentication token (JWT) and namespace of Service Account.
Create Service Account:
- Declaratively:
apiVersion: v1 kind: ServiceAccount metadata: name: mysvcaccount
- Imperatively:
$ kubectl create serviceaccount mysvcaccount
Introduces 3 different policies that broadly cover what Pods are allowed to do. Policies are cumulative, and are applied at a namespace level (give appropriate label). Existing Pods are not affected. Rules are enforced via built-in admission controller.
- Privileged - unrestricted, also known as privilege escalations
- Baseline - minimally restrictive, which prevents known privilege escalations, allows the default Pod configuration
- Restricted - heavily restricted, follows current Pod hardening best practices
Example:
apiVersion: v1
kind: Namespace
metadata:
name: no-restrictions-namespace
labels:
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/enforce-version: latest
kubectl [command] [type] [Name] [flag]
All commands can be supplied with object type and a name separately or in the
form of object_type/object_name
.
Common commands:
- apply/create - create resources
- run - start a pod from an image
- explain - built-in documentation of object or resource, can also pass
object's properties via dot notation, e.g.
pod.spec.containers
- delete - delete resources
- get - list resources,
get all
shows all Pods and controller objects - describe - detailed information on a given resource
- exec - execute command on a container (in multi container scenario, specify
container with
-c
or--container
; defaults to the first declared container) - logs - view logs on a container
-
cp path/on/host <pod_name>:path/in/container
- copy files from host - diff - check the difference between existing object and the one defined in
the manifest
$ kubectl diff -f manifest.yaml
- set - update various fields in already existing Kubernetes objects
Common flags:
-
-o <format>
- output format, one ofwide
,yaml
,json
-
--dry-run <option>
- eitherserver
orclient
,client
is useful for validating syntax and generate syntactically correct manifest,server
sends a request toapiserver
, but doesn't persist in storage, could be used to validate syntax errors$ kubectl create deployment nginx --image nginx --dry-run=client -o yaml
-
-v
- verbose output, can be set to different levels, e.g. 7. Any number can be specified starting from 0 (less verbose), but there is no implementation for greater than 10 -
--save-config
- use with create or apply commands to save configuration in annotation to serve as a metadata for future object updates -
--watch
- give output over time (updates when status changes) -
--recursive
- show all inner fields -
--show-labels
- add extra columnlabels
to the output with all labels -
--tail=<number>
- limit output to the last 20 lines -
--since=3h
- limit output based on time limit -
--cascade=orphane
deletes the controller, but not objects it has created.
Running imperative commands, adhocs, like set
, create
, etc does not leave
any change information. --record
option can be used to write the command to
kubernetes.io/change-cause
annotation to be inspected later on, for example,
by kubectl rollout history
.
Change manifest and object from cli (in JSON parameter specify field to change from root):
$ kubectl patch <object_type> <name> -p <json>
Get privileged access to a node through interactive debugging container (will
run in host namespace, node's filesystem will be mounted at /host
)
$ kubectl debug node/<name> -ti --image=<image_name>
Troubleshoot a Pod that doesn't have necessary utilities or shell altogether.
Adds a new container to a running Pod. This will be listed as ephemeral container
. Added via handler (via API call), not Pod's spec. --target
option
adds new container to the same namespace (so that ps
would show all
processes) - must be implemented by container runtime, otherwise gets added in
its own separate namespace.
$ kubectl debug -it <pod> --image=busybox --target=<pod>
tutum/dnsutils
image contains nslookup
and dig
for DNS debugging.
JSONPath support docs.
To build up JSONPath parameter output desired objects with -o json
first.
# get names of all Pods
$ kubectl get pod -o jsonpath='{ .items[*].metadata.name }'
# get images used in all namespaces
$ kubectl get pod -A -o jsonpath='{ .items[*}.spec.containers[*].image }'
Filter specific field instead of retrieving all elements in a list with *
.
?
- define a filter, @
- refer to current object.
# Retrieve internal IPs of all nodes
$ kubectl get nodes -o jsonpath="{ .items[*].status.addresses[?(@.type=='InternalIP')].address }"
Output can be formatted for easy reading:
$ kubectl get pod -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'
Sorting can be done based on any string or numeric field with --sort-by
parameter. Also data can be presented in columns; usually used in with
custom-columns
(e.g. to outputs fields that are not part of default
kubectl
output).
$ kubectl get pod -o jsonpath='{ .items[*].metadata.name }' --sort-by=.metadata.name
$ kubectl get pod -o jsonpath='{ .items[*].metadata.name }' \
--sort-by=.metadata.creationTimestamp \
--output=custom-columns='Name:metadata.name,CREATIONTIMESTAMP:metadata.creationTimestamp'
kubectl
interacts with kube-apiserver
and uses configuration file,
home/.kube/config
, as a source of server information and authentication.
context is a combination of cluster and user credentials. Can be passed as
cli parameters, or switch the shell contexts:
$ kubectl config use-context <context>
kubeconfig
files define connection settings to the cluster: mainly client
certificates and apiserver
network location. Often CA certificate that
was used to sign the certificate of apiserver
is also included, thus, client
can trust the certificate presented by apiserver
upon connection.
During kubeadm
bootstrapping kubeconfig
files for various components are
placed at /etc/kubernetes
:
-
admin.conf
- cluster admin account (kubernetes-admin
) kubelet.conf
controller-manager.conf
scheduler.conf
Each worker node also has kubeconfig.conf
file that is used by kubelet
to
authenticate to the apiserver
. kube-proxy
's kubeconfig
file is stored as
ConfigMap in kube-system
namespace.
kubeconfig
consists of 3 sections: clusters, users, and contexts (combination
of a user and a cluster with optionally a namespace). Each has a name
field
for reference. Context name convention - <user_name>@<cluster_name>
.
current-context
field specifies the default context to use with all kubectl
commands. User
defines a user name and either one of
certificate/token/password.
- cluster
-
certificate-authority-data
- base64-encodedca.crt
-
server
- URL, location of API server
-
- user
-
client-certificate-data
- base64-encoded certificate that is presented to API server for authentication (username is encoded inside certificate) -
client-key-data
- correlated private key
-
- context
-
cluster
- referenced by name -
user
- referenced by name
-
kubectl config
is used to interact with kubeconfig
file. Default user
location - $HOME/.kube/config
. Use --kubeconfig
parameter or KUBECONFIG
environment variable to use a different file in a custom location.
# View contents, basic server info
# Certificate data is refacted
$ kubectl config view
# View all including certificates (base64 encoded)
$ kubectl config view --raw
$ kubectl config view --kubeconfig=/path/to/kubeconfig
# context, cluster info
# useful to verify the context
$ kubectl cluster-info
# list all contexts
$ kubectl config get-contexts
$ kubectl config use-context <context_name>
# configure user credentials
# token and username/password are mutually exclusive
$ kubectl config set-credentials
# Remove entries
$ kubectl config delete-context <context>
$ kubectl config delete-cluster <cluster>
$ kubectl config unset users.<user>
Manually create kubeconfig
file using user certificates obtained earlier:
# Define cluster
# Optionally specify --kubeconfig to use custom file instead of default
# --embed-certs base64 encodes certificate data and inserts it
$ kubectl config set-cluster <cluster_name> \
--server=<api_server_url> \
--certificate-authority=<path_to_ca.crt> \
--embed-certs=true \
--kubeconfig=<path_to_kubeconfig.conf>
# Define a credential
$ kubectl config set-credentials <user_name> \
--client-key=<path_to_user.key> \
--client-certificate=<path_to_user.crt> \
--embed-certs=true \
--kubeconfig=<path_to_kubeconfig.conf>
# Define context
$ kubectl config set-context <context_name> \
--cluster=<cluster_name> \
--user=<user_name> \
--kubeconfig=<path_to_kubeconfig.conf>
krew is a plugin (extensions)
manager for kubectl
. Plugins introduce new commands, but don't overwrite or
extend existing kubectl
commands.
Extend kubectl with plugins.
Ensure that PATH includes plugins (krew
's home directory, most likely
$HOME/.krew
).
- get container ids in a pod:
$ kubectl get pod <pod_name> -o=jsonpath='{range .status.containerstatuses[*]}{@.name}{" - "}{@.containerid}{"\n"}{end}'
- start up ubuntu pod:
$ kubectl run <name> --rm -i --tty --image ubuntu -- bash
Sample image for testing gcr.io/google-samples/hello-app:1.0
.
General troubleshooting steps:
- command line errors
- Pod logs and state of Pods
- Pod DNS and network issues (via container shell)
- node logs (check for errors), enough resources allocated
- RBAC, SELinux or AppArmor for security settings
- API calls to and from controllers to
kube-apiserver
- inter-node network issues, DNS and firewall
- control plane controllers
Helpful materials:
- Troubleshooting
- Troubleshooting applications
- Troubleshoot cluster
- Debug pods and ReplicationControllers
- Debug services
Sonobuoy is a conformance testing tool that helps to validate the state of Kubernetes cluster.
- OpenShift Online Starter (free, multi-tenant hosted solution)
- Minishift (similar to Minikube, OpenShift cluster)
- Tutorials
- Kubernetes Fundamentals
- Certified Kubernetes Administrator Pluralsight Path
- Kubernetes in Action (Marko Luksa)
- CKA lab practice