Kubernetes

Pods

create pod spec: kubectl run podname --image=busybox -o yaml --dry-run=client --command sleep 4800\; > pod.yaml
get pods sorted by creation timestamp: kubectl get pods --sort-by=.metadata.creationTimestamp
get pods with "Running" status: kubectl get pods --field-selector=status.phase=Running
get pods with status "!Running": kubectl get pods --field-selector=status.phase!=Running
get pods with labels env=dev and env=prod: kubectl get pods -l 'env in (dev,prod)' --show-labels
get pods sorting by node name: kubectl get pods -o wide --sort-by=.spec.nodeName
get container images in a pod (podname): kubectl get pod podname -o jsonpath='{.spec.containers[].image}{"\n"}'
DEBUGGING - get pods with status "!Succeeded": kubectl get pods --field-selector=status.phase!=Succeeded
force delete: kubectl delete pod nginx-0 --grace-period=0 --force
after force deleting a pod, it might stay in an Unknown state, so a patch to the API server will delete the entry: kubectl patch pod nginx-0 -p '{"metadata":{"finalizers":null}}'
check the resources section within the pod’s spec.containers: kubectl explain pod.spec.containers.resources
update resources of a container in a deployment: kubectl set resources deployment/nginx-deployment -c=nginx --limits=cpu=200m,memory=512Mi
delete a stuck CRD: kubectl patch crd/crd-name -p '{"metadata":{"finalizers":[]}}' --type=merge
force deletion of a stuck namespace (in this case developer):

echo '{
    "apiVersion": "v1",
    "kind": "Namespace",
    "metadata": {
        "name": "developer"
    },
    "spec": {
        "finalizers": null
    }
}' > /tmp/namespace.json

kubectl proxy

curl -k -H "Content-Type: application/json" -X PUT --data-binary @/tmp/namespace.json http://127.0.0.1:8001/api/v1/namespaces/developer/finalize

to delete hanging resources in general (in this example crd):

kubectl get crd rayclusters.cluster.ray.io -o json > bad.json

# Make a PUT call removing any finalizer
cat bad.json | jq '. | setpath(["metadata","finalizers"]; [])' | curl -kD- -H "Content-Type: application/json" -X PUT --data-binary @- "127.0.0.1:8001$(cat bad.json | jq -r '.metadata.selfLink')"

touch a file in a list of pods: for name in pod1 pod2 pod3; do kubectl exec $name -- touch /tmp/file; done

Secrets

get decoded secret: kubectl get secrets/<secret-name> --template='{{.data.<target-key> | base64decode}}'

Debugging

Official Doc: https://kubernetes.io/docs/tasks/debug-application-cluster/

Pods

debugging pods: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods
kubectl debug pod/myapp-pod -it --copy-to=myapp-debug --container=myapp-container --image=busybox: this command creates a copy of myapp-pod, replacing myapp-container with a busybox image for debugging purposes.
ephemeral containers: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-running-pod/#ephemeral-container
configure process namespace sharing for a pod: https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace

Nodes

create an interactive shell on a node: kubectl debug node/mynode -it --image=busybox

When creating a debugging session on a node, keep in mind that:

kubectl debug automatically generates the name of the new Pod based on the name of the Node.

The container runs in the host IPC, Network, and PID namespaces.

The root filesystem of the Node will be mounted at /host.

autoscaling: Karpenter automatically launches just the right compute resources to handle your cluster's applications.

Containers

Check the container capabilities:

kubectl exec -it pod -c container -- sh
grep Cap /proc/1/status
capsh --decode=00000000a80425fb

More info on the Linux capabilities at: https://github.com/torvalds/linux/blob/master/include/uapi/linux/capability.h

Images

list all images in the target namespace: kubectl get pods -n namespace -o=jsonpath='{range .items[*]}{"\n"}{range .spec.containers[*]}{.image}{end}{end}' | sort

UI

Dashboard: login to localhost:12345: kubectl port-forward svc/kubernetes-dashboard -n kubernetes-dashboard 12345:80
- to get the auth token:
1. list the service accounts: kubectl get serviceaccounts
2. get the target service account and check the secret name: kubectl get serviceaccounts user -o yaml
3. get the secret: kubectl get secrets gian-token-ghh4l -o yaml
4. decode the token section: echo -n token | base64 -d
Headlamp is an easy-to-use and extensible Kubernetes web UI: https://headlamp.dev/

Werf

CLI tool to implement full-cycle CI/CD to Kubernetes: https://werf.io

Helm

enable autocompletion in Linux (bash): helm completion bash > /etc/bash_completion.d/helm
search Hub for an ingress controller: helm search hub ingress
add NGINX to the repositories: helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
pull the latest version: helm repo update
download the chart: helm fetch ingress-nginx/ingress-nginx --untar
install the chart: helm install myingress .

ETCD

Note that the examples refer to etcd version 3.

List all the Kubernetes keys: ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key get / --prefix --keys-only
Delete a key (example /registry/your-key): ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key del /registry/your-key

Metrics

Official metrics Documentation: https://github.com/coreos/etcd/blob/v3.2.17/Documentation/metrics.md
Getting Access to etcd Metrics: https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-5-etcd-metrics-6502693fa58

Schedule Pods on preferred nodes

avoid that non core pods gets executed on the core nodes -> Taint nodes
make sure that core pods run on core nodes -> nodeAffinity on pods + Toleration to Taints
make sure that core pods can be scheduled on non core nodes if core nodes fail -> nodeAffinity preferredDuringSchedulingIgnoredDuringExecution

#!/bin/bash

set -o errexit
set -o nounset
set -o pipefail

if [[ $# -ne 3 ]]; then
  printf "error - wrong input parameters - expected: node-name label-key label-value \n"
  printf "parameters passed: %s\n" "$*"
  exit 1
fi

NODENAME=$1
LABELKEY=$2
LABELVALUE=$3

# script depends on docker-compose - check for existence
cmd=kubectl
if ! which "${cmd}" >/dev/null; then
  echo "can't find ${cmd} in PATH, please fix and retry"
  exit 1
fi

# generate the template pod spec
cat <<EOF | cat > pod-spec.yaml
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: ${LABELKEY}
            operator: In
            values:
            - ${LABELVALUE}
  tolerations:
  - key: "${LABELKEY}"
    operator: "Equal"
    value: "${LABELVALUE}"
    effect: "NoSchedule"
EOF

# add a label the target node
kubectl label nodes "$NODENAME" "$LABELKEY"="$LABELVALUE"

# add a taint to the target node
kubectl taint node "$NODENAME" "$LABELKEY"="$LABELVALUE":NoSchedule

# add a taint to all the nodes that have the target label set
# kubectl taint node -l "$LABELKEY"="$LABELVALUE" "$LABELKEY"="$LABELVALUE":NoSchedule

To test:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: test-taint
  name: test-taint
spec:
  containers:
  - image: bash
    name: test-taint
    command: [ "sleep", "600" ]
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: role
            operator: In
            values:
            - core
  tolerations:
  - key: "role"
    operator: "Equal"
    value: "core"
    effect: "NoSchedule"
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

Resource Reservation

Pods

DRA

Dynamic Resource Allocation is an API for requesting and sharing resources between pods and containers inside a pod. Kubernetes v1.29 includes cluster-level API support for dynamic resource allocation, but it needs to be enabled explicitly.

The fundamental components of DRA are:

ResourceClass: Identifies the resource driver handling a particular kind of resource.
ResourceClaim: Specifies a particular resource instance needed by a workload.
ResourceClaimTemplate: Defines the specs for creating ResourceClaims.
PodSchedulingContext: Facilitates coordination in pod scheduling.

DRA with NVIDIA GPUs: https://github.com/NVIDIA/k8s-dra-driver

System

Kubernetes offers two variables that can be configured as part of kubelet configuration file:

systemReserved
kubeReserved

When configured, these two variables "tell" kubelet to preserve a certain amount of resources for system processes (kernel, sshd, .etc) and for Kubernetes node components (like kubelet) respectively.

When configuring these variables alongside a third argument that is configured by default ( --enforce-node-allocatable), kubelet limits the amount of resources that can be consumed by pods on the node (Total Amount - kubeReseved - systemReserved), based on a Linux feature called cgroup.

This limitation ensures that in any situation where the total amount of memory consumed by pods on a node grows above the allowed limit, Linux itself will start to evict pods that consume more resources than requested. This way, important processes are guaranteed to have a minimum amount of resources available.

To configure, edit the file /etc/kubernetes/kubelet-config.yaml and add the following:

kubeReserved:
  cpu: 100m
  memory: 1G
systemReserved:
  cpu: 100m
  memory: 1G

Eviction

Another argument that can be passed to kubelet is evictionHard, which specifies an absolute amount of memory that should always be available on the node. Setting this argument guarantees that critical processes might have extra room to expand above their reserved resources in case they need to and prevent starvation for those processes on the node.

If the amount of memory available on the nodes drops below the configured value, kubelet will start to evict pods on the node.

This enforcement is made by kubelet itself, and therefore less reliable, but it lowers the chance for resource issues on the node, and therefore recommended for use. To configure, please update the file /etc/kubernetes/kubelet-config.yaml with the following:

evictionHard:
  memory.available: "500Mi"
  # Default value for evictionHard on kubelet
  nodefs.available: "10%"
  nodefs.inodesFree: "5%"
  imagefs.available: "15%"

Please note that specifying values for evictionHard will override the default values on kubelet which are of very high importance. For further reading please refer to reserve-compute-resources.

Logging

Kubernetes does not have cluster-wide logging yet. Instead, another CNCF project is used, called Fluentd. When implemented, it provides a unified logging layer for the cluster, which filters, buffers, and routes messages.

Kubernetes Deployment: https://docs.fluentd.org/container-deployment/kubernetes

Metrics

Resources -> USE method: captures Utilization, Saturation, and Errors (USE) for each of the resources your application uses.
Services -> RED method: captures the Rate/Requests, Errors, and Durations of requests that the service handles.
Four Golden Signals: Google suggests you measure four critical signals for every service (Latency, Traffic, Errors, and Saturation).

Observability

Kubernetes observability: https://www.cncf.io/blog/2020/11/11/the-top-kubernetes-apis-for-cloud-native-observability-part-1-the-kubernetes-metrics-service-container-apis/

Prometheus

Prometheus is a popular open-source monitoring and alerting tool that can be used to scrape metrics from your applications running in a Kubernetes cluster. One way to configure Prometheus to scrape metrics from a new application is to use labels or annotations in your Kubernetes deployment.

Here is an example of how to configure a Prometheus scrape target for a new application using labels:

In your Kubernetes deployment file, add the following label to the pod template:

    labels:
      prometheus: "true"

In your Prometheus configuration file, add a new scrape configuration for the pods with the prometheus label:

    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_prometheus]
        action: keep

This configuration tells Prometheus to scrape metrics from all pods that have the label prometheus: "true". The relabel_configs section is used to filter the pods based on their labels.

You can also configure Prometheus scrape target using annotations. The process is very similar.

In your Kubernetes deployment file, add the following annotation to the pod template:

    annotations:
      prometheus.io/scrape: "true"

In your Prometheus configuration file, add a new scrape configuration for the pods with the prometheus.io/scrape annotation:

    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep

This configuration tells Prometheus to scrape metrics from all pods that have the annotation prometheus.io/scrape: "true".

Note that the above configurations are just examples, you can customize it to your exact needs and set other rules like relabelling.

You will also need to have prometheus-operator installed on your kubernetes cluster, this is the tool that does the discovery of pods and other objects, this allows you to configure the scrape target using labels and annotations.

Networking

Tim Hockin, one of the lead Kubernetes developers, has created a very useful slide deck to understand the Kubernetes networking.

Clusters Federation

KubeFed: Kubernetes Cluster Federation allows you to coordinate the configuration of multiple Kubernetes clusters from a single set of APIs in a hosting cluster.
multicluster-scheduler: Admiralty is a system of Kubernetes controllers that intelligently schedules workloads across clusters.
Cilium Multi-cluster: Cilium's multi-cluster implementation - ClusterMesh.
Virtual Kubelet: Virtual Kubelet is an open-source implementation that masquerades as a kubelet.

This allows Kubernetes nodes to be backed by Virtual Kubelet providers such as serverless cloud container platforms

Stateful workloads

PostgreSQL Operator: StackGres is a stack of software components built on standard Postgres. With Patroni for high availability, connection pooling, automated backups, monitoring, centralized logging, and a fully-featured management web console

Tools

node-problem-detector: https://github.com/kubernetes/node-problem-detector
Descheduler for Kubernetes. Descheduler, based on its policy, finds pods that can be moved and evicts them: https://github.com/kubernetes-sigs/descheduler
Kubernetes performance and scale test orchestration framework: https://github.com/kube-burner/kube-burner

AKS

Quick troubleshooting for your Azure Kubernetes Service (AKS) cluster: https://github.com/Azure/aks-periscope

Additional Resources

Kubernetes API Conventions: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md
SDK for rapidly building and publishing Kubernetes APIs in Go: https://kubebuilder.io/
collection of tools to discover, validate and evaluate your kubernetes storage options: https://kubestr.io/
open source toolkit to manage Kubernetes Operators, in an effective, automated, and scalable way: https://github.com/operator-framework
Kubernetes WithOut Kubelet - Simulates thousands of Nodes and Clusters: https://github.com/kubernetes-sigs/kwok
The simulator for the Kubernetes scheduler: https://github.com/kubernetes-sigs/kube-scheduler-simulator
JobSet, a k8s native API for distributed ML training and HPC workloads: https://github.com/kubernetes-sigs/jobset

Articles

Scaling Kubernetes to 7,500 Nodes: https://openai.com/blog/scaling-kubernetes-to-7500-nodes/
Kubernetes Failure Stories: https://k8s.af/
Why did we transition from Gatekeeper to Kyverno: https://medium.com/adevinta-tech-blog/why-did-we-transition-from-gatekeeper-to-kyverno-for-kubernetes-policy-management-42bc2c4523d0

MutatingAdmissionWebhook

MutatingAdmissionWebhook that injects a nginx sidecar container into pod: https://github.com/morvencao/kube-sidecar-injector

Presentations

Tim Hockin: https://speakerdeck.com/thockin

Manifest Examples

Pod

apiVersion: v1
kind: Pod
metadata:
  name: counter
spec:
  containers:
  - name: count
    image: busybox:1.28
    args:
    - /bin/sh
    - -c
    - >
      i=0;
      while true;
      do
        echo "$i: $(date)" >> /var/log/1.log;
        echo "$(date) INFO $i" >> /var/log/2.log;
        i=$((i+1));
        sleep 1;
      done      
    volumeMounts:
    - name: varlog
      mountPath: /var/log
  volumes:
  - name: varlog
    emptyDir: {}

Kubernetes - romitagl/kgraph GitHub Wiki

Kubernetes

Pods

Secrets

Debugging

Pods

Nodes

Containers

Images

UI

Werf

Helm

ETCD

Metrics

Schedule Pods on preferred nodes

Resource Reservation

Pods

DRA

System

Eviction

Logging

Metrics

Observability

Prometheus

Networking

Clusters Federation

Stateful workloads

Tools

AKS

Additional Resources

Articles

MutatingAdmissionWebhook

Presentations

Manifest Examples

Pod

⚠️ GitHub.com Fallback ⚠️

Kubernetes - romitagl/kgraph GitHub Wiki

Kubernetes

Pods

Secrets

Debugging

Pods

Nodes

Containers

Images

UI

Werf

Helm

ETCD

Metrics

Schedule Pods on preferred nodes

Resource Reservation

Pods

DRA

System

Eviction

Logging

Metrics

Observability

Prometheus

Networking

Clusters Federation

Stateful workloads

Tools

AKS

Additional Resources

Articles

MutatingAdmissionWebhook

Presentations

Manifest Examples

Pod

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️