Kubernetes - romitagl/kgraph GitHub Wiki

Kubernetes

Pods

  • create pod spec: kubectl run podname --image=busybox -o yaml --dry-run=client --command sleep 4800\; > pod.yaml

  • get pods sorted by creation timestamp: kubectl get pods --sort-by=.metadata.creationTimestamp

  • get pods with "Running" status: kubectl get pods --field-selector=status.phase=Running

  • get pods with status "!Running": kubectl get pods --field-selector=status.phase!=Running

  • get pods with labels env=dev and env=prod: kubectl get pods -l 'env in (dev,prod)' --show-labels

  • get pods sorting by node name: kubectl get pods -o wide --sort-by=.spec.nodeName

  • get container images in a pod (podname): kubectl get pod podname -o jsonpath='{.spec.containers[].image}{"\n"}'

  • DEBUGGING - get pods with status "!Succeeded": kubectl get pods --field-selector=status.phase!=Succeeded

  • force delete: kubectl delete pod nginx-0 --grace-period=0 --force

  • after force deleting a pod, it might stay in an Unknown state, so a patch to the API server will delete the entry: kubectl patch pod nginx-0 -p '{"metadata":{"finalizers":null}}'

  • check the resources section within the pod’s spec.containers: kubectl explain pod.spec.containers.resources

  • update resources of a container in a deployment: kubectl set resources deployment/nginx-deployment -c=nginx --limits=cpu=200m,memory=512Mi

  • delete a stuck CRD: kubectl patch crd/crd-name -p '{"metadata":{"finalizers":[]}}' --type=merge

  • force deletion of a stuck namespace (in this case developer):

echo '{
    "apiVersion": "v1",
    "kind": "Namespace",
    "metadata": {
        "name": "developer"
    },
    "spec": {
        "finalizers": null
    }
}' > /tmp/namespace.json

kubectl proxy

curl -k -H "Content-Type: application/json" -X PUT --data-binary @/tmp/namespace.json http://127.0.0.1:8001/api/v1/namespaces/developer/finalize
  • to delete hanging resources in general (in this example crd):
kubectl get crd rayclusters.cluster.ray.io -o json > bad.json

# Make a PUT call removing any finalizer
cat bad.json | jq '. | setpath(["metadata","finalizers"]; [])' | curl -kD- -H "Content-Type: application/json" -X PUT --data-binary @- "127.0.0.1:8001$(cat bad.json | jq -r '.metadata.selfLink')"
  • touch a file in a list of pods: for name in pod1 pod2 pod3; do kubectl exec $name -- touch /tmp/file; done

Secrets

  • get decoded secret: kubectl get secrets/<secret-name> --template='{{.data.<target-key> | base64decode}}'

Debugging

Official Doc: https://kubernetes.io/docs/tasks/debug-application-cluster/

Pods

Nodes

  • create an interactive shell on a node: kubectl debug node/mynode -it --image=busybox

When creating a debugging session on a node, keep in mind that:

  • kubectl debug automatically generates the name of the new Pod based on the name of the Node.
  • The container runs in the host IPC, Network, and PID namespaces.
  • The root filesystem of the Node will be mounted at /host.
  • autoscaling: Karpenter automatically launches just the right compute resources to handle your cluster's applications.

Containers

  • Check the container capabilities:
kubectl exec -it pod -c container -- sh
grep Cap /proc/1/status
capsh --decode=00000000a80425fb

More info on the Linux capabilities at: https://github.com/torvalds/linux/blob/master/include/uapi/linux/capability.h

Images

  • list all images in the target namespace: kubectl get pods -n namespace -o=jsonpath='{range .items[*]}{"\n"}{range .spec.containers[*]}{.image}{end}{end}' | sort

UI

  • Dashboard: login to localhost:12345: kubectl port-forward svc/kubernetes-dashboard -n kubernetes-dashboard 12345:80

    • to get the auth token:
    1. list the service accounts: kubectl get serviceaccounts
    2. get the target service account and check the secret name: kubectl get serviceaccounts user -o yaml
    3. get the secret: kubectl get secrets gian-token-ghh4l -o yaml
    4. decode the token section: echo -n token | base64 -d
  • Headlamp is an easy-to-use and extensible Kubernetes web UI: https://headlamp.dev/

Werf

Helm

  • enable autocompletion in Linux (bash): helm completion bash > /etc/bash_completion.d/helm
  • search Hub for an ingress controller: helm search hub ingress
  • add NGINX to the repositories: helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
  • pull the latest version: helm repo update
  • download the chart: helm fetch ingress-nginx/ingress-nginx --untar
  • install the chart: helm install myingress .

ETCD

Note that the examples refer to etcd version 3.

  • List all the Kubernetes keys: ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key get / --prefix --keys-only

  • Delete a key (example /registry/your-key): ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key del /registry/your-key

Metrics

Schedule Pods on preferred nodes

  • avoid that non core pods gets executed on the core nodes -> Taint nodes
  • make sure that core pods run on core nodes -> nodeAffinity on pods + Toleration to Taints
  • make sure that core pods can be scheduled on non core nodes if core nodes fail -> nodeAffinity preferredDuringSchedulingIgnoredDuringExecution
#!/bin/bash

set -o errexit
set -o nounset
set -o pipefail

if [[ $# -ne 3 ]]; then
  printf "error - wrong input parameters - expected: node-name label-key label-value \n"
  printf "parameters passed: %s\n" "$*"
  exit 1
fi

NODENAME=$1
LABELKEY=$2
LABELVALUE=$3

# script depends on docker-compose - check for existence
cmd=kubectl
if ! which "${cmd}" >/dev/null; then
  echo "can't find ${cmd} in PATH, please fix and retry"
  exit 1
fi

# generate the template pod spec
cat <<EOF | cat > pod-spec.yaml
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: ${LABELKEY}
            operator: In
            values:
            - ${LABELVALUE}
  tolerations:
  - key: "${LABELKEY}"
    operator: "Equal"
    value: "${LABELVALUE}"
    effect: "NoSchedule"
EOF

# add a label the target node
kubectl label nodes "$NODENAME" "$LABELKEY"="$LABELVALUE"

# add a taint to the target node
kubectl taint node "$NODENAME" "$LABELKEY"="$LABELVALUE":NoSchedule

# add a taint to all the nodes that have the target label set
# kubectl taint node -l "$LABELKEY"="$LABELVALUE" "$LABELKEY"="$LABELVALUE":NoSchedule

To test:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: test-taint
  name: test-taint
spec:
  containers:
  - image: bash
    name: test-taint
    command: [ "sleep", "600" ]
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: role
            operator: In
            values:
            - core
  tolerations:
  - key: "role"
    operator: "Equal"
    value: "core"
    effect: "NoSchedule"
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

Resource Reservation

Pods

DRA

Dynamic Resource Allocation is an API for requesting and sharing resources between pods and containers inside a pod. Kubernetes v1.29 includes cluster-level API support for dynamic resource allocation, but it needs to be enabled explicitly.

The fundamental components of DRA are:

  • ResourceClass: Identifies the resource driver handling a particular kind of resource.
  • ResourceClaim: Specifies a particular resource instance needed by a workload.
  • ResourceClaimTemplate: Defines the specs for creating ResourceClaims.
  • PodSchedulingContext: Facilitates coordination in pod scheduling.

DRA with NVIDIA GPUs: https://github.com/NVIDIA/k8s-dra-driver

System

Kubernetes offers two variables that can be configured as part of kubelet configuration file:

  • systemReserved
  • kubeReserved

When configured, these two variables "tell" kubelet to preserve a certain amount of resources for system processes (kernel, sshd, .etc) and for Kubernetes node components (like kubelet) respectively.

When configuring these variables alongside a third argument that is configured by default ( --enforce-node-allocatable), kubelet limits the amount of resources that can be consumed by pods on the node (Total Amount - kubeReseved - systemReserved), based on a Linux feature called cgroup.

This limitation ensures that in any situation where the total amount of memory consumed by pods on a node grows above the allowed limit, Linux itself will start to evict pods that consume more resources than requested. This way, important processes are guaranteed to have a minimum amount of resources available.

To configure, edit the file /etc/kubernetes/kubelet-config.yaml and add the following:

kubeReserved:
  cpu: 100m
  memory: 1G
systemReserved:
  cpu: 100m
  memory: 1G

Eviction

Another argument that can be passed to kubelet is evictionHard, which specifies an absolute amount of memory that should always be available on the node. Setting this argument guarantees that critical processes might have extra room to expand above their reserved resources in case they need to and prevent starvation for those processes on the node.

If the amount of memory available on the nodes drops below the configured value, kubelet will start to evict pods on the node.

This enforcement is made by kubelet itself, and therefore less reliable, but it lowers the chance for resource issues on the node, and therefore recommended for use. To configure, please update the file /etc/kubernetes/kubelet-config.yaml with the following:

evictionHard:
  memory.available: "500Mi"
  # Default value for evictionHard on kubelet
  nodefs.available: "10%"
  nodefs.inodesFree: "5%"
  imagefs.available: "15%"

Please note that specifying values for evictionHard will override the default values on kubelet which are of very high importance. For further reading please refer to reserve-compute-resources.

Logging

Kubernetes does not have cluster-wide logging yet. Instead, another CNCF project is used, called Fluentd. When implemented, it provides a unified logging layer for the cluster, which filters, buffers, and routes messages.

Metrics

  • Resources -> USE method: captures Utilization, Saturation, and Errors (USE) for each of the resources your application uses.
  • Services -> RED method: captures the Rate/Requests, Errors, and Durations of requests that the service handles.
  • Four Golden Signals: Google suggests you measure four critical signals for every service (Latency, Traffic, Errors, and Saturation).

Observability

Prometheus

Prometheus is a popular open-source monitoring and alerting tool that can be used to scrape metrics from your applications running in a Kubernetes cluster. One way to configure Prometheus to scrape metrics from a new application is to use labels or annotations in your Kubernetes deployment.

Here is an example of how to configure a Prometheus scrape target for a new application using labels:

In your Kubernetes deployment file, add the following label to the pod template:

    labels:
      prometheus: "true"

In your Prometheus configuration file, add a new scrape configuration for the pods with the prometheus label:

    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_prometheus]
        action: keep

This configuration tells Prometheus to scrape metrics from all pods that have the label prometheus: "true". The relabel_configs section is used to filter the pods based on their labels.

You can also configure Prometheus scrape target using annotations. The process is very similar.

In your Kubernetes deployment file, add the following annotation to the pod template:

    annotations:
      prometheus.io/scrape: "true"

In your Prometheus configuration file, add a new scrape configuration for the pods with the prometheus.io/scrape annotation:

    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep

This configuration tells Prometheus to scrape metrics from all pods that have the annotation prometheus.io/scrape: "true".

Note that the above configurations are just examples, you can customize it to your exact needs and set other rules like relabelling.

You will also need to have prometheus-operator installed on your kubernetes cluster, this is the tool that does the discovery of pods and other objects, this allows you to configure the scrape target using labels and annotations.

Networking

Tim Hockin, one of the lead Kubernetes developers, has created a very useful slide deck to understand the Kubernetes networking.

Clusters Federation

  • KubeFed: Kubernetes Cluster Federation allows you to coordinate the configuration of multiple Kubernetes clusters from a single set of APIs in a hosting cluster.
  • multicluster-scheduler: Admiralty is a system of Kubernetes controllers that intelligently schedules workloads across clusters.
  • Cilium Multi-cluster: Cilium's multi-cluster implementation - ClusterMesh.
  • Virtual Kubelet: Virtual Kubelet is an open-source implementation that masquerades as a kubelet.

This allows Kubernetes nodes to be backed by Virtual Kubelet providers such as serverless cloud container platforms

Stateful workloads

  • PostgreSQL Operator: StackGres is a stack of software components built on standard Postgres. With Patroni for high availability, connection pooling, automated backups, monitoring, centralized logging, and a fully-featured management web console

Tools

AKS

Additional Resources

Articles

MutatingAdmissionWebhook

Presentations

Manifest Examples

Pod

apiVersion: v1
kind: Pod
metadata:
  name: counter
spec:
  containers:
  - name: count
    image: busybox:1.28
    args:
    - /bin/sh
    - -c
    - >
      i=0;
      while true;
      do
        echo "$i: $(date)" >> /var/log/1.log;
        echo "$(date) INFO $i" >> /var/log/2.log;
        i=$((i+1));
        sleep 1;
      done      
    volumeMounts:
    - name: varlog
      mountPath: /var/log
  volumes:
  - name: varlog
    emptyDir: {}
⚠️ **GitHub.com Fallback** ⚠️