Kubernetes - romitagl/kgraph GitHub Wiki
-
create pod spec:
kubectl run podname --image=busybox -o yaml --dry-run=client --command sleep 4800\; > pod.yaml
-
get pods sorted by creation timestamp:
kubectl get pods --sort-by=.metadata.creationTimestamp
-
get pods with "Running" status:
kubectl get pods --field-selector=status.phase=Running
-
get pods with status "!Running":
kubectl get pods --field-selector=status.phase!=Running
-
get pods with labels env=dev and env=prod:
kubectl get pods -l 'env in (dev,prod)' --show-labels
-
get pods sorting by node name:
kubectl get pods -o wide --sort-by=.spec.nodeName
-
get container images in a pod (podname):
kubectl get pod podname -o jsonpath='{.spec.containers[].image}{"\n"}'
-
DEBUGGING - get pods with status "!Succeeded":
kubectl get pods --field-selector=status.phase!=Succeeded
-
force delete:
kubectl delete pod nginx-0 --grace-period=0 --force
-
after force deleting a pod, it might stay in an Unknown state, so a patch to the API server will delete the entry:
kubectl patch pod nginx-0 -p '{"metadata":{"finalizers":null}}'
-
check the resources section within the pod’s spec.containers:
kubectl explain pod.spec.containers.resources
-
update resources of a container in a deployment:
kubectl set resources deployment/nginx-deployment -c=nginx --limits=cpu=200m,memory=512Mi
-
delete a stuck CRD:
kubectl patch crd/crd-name -p '{"metadata":{"finalizers":[]}}' --type=merge
-
force deletion of a stuck namespace (in this case
developer
):
echo '{
"apiVersion": "v1",
"kind": "Namespace",
"metadata": {
"name": "developer"
},
"spec": {
"finalizers": null
}
}' > /tmp/namespace.json
kubectl proxy
curl -k -H "Content-Type: application/json" -X PUT --data-binary @/tmp/namespace.json http://127.0.0.1:8001/api/v1/namespaces/developer/finalize
- to delete hanging resources in general (in this example crd):
kubectl get crd rayclusters.cluster.ray.io -o json > bad.json
# Make a PUT call removing any finalizer
cat bad.json | jq '. | setpath(["metadata","finalizers"]; [])' | curl -kD- -H "Content-Type: application/json" -X PUT --data-binary @- "127.0.0.1:8001$(cat bad.json | jq -r '.metadata.selfLink')"
- touch a file in a list of pods:
for name in pod1 pod2 pod3; do kubectl exec $name -- touch /tmp/file; done
- get decoded secret:
kubectl get secrets/<secret-name> --template='{{.data.<target-key> | base64decode}}'
Official Doc: https://kubernetes.io/docs/tasks/debug-application-cluster/
- debugging pods: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-pods
-
kubectl debug pod/myapp-pod -it --copy-to=myapp-debug --container=myapp-container --image=busybox
: this command creates a copy of myapp-pod, replacing myapp-container with a busybox image for debugging purposes. - ephemeral containers: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-running-pod/#ephemeral-container
- configure process namespace sharing for a pod: https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace
- create an interactive shell on a node:
kubectl debug node/mynode -it --image=busybox
When creating a debugging session on a node, keep in mind that:
- kubectl debug automatically generates the name of the new Pod based on the name of the Node.
- The container runs in the host IPC, Network, and PID namespaces.
- The root filesystem of the Node will be mounted at /host.
- autoscaling: Karpenter automatically launches just the right compute resources to handle your cluster's applications.
- Check the container capabilities:
kubectl exec -it pod -c container -- sh
grep Cap /proc/1/status
capsh --decode=00000000a80425fb
More info on the Linux capabilities at: https://github.com/torvalds/linux/blob/master/include/uapi/linux/capability.h
- list all images in the target namespace:
kubectl get pods -n namespace -o=jsonpath='{range .items[*]}{"\n"}{range .spec.containers[*]}{.image}{end}{end}' | sort
-
Dashboard: login to localhost:12345:
kubectl port-forward svc/kubernetes-dashboard -n kubernetes-dashboard 12345:80
- to get the auth token:
- list the service accounts:
kubectl get serviceaccounts
- get the target service account and check the secret name:
kubectl get serviceaccounts user -o yaml
- get the secret:
kubectl get secrets gian-token-ghh4l -o yaml
- decode the token section:
echo -n token | base64 -d
-
Headlamp is an easy-to-use and extensible Kubernetes web UI: https://headlamp.dev/
- CLI tool to implement full-cycle CI/CD to Kubernetes: https://werf.io
- enable autocompletion in Linux (bash):
helm completion bash > /etc/bash_completion.d/helm
- search Hub for an ingress controller:
helm search hub ingress
- add NGINX to the repositories:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
- pull the latest version:
helm repo update
- download the chart:
helm fetch ingress-nginx/ingress-nginx --untar
- install the chart:
helm install myingress .
Note that the examples refer to etcd version 3.
-
List all the Kubernetes keys:
ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key get / --prefix --keys-only
-
Delete a key (example
/registry/your-key
):ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key del /registry/your-key
- Official metrics Documentation: https://github.com/coreos/etcd/blob/v3.2.17/Documentation/metrics.md
- Getting Access to etcd Metrics: https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-5-etcd-metrics-6502693fa58
- avoid that non core pods gets executed on the core nodes -> Taint nodes
- make sure that core pods run on core nodes -> nodeAffinity on pods + Toleration to Taints
- make sure that core pods can be scheduled on non core nodes if core nodes fail -> nodeAffinity preferredDuringSchedulingIgnoredDuringExecution
#!/bin/bash
set -o errexit
set -o nounset
set -o pipefail
if [[ $# -ne 3 ]]; then
printf "error - wrong input parameters - expected: node-name label-key label-value \n"
printf "parameters passed: %s\n" "$*"
exit 1
fi
NODENAME=$1
LABELKEY=$2
LABELVALUE=$3
# script depends on docker-compose - check for existence
cmd=kubectl
if ! which "${cmd}" >/dev/null; then
echo "can't find ${cmd} in PATH, please fix and retry"
exit 1
fi
# generate the template pod spec
cat <<EOF | cat > pod-spec.yaml
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: ${LABELKEY}
operator: In
values:
- ${LABELVALUE}
tolerations:
- key: "${LABELKEY}"
operator: "Equal"
value: "${LABELVALUE}"
effect: "NoSchedule"
EOF
# add a label the target node
kubectl label nodes "$NODENAME" "$LABELKEY"="$LABELVALUE"
# add a taint to the target node
kubectl taint node "$NODENAME" "$LABELKEY"="$LABELVALUE":NoSchedule
# add a taint to all the nodes that have the target label set
# kubectl taint node -l "$LABELKEY"="$LABELVALUE" "$LABELKEY"="$LABELVALUE":NoSchedule
To test:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: test-taint
name: test-taint
spec:
containers:
- image: bash
name: test-taint
command: [ "sleep", "600" ]
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: role
operator: In
values:
- core
tolerations:
- key: "role"
operator: "Equal"
value: "core"
effect: "NoSchedule"
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
Dynamic Resource Allocation is an API for requesting and sharing resources between pods and containers inside a pod. Kubernetes v1.29 includes cluster-level API support for dynamic resource allocation, but it needs to be enabled explicitly.
The fundamental components of DRA are:
- ResourceClass: Identifies the resource driver handling a particular kind of resource.
- ResourceClaim: Specifies a particular resource instance needed by a workload.
- ResourceClaimTemplate: Defines the specs for creating ResourceClaims.
- PodSchedulingContext: Facilitates coordination in pod scheduling.
DRA with NVIDIA GPUs: https://github.com/NVIDIA/k8s-dra-driver
Kubernetes offers two variables that can be configured as part of kubelet configuration file:
- systemReserved
- kubeReserved
When configured, these two variables "tell" kubelet to preserve a certain amount of resources for system processes (kernel, sshd, .etc) and for Kubernetes node components (like kubelet) respectively.
When configuring these variables alongside a third argument that is configured by default ( --enforce-node-allocatable), kubelet limits the amount of resources that can be consumed by pods on the node (Total Amount - kubeReseved - systemReserved), based on a Linux feature called cgroup.
This limitation ensures that in any situation where the total amount of memory consumed by pods on a node grows above the allowed limit, Linux itself will start to evict pods that consume more resources than requested. This way, important processes are guaranteed to have a minimum amount of resources available.
To configure, edit the file /etc/kubernetes/kubelet-config.yaml
and add the following:
kubeReserved:
cpu: 100m
memory: 1G
systemReserved:
cpu: 100m
memory: 1G
Another argument that can be passed to kubelet is evictionHard, which specifies an absolute amount of memory that should always be available on the node. Setting this argument guarantees that critical processes might have extra room to expand above their reserved resources in case they need to and prevent starvation for those processes on the node.
If the amount of memory available on the nodes drops below the configured value, kubelet will start to evict pods on the node.
This enforcement is made by kubelet itself, and therefore less reliable, but it lowers the chance for resource issues on the node, and therefore recommended for use. To configure, please update the file /etc/kubernetes/kubelet-config.yaml
with the following:
evictionHard:
memory.available: "500Mi"
# Default value for evictionHard on kubelet
nodefs.available: "10%"
nodefs.inodesFree: "5%"
imagefs.available: "15%"
Please note that specifying values for evictionHard will override the default values on kubelet which are of very high importance. For further reading please refer to reserve-compute-resources.
Kubernetes does not have cluster-wide logging yet. Instead, another CNCF project is used, called Fluentd. When implemented, it provides a unified logging layer for the cluster, which filters, buffers, and routes messages.
- Kubernetes Deployment: https://docs.fluentd.org/container-deployment/kubernetes
- Resources -> USE method: captures Utilization, Saturation, and Errors (USE) for each of the resources your application uses.
- Services -> RED method: captures the Rate/Requests, Errors, and Durations of requests that the service handles.
- Four Golden Signals: Google suggests you measure four critical signals for every service (Latency, Traffic, Errors, and Saturation).
- Kubernetes observability: https://www.cncf.io/blog/2020/11/11/the-top-kubernetes-apis-for-cloud-native-observability-part-1-the-kubernetes-metrics-service-container-apis/
Prometheus is a popular open-source monitoring and alerting tool that can be used to scrape metrics from your applications running in a Kubernetes cluster. One way to configure Prometheus to scrape metrics from a new application is to use labels or annotations in your Kubernetes deployment.
Here is an example of how to configure a Prometheus scrape target for a new application using labels:
In your Kubernetes deployment file, add the following label to the pod template:
labels:
prometheus: "true"
In your Prometheus configuration file, add a new scrape configuration for the pods with the prometheus label:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_prometheus]
action: keep
This configuration tells Prometheus to scrape metrics from all pods that have the label prometheus: "true". The relabel_configs section is used to filter the pods based on their labels.
You can also configure Prometheus scrape target using annotations. The process is very similar.
In your Kubernetes deployment file, add the following annotation to the pod template:
annotations:
prometheus.io/scrape: "true"
In your Prometheus configuration file, add a new scrape configuration for the pods with the prometheus.io/scrape annotation:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
This configuration tells Prometheus to scrape metrics from all pods that have the annotation prometheus.io/scrape: "true".
Note that the above configurations are just examples, you can customize it to your exact needs and set other rules like relabelling.
You will also need to have prometheus-operator installed on your kubernetes cluster, this is the tool that does the discovery of pods and other objects, this allows you to configure the scrape target using labels and annotations.
Tim Hockin, one of the lead Kubernetes developers, has created a very useful slide deck to understand the Kubernetes networking.
- KubeFed: Kubernetes Cluster Federation allows you to coordinate the configuration of multiple Kubernetes clusters from a single set of APIs in a hosting cluster.
- multicluster-scheduler: Admiralty is a system of Kubernetes controllers that intelligently schedules workloads across clusters.
- Cilium Multi-cluster: Cilium's multi-cluster implementation - ClusterMesh.
- Virtual Kubelet: Virtual Kubelet is an open-source implementation that masquerades as a kubelet.
This allows Kubernetes nodes to be backed by Virtual Kubelet providers such as serverless cloud container platforms
- PostgreSQL Operator: StackGres is a stack of software components built on standard Postgres. With Patroni for high availability, connection pooling, automated backups, monitoring, centralized logging, and a fully-featured management web console
- node-problem-detector: https://github.com/kubernetes/node-problem-detector
- Descheduler for Kubernetes. Descheduler, based on its policy, finds pods that can be moved and evicts them: https://github.com/kubernetes-sigs/descheduler
- Kubernetes performance and scale test orchestration framework: https://github.com/kube-burner/kube-burner
- Quick troubleshooting for your Azure Kubernetes Service (AKS) cluster: https://github.com/Azure/aks-periscope
- Kubernetes API Conventions: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md
- SDK for rapidly building and publishing Kubernetes APIs in Go: https://kubebuilder.io/
- collection of tools to discover, validate and evaluate your kubernetes storage options: https://kubestr.io/
- open source toolkit to manage Kubernetes Operators, in an effective, automated, and scalable way: https://github.com/operator-framework
- Kubernetes WithOut Kubelet - Simulates thousands of Nodes and Clusters: https://github.com/kubernetes-sigs/kwok
- The simulator for the Kubernetes scheduler: https://github.com/kubernetes-sigs/kube-scheduler-simulator
- JobSet, a k8s native API for distributed ML training and HPC workloads: https://github.com/kubernetes-sigs/jobset
- Scaling Kubernetes to 7,500 Nodes: https://openai.com/blog/scaling-kubernetes-to-7500-nodes/
- Kubernetes Failure Stories: https://k8s.af/
- Why did we transition from Gatekeeper to Kyverno: https://medium.com/adevinta-tech-blog/why-did-we-transition-from-gatekeeper-to-kyverno-for-kubernetes-policy-management-42bc2c4523d0
- MutatingAdmissionWebhook that injects a nginx sidecar container into pod: https://github.com/morvencao/kube-sidecar-injector
- Tim Hockin: https://speakerdeck.com/thockin
apiVersion: v1
kind: Pod
metadata:
name: counter
spec:
containers:
- name: count
image: busybox:1.28
args:
- /bin/sh
- -c
- >
i=0;
while true;
do
echo "$i: $(date)" >> /var/log/1.log;
echo "$(date) INFO $i" >> /var/log/2.log;
i=$((i+1));
sleep 1;
done
volumeMounts:
- name: varlog
mountPath: /var/log
volumes:
- name: varlog
emptyDir: {}