k8s_architecture - henk52/knowledgesharing GitHub Wiki
Kubernetes Architecture Documentation
Introduction
Purpose
Vocabulary
- cidr
- CNI - Container Network Interface. Allow to bring own network implementation.
- CRD - Custom Resource Definition - TODO investigate
- CSI - Container Storage Interface.
- mTLS - mutual TLS; both sides of the connection are authenticated.
- service mesh
- SPIFFE - Secure Production Identity Framework for Everyone.
References
Overview
- Control plane
- API server
- ETCD
- management controller
- cloud-controler
Cluster overview
Networking
- Networking & Kubernetes, James Strong and Vallery Lancey
- Kevin Sookocheff A guide to the kubernetes networking model
- The kubernetes network guide
Networking overview
graph TB
subgraph node_a [Node]
subgraph pod_a_1[POD]
pause
container_a_1_1[Container]
container_a_1_2[Container]
ceth_a[IP/veth - ceth0]
end
subgraph pod_a_2[POD]
pause
container_a_2_1[Container]
container_a_2_2[Container]
ceth_b[IP/veth - ceth1]
end
cni_1[CNI]
routes_1[Routes]
forward_1[Forward rules]
kube_proxy_1[kube-proxy]
subgraph ethernet_a[ethernet]
eth0 --- cbr0
cbr0 --- veth_a[veth0]
cbr0 --- veth_b[veth1]
end
veth_a --- ceth_a
veth_b --- ceth_b
end
cni_1 ---|allocate| pod_a_1
cni_1 ---|manages| routes_1
kube_proxy_1 ---|manages| forward_1
subgraph node_b [Node]
subgraph pod_b_1[POD]
pause
container_b_1_1[Container]
container_b_1_2[Container]
IP/veth
end
subgraph pod_b_2[POD]
pause
container_b_2_1[Container]
container_b_2_2[Container]
IP/veth
end
cni_b[CNI]
routes_b[Routes]
forward_b[Forward rules]
kube_proxy_b[kube-proxy]
ethernet_b[ethernet]
end
cni_b ---|allocate| pod_b_1
cni_b ---|manages| routes_b
kube_proxy_b ---|manages| forward_b
subgraph network[Network]
end
network --- ethernet_a
network --- ethernet_b
-
CNI
- allocates
- Interfaces (in pods?)
- IP addresses
- allocates
-
Intercontainger communication is over localhost
-
pod-to-pod communcations: All pods can communicate with other pods via their IP addresses.
-
Pod-to-service communication: this is covered by services.
-
External-to-service communications: this is covered by services.
-
Each node has its own subnet to allocate IP address.
-
kubectl get configmap -n kube-system kubeadm-config -o yaml | grep podSubnet
-
TODO kubectl get no -o=custom-columns=NAME:.metadata.name,CIDR:.spec.podCIDR,ExternalIP:.status.addresses[0].address
Each node
- Linux network stack - for simplicity aka rootns (root namespace)
- networking interface
- routing
- iptables
- conntrack - connection tracking
- tracks the connections inside the kernel.
network in pods
- The pause container is always there.
- it is there so the pod does not disappear
- it is there so you can provide the pod with an ip address (and it doesn't go away)
- podns - Pod NameSpace
- Inside the pod namspace there is a complete copy of the simplified linux network stack
- networking interface
- routing
- iptables
- conntrack - connection tracking
- Inside the pod namspace there is a complete copy of the simplified linux network stack
- each container connects to the podns
Services
Understanding Kubernetes Networking in 30 Minutes - Ricardo Katz & James Strong
-
Services - Gives you a single IP address for all your pods
- Cluster IP
- Nodeport
- ExternalName
- Load Balancer
- Headless
-
kubectl get configmap -n kube-system kubeadm-config -o yaml | grep -i servicesubnet
- TODO is this the list of ip addresses services can be given?
- ip .1 will always be the api service.
- ip .10 will always be the DNS. Understanding Kubernetes Networking in 30 Minutes
-
TODO what does this do, on a node? : nft -s list map kube-proxy service-ips
Service - Cluster IP
The command behind it
iptables \
--table nat \
--append APP-SVC-HTTP \
--destination 172.21.2.25 \
--protocol tcp \
--match tcp \
--dport 8080 \
--jump DNAT \
--to-destination 10.0.0.11:8080
Service - Nodeport
- Allocate a port (on all nodes?)
- The same port on any node will reach the service Understanding Kubernetes Networking in 30 Minutes
DNS
- kubectl create deployment nginx --image=nginx
- kubectl get pods
> kubectl exec -it nginx-bf5d5cf98-qjbdx -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
Network policies - how to get traffic into your cluster from external sources
-
Network policy - Another component that will create "firewall rules" on your node to control the traffic.
-
NetworkPolicies are divided into ingress and egress rules.
- If you declare a direction of transport, then only the traffic explicitly allowed will connect(LFS260, ch6)
Ingress controller
- Manage pods that wil do more complex traffic ingresses to the cluster Understanding Kubernetes Networking in 30 Minutes
Kube-proxy - maintains all the networking rules
-
Understanding Kubernetes Networking in 30 Minutes - Ricardo Katz & James Strong
-
When a service change, kube-proxy is responsible for changing the network rules to accomodate the service change.
- using
- iptables
- nftables
- Depending on your implementation.
- using
CNI
CNI configuration
-
Kubernetes Networking: How to Write a CNI Plugin From Scratch - Eran Yanay, Twistlock
-
/etc/cni/net.d/10-my-cni-demo.conf - configuration of the CNI
- cniVersion -
- name - name of the file to look for in /opt/cni/bin
- type - ?
- podcidr -
-
There is a network space for the host nodes
- e.g. 10.10.10.10/24
-
There is network space for the pods
- each node gets a subnet of the pods network space.
- node1: 10.240.0.0/24
- node2: 10.240.1.0/24
- each node gets a subnet of the pods network space.
What happens during add
- enable forwarding - to allow routing of the pod packets?
- enable masquerating - to allow nodes to connect to the internet
- create the bridge
- at the host level?
- create the veth set, between pod and bridge
- allow the network traffic with iptables
- ip route add 10.240.1.0/24 via 10.10.10.11 dev enp0s9
- 10.10.10.11 - node ip
Module - Ingress controller
Implementation apps
- Envoy Proxy
- NGINX
- Traefik
- Ambasador
DNS lookup from a container
- TODO document how the cluster actually boots up, who starts first, who talks to whom etc.
Endpoints
- TODO Endpoint controller
- Who owns the endpoint controller
- Where does the endpoint controller look up the label selector mathc on pods
- Where do I document how a pod is started?
Usecase
-
Kubernetes Endpoints Explained: How Pod IPs Are Tracked Behind Services
-
Pod is created
-
Service is created with label selector
-
Endpoint Controller watches for matching pods
- There is a one-to-one connection between a service and an endpoint
- There is a one-to-many connections between a service and endpointslices.
- Through the event stream from the API server?
- TODO who started the Endpoint controller
- Is the endpoint controller part of kubelet
- According to chatgpt: runs inside kube-controller-manager
- Watches Services
- Watches Pods
- Creates and keeps updated the Endpoint objects (or EndpointSlices) for each Service
- Is the endpoint controller part of kubelet
- TODO who query the endpointslices for the IP addresses of the pods?
- Also from chatgpt
- kube-proxy
- Reads Endpoints / EndpointSlices
- Programs iptables / IPVS rules
- kube-proxy does not poll EndpointSlices. It watches them via the Kubernetes API and reacts incrementally.
- TODO it seems like the kube-proxy receives the event stream from the api server.
- eBPF dataplanes (Calico, Cilium)
- Use EndpointSlices to build service load-balancing
- Ingress controllers
- Service meshes (Istio, Linkerd)
- kube-proxy
- Also from chatgpt
- There is a one-to-one connection between a service and an endpoint
-
The Endpoint controller created an endpoint object with the Pods IP, in the relevant endpoint slice
- how is the endpointslice choosen?
kubectl get endpoints my-service
kube-proxy
kube-proxy does not poll EndpointSlices. It watches them via the Kubernetes API and reacts incrementally.
- kube-proxy sets up watches
On startup, kube-proxy establishes shared informers for:
- Services
- EndpointSlices (preferred)
- Endpoints (legacy / fallback)
- Nodes (for NodePort, health checks)
These informers:
- Do an initial LIST
- Then open a long-lived WATCH connection to the API server
- So kube-proxy is continuously streaming changes, not querying repeatedly.
The kube-prody Informers run inside the kube-proxy process
- Watching:
- Services
- EndpointSlices
- Node
Every informer:
- Opens a WATCH HTTP connection to the API server
- Gets streamed events
- Reconnects automatically on failure
- EndpointSlice change occurs
Example events:
A Pod becomes Ready
A Pod terminates
A Service selector changes
A Pod is rescheduled to a new node
The endpoint controller updates the relevant EndpointSlice(s).
This produces watch events like:
ADDED
UPDATED
DELETED
- Informer updates kube-proxy’s local cache
The informer:
Deserializes the event
Updates kube-proxy’s in-memory cache
Triggers event handlers
At this point kube-proxy has an up-to-date view of:
Which backends exist
Their IPs, ports, readiness, and topology info
- kube-proxy marks Services as “dirty”
kube-proxy doesn’t immediately rewrite rules for every tiny change.
Instead it:
Marks the affected Service(s) as needing sync
Coalesces many updates together
Uses a rate-limited sync loop
This avoids thrashing iptables when many Pods churn at once.
- Sync loop recalculates desired state
During a sync:
kube-proxy builds the desired Service → backend mapping
Applies:
Session affinity
ExternalTrafficPolicy
Topology hints
Health check rules
Then it diffs:
current kernel state vs desired state
- Programs the dataplane
Depending on mode:
iptables mode
Writes chains like:
KUBE-SVC-*
KUBE-SEP-*
Uses probabilistic rules for load balancing
Performs atomic updates via iptables-restore
IPVS mode
Programs kernel IPVS tables
Creates virtual services and real servers
More scalable for large clusters
Either way, this step is idempotent — kube-proxy can safely reapply rules.
How kube-proxy stays consistent
kube-proxy relies on:
ResourceVersion from the API server
Automatic watch reconnects
Full resyncs on failure or desync
Periodic housekeeping syncs
If it misses events:
The watch re-lists
kube-proxy rebuilds state from scratch
That’s why Services usually recover even after control-plane hiccups.
Performance and scale notes
EndpointSlices dramatically reduce churn vs Endpoints
kube-proxy only watches slices for Services it cares about
Large clusters almost always use IPVS mode
Frequent Pod churn = more syncs, but still batch
DNS
- /etc/resolv.conf is injected when the pod is created.
Pods
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
- 10.96.0.10 = CoreDNS Service ClusterIP
- search domains enable short names
- ndots:5 controls when search domains are applied
For a lookup like my-svc, the resolver tries (in order):
- my-svc.default.svc.cluster.local
- my-svc.svc.cluster.local
- my-svc.cluster.local
- (possibly) my-svc.
Each attempt generates a DNS query.
- The DNS packet leaves the Pod
- UDP (usually) to port 53
- Destination IP: CoreDNS Service ClusterIP
- Source IP: Pod IP
- At this point, DNS is just normal network traffic.
- kube-proxy (or eBPF) routes the packet
- Because the destination is a Service IP:
- kube-proxy (iptables / IPVS), or an eBPF dataplane (Calico, Cilium)
- …intercepts the packet and load-balances it to one of the CoreDNS Pod IPs.
- No DNS logic here — just Service routing.
- CoreDNS receives the request on port 53 UDP/TCP
- Uses its configured plugin chain (from the Corefile)
- kubernetes
- "Is this name inside a zone I manage?"
- Yes
- Looks in its in-memory cache
- Backed by informers watching:
- Services
- EndpointSlices
- Namespaces
- No API call per query.
- "Is this name inside a zone I manage?"
- forward
- Forwarded to:
- Node’s /etc/resolv.conf, or
- Explicit upstream resolvers
- Result is cached and returned
- cache
- kubernetes
- Response goes back to the Pod
- CoreDNS sends a DNS response
- Packet travels back directly to the Pod IP
- No kube-proxy involvement on the return path
- Uses its configured plugin chain (from the Corefile)