k8s_architecture - henk52/knowledgesharing GitHub Wiki

Kubernetes Architecture Documentation

Introduction

Purpose

Vocabulary

cidr
CNI - Container Network Interface. Allow to bring own network implementation.
CRD - Custom Resource Definition - TODO investigate
CSI - Container Storage Interface.
mTLS - mutual TLS; both sides of the connection are authenticated.
service mesh
SPIFFE - Secure Production Identity Framework for Everyone.

References

Overview

Control plane
- API server
- ETCD
- management controller
- cloud-controler

Cluster overview

Cluster Architecture - The architectural concepts behind Kubernetes

Networking

Networking & Kubernetes, James Strong and Vallery Lancey
Kevin Sookocheff A guide to the kubernetes networking model
The kubernetes network guide

Networking overview

Understanding Kubernetes Networking in 30 Minutes - Ricardo Katz & James Strong

graph TB
  subgraph node_a [Node]
    subgraph pod_a_1[POD]
      pause
      container_a_1_1[Container]
      container_a_1_2[Container]
      ceth_a[IP/veth - ceth0]
    end
    subgraph pod_a_2[POD]
      pause
      container_a_2_1[Container]
      container_a_2_2[Container]
      ceth_b[IP/veth - ceth1]
    end
    cni_1[CNI]
    routes_1[Routes]
    forward_1[Forward rules]
    kube_proxy_1[kube-proxy]
    subgraph ethernet_a[ethernet]
      eth0 --- cbr0
      cbr0 --- veth_a[veth0]
      cbr0 --- veth_b[veth1]
    end
    veth_a --- ceth_a
    veth_b --- ceth_b

  end
  cni_1 ---|allocate| pod_a_1
  cni_1 ---|manages| routes_1
  kube_proxy_1 ---|manages| forward_1

  subgraph node_b [Node]
    subgraph pod_b_1[POD]
      pause
      container_b_1_1[Container]
      container_b_1_2[Container]
      IP/veth
    end
    subgraph pod_b_2[POD]
      pause
      container_b_2_1[Container]
      container_b_2_2[Container]
      IP/veth
    end
    cni_b[CNI]
    routes_b[Routes]
    forward_b[Forward rules]
    kube_proxy_b[kube-proxy]
    ethernet_b[ethernet]
  end
  cni_b ---|allocate| pod_b_1
  cni_b ---|manages| routes_b
  kube_proxy_b ---|manages| forward_b

  subgraph network[Network]
  end
  network --- ethernet_a
  network --- ethernet_b

CNI
- allocates
  - Interfaces (in pods?)
  - IP addresses
Intercontainger communication is over localhost
pod-to-pod communcations: All pods can communicate with other pods via their IP addresses.
Pod-to-service communication: this is covered by services.
External-to-service communications: this is covered by services.
Each node has its own subnet to allocate IP address.
kubectl get configmap -n kube-system kubeadm-config -o yaml | grep podSubnet
TODO kubectl get no -o=custom-columns=NAME:.metadata.name,CIDR:.spec.podCIDR,ExternalIP:.status.addresses[0].address

Each node

Linux network stack - for simplicity aka rootns (root namespace)
- networking interface
- routing
- iptables
- conntrack - connection tracking
  - tracks the connections inside the kernel.

network in pods

The pause container is always there.
- it is there so the pod does not disappear
- it is there so you can provide the pod with an ip address (and it doesn't go away)
podns - Pod NameSpace
- Inside the pod namspace there is a complete copy of the simplified linux network stack
  - networking interface
  - routing
  - iptables
  - conntrack - connection tracking
each container connects to the podns

Services

Understanding Kubernetes Networking in 30 Minutes - Ricardo Katz & James Strong

Services - Gives you a single IP address for all your pods
- Cluster IP
- Nodeport
- ExternalName
- Load Balancer
- Headless
kubectl get configmap -n kube-system kubeadm-config -o yaml | grep -i servicesubnet
- TODO is this the list of ip addresses services can be given?
- ip .1 will always be the api service.
- ip .10 will always be the DNS. Understanding Kubernetes Networking in 30 Minutes
TODO what does this do, on a node? : nft -s list map kube-proxy service-ips

Service - Cluster IP

The command behind it

iptables \
  --table nat \
  --append APP-SVC-HTTP \
  --destination 172.21.2.25 \
  --protocol tcp \
  --match tcp \
  --dport 8080 \
  --jump DNAT \
  --to-destination 10.0.0.11:8080

Service - Nodeport

Allocate a port (on all nodes?)
The same port on any node will reach the service Understanding Kubernetes Networking in 30 Minutes

DNS

kubectl create deployment nginx --image=nginx
kubectl get pods

> kubectl exec -it nginx-bf5d5cf98-qjbdx -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Network policies - how to get traffic into your cluster from external sources

Understanding Kubernetes Networking in 30 Minutes
Network policy - Another component that will create "firewall rules" on your node to control the traffic.
NetworkPolicies are divided into ingress and egress rules.
- If you declare a direction of transport, then only the traffic explicitly allowed will connect(LFS260, ch6)

Ingress controller

Manage pods that wil do more complex traffic ingresses to the cluster Understanding Kubernetes Networking in 30 Minutes

Kube-proxy - maintains all the networking rules

Understanding Kubernetes Networking in 30 Minutes - Ricardo Katz & James Strong
When a service change, kube-proxy is responsible for changing the network rules to accomodate the service change.
- using
  - iptables
  - nftables
- Depending on your implementation.

CNI

CNI configuration

Kubernetes Networking: How to Write a CNI Plugin From Scratch - Eran Yanay, Twistlock
/etc/cni/net.d/10-my-cni-demo.conf - configuration of the CNI
- cniVersion -
- name - name of the file to look for in /opt/cni/bin
- type - ?
- podcidr -
There is a network space for the host nodes
- e.g. 10.10.10.10/24
There is network space for the pods
- each node gets a subnet of the pods network space.
  - node1: 10.240.0.0/24
  - node2: 10.240.1.0/24

What happens during add

enable forwarding - to allow routing of the pod packets?
enable masquerating - to allow nodes to connect to the internet
create the bridge
- at the host level?
create the veth set, between pod and bridge
allow the network traffic with iptables
ip route add 10.240.1.0/24 via 10.10.10.11 dev enp0s9
- 10.10.10.11 - node ip

Module - Ingress controller

Implementation apps

Envoy Proxy
NGINX
Traefik
Ambasador

DNS lookup from a container

TODO document how the cluster actually boots up, who starts first, who talks to whom etc.

Endpoints

TODO Endpoint controller
- Who owns the endpoint controller
- Where does the endpoint controller look up the label selector mathc on pods
- Where do I document how a pod is started?

Usecase

Kubernetes Endpoints Explained: How Pod IPs Are Tracked Behind Services
Pod is created
Service is created with label selector
Endpoint Controller watches for matching pods
- There is a one-to-one connection between a service and an endpoint
  - There is a one-to-many connections between a service and endpointslices.
- Through the event stream from the API server?
- TODO who started the Endpoint controller
  - Is the endpoint controller part of kubelet
    - According to chatgpt: runs inside kube-controller-manager
    - Watches Services
    - Watches Pods
    - Creates and keeps updated the Endpoint objects (or EndpointSlices) for each Service
- TODO who query the endpointslices for the IP addresses of the pods?
  - Also from chatgpt
    - kube-proxy
      - Reads Endpoints / EndpointSlices
      - Programs iptables / IPVS rules
      - kube-proxy does not poll EndpointSlices. It watches them via the Kubernetes API and reacts incrementally.
        
        TODO it seems like the kube-proxy receives the event stream from the api server.
    - eBPF dataplanes (Calico, Cilium)
      - Use EndpointSlices to build service load-balancing
    - Ingress controllers
    - Service meshes (Istio, Linkerd)
The Endpoint controller created an endpoint object with the Pods IP, in the relevant endpoint slice
- how is the endpointslice choosen?

kubectl get endpoints my-service

kube-proxy

kube-proxy does not poll EndpointSlices. It watches them via the Kubernetes API and reacts incrementally.

kube-proxy sets up watches

On startup, kube-proxy establishes shared informers for:

Services
EndpointSlices (preferred)
Endpoints (legacy / fallback)
Nodes (for NodePort, health checks)

These informers:

Do an initial LIST
Then open a long-lived WATCH connection to the API server
So kube-proxy is continuously streaming changes, not querying repeatedly.

The kube-prody Informers run inside the kube-proxy process

Watching:
- Services
- EndpointSlices
- Node

Every informer:

Opens a WATCH HTTP connection to the API server
Gets streamed events
Reconnects automatically on failure

EndpointSlice change occurs

Example events:

A Pod becomes Ready

A Pod terminates

A Service selector changes

A Pod is rescheduled to a new node

The endpoint controller updates the relevant EndpointSlice(s).

This produces watch events like:

ADDED

UPDATED

DELETED

Informer updates kube-proxy’s local cache

The informer:

Deserializes the event

Updates kube-proxy’s in-memory cache

Triggers event handlers

At this point kube-proxy has an up-to-date view of:

Which backends exist

Their IPs, ports, readiness, and topology info

kube-proxy marks Services as “dirty”

kube-proxy doesn’t immediately rewrite rules for every tiny change.

Instead it:

Marks the affected Service(s) as needing sync

Coalesces many updates together

Uses a rate-limited sync loop

This avoids thrashing iptables when many Pods churn at once.

Sync loop recalculates desired state

During a sync:

kube-proxy builds the desired Service → backend mapping

Applies:

Session affinity

ExternalTrafficPolicy

Topology hints

Health check rules

Then it diffs:

current kernel state vs desired state

Programs the dataplane

Depending on mode:

iptables mode

Writes chains like:

KUBE-SVC-*

KUBE-SEP-*

Uses probabilistic rules for load balancing

Performs atomic updates via iptables-restore

IPVS mode

Programs kernel IPVS tables

Creates virtual services and real servers

More scalable for large clusters

Either way, this step is idempotent — kube-proxy can safely reapply rules.

How kube-proxy stays consistent

kube-proxy relies on:

ResourceVersion from the API server

Automatic watch reconnects

Full resyncs on failure or desync

Periodic housekeeping syncs

If it misses events:

The watch re-lists

kube-proxy rebuilds state from scratch

That’s why Services usually recover even after control-plane hiccups.

Performance and scale notes

EndpointSlices dramatically reduce churn vs Endpoints

kube-proxy only watches slices for Services it cares about

Large clusters almost always use IPVS mode

Frequent Pod churn = more syncs, but still batch

DNS

/etc/resolv.conf is injected when the pod is created.

Pods

Pods and Containers - Kubernetes Networking | Container Communication inside the Pod

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

10.96.0.10 = CoreDNS Service ClusterIP
search domains enable short names
ndots:5 controls when search domains are applied

For a lookup like my-svc, the resolver tries (in order):

my-svc.default.svc.cluster.local
my-svc.svc.cluster.local
my-svc.cluster.local
(possibly) my-svc.

Each attempt generates a DNS query.

The DNS packet leaves the Pod
- UDP (usually) to port 53
- Destination IP: CoreDNS Service ClusterIP
- Source IP: Pod IP
- At this point, DNS is just normal network traffic.
kube-proxy (or eBPF) routes the packet
- Because the destination is a Service IP:
- kube-proxy (iptables / IPVS), or an eBPF dataplane (Calico, Cilium)
  - …intercepts the packet and load-balances it to one of the CoreDNS Pod IPs.
  - No DNS logic here — just Service routing.
CoreDNS receives the request on port 53 UDP/TCP
- Uses its configured plugin chain (from the Corefile)
  - kubernetes
    - "Is this name inside a zone I manage?"
      - Yes
      - Looks in its in-memory cache
      - Backed by informers watching:
        
        Services
        
        EndpointSlices
        
        Namespaces
    - No API call per query.
  - forward
    - Forwarded to:
    - Node’s /etc/resolv.conf, or
    - Explicit upstream resolvers
    - Result is cached and returned
  - cache
- Response goes back to the Pod
  - CoreDNS sends a DNS response
  - Packet travels back directly to the Pod IP
    - No kube-proxy involvement on the return path