k8s_architecture - henk52/knowledgesharing GitHub Wiki

Kubernetes Architecture Documentation

Introduction

Purpose

Vocabulary

  • cidr
  • CNI - Container Network Interface. Allow to bring own network implementation.
  • CRD - Custom Resource Definition - TODO investigate
  • CSI - Container Storage Interface.
  • mTLS - mutual TLS; both sides of the connection are authenticated.
  • service mesh
  • SPIFFE - Secure Production Identity Framework for Everyone.

References

Overview

  • Control plane
    • API server
    • ETCD
    • management controller
    • cloud-controler

Cluster overview

Networking

  • Networking & Kubernetes, James Strong and Vallery Lancey
  • Kevin Sookocheff A guide to the kubernetes networking model
  • The kubernetes network guide

Networking overview

graph TB
  subgraph node_a [Node]
    subgraph pod_a_1[POD]
      pause
      container_a_1_1[Container]
      container_a_1_2[Container]
      ceth_a[IP/veth - ceth0]
    end
    subgraph pod_a_2[POD]
      pause
      container_a_2_1[Container]
      container_a_2_2[Container]
      ceth_b[IP/veth - ceth1]
    end
    cni_1[CNI]
    routes_1[Routes]
    forward_1[Forward rules]
    kube_proxy_1[kube-proxy]
    subgraph ethernet_a[ethernet]
      eth0 --- cbr0
      cbr0 --- veth_a[veth0]
      cbr0 --- veth_b[veth1]
    end
    veth_a --- ceth_a
    veth_b --- ceth_b

  end
  cni_1 ---|allocate| pod_a_1
  cni_1 ---|manages| routes_1
  kube_proxy_1 ---|manages| forward_1

  subgraph node_b [Node]
    subgraph pod_b_1[POD]
      pause
      container_b_1_1[Container]
      container_b_1_2[Container]
      IP/veth
    end
    subgraph pod_b_2[POD]
      pause
      container_b_2_1[Container]
      container_b_2_2[Container]
      IP/veth
    end
    cni_b[CNI]
    routes_b[Routes]
    forward_b[Forward rules]
    kube_proxy_b[kube-proxy]
    ethernet_b[ethernet]
  end
  cni_b ---|allocate| pod_b_1
  cni_b ---|manages| routes_b
  kube_proxy_b ---|manages| forward_b

  subgraph network[Network]
  end
  network --- ethernet_a
  network --- ethernet_b
  • CNI

    • allocates
      • Interfaces (in pods?)
      • IP addresses
  • Intercontainger communication is over localhost

  • pod-to-pod communcations: All pods can communicate with other pods via their IP addresses.

  • Pod-to-service communication: this is covered by services.

  • External-to-service communications: this is covered by services.

  • Each node has its own subnet to allocate IP address.

  • kubectl get configmap -n kube-system kubeadm-config -o yaml | grep podSubnet

  • TODO kubectl get no -o=custom-columns=NAME:.metadata.name,CIDR:.spec.podCIDR,ExternalIP:.status.addresses[0].address

Each node

  • Linux network stack - for simplicity aka rootns (root namespace)
    • networking interface
    • routing
    • iptables
    • conntrack - connection tracking
      • tracks the connections inside the kernel.

network in pods

  • The pause container is always there.
    • it is there so the pod does not disappear
    • it is there so you can provide the pod with an ip address (and it doesn't go away)
  • podns - Pod NameSpace
    • Inside the pod namspace there is a complete copy of the simplified linux network stack
      • networking interface
      • routing
      • iptables
      • conntrack - connection tracking
  • each container connects to the podns

Services

Understanding Kubernetes Networking in 30 Minutes - Ricardo Katz & James Strong

  • Services - Gives you a single IP address for all your pods

    • Cluster IP
    • Nodeport
    • ExternalName
    • Load Balancer
    • Headless
  • kubectl get configmap -n kube-system kubeadm-config -o yaml | grep -i servicesubnet

  • TODO what does this do, on a node? : nft -s list map kube-proxy service-ips

Service - Cluster IP

The command behind it

iptables \
  --table nat \
  --append APP-SVC-HTTP \
  --destination 172.21.2.25 \
  --protocol tcp \
  --match tcp \
  --dport 8080 \
  --jump DNAT \
  --to-destination 10.0.0.11:8080

Service - Nodeport

DNS

  • kubectl create deployment nginx --image=nginx
  • kubectl get pods
> kubectl exec -it nginx-bf5d5cf98-qjbdx -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Network policies - how to get traffic into your cluster from external sources

  • Understanding Kubernetes Networking in 30 Minutes

  • Network policy - Another component that will create "firewall rules" on your node to control the traffic.

  • NetworkPolicies are divided into ingress and egress rules.

    • If you declare a direction of transport, then only the traffic explicitly allowed will connect(LFS260, ch6)

Ingress controller

Kube-proxy - maintains all the networking rules

CNI

CNI configuration

  • Kubernetes Networking: How to Write a CNI Plugin From Scratch - Eran Yanay, Twistlock

  • /etc/cni/net.d/10-my-cni-demo.conf - configuration of the CNI

    • cniVersion -
    • name - name of the file to look for in /opt/cni/bin
    • type - ?
    • podcidr -
  • There is a network space for the host nodes

    • e.g. 10.10.10.10/24
  • There is network space for the pods

    • each node gets a subnet of the pods network space.
      • node1: 10.240.0.0/24
      • node2: 10.240.1.0/24

What happens during add

  • enable forwarding - to allow routing of the pod packets?
  • enable masquerating - to allow nodes to connect to the internet
  • create the bridge
    • at the host level?
  • create the veth set, between pod and bridge
  • allow the network traffic with iptables
  • ip route add 10.240.1.0/24 via 10.10.10.11 dev enp0s9
    • 10.10.10.11 - node ip

Module - Ingress controller

Implementation apps

  • Envoy Proxy
  • NGINX
  • Traefik
  • Ambasador

DNS lookup from a container

  • TODO document how the cluster actually boots up, who starts first, who talks to whom etc.

Endpoints

  • TODO Endpoint controller
    • Who owns the endpoint controller
    • Where does the endpoint controller look up the label selector mathc on pods
    • Where do I document how a pod is started?

Usecase

  • Kubernetes Endpoints Explained: How Pod IPs Are Tracked Behind Services

  • Pod is created

  • Service is created with label selector

  • Endpoint Controller watches for matching pods

    • There is a one-to-one connection between a service and an endpoint
      • There is a one-to-many connections between a service and endpointslices.
    • Through the event stream from the API server?
    • TODO who started the Endpoint controller
      • Is the endpoint controller part of kubelet
        • According to chatgpt: runs inside kube-controller-manager
        • Watches Services
        • Watches Pods
        • Creates and keeps updated the Endpoint objects (or EndpointSlices) for each Service
    • TODO who query the endpointslices for the IP addresses of the pods?
      • Also from chatgpt
        • kube-proxy
          • Reads Endpoints / EndpointSlices
          • Programs iptables / IPVS rules
          • kube-proxy does not poll EndpointSlices. It watches them via the Kubernetes API and reacts incrementally.
            • TODO it seems like the kube-proxy receives the event stream from the api server.
        • eBPF dataplanes (Calico, Cilium)
          • Use EndpointSlices to build service load-balancing
        • Ingress controllers
        • Service meshes (Istio, Linkerd)
  • The Endpoint controller created an endpoint object with the Pods IP, in the relevant endpoint slice

    • how is the endpointslice choosen?

kubectl get endpoints my-service

kube-proxy

kube-proxy does not poll EndpointSlices. It watches them via the Kubernetes API and reacts incrementally.

  1. kube-proxy sets up watches

On startup, kube-proxy establishes shared informers for:

  • Services
  • EndpointSlices (preferred)
  • Endpoints (legacy / fallback)
  • Nodes (for NodePort, health checks)

These informers:

  • Do an initial LIST
  • Then open a long-lived WATCH connection to the API server
  • So kube-proxy is continuously streaming changes, not querying repeatedly.

The kube-prody Informers run inside the kube-proxy process

  • Watching:
    • Services
    • EndpointSlices
    • Node

Every informer:

  • Opens a WATCH HTTP connection to the API server
  • Gets streamed events
  • Reconnects automatically on failure
  1. EndpointSlice change occurs

Example events:

A Pod becomes Ready

A Pod terminates

A Service selector changes

A Pod is rescheduled to a new node

The endpoint controller updates the relevant EndpointSlice(s).

This produces watch events like:

ADDED

UPDATED

DELETED

  1. Informer updates kube-proxy’s local cache

The informer:

Deserializes the event

Updates kube-proxy’s in-memory cache

Triggers event handlers

At this point kube-proxy has an up-to-date view of:

Which backends exist

Their IPs, ports, readiness, and topology info

  1. kube-proxy marks Services as “dirty”

kube-proxy doesn’t immediately rewrite rules for every tiny change.

Instead it:

Marks the affected Service(s) as needing sync

Coalesces many updates together

Uses a rate-limited sync loop

This avoids thrashing iptables when many Pods churn at once.

  1. Sync loop recalculates desired state

During a sync:

kube-proxy builds the desired Service → backend mapping

Applies:

Session affinity

ExternalTrafficPolicy

Topology hints

Health check rules

Then it diffs:

current kernel state vs desired state

  1. Programs the dataplane

Depending on mode:

iptables mode

Writes chains like:

KUBE-SVC-*

KUBE-SEP-*

Uses probabilistic rules for load balancing

Performs atomic updates via iptables-restore

IPVS mode

Programs kernel IPVS tables

Creates virtual services and real servers

More scalable for large clusters

Either way, this step is idempotent — kube-proxy can safely reapply rules.

How kube-proxy stays consistent

kube-proxy relies on:

ResourceVersion from the API server

Automatic watch reconnects

Full resyncs on failure or desync

Periodic housekeeping syncs

If it misses events:

The watch re-lists

kube-proxy rebuilds state from scratch

That’s why Services usually recover even after control-plane hiccups.

Performance and scale notes

EndpointSlices dramatically reduce churn vs Endpoints

kube-proxy only watches slices for Services it cares about

Large clusters almost always use IPVS mode

Frequent Pod churn = more syncs, but still batch

DNS

  • /etc/resolv.conf is injected when the pod is created.

Pods

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
  • 10.96.0.10 = CoreDNS Service ClusterIP
  • search domains enable short names
  • ndots:5 controls when search domains are applied

For a lookup like my-svc, the resolver tries (in order):

  • my-svc.default.svc.cluster.local
  • my-svc.svc.cluster.local
  • my-svc.cluster.local
  • (possibly) my-svc.

Each attempt generates a DNS query.

  • The DNS packet leaves the Pod
    • UDP (usually) to port 53
    • Destination IP: CoreDNS Service ClusterIP
    • Source IP: Pod IP
    • At this point, DNS is just normal network traffic.
  • kube-proxy (or eBPF) routes the packet
    • Because the destination is a Service IP:
    • kube-proxy (iptables / IPVS), or an eBPF dataplane (Calico, Cilium)
      • …intercepts the packet and load-balances it to one of the CoreDNS Pod IPs.
      • No DNS logic here — just Service routing.
  • CoreDNS receives the request on port 53 UDP/TCP
    • Uses its configured plugin chain (from the Corefile)
      • kubernetes
        • "Is this name inside a zone I manage?"
          • Yes
          • Looks in its in-memory cache
          • Backed by informers watching:
            • Services
            • EndpointSlices
            • Namespaces
        • No API call per query.
      • forward
        • Forwarded to:
        • Node’s /etc/resolv.conf, or
        • Explicit upstream resolvers
        • Result is cached and returned
      • cache
    • Response goes back to the Pod
      • CoreDNS sends a DNS response
      • Packet travels back directly to the Pod IP
        • No kube-proxy involvement on the return path