EN_K8s_Networking - somaz94/DevOps-Engineer GitHub Wiki
In Kubernetes, a Service is an abstraction that defines a logical set of Pods and a policy by which to access them. The Service Type specifies how the Service is exposed to the network.
The main Service Types include:
- ClusterIP: This default service type provides a service only accessible within the cluster.
-
NodePort: Exposes the service on each Node's IP at a static port. You can contact the NodePort service from outside the cluster by requesting
<NodeIP>:<NodePort>. - LoadBalancer: Creates an external load balancer in the current cloud (if supported) and assigns a fixed, external IP to the service.
-
ExternalName: Maps the service to the contents of the
externalNamefield (e.g.,foo.bar.example.com), by returning a CNAME record with its value.
externalTrafficPolicy is an option of a Service of type LoadBalancer or NodePort that controls how the incoming traffic is routed. It can have two values: Cluster or Local.
- Cluster: The traffic is routed to any node, and if that node doesn't have a pod for the service, the traffic is forwarded to a node that does. This can cause an extra hop and might obscure the source IP address.
- Local: The traffic is only routed to the nodes that have the pod for the service. If the traffic hits a node without a pod, it's dropped, not forwarded. This preserves the original source IP address but can lead to uneven distribution of traffic across the pods.
- A Pod's veth pair connects to the host's
caliXXXinterface. ThecaliXXXinterface has no IP — only a MAC address. - When a Pod sends an ARP request to learn the destination Pod's MAC address, the host kernel's ARP Proxy responds with its own MAC address.
- The Pod sends the packet to the host MAC, and the host routing table forwards it to the destination Pod's veth interface.
- This enables efficient inter-Pod communication without an L2 bridge.
VXLAN operates as L2-over-L3 tunneling:
- Pod's original packet (Inner Ethernet + IP)
- VXLAN header added (includes VNID)
- Outer UDP header added (port 4789)
- Outer IP header added (node IP)
- Outer Ethernet header
MTU decreases from 1500 to 1450 bytes (50-byte overhead), introducing CPU overhead (encapsulation/decapsulation) and ~3% bandwidth overhead. AWS VPC CNI solves this by having Pods use VPC IPs directly.
- iptables: Sequential search with O(n) complexity. Performance degrades with 1000+ Services. Requires full rewrite on rule update.
-
IPVS: Hash table with O(1) complexity, suitable for large clusters. Supports multiple load balancing algorithms:
rr(round-robin),lc(least connections),sh(source hashing). -
Recommendation: Use iptables for fewer than 100 Services; IPVS for larger clusters. IPVS allows real-time connection state inspection via
ipvsadm.
When a Service is created, kube-proxy generates iptables rules:
-
PREROUTING→KUBE-SERVICES→KUBE-SVC-XXX→KUBE-SEP-XXX(each Pod endpoint) - ClusterIP traffic: DNAT converts to Pod IP
-
NodePort traffic:
KUBE-MARK-MASQmarks the packet, then SNAT converts to node IP (externalTrafficPolicy: Cluster). WithLocal, SNAT is skipped. -
conntracktracks connections to ensure correct routing of response packets.
Use iptables-save to inspect all rules and trace from the KUBE-SERVICES chain.
- CoreDNS processes DNS queries based on the
ndotssetting (default: 5) in/etc/resolv.conf. - For non-FQDN queries, search domains are appended sequentially — a lookup for
my-servicecan trigger up to 6 queries. - Reduce query count by lowering
ndotsor using FQDNs likemy-service.namespace.svc.cluster.local. - CoreDNS uses the
cacheplugin for TTL-based caching, reducing response time to milliseconds. Theautopathplugin optimizes search order.
- Calico uses BGP to advertise Pod CIDR information between nodes.
- Full-mesh mode: All nodes peer with each other, creating N(N-1)/2 connections — scalability issues beyond 100 nodes.
- Route Reflector mode: A central RR node manages routing information, requiring only N connections.
- RR nodes should be redundant to avoid SPOF. Use Kubernetes Nodes as RR or deploy dedicated RR instances.
- Check BGP peer status with
calicoctl node status.
- Generate CA certificates with Cert-Manager
- Mount TLS Secrets to each Pod
- Implement TLS handshake in the application
Use NetworkPolicy to allow only specific labeled Pods to communicate, and restrict privileges with PSA (Pod Security Admission).
Service Meshes like Linkerd/Istio automate this without application code changes, providing traffic encryption, authentication, authorization, and Observability in one place.
- Ingress Controllers watch Ingress resources and auto-generate reverse proxy configurations.
- NGINX Ingress: Most mature and stable; annotation-based configuration; global settings via ConfigMap.
- Traefik: Dynamic configuration; automatic SSL (Let's Encrypt); middleware chaining; Kubernetes CRD support.
- Istio Gateway: Integrated with Service Mesh; L7 routing; traffic splitting (Canary); built-in mTLS and Observability.
Selection guide: NGINX for simple L7 routing, Traefik for dynamic environments, Istio for advanced microservice features.
| Legacy (in-tree) | AWS LoadBalancer Controller (out-of-tree) | |
|---|---|---|
| Update cycle | Slow (tied to K8s core) | Independent, fast |
| LB support | Classic LB only | ALB + NLB native |
| Features | Basic | IP mode, TargetGroupBinding, WAF integration, Subnet Discovery, NLB client IP preservation |
When creating an ALB via Ingress, fine-grained control is possible through annotations. When creating an NLB via Service, you can choose Instance/IP type. IP type registers Pod IPs directly, eliminating node hops.
Q20. How do you verify external client source IP, and what is the difference between X-Forwarded-For and Proxy Protocol?
-
AWS:
externalTrafficPolicy: Local+ NLB (Proxy Protocol v2) to preserve client IP. ALB usesX-Forwarded-Forheader. -
On-Premise:
externalTrafficPolicy: Local+ MetalLB (Layer2/BGP mode).
| X-Forwarded-For | Proxy Protocol | |
|---|---|---|
| Layer | L7 (HTTP header) | L4 (TCP connection) |
| Operation | Application must parse | Binary header at connection start |
| Support | All HTTP proxies | NGINX, HAProxy, NLB |
| Performance | Slightly more overhead | Better performance |
When using externalTrafficPolicy: Local, configure Anti-Affinity to distribute Pods evenly across all nodes to avoid traffic imbalance.
- External Traffic → Ingress Controller / LoadBalancer (AWS ELB, NLB, ALB)
- Service ClusterIP (Virtual IP, Endpoints management)
-
kube-proxy (iptables
KUBE-SERVICESchain, DNAT to Pod IP; IPVS mode uses hash table) - CNI Network (Calico/Flannel routing; same-node uses veth pair + ARP Proxy)
- Overlay Network (different node uses VXLAN/IPIP encapsulation, tunneling interface)
- Pod Container Port reached
Debugging: kubectl logs, tcpdump, iptables-save, check Endpoints, validate NetworkPolicy.
Q21-1. What is the full process of CNI Plugin setting up Pod networking, and what is the role of IPAM?
CNI (Container Network Interface) is the standard interface between kubelet and network plugins.
- kubelet requests container creation via CRI (Container Runtime Interface)
- CRI creates a network namespace
- kubelet calls CNI Plugin (ADD command)
- IPAM (IP Address Management) Plugin assigns an available IP address
- CNI Plugin creates a veth pair (one end in Pod namespace, the other on the host)
- Sets IP on the Pod-side interface and adds default route
- Connects host-side interface to bridge or routing table
- CNI returns result (IP, Gateway, DNS) to kubelet
- host-local: Stores IP allocation in local files; simple but no cross-node synchronization.
- Calico IPAM: Distributed IP management via etcd; efficient allocation using IP Pool concept.
- Whereabouts: Multi-network IP management via etcd/Kubernetes API.
# CNI logs
ls /var/log/pods/
ls /opt/cni/bin/
# IP allocation status (Calico)
calicoctl ipam show
# Network namespace inspection
ip netns list
nsenter -t <pid> -n ip addrService Meshes like Istio/Linkerd inject Envoy/linkerd-proxy as a sidecar to intercept all traffic.
- Mutating Admission Webhook modifies Pod Spec
- Init Container (
istio-init) configures iptables rules - Sidecar Proxy container is added
- Runs alongside the application container
# Outbound traffic interception
-A OUTPUT -p tcp -j ISTIO_OUTPUT
-A ISTIO_OUTPUT ! -d 127.0.0.1/32 -j ISTIO_REDIRECT
-A ISTIO_REDIRECT -p tcp -j REDIRECT --to-ports 15001
# Inbound traffic interception
-A PREROUTING -p tcp -j ISTIO_INBOUND
-A ISTIO_INBOUND -p tcp --dport 80 -j REDIRECT --to-ports 15006- Outbound: Redirects application's outbound traffic to Envoy's port 15001
- Inbound: Redirects incoming traffic to Envoy's port 15006
- Envoy applies mTLS, routing, load balancing, Retry, Circuit Breaker, then forwards to actual destination
- Envoy's own traffic is excluded to prevent infinite loops
- Prometheus metrics port (15090) excluded
- Use
traffic.sidecar.istio.io/excludeOutboundPortsannotation to exclude specific ports
MTU (Maximum Transmission Unit) is the maximum packet size that can be transmitted at once.
- Standard Ethernet MTU: 1500 bytes
- VXLAN Overlay overhead: 50 bytes (VXLAN header 8 + Outer IP 20 + Outer UDP 8 + Outer Ethernet 14)
- Effective Pod MTU: 1450 bytes
- Sending 1500-byte packets causes fragmentation
- Performance degradation during large data transfers
- TCP connections dropping mid-session
- Communication failure when PMTUD (Path MTU Discovery) fails
1. Auto-set Pod MTU:
# Calico CNI configuration
apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
name: default
spec:
mtuIfacePattern: "^((en|wl|ww|sl|ib)[opsx].*|(eth|wlan|wwan).*)"
vxlanMTU: 14502. CNI auto-detection:
- Calico:
FELIX_IPINIPMTU,FELIX_VXLANMTUenvironment variables - Cilium: Auto-calculated based on
tunnel-protocolsetting - AWS VPC CNI: Auto-set based on ENI MTU (Jumbo Frame up to 9001)
3. TCP MSS Clamping:
iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu4. Enable Jumbo Frames on physical interface:
ip link set dev eth0 mtu 9000# Check MTU inside Pod
kubectl exec -it <pod> -- ip link show eth0
# PMTUD test (ping with Don't Fragment flag)
ping -M do -s 1472 <destination> # OK if MTU 1500
ping -M do -s 1422 <destination> # For VXLAN environmentsNodePort Service allows external access via a specific port on all nodes.
- External client requests
Node1:30080 - Node1's kube-proxy randomly selects Pod B on Node2
- SNAT occurs: client IP → Node1 IP
- Pod B perceives Node1 as the client (original IP lost)
- Response also returns through Node1 (extra hop)
- If Pod B responds directly to the client IP, the client sent the request to Node1 but receives a response from Node2, breaking the connection (asymmetric routing).
- SNAT maintains Node1 IP to guarantee the response path.
apiVersion: v1
kind: Service
spec:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800Issue: Due to SNAT, all requests appear to come from the node IP. Multiple clients through the same node are all routed to the same Pod (imbalance).
1. Use externalTrafficPolicy: Local:
spec:
type: NodePort
externalTrafficPolicy: Local- Pros: Preserves client IP, Session Affinity works correctly, fewer hops
- Cons: Requests to nodes without Pods fail; possible imbalance
2. LoadBalancer + Proxy Protocol:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"3. Use Ingress Controller:
- Forward original IP via
X-Forwarded-Forheader at L7 - Ingress manages Session Affinity (cookie-based)
# Check conntrack table
conntrack -L | grep <service-ip>
# Check iptables SNAT rules
iptables -t nat -L KUBE-POSTROUTING -n -v
# Watch endpoint changes
kubectl get endpoints <service> --watchQ21-5. How do you configure a Dual-Stack (IPv4/IPv6) Kubernetes cluster, and what should you consider?
Dual-Stack supports both IPv4 and IPv6 simultaneously (GA in K8s 1.23+).
# kube-apiserver flags
--service-cluster-ip-range=10.96.0.0/12,fd00:1234::/112
--feature-gates=IPv6DualStack=true
# kube-controller-manager flags
--cluster-cidr=10.244.0.0/16,fd00:5678::/104
--service-cluster-ip-range=10.96.0.0/12,fd00:1234::/112
--node-cidr-mask-size-ipv4=24
--node-cidr-mask-size-ipv6=120apiVersion: v1
kind: Pod
metadata:
name: dual-stack-pod
spec:
containers:
- name: app
image: nginx
status:
podIPs:
- ip: 10.244.1.5 # IPv4
- ip: fd00:5678::5 # IPv6apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
ipFamilyPolicy: PreferDualStack # SingleStack | PreferDualStack | RequireDualStack
ipFamilies:
- IPv4
- IPv6
clusterIPs:
- 10.96.100.200 # Primary (IPv4)
- fd00:1234::200 # Secondary (IPv6)- Calico: Full Dual-Stack support, IPv6 BGP peering
- Cilium: Native IPv6 support, high performance with eBPF
- Flannel: Limited support (VXLAN mode only)
1. DNS Resolution:
# CoreDNS auto-generates AAAA records
my-service.default.svc.cluster.local. # Returns A + AAAA2. Application Compatibility:
-
0.0.0.0:8080→[::]:8080(IPv6 binding) - Go:
net.Listen("tcp", ":8080")supports Dual-Stack automatically - Python:
socket.AF_INET6+IPV6_V6ONLY=0
3. NetworkPolicy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
spec:
ingress:
- from:
- ipBlock:
cidr: 10.0.0.0/8 # IPv4
- ipBlock:
cidr: fd00::/8 # IPv64. Cloud Provider Constraints:
- AWS: Requires VPC IPv6 CIDR; ELB supports Dual-Stack NLB only
- GCP: Dual-Stack GKE in beta; requires additional configuration
- Azure: AKS Dual-Stack in preview
- Configure IPv4 Single-Stack cluster
- Gradually add IPv6 with
ipFamilyPolicy: PreferDualStack - Validate IPv6 compatibility per application
- Enforce with
ipFamilyPolicy: RequireDualStack - Switch primary IP Family to IPv6 (change
ipFamiliesorder)
# Check Pod IPv6 address
kubectl get pod <pod> -o jsonpath='{.status.podIPs}'
# Test IPv6 connectivity
kubectl exec -it <pod> -- curl -6 http://[fd00:1234::200]:80
# Check CNI IPv6 routing
ip -6 route show