Kubernetes Networking

Service Types & Advanced Networking (Q6, Q11-Q21)

Q6. Kubernetes Service Type & ExternalTrafficPolicy

Kubernetes Service Type

In Kubernetes, a Service is an abstraction that defines a logical set of Pods and a policy by which to access them. The Service Type specifies how the Service is exposed to the network.

The main Service Types include:

ClusterIP: This default service type provides a service only accessible within the cluster.
NodePort: Exposes the service on each Node's IP at a static port. You can contact the NodePort service from outside the cluster by requesting <NodeIP>:<NodePort>.
LoadBalancer: Creates an external load balancer in the current cloud (if supported) and assigns a fixed, external IP to the service.
ExternalName: Maps the service to the contents of the externalName field (e.g., foo.bar.example.com), by returning a CNAME record with its value.

ExternalTrafficPolicy

externalTrafficPolicy is an option of a Service of type LoadBalancer or NodePort that controls how the incoming traffic is routed. It can have two values: Cluster or Local.

Cluster: The traffic is routed to any node, and if that node doesn't have a pod for the service, the traffic is forwarded to a node that does. This can cause an extra hop and might obscure the source IP address.
Local: The traffic is only routed to the nodes that have the pod for the service. If the traffic hits a node without a pod, it's dropped, not forwarded. This preserves the original source IP address but can lead to uneven distribution of traffic across the pods.

Q11. How does ARP Proxy work in same-node Pod communication?

A Pod's veth pair connects to the host's caliXXX interface. The caliXXX interface has no IP — only a MAC address.
When a Pod sends an ARP request to learn the destination Pod's MAC address, the host kernel's ARP Proxy responds with its own MAC address.
The Pod sends the packet to the host MAC, and the host routing table forwards it to the destination Pod's veth interface.
This enables efficient inter-Pod communication without an L2 bridge.

Q12. What is the VXLAN encapsulation process in Overlay Networks and its performance impact?

VXLAN operates as L2-over-L3 tunneling:

Pod's original packet (Inner Ethernet + IP)
VXLAN header added (includes VNID)
Outer UDP header added (port 4789)
Outer IP header added (node IP)
Outer Ethernet header

MTU decreases from 1500 to 1450 bytes (50-byte overhead), introducing CPU overhead (encapsulation/decapsulation) and ~3% bandwidth overhead. AWS VPC CNI solves this by having Pods use VPC IPs directly.

Q13. What are the performance differences between iptables and IPVS modes, and how do you choose?

iptables: Sequential search with O(n) complexity. Performance degrades with 1000+ Services. Requires full rewrite on rule update.
IPVS: Hash table with O(1) complexity, suitable for large clusters. Supports multiple load balancing algorithms: rr (round-robin), lc (least connections), sh (source hashing).
Recommendation: Use iptables for fewer than 100 Services; IPVS for larger clusters. IPVS allows real-time connection state inspection via ipvsadm.

Q14. Explain the kube-proxy iptables chain structure and DNAT/SNAT process.

When a Service is created, kube-proxy generates iptables rules:

PREROUTING → KUBE-SERVICES → KUBE-SVC-XXX → KUBE-SEP-XXX (each Pod endpoint)
ClusterIP traffic: DNAT converts to Pod IP
NodePort traffic: KUBE-MARK-MASQ marks the packet, then SNAT converts to node IP (externalTrafficPolicy: Cluster). With Local, SNAT is skipped.
conntrack tracks connections to ensure correct routing of response packets.

Use iptables-save to inspect all rules and trace from the KUBE-SERVICES chain.

Q15. How does CoreDNS caching work and what is the impact of the ndots setting?

CoreDNS processes DNS queries based on the ndots setting (default: 5) in /etc/resolv.conf.
For non-FQDN queries, search domains are appended sequentially — a lookup for my-service can trigger up to 6 queries.
Reduce query count by lowering ndots or using FQDNs like my-service.namespace.svc.cluster.local.
CoreDNS uses the cache plugin for TTL-based caching, reducing response time to milliseconds. The autopath plugin optimizes search order.

Q16. How does Calico's BGP routing work, and what is the role of Route Reflectors?

Calico uses BGP to advertise Pod CIDR information between nodes.
Full-mesh mode: All nodes peer with each other, creating N(N-1)/2 connections — scalability issues beyond 100 nodes.
Route Reflector mode: A central RR node manages routing information, requiring only N connections.
RR nodes should be redundant to avoid SPOF. Use Kubernetes Nodes as RR or deploy dedicated RR instances.
Check BGP peer status with calicoctl node status.

Q17. How can you implement Pod-to-Pod mTLS without a Service Mesh?

Generate CA certificates with Cert-Manager
Mount TLS Secrets to each Pod
Implement TLS handshake in the application

Use NetworkPolicy to allow only specific labeled Pods to communicate, and restrict privileges with PSA (Pod Security Admission).

Service Meshes like Linkerd/Istio automate this without application code changes, providing traffic encryption, authentication, authorization, and Observability in one place.

Q18. How do Ingress Controllers work, and how do NGINX vs Traefik vs Istio Gateway compare?

Ingress Controllers watch Ingress resources and auto-generate reverse proxy configurations.
NGINX Ingress: Most mature and stable; annotation-based configuration; global settings via ConfigMap.
Traefik: Dynamic configuration; automatic SSL (Let's Encrypt); middleware chaining; Kubernetes CRD support.
Istio Gateway: Integrated with Service Mesh; L7 routing; traffic splitting (Canary); built-in mTLS and Observability.

Selection guide: NGINX for simple L7 routing, Traefik for dynamic environments, Istio for advanced microservice features.

Q19. What is the difference between AWS LoadBalancer Controller and the legacy Cloud Provider?

	Legacy (in-tree)	AWS LoadBalancer Controller (out-of-tree)
Update cycle	Slow (tied to K8s core)	Independent, fast
LB support	Classic LB only	ALB + NLB native
Features	Basic	IP mode, TargetGroupBinding, WAF integration, Subnet Discovery, NLB client IP preservation

When creating an ALB via Ingress, fine-grained control is possible through annotations. When creating an NLB via Service, you can choose Instance/IP type. IP type registers Pod IPs directly, eliminating node hops.

Q20. How do you verify external client source IP, and what is the difference between X-Forwarded-For and Proxy Protocol?

AWS: externalTrafficPolicy: Local + NLB (Proxy Protocol v2) to preserve client IP. ALB uses X-Forwarded-For header.
On-Premise: externalTrafficPolicy: Local + MetalLB (Layer2/BGP mode).

	X-Forwarded-For	Proxy Protocol
Layer	L7 (HTTP header)	L4 (TCP connection)
Operation	Application must parse	Binary header at connection start
Support	All HTTP proxies	NGINX, HAProxy, NLB
Performance	Slightly more overhead	Better performance

When using externalTrafficPolicy: Local, configure Anti-Affinity to distribute Pods evenly across all nodes to avoid traffic imbalance.

Q21. Explain the full packet flow from external client to Pod step-by-step.

External Traffic → Ingress Controller / LoadBalancer (AWS ELB, NLB, ALB)
Service ClusterIP (Virtual IP, Endpoints management)
kube-proxy (iptables KUBE-SERVICES chain, DNAT to Pod IP; IPVS mode uses hash table)
CNI Network (Calico/Flannel routing; same-node uses veth pair + ARP Proxy)
Overlay Network (different node uses VXLAN/IPIP encapsulation, tunneling interface)
Pod Container Port reached

Debugging: kubectl logs, tcpdump, iptables-save, check Endpoints, validate NetworkPolicy.

Q21-1. What is the full process of CNI Plugin setting up Pod networking, and what is the role of IPAM?

CNI (Container Network Interface) is the standard interface between kubelet and network plugins.

Pod Network Setup Flow:

kubelet requests container creation via CRI (Container Runtime Interface)
CRI creates a network namespace
kubelet calls CNI Plugin (ADD command)
IPAM (IP Address Management) Plugin assigns an available IP address
CNI Plugin creates a veth pair (one end in Pod namespace, the other on the host)
Sets IP on the Pod-side interface and adds default route
Connects host-side interface to bridge or routing table
CNI returns result (IP, Gateway, DNS) to kubelet

IPAM Types:

host-local: Stores IP allocation in local files; simple but no cross-node synchronization.
Calico IPAM: Distributed IP management via etcd; efficient allocation using IP Pool concept.
Whereabouts: Multi-network IP management via etcd/Kubernetes API.

Debugging:

# CNI logs
ls /var/log/pods/
ls /opt/cni/bin/

# IP allocation status (Calico)
calicoctl ipam show

# Network namespace inspection
ip netns list
nsenter -t <pid> -n ip addr

Q21-2. How does the Sidecar Proxy in Service Mesh intercept traffic using iptables rules?

Service Meshes like Istio/Linkerd inject Envoy/linkerd-proxy as a sidecar to intercept all traffic.

Sidecar Injection Process:

Mutating Admission Webhook modifies Pod Spec
Init Container (istio-init) configures iptables rules
Sidecar Proxy container is added
Runs alongside the application container

iptables Rule Structure:

# Outbound traffic interception
-A OUTPUT -p tcp -j ISTIO_OUTPUT
-A ISTIO_OUTPUT ! -d 127.0.0.1/32 -j ISTIO_REDIRECT
-A ISTIO_REDIRECT -p tcp -j REDIRECT --to-ports 15001

# Inbound traffic interception
-A PREROUTING -p tcp -j ISTIO_INBOUND
-A ISTIO_INBOUND -p tcp --dport 80 -j REDIRECT --to-ports 15006

How It Works:

Outbound: Redirects application's outbound traffic to Envoy's port 15001
Inbound: Redirects incoming traffic to Envoy's port 15006
Envoy applies mTLS, routing, load balancing, Retry, Circuit Breaker, then forwards to actual destination

Exclusions:

Envoy's own traffic is excluded to prevent infinite loops
Prometheus metrics port (15090) excluded
Use traffic.sidecar.istio.io/excludeOutboundPorts annotation to exclude specific ports

Q21-3. What causes MTU mismatch-induced packet fragmentation in Kubernetes, and how do you fix it?

MTU (Maximum Transmission Unit) is the maximum packet size that can be transmitted at once.

The Problem:

Standard Ethernet MTU: 1500 bytes
VXLAN Overlay overhead: 50 bytes (VXLAN header 8 + Outer IP 20 + Outer UDP 8 + Outer Ethernet 14)
Effective Pod MTU: 1450 bytes
Sending 1500-byte packets causes fragmentation

Symptoms:

Performance degradation during large data transfers
TCP connections dropping mid-session
Communication failure when PMTUD (Path MTU Discovery) fails

Solutions:

1. Auto-set Pod MTU:

# Calico CNI configuration
apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
  name: default
spec:
  mtuIfacePattern: "^((en|wl|ww|sl|ib)[opsx].*|(eth|wlan|wwan).*)"
  vxlanMTU: 1450

2. CNI auto-detection:

Calico: FELIX_IPINIPMTU, FELIX_VXLANMTU environment variables
Cilium: Auto-calculated based on tunnel-protocol setting
AWS VPC CNI: Auto-set based on ENI MTU (Jumbo Frame up to 9001)

3. TCP MSS Clamping:

iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

4. Enable Jumbo Frames on physical interface:

ip link set dev eth0 mtu 9000

Verification:

# Check MTU inside Pod
kubectl exec -it <pod> -- ip link show eth0

# PMTUD test (ping with Don't Fragment flag)
ping -M do -s 1472 <destination>   # OK if MTU 1500
ping -M do -s 1422 <destination>   # For VXLAN environments

Q21-4. Explain NodePort Service SNAT behavior and the Session Affinity problem.

NodePort Service allows external access via a specific port on all nodes.

Default Behavior (externalTrafficPolicy: Cluster):

External client requests Node1:30080
Node1's kube-proxy randomly selects Pod B on Node2
SNAT occurs: client IP → Node1 IP
Pod B perceives Node1 as the client (original IP lost)
Response also returns through Node1 (extra hop)

Why SNAT Occurs:

If Pod B responds directly to the client IP, the client sent the request to Node1 but receives a response from Node2, breaking the connection (asymmetric routing).
SNAT maintains Node1 IP to guarantee the response path.

Session Affinity Problem:

apiVersion: v1
kind: Service
spec:
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800

Issue: Due to SNAT, all requests appear to come from the node IP. Multiple clients through the same node are all routed to the same Pod (imbalance).

Solutions:

1. Use externalTrafficPolicy: Local:

spec:
  type: NodePort
  externalTrafficPolicy: Local

Pros: Preserves client IP, Session Affinity works correctly, fewer hops
Cons: Requests to nodes without Pods fail; possible imbalance

2. LoadBalancer + Proxy Protocol:

service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"

3. Use Ingress Controller:

Forward original IP via X-Forwarded-For header at L7
Ingress manages Session Affinity (cookie-based)

Debugging:

# Check conntrack table
conntrack -L | grep <service-ip>

# Check iptables SNAT rules
iptables -t nat -L KUBE-POSTROUTING -n -v

# Watch endpoint changes
kubectl get endpoints <service> --watch

Q21-5. How do you configure a Dual-Stack (IPv4/IPv6) Kubernetes cluster, and what should you consider?

Dual-Stack supports both IPv4 and IPv6 simultaneously (GA in K8s 1.23+).

Cluster Configuration:

# kube-apiserver flags
--service-cluster-ip-range=10.96.0.0/12,fd00:1234::/112
--feature-gates=IPv6DualStack=true

# kube-controller-manager flags
--cluster-cidr=10.244.0.0/16,fd00:5678::/104
--service-cluster-ip-range=10.96.0.0/12,fd00:1234::/112
--node-cidr-mask-size-ipv4=24
--node-cidr-mask-size-ipv6=120

Pod Network:

apiVersion: v1
kind: Pod
metadata:
  name: dual-stack-pod
spec:
  containers:
  - name: app
    image: nginx
status:
  podIPs:
  - ip: 10.244.1.5      # IPv4
  - ip: fd00:5678::5    # IPv6

Service Configuration:

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  ipFamilyPolicy: PreferDualStack  # SingleStack | PreferDualStack | RequireDualStack
  ipFamilies:
  - IPv4
  - IPv6
  clusterIPs:
  - 10.96.100.200       # Primary (IPv4)
  - fd00:1234::200      # Secondary (IPv6)

CNI Support:

Calico: Full Dual-Stack support, IPv6 BGP peering
Cilium: Native IPv6 support, high performance with eBPF
Flannel: Limited support (VXLAN mode only)

Key Considerations:

1. DNS Resolution:

# CoreDNS auto-generates AAAA records
my-service.default.svc.cluster.local.  # Returns A + AAAA

2. Application Compatibility:

0.0.0.0:8080 → [::]:8080 (IPv6 binding)
Go: net.Listen("tcp", ":8080") supports Dual-Stack automatically
Python: socket.AF_INET6 + IPV6_V6ONLY=0

3. NetworkPolicy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
spec:
  ingress:
  - from:
    - ipBlock:
        cidr: 10.0.0.0/8      # IPv4
    - ipBlock:
        cidr: fd00::/8        # IPv6

4. Cloud Provider Constraints:

AWS: Requires VPC IPv6 CIDR; ELB supports Dual-Stack NLB only
GCP: Dual-Stack GKE in beta; requires additional configuration
Azure: AKS Dual-Stack in preview

Migration Strategy:

Configure IPv4 Single-Stack cluster
Gradually add IPv6 with ipFamilyPolicy: PreferDualStack
Validate IPv6 compatibility per application
Enforce with ipFamilyPolicy: RequireDualStack
Switch primary IP Family to IPv6 (change ipFamilies order)

Debugging:

# Check Pod IPv6 address
kubectl get pod <pod> -o jsonpath='{.status.podIPs}'

# Test IPv6 connectivity
kubectl exec -it <pod> -- curl -6 http://[fd00:1234::200]:80

# Check CNI IPv6 routing
ip -6 route show

Reference

← Back to EN_Kubernetes

EN_K8s_Networking - somaz94/DevOps-Engineer GitHub Wiki

Kubernetes Networking

Service Types & Advanced Networking (Q6, Q11-Q21)

Q6. Kubernetes Service Type & ExternalTrafficPolicy

Kubernetes Service Type

ExternalTrafficPolicy

Q11. How does ARP Proxy work in same-node Pod communication?

Q12. What is the VXLAN encapsulation process in Overlay Networks and its performance impact?

Q13. What are the performance differences between iptables and IPVS modes, and how do you choose?

Q14. Explain the kube-proxy iptables chain structure and DNAT/SNAT process.

Q15. How does CoreDNS caching work and what is the impact of the ndots setting?

Q16. How does Calico's BGP routing work, and what is the role of Route Reflectors?

Q17. How can you implement Pod-to-Pod mTLS without a Service Mesh?

Q18. How do Ingress Controllers work, and how do NGINX vs Traefik vs Istio Gateway compare?

Q19. What is the difference between AWS LoadBalancer Controller and the legacy Cloud Provider?

Q20. How do you verify external client source IP, and what is the difference between X-Forwarded-For and Proxy Protocol?

Q21. Explain the full packet flow from external client to Pod step-by-step.

Q21-1. What is the full process of CNI Plugin setting up Pod networking, and what is the role of IPAM?

Pod Network Setup Flow:

IPAM Types:

Debugging:

Q21-2. How does the Sidecar Proxy in Service Mesh intercept traffic using iptables rules?

Sidecar Injection Process:

iptables Rule Structure:

How It Works:

Exclusions:

Q21-3. What causes MTU mismatch-induced packet fragmentation in Kubernetes, and how do you fix it?

The Problem:

Symptoms:

Solutions:

Verification:

Q21-4. Explain NodePort Service SNAT behavior and the Session Affinity problem.

Default Behavior (externalTrafficPolicy: Cluster):

Why SNAT Occurs:

Session Affinity Problem:

Solutions:

Debugging:

Q21-5. How do you configure a Dual-Stack (IPv4/IPv6) Kubernetes cluster, and what should you consider?

Cluster Configuration:

Pod Network:

Service Configuration:

CNI Support:

Key Considerations:

Migration Strategy:

Debugging:

Reference

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️