Kubernetes: Real‐World Examples - pcont/aws_sample GitHub Wiki

Kubernetes (K8s) for DevOps Architects: Real-World Examples

Let me enhance the explanation with real-world examples for each component to help you better understand how they function in production environments.

Control Plane Components in Action

API Server

Real-world example: When you run kubectl apply -f deployment.yaml, the kubectl command sends a REST request to the API Server, which validates your YAML manifest before storing it in etcd. During a Black Friday sale, when your team needs to scale up services quickly, all those scaling commands are processed through this component.

etcd

Real-world example: In a multinational e-commerce platform, etcd stores critical information like which Pods are running on which nodes, what Services exist, and their configurations. If your team updates a Deployment from 5 to 10 replicas, etcd maintains this desired state, even if the control plane temporarily fails.

Scheduler

Real-world example: For a machine learning platform, when data scientists submit new model training jobs as Pods with GPU requirements, the Scheduler identifies nodes with available GPUs and assigns the Pods accordingly, considering resource constraints and affinity rules.

Controller Manager

Real-world example: In a banking application, if a node running payment processing Pods suddenly fails, the Node Controller (part of Controller Manager) detects the failure, marks the node as unhealthy, and the Deployment Controller ensures new Pods are scheduled to maintain the desired state of payment services.

Cloud Controller Manager

Real-world example: In an AWS-hosted Kubernetes cluster, when you create a LoadBalancer Service for your customer-facing API, the Cloud Controller Manager provisions an AWS Elastic Load Balancer automatically and configures it to route traffic to your Service.

Node Components in Practice

Kubelet

Real-world example: During a rolling update of a Netflix-like streaming service, the Kubelet on each node receives instructions to terminate old version Pods and start new ones. It handles the graceful shutdown procedures, ensuring in-flight requests complete before container termination.

Kube-proxy

Real-world example: In a microservices architecture, when your "user-profile" service needs to communicate with the "payment-history" service, kube-proxy maintains the network rules that allow this internal communication using either iptables or IPVS on each node.

Container Runtime

Real-world example: In a CI/CD pipeline environment, after developers push new code, containerd (a common container runtime) pulls the newly built container images from your private Docker registry and runs them within Pods according to defined resource limits.

Kubernetes Objects in Production

Pod

Real-world example: At Spotify, individual microservices like the "playlist-manager" might run as Pods, with the main application container paired with a sidecar container that handles metrics collection for observability.

Service

Real-world example: In a SaaS application, the "authentication" Service maintains a stable endpoint (auth.internal.service) that other services can reliably call, even as the underlying authentication Pods scale up and down or get redeployed during updates.

Volume

Real-world example: For a media processing application, a temporary Volume might be mounted to multiple containers in a Pod - one container downloads media files, another processes them, and a third uploads the processed files to cloud storage.

Namespace

Real-world example: A financial services company might create separate Namespaces for "trading", "reporting", and "customer-portal" teams, each with their own resource quotas and access permissions to maintain separation of concerns.

Deployment

Real-world example: For an e-commerce website, the frontend application runs as a Deployment with 10 replicas across multiple nodes. During a new feature release, DevOps engineers perform a rolling update with kubectl set image deployment/frontend frontend=v2.1.3, progressively replacing old Pods with new ones.

StatefulSet

Real-world example: A MongoDB replica set in production would be deployed as a StatefulSet named "mongodb" with 3 replicas, ensuring each MongoDB instance gets a predictable name (mongodb-0, mongodb-1, mongodb-2) and persistent storage that follows the Pod if it's rescheduled to another node.

DaemonSet

Real-world example: Datadog's monitoring agent runs as a DaemonSet, ensuring every node in your cluster has exactly one monitoring Pod that collects metrics, logs, and traces from all containers running on that node.

Job/CronJob

Real-world example: A retail company might use a CronJob to run inventory reconciliation at midnight, while another Job might be triggered after a product import to regenerate search indexes.

Ingress

Real-world example: A media company uses an NGINX Ingress Controller to route traffic based on path and hostname: requests to api.example.com go to the API service, while web.example.com routes to the frontend service, with TLS termination handled automatically.

Networking Solutions

Real-world example: A large financial institution might choose Calico as their CNI plugin because it supports network policies for security isolation between banking, investment, and insurance services running on the same cluster.

Storage in Action

Persistent Volumes (PV)

Real-world example: In a medical imaging application, a 500GB PV provisioned on high-performance AWS EBS volumes stores scan data that must persist even when the processing Pods are restarted or rescheduled.

Persistent Volume Claims (PVC)

Real-world example: A content management system's database might use a PVC requesting 100GB of storage with specific performance characteristics, which gets bound to an appropriately sized PV by the cluster.

Storage Classes

Real-world example: An enterprise might define Storage Classes like "fast-ssd" (using NVMe drives) for databases, "standard-hdd" for backups, and "replicated-storage" for critical data, allowing teams to choose the appropriate storage type for their workloads.

Security Implementations

Authentication

Real-world example: A healthcare organization integrates Kubernetes with their existing Active Directory using OIDC, allowing developers to authenticate to the cluster using their corporate credentials.

Authorization (RBAC)

Real-world example: In a multi-tenant platform, the Platform team creates specific Roles like "developer", "operator", and "auditor" with increasing levels of permissions, then assigns these roles to users through RoleBindings.

Admission Control

Real-world example: A regulated industry uses the PodSecurityPolicy admission controller to enforce that all Pods must run as non-root users and cannot mount the host filesystem, preventing potential security breaches.

Network Policies

Real-world example: In a payment card processing environment, Network Policies ensure that only the authorized "payment-processor" Pods can communicate with the "card-vault" Pods, and only on specific ports.

Secret Management

Real-world example: A B2B SaaS application stores API keys, database credentials, and encryption keys as Kubernetes Secrets, which are then mounted as environment variables or files in the appropriate Pods.

Advanced Implementations

Service Mesh

Real-world example: Lyft uses Envoy (the basis for many service meshes) to handle inter-service communication, providing circuit breaking, rate limiting, and observability without changing application code.

GitOps

Real-world example: Weaveworks (creators of Flux) manage their own infrastructure using GitOps principles - infrastructure changes must be committed to Git, and automated controllers reconcile the cluster state with the Git repository state.

Operators

Real-world example: The Prometheus Operator automates the deployment and management of Prometheus monitoring instances, handling details like configuration, persistent storage, and high availability setups in a Kubernetes-native way.

flowchart TD
    subgraph "Control Plane Components"
        api[API Server\nProcesses kubectl commands] --> etcd[etcd\nStores cluster state]
        api --> scheduler[Scheduler\nAssigns ML training jobs to GPU nodes]
        api --> cm[Controller Manager\nRestores payment Pods after node failure]
        api --> ccm[Cloud Controller Manager\nProvisions AWS ELB for Services]
    end
    
    subgraph "Worker Node 1"
        kubelet1[Kubelet\nManages container lifecycle] --> container1[Container Runtime\nRuns containerd/Docker]
        kp1[Kube Proxy\nManages iptables rules] --> container1
        container1 --> pod11[Pod: Payment Service\nMain + Sidecar containers]
        container1 --> pod12[Pod: User Profile API\nClaims 2 CPU, 4GB RAM]
    end
    
    subgraph "Worker Node 2"
        kubelet2[Kubelet\nEnforces resource limits] --> container2[Container Runtime\nPulls images from registry]
        kp2[Kube Proxy\nEnables Service discovery] --> container2
        container2 --> pod21[Pod: MongoDB-0\nStatefulSet member]
        container2 --> pod22[Pod: Datadog Agent\nFrom DaemonSet]
    end
    
    api <--> kubelet1
    api <--> kubelet2
    api <--> kp1
    api <--> kp2
    
    User[DevOps Engineer] --> api
    
    subgraph "External Components"
        dns[CoreDNS\nResolves service.namespace.svc.cluster.local]
        ingress[NGINX Ingress Controller\nRoutes traffic by hostname]
        lb[AWS Load Balancer\nProvided by Cloud Controller]
    end
    
    api --> dns
    api --> ingress
    ingress --> lb
    lb --> External[External Traffic\nCustomer requests]
    
    subgraph "Storage & Persistence"
        sc[Storage Classes\nfast-ssd, standard-hdd]
        pv[Persistent Volumes\n500GB EBS volume]
        pvc[PVC\nRequested by MongoDB StatefulSet]
    end
    
    api --> sc
    sc --> pv
    pv --> pvc
    pvc --> pod21
    
    subgraph "Real Business Workloads"
        dep[Deployment: E-commerce Frontend\n10 replicas with rolling updates]
        ss[StatefulSet: MongoDB Cluster\n3 ordered replicas]
        ds[DaemonSet: Logging Agent\nOne per node]
        cj[CronJob: Nightly Backup\nRuns at 2 AM]
    end
    
    api --> dep
    api --> ss
    api --> ds
    api --> cj

DevOps Best Practices with Examples

Infrastructure as Code

Real-world example: At Monzo Bank, the entire Kubernetes infrastructure is defined in Terraform modules and versioned in Git. When they need to create a new environment, they simply apply the same code with different variables.

CI/CD Integration

Real-world example: At Shopify, when developers merge code to the main branch, their CI pipeline automatically builds container images, runs security scans, updates Kubernetes manifests with the new image tag, and applies the changes to a staging cluster before promoting to production.

Monitoring and Observability

Real-world example: A ride-sharing company uses Prometheus to scrape metrics from all services, Grafana dashboards to visualize performance, and Jaeger to trace requests as they flow from the mobile app through the backend services to the driver matching algorithm.

Disaster Recovery

Real-world example: Netflix regularly tests their disaster recovery procedures by using tools like Velero to back up their entire Kubernetes cluster state and restore it to a different region, ensuring they can recover from region-wide outages.

Resource Management

Real-world example: An AI company sets memory requests and limits for their model training Pods based on profiling data, and implements horizontal pod autoscaling to handle variable loads, optimizing cluster resource utilization.

Multi-environment Strategy

Real-world example: Zalando maintains separate Kubernetes clusters for development, staging, and production, but uses the same Helm charts with environment-specific values to ensure consistency between environments.

Real-world Challenges and Solutions

Complexity

Real-world example: Airbnb initially struggled with Kubernetes complexity, so they started with a managed EKS service and gradually built expertise before adding custom components and optimizations.

Networking

Real-world example: A global gaming company with strict latency requirements chose Cilium as their CNI for its eBPF-based performance optimizations and integrated service mesh capabilities.

Stateful Applications

Real-world example: Shopify runs MySQL databases on Kubernetes using Vitess Operator, which handles sharding, connection pooling, and failover, demonstrating that even complex stateful workloads can thrive in Kubernetes with the right architecture.

Scalability

Real-world example: During Black Friday, an e-commerce platform uses Horizontal Pod Autoscalers based on custom metrics (order queue length) to scale checkout services independently from product catalog services, handling 20x normal traffic efficiently.

Security

Real-world example: A cryptocurrency exchange uses Gatekeeper (OPA) to enforce security policies across all deployments, automatically rejecting any Pod that tries to run as root or mount sensitive host paths, while scanning all images for vulnerabilities before deployment.

Would you like me to elaborate further on any specific real-world implementation or provide more detailed examples for a particular component?