EN_K8s_Scheduling - somaz94/DevOps-Engineer GitHub Wiki

Kubernetes Advanced Scheduling

Advanced Scheduling (Q46-Q50)


Q46. What are the practical strategies for applying Pod Priority and Preemption?

Pod Priority determines which Pods are terminated first when resources are insufficient.

PriorityClass Definition:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical system pods"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
globalDefault: false
preemptionPolicy: Never   # Disable preemption
description: "Batch jobs"

Priority Strategy:

Class Value Use Case
system-critical 100,000,000 kube-system components
production-high 1,000,000 Core services
production-normal 100,000 General services
best-effort 0 Batch jobs

Pod Configuration:

apiVersion: v1
kind: Pod
spec:
  priorityClassName: high-priority
  containers:
  - name: app
    image: my-app

Preemption Behavior:

  • High-priority Pods evict low-priority Pods to get scheduled.
  • Preemption respects PodDisruptionBudgets (PDB) to prevent excessive eviction.
  • Use preemptionPolicy: Never to prevent a Pod from preempting others while still having priority for scheduling.

Practical Considerations:

  • Combine with ResourceQuota to ensure fairness
  • Monitor eviction events to detect priority misconfiguration
  • Set preemptionPolicy: Never for batch jobs to prevent disrupting services

Q47. How do you choose between Topology Spread Constraints and Pod Anti-Affinity?

Both mechanisms control Pod distribution, but with different approaches.

Pod Anti-Affinity:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:  # Hard constraint
    - labelSelector:
        matchLabels:
          app: database
      topologyKey: kubernetes.io/hostname
    preferredDuringSchedulingIgnoredDuringExecution:  # Soft constraint
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: web
        topologyKey: topology.kubernetes.io/zone
  • Binary (schedulable/not schedulable)
  • Strict separation between specific Pods
  • Use case: DB Primary/Replica must not be on the same node

Topology Spread Constraints:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule   # or ScheduleAnyway
  labelSelector:
    matchLabels:
      app: web
- maxSkew: 2
  topologyKey: kubernetes.io/hostname
  whenUnsatisfiable: ScheduleAnyway
  labelSelector:
    matchLabels:
      app: web
  • Distributes Pods evenly across topology domains (nodes/AZs/regions)
  • maxSkew: maximum allowed imbalance between domains
  • whenUnsatisfiable: DoNotSchedule = hard constraint; ScheduleAnyway = soft constraint

Selection Guide:

Scenario Recommendation
DB Primary/Replica must be on different nodes Pod Anti-Affinity (required)
Distribute stateless apps evenly Topology Spread Constraints
Multi-AZ high availability Topology Spread with maxSkew=1
Best effort distribution Topology Spread with ScheduleAnyway

Q48. What are the differences between Taint Effects (NoSchedule/PreferNoSchedule/NoExecute) and how are they used?

Taints are set on nodes to restrict which Pods can be scheduled there.

Effect Types:

NoSchedule:

# Set taint on GPU node
kubectl taint nodes gpu-node nvidia.com/gpu=true:NoSchedule
  • Pods without Toleration cannot be scheduled
  • Existing Pods are unaffected
  • Use case: Restrict new workloads to specific hardware

PreferNoSchedule:

kubectl taint nodes node-1 spot-instance=true:PreferNoSchedule
  • Prefers not to schedule, but allows it when resources are insufficient
  • Soft constraint for lower-priority workloads

NoExecute:

# Maintenance mode — evicts existing Pods
kubectl taint nodes node-1 maintenance=true:NoExecute

# Set grace period
kubectl taint nodes node-1 maintenance=true:NoExecute
  • Evicts existing Pods immediately
  • Use tolerationSeconds to allow a grace period before eviction
  • Use case: Node maintenance, node failure

Toleration Configuration:

tolerations:
# Exact match
- key: "nvidia.com/gpu"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"

# Tolerate any value
- key: "spot-instance"
  operator: "Exists"
  effect: "PreferNoSchedule"

# Tolerate with grace period
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300

Practical Use Cases:

Scenario Taint Effect
Spot/Preemptible instances NoSchedule
Failed/maintenance nodes NoExecute
Special hardware (GPU/ARM) NoSchedule + Toleration
Gradual migration PreferNoSchedule

Q49. What are the use cases for Scheduler Profiles and Multiple Schedulers?

Kubernetes supports custom schedulers in addition to the default scheduler.

Scheduler Profiles:

A single scheduler instance can serve multiple profiles, each with different scheduling logic via plugin combinations.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesFit
      - name: PodTopologySpread
- schedulerName: no-scoring-scheduler
  plugins:
    score:
      disabled:
      - name: '*'   # Skip all scoring — first-fit strategy
- schedulerName: gpu-scheduler
  plugins:
    filter:
      enabled:
      - name: NodeResourcesFit
    score:
      enabled:
      - name: NodeResourcesFit
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: MostAllocated   # Bin-packing for GPU nodes

Multiple Schedulers:

# Deploy a custom scheduler
apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-scheduler
spec:
  template:
    spec:
      containers:
      - name: custom-scheduler
        image: my-custom-scheduler:v1.0
        command:
        - /usr/local/bin/kube-scheduler
        - --config=/etc/kubernetes/custom-scheduler-config.yaml
# Pod using custom scheduler
apiVersion: v1
kind: Pod
spec:
  schedulerName: custom-scheduler
  containers:
  - name: app
    image: my-app

Popular Custom Schedulers:

Scheduler Use Case
Volcano Batch jobs (AI/ML), Gang scheduling
Yunikorn Multi-tenancy, fair resource sharing
Default scheduler profiles Different strategies per workload type

Recommendation: Handle most workloads with the default scheduler; use custom schedulers only for specialized requirements.


Q50. What are the performance tuning points for large-scale clusters (1000+ nodes)?

Large-scale clusters see bottlenecks in API Server, etcd, and Scheduler.

API Server Tuning:

# kube-apiserver flags
--max-requests-inflight=800          # Default 400
--max-mutating-requests-inflight=400 # Default 200
--request-timeout=60s
--watch-cache-sizes=node#1000,pod#5000
# API Priority and Fairness (APF)
apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: PriorityLevelConfiguration
metadata:
  name: workload-high
spec:
  type: Limited
  limited:
    nominalConcurrencyShares: 30
    limitResponse:
      type: Queue
      queuing:
        queues: 64
        handSize: 6
        queueLengthLimit: 50

etcd Tuning:

# etcd startup flags
--quota-backend-bytes=8589934592      # 8GB
--auto-compaction-retention=1         # Compact every 1 hour
--auto-compaction-mode=periodic
--snapshot-count=10000
--heartbeat-interval=100              # ms
--election-timeout=1000               # ms
  • SSD required: etcd is I/O intensive
  • Dedicated servers: Separate etcd from other components
  • 5-node cluster: Higher availability than 3-node

Scheduler Tuning:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
clientConnection:
  qps: 100      # Default 50
  burst: 200    # Default 100
percentageOfNodesToScore: 50  # Score only 50% of nodes for large clusters

kube-proxy Tuning:

# Use IPVS mode
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs
ipvs:
  scheduler: rr
  syncPeriod: 30s
  minSyncPeriod: 2s

Additional Optimizations:

# Increase CoreDNS cache and replicas
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
      cache 300      # Increase TTL (default 30s)
      # ...
    }
# Scale CoreDNS
kubectl scale deployment coredns -n kube-system --replicas=5

Key Metrics to Monitor:

# API Server latency
apiserver_request_duration_seconds{quantile="0.99"}

# etcd disk performance
etcd_disk_backend_commit_duration_seconds{quantile="0.99"}

# Scheduler latency
scheduler_binding_duration_seconds{quantile="0.99"}

# kube-proxy sync latency
kubeproxy_sync_proxy_rules_duration_seconds{quantile="0.99"}

Key Terms: PriorityClass, Preemption, Topology Spread Constraints, Taint, Toleration, Scheduler Profile — refer to the Scheduling & Resource Management glossary section for detailed explanations.



Reference

⚠️ **GitHub.com Fallback** ⚠️