EN_K8s_Scheduling - somaz94/DevOps-Engineer GitHub Wiki
Pod Priority determines which Pods are terminated first when resources are insufficient.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "Critical system pods"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 1000
globalDefault: false
preemptionPolicy: Never # Disable preemption
description: "Batch jobs"| Class | Value | Use Case |
|---|---|---|
| system-critical | 100,000,000 | kube-system components |
| production-high | 1,000,000 | Core services |
| production-normal | 100,000 | General services |
| best-effort | 0 | Batch jobs |
apiVersion: v1
kind: Pod
spec:
priorityClassName: high-priority
containers:
- name: app
image: my-app- High-priority Pods evict low-priority Pods to get scheduled.
- Preemption respects PodDisruptionBudgets (PDB) to prevent excessive eviction.
- Use
preemptionPolicy: Neverto prevent a Pod from preempting others while still having priority for scheduling.
- Combine with ResourceQuota to ensure fairness
- Monitor eviction events to detect priority misconfiguration
- Set
preemptionPolicy: Neverfor batch jobs to prevent disrupting services
Both mechanisms control Pod distribution, but with different approaches.
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # Hard constraint
- labelSelector:
matchLabels:
app: database
topologyKey: kubernetes.io/hostname
preferredDuringSchedulingIgnoredDuringExecution: # Soft constraint
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: web
topologyKey: topology.kubernetes.io/zone- Binary (schedulable/not schedulable)
- Strict separation between specific Pods
- Use case: DB Primary/Replica must not be on the same node
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # or ScheduleAnyway
labelSelector:
matchLabels:
app: web
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web- Distributes Pods evenly across topology domains (nodes/AZs/regions)
-
maxSkew: maximum allowed imbalance between domains -
whenUnsatisfiable: DoNotSchedule= hard constraint;ScheduleAnyway= soft constraint
| Scenario | Recommendation |
|---|---|
| DB Primary/Replica must be on different nodes | Pod Anti-Affinity (required) |
| Distribute stateless apps evenly | Topology Spread Constraints |
| Multi-AZ high availability | Topology Spread with maxSkew=1 |
| Best effort distribution | Topology Spread with ScheduleAnyway |
Q48. What are the differences between Taint Effects (NoSchedule/PreferNoSchedule/NoExecute) and how are they used?
Taints are set on nodes to restrict which Pods can be scheduled there.
NoSchedule:
# Set taint on GPU node
kubectl taint nodes gpu-node nvidia.com/gpu=true:NoSchedule- Pods without Toleration cannot be scheduled
- Existing Pods are unaffected
- Use case: Restrict new workloads to specific hardware
PreferNoSchedule:
kubectl taint nodes node-1 spot-instance=true:PreferNoSchedule- Prefers not to schedule, but allows it when resources are insufficient
- Soft constraint for lower-priority workloads
NoExecute:
# Maintenance mode — evicts existing Pods
kubectl taint nodes node-1 maintenance=true:NoExecute
# Set grace period
kubectl taint nodes node-1 maintenance=true:NoExecute- Evicts existing Pods immediately
- Use
tolerationSecondsto allow a grace period before eviction - Use case: Node maintenance, node failure
tolerations:
# Exact match
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
# Tolerate any value
- key: "spot-instance"
operator: "Exists"
effect: "PreferNoSchedule"
# Tolerate with grace period
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300| Scenario | Taint Effect |
|---|---|
| Spot/Preemptible instances | NoSchedule |
| Failed/maintenance nodes | NoExecute |
| Special hardware (GPU/ARM) | NoSchedule + Toleration |
| Gradual migration | PreferNoSchedule |
Kubernetes supports custom schedulers in addition to the default scheduler.
A single scheduler instance can serve multiple profiles, each with different scheduling logic via plugin combinations.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
score:
enabled:
- name: NodeResourcesFit
- name: PodTopologySpread
- schedulerName: no-scoring-scheduler
plugins:
score:
disabled:
- name: '*' # Skip all scoring — first-fit strategy
- schedulerName: gpu-scheduler
plugins:
filter:
enabled:
- name: NodeResourcesFit
score:
enabled:
- name: NodeResourcesFit
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated # Bin-packing for GPU nodes# Deploy a custom scheduler
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-scheduler
spec:
template:
spec:
containers:
- name: custom-scheduler
image: my-custom-scheduler:v1.0
command:
- /usr/local/bin/kube-scheduler
- --config=/etc/kubernetes/custom-scheduler-config.yaml# Pod using custom scheduler
apiVersion: v1
kind: Pod
spec:
schedulerName: custom-scheduler
containers:
- name: app
image: my-app| Scheduler | Use Case |
|---|---|
| Volcano | Batch jobs (AI/ML), Gang scheduling |
| Yunikorn | Multi-tenancy, fair resource sharing |
| Default scheduler profiles | Different strategies per workload type |
Recommendation: Handle most workloads with the default scheduler; use custom schedulers only for specialized requirements.
Large-scale clusters see bottlenecks in API Server, etcd, and Scheduler.
# kube-apiserver flags
--max-requests-inflight=800 # Default 400
--max-mutating-requests-inflight=400 # Default 200
--request-timeout=60s
--watch-cache-sizes=node#1000,pod#5000# API Priority and Fairness (APF)
apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: PriorityLevelConfiguration
metadata:
name: workload-high
spec:
type: Limited
limited:
nominalConcurrencyShares: 30
limitResponse:
type: Queue
queuing:
queues: 64
handSize: 6
queueLengthLimit: 50# etcd startup flags
--quota-backend-bytes=8589934592 # 8GB
--auto-compaction-retention=1 # Compact every 1 hour
--auto-compaction-mode=periodic
--snapshot-count=10000
--heartbeat-interval=100 # ms
--election-timeout=1000 # ms- SSD required: etcd is I/O intensive
- Dedicated servers: Separate etcd from other components
- 5-node cluster: Higher availability than 3-node
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
clientConnection:
qps: 100 # Default 50
burst: 200 # Default 100
percentageOfNodesToScore: 50 # Score only 50% of nodes for large clusters# Use IPVS mode
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs
ipvs:
scheduler: rr
syncPeriod: 30s
minSyncPeriod: 2s# Increase CoreDNS cache and replicas
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
cache 300 # Increase TTL (default 30s)
# ...
}# Scale CoreDNS
kubectl scale deployment coredns -n kube-system --replicas=5# API Server latency
apiserver_request_duration_seconds{quantile="0.99"}
# etcd disk performance
etcd_disk_backend_commit_duration_seconds{quantile="0.99"}
# Scheduler latency
scheduler_binding_duration_seconds{quantile="0.99"}
# kube-proxy sync latency
kubeproxy_sync_proxy_rules_duration_seconds{quantile="0.99"}
Key Terms: PriorityClass, Preemption, Topology Spread Constraints, Taint, Toleration, Scheduler Profile — refer to the Scheduling & Resource Management glossary section for detailed explanations.