Kubernetes Cost Optimization: Resource Requests, Autoscaling, and Efficiency
Introduction
Your Kubernetes bill is likely higher than it needs to be. Studies show that 30-50% of cloud infrastructure spending in Kubernetes environments goes to waste through over-provisioning, inefficient autoscaling, and poor resource utilization. The good news? You can reclaim 20-40% of that spending without sacrificing performance or reliability.
This comprehensive guide walks you through three critical pillars of Kubernetes cost optimization: configuring resource requests and limits correctly, implementing intelligent autoscaling mechanisms, and adopting efficiency patterns that compound savings over time. Whether you’re managing a single cluster or orchestrating infrastructure across multiple environments, these strategies are immediately actionable and grounded in real-world production experience.
By the end of this post, you’ll understand how to right-size your workloads, implement autoscaling that actually works, and identify the hidden cost drains in your cluster. More importantly, you’ll have concrete YAML configurations and monitoring strategies you can deploy today.
Part 1: Resource Requests and LimitsโThe Foundation of Cost Optimization
Understanding Requests vs. Limits
The distinction between resource requests and limits is fundamental to Kubernetes cost optimization, yet it’s where many teams go wrong.
Resource Requests tell the Kubernetes scheduler how much CPU and memory your pod needs to run. The scheduler uses this information to decide which node can accommodate your pod. Requests are also what you’re charged for in most cloud environmentsโthey represent your guaranteed resource allocation.
Resource Limits set a hard ceiling on how much CPU and memory a pod can consume. If a pod exceeds its memory limit, it gets killed (OOMKilled). CPU limits are throttled, preventing the container from using more than allocated.
Here’s the cost implication: if you set requests too high, you’re paying for resources you don’t use. If you set them too low, your pods crash or perform poorly, leading to failed deployments and emergency scaling. The sweet spot is setting requests to match your actual 95th percentile usage, not your peak theoretical maximum.
Right-Sizing Resource Requests
Start by understanding your current resource consumption. Most teams over-provision by 2-3x because they’re uncertain about actual requirements.
Here’s a practical approach:
- Deploy with minimal requests (but not zeroโthe scheduler needs guidance)
- Monitor actual usage for 1-2 weeks under normal load
- Set requests to 95th percentile usage plus a small buffer (10-15%)
- Set limits to 150-200% of requests to allow for traffic spikes
For example, if you observe that your web service uses an average of 150m CPU and peaks at 280m CPU:
apiVersion: v1
kind: Pod
metadata:
name: web-service
spec:
containers:
- name: app
image: myapp:latest
resources:
requests:
cpu: "200m" # 95th percentile + buffer
memory: "256Mi"
limits:
cpu: "400m" # 2x requests for headroom
memory: "512Mi"
This configuration ensures your pod gets scheduled appropriately while preventing runaway resource consumption.
Common Over-Provisioning Patterns
Pattern 1: Cargo Cult Requests
Many teams copy resource requests from examples or other services without understanding their workload. A typical mistake: setting every service to cpu: 500m and memory: 512Mi regardless of actual needs.
Pattern 2: Fear-Based Limits Setting limits extremely high (“just in case”) defeats the purpose. If your limit is 10x your request, you’re essentially not limiting anything, and you’re paying for that headroom.
Pattern 3: Ignoring Burstable QoS Kubernetes assigns Quality of Service (QoS) classes based on requests and limits. Pods with requests but no limits get “Burstable” QoS, which means they’re first to be evicted during node pressure. Understanding QoS helps you make intentional trade-offs.
Practical Right-Sizing Workflow
Use this workflow to right-size your existing deployments:
# 1. Get current resource usage across all pods
kubectl top pods --all-namespaces --sort-by=memory
# 2. Identify pods with high request-to-usage ratios
kubectl get pods --all-namespaces -o json | \
jq '.items[] | {name: .metadata.name, requests: .spec.containers[].resources.requests}'
# 3. For a specific deployment, check actual vs. requested
kubectl describe deployment myapp -n production | grep -A 5 "Requests"
Once you’ve identified over-provisioned workloads, update them incrementally. Change one deployment at a time, monitor for 24 hours, then adjust further if needed.
Part 2: Autoscaling MechanismsโScaling Intelligently
Autoscaling is where static resource allocation becomes dynamic cost optimization. Rather than paying for peak capacity 24/7, you scale up when needed and scale down during quiet periods.
Horizontal Pod Autoscaler (HPA)
HPA automatically scales the number of pod replicas based on observed metrics. It’s the most common autoscaling mechanism and typically delivers 25-35% cost savings for variable workloads.
How HPA Works:
- Metrics server collects CPU and memory usage from pods
- HPA controller checks metrics every 15 seconds (default)
- If average CPU exceeds target, HPA scales up
- If average CPU falls below target, HPA scales down (after cooldown period)
Here’s a production-ready HPA configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-service-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-service
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
Key configuration decisions:
- minReplicas: 2 โ Always run at least 2 replicas for availability. Running 1 replica saves money but creates a single point of failure.
- maxReplicas: 20 โ Set this based on your infrastructure limits and budget. This prevents runaway scaling during traffic spikes or bugs.
- targetCPUUtilization: 70% โ Scale up when average CPU hits 70%. This provides headroom for traffic spikes without over-provisioning.
- scaleDown stabilization: 300s โ Wait 5 minutes before scaling down. This prevents thrashing (rapid up/down scaling) during variable load.
- scaleUp policies โ Scale up aggressively (100% increase every 15 seconds) to handle traffic spikes quickly.
Vertical Pod Autoscaler (VPA)
While HPA scales the number of pods, VPA scales the resources per pod. It’s useful for workloads where you can’t easily add more replicas (like databases or stateful services) or where you want to optimize resource requests over time.
VPA works by:
- Monitoring actual resource usage
- Recommending new request/limit values
- Evicting and restarting pods with updated resources
Install VPA:
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
Configure VPA for a deployment:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-service-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-service
updatePolicy:
updateMode: "Auto" # Options: Off, Initial, Recreate, Auto
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 2
memory: 2Gi
controlledResources: ["cpu", "memory"]
When to use VPA:
- Workloads with unpredictable resource needs
- Stateful services where HPA isn’t suitable
- Fine-tuning resource requests after initial deployment
- Workloads that need to scale vertically (more powerful pods) rather than horizontally
Important caveat: VPA evicts pods to apply new resource values, causing brief downtime. Use updateMode: Initial for production workloads where you want recommendations without automatic updates.
Cluster Autoscaler
HPA and VPA scale pods, but what happens when your cluster runs out of node capacity? Cluster Autoscaler automatically adds nodes when pods can’t be scheduled and removes nodes when they’re underutilized.
Cluster Autoscaler is essential for cost optimization because it prevents you from over-provisioning nodes upfront. Instead, you start with a baseline and let the autoscaler add capacity as needed.
Configure Cluster Autoscaler for AWS EKS:
apiVersion: v1
kind: ServiceAccount
metadata:
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["events", "endpoints"]
verbs: ["create", "patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: [""]
resources: ["pods/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["endpoints"]
resourceNames: ["cluster-autoscaler"]
verbs: ["get", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["watch", "list", "get", "update"]
- apiGroups: [""]
resources: ["namespaces", "pods", "services", "replicationcontrollers", "persistentvolumeclaims", "persistentvolumes"]
verbs: ["watch", "list", "get"]
- apiGroups: ["extensions"]
resources: ["replicasets", "statefulsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["watch", "list"]
- apiGroups: ["apps"]
resources: ["statefulsets", "daemonsets", "replicasets", "deployments"]
verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
verbs: ["watch", "list", "get"]
- apiGroups: ["batch", "extensions"]
resources: ["jobs"]
verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
resourceNames: ["cluster-autoscaler"]
resources: ["leases"]
verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.27.0
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 600Mi
requests:
cpu: 100m
memory: 600Mi
command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --expander=least-waste
- --node-group-auto-discovery=asg:tag:k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
- --balance-similar-node-groups
- --skip-nodes-with-local-storage=false
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-delay-after-failure=3m
- --scale-down-delay-after-delete=10s
- --scale-down-unneeded-time=10m
env:
- name: AWS_REGION
value: us-east-1
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Equal
effect: NoSchedule
Key parameters:
- scale-down-enabled=true โ Removes underutilized nodes
- scale-down-unneeded-time=10m โ Wait 10 minutes before removing a node (prevents thrashing)
- scale-down-delay-after-add=10m โ Wait 10 minutes after adding a node before considering removal
- expander=least-waste โ When multiple node groups can satisfy a pod, choose the one with least wasted resources
Combining HPA, VPA, and Cluster Autoscaler
These three mechanisms work together:
- HPA scales pod replicas based on CPU/memory
- Cluster Autoscaler adds nodes when pods can’t be scheduled
- VPA fine-tunes resource requests over time
For most workloads, use HPA + Cluster Autoscaler. Use VPA only for workloads where horizontal scaling isn’t practical.
Part 3: Efficiency Improvements and Advanced Patterns
Node Affinity and Pod Topology Spread
Efficient resource utilization depends on how pods are distributed across nodes. Poor scheduling can leave nodes partially empty while others are overloaded.
Use Pod Topology Spread Constraints to distribute pods evenly:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-service
spec:
replicas: 6
selector:
matchLabels:
app: web-service
template:
metadata:
labels:
app: web-service
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-service
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web-service
containers:
- name: app
image: myapp:latest
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "200m"
memory: "256Mi"
This configuration ensures:
- No more than 1 pod difference per node (maxSkew: 1)
- Pods spread across availability zones
- Better resource utilization and fault tolerance
Using Spot Instances for Non-Critical Workloads
Spot instances cost 70-90% less than on-demand instances but can be interrupted. Use them for fault-tolerant workloads like batch jobs, CI/CD runners, and non-critical services.
Configure a node pool for spot instances:
apiVersion: v1
kind: Node
metadata:
labels:
workload-type: spot
capacity-type: spot
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
replicas: 5
selector:
matchLabels:
app: batch-processor
template:
metadata:
labels:
app: batch-processor
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: capacity-type
operator: In
values:
- spot
tolerations:
- key: spot-instance
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: processor
image: batch-processor:latest
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
Namespace Resource Quotas
Prevent runaway resource consumption by setting namespace-level quotas:
apiVersion: v1
kind: Namespace
metadata:
name: production
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "100"
requests.memory: "200Gi"
limits.cpu: "200"
limits.memory: "400Gi"
pods: "500"
services.loadbalancers: "2"
---
apiVersion: v1
kind: LimitRange
metadata:
name: production-limits
namespace: production
spec:
limits:
- max:
cpu: "2"
memory: "2Gi"
min:
cpu: "50m"
memory: "64Mi"
type: Container
- max:
cpu: "4"
memory: "4Gi"
min:
cpu: "100m"
memory: "128Mi"
type: Pod
Monitoring and Cost Tracking
You can’t optimize what you don’t measure. Implement cost monitoring to track spending by namespace, team, and workload.
Tools for cost tracking:
- Kubecost โ Kubernetes-native cost monitoring with allocation by namespace, pod, and label
- CloudZero โ Cloud cost intelligence platform
- Infracost โ Infrastructure cost estimation for IaC
- AWS Cost Explorer โ Native AWS cost tracking
Install Kubecost:
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost \
--create-namespace \
--set kubecostModel.warmCache=true \
--set kubecostModel.warmSavingsCache=true
Query Kubecost API for cost by namespace:
kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090
# Get costs for last 7 days
curl http://localhost:9090/api/v1/allocation \
"?window=7d&aggregate=namespace&accumulate=true"
Common Anti-Patterns and How to Avoid Them
Anti-Pattern 1: No Resource Requests Pods without requests get scheduled anywhere, leading to uneven utilization and poor autoscaling.
Fix: Always set resource requests. Use a LimitRange to enforce this at the namespace level.
Anti-Pattern 2: Requests Equal Limits Setting requests equal to limits prevents the scheduler from packing pods efficiently and disables autoscaling headroom.
Fix: Set limits to 150-200% of requests.
Anti-Pattern 3: Ignoring Pod Disruption Budgets During node maintenance or scale-down, pods get evicted without warning, causing service disruptions.
Fix: Define PodDisruptionBudgets for critical workloads:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-service-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: web-service
Anti-Pattern 4: Over-Relying on Limits Setting extremely high limits doesn’t prevent cost overruns; it just delays the problem.
Fix: Combine limits with HPA and Cluster Autoscaler to scale proactively.
Anti-Pattern 5: Not Cleaning Up Old Resources Orphaned deployments, unused PersistentVolumes, and forgotten namespaces accumulate costs.
Fix: Implement a cleanup policy:
# Find deployments with zero replicas
kubectl get deployments --all-namespaces --field-selector status.replicas=0
# Find unused PersistentVolumes
kubectl get pv --field-selector status.phase=Released
# Find namespaces with no pods
kubectl get namespaces -o json | \
jq '.items[] | select(.status.phase=="Active") | .metadata.name' | \
while read ns; do
count=$(kubectl get pods -n $ns --no-headers 2>/dev/null | wc -l)
if [ $count -eq 0 ]; then
echo "Empty namespace: $ns"
fi
done
Implementation Roadmap
Here’s a phased approach to implementing these optimizations:
Phase 1: Foundation (Week 1-2)
- Audit current resource requests and actual usage
- Identify over-provisioned workloads
- Set baseline resource requests based on observed usage
- Enable metrics-server if not already running
Phase 2: Autoscaling (Week 3-4)
- Deploy HPA for variable workloads
- Configure Cluster Autoscaler
- Set up monitoring and alerting
- Test scale-up and scale-down behavior
Phase 3: Optimization (Week 5-8)
- Deploy VPA for stateful workloads
- Implement Pod Topology Spread Constraints
- Set up namespace quotas and limits
- Migrate non-critical workloads to spot instances
Phase 4: Continuous Improvement (Ongoing)
- Monitor costs weekly
- Review and adjust HPA targets
- Update resource requests based on VPA recommendations
- Identify and clean up unused resources
Conclusion
Kubernetes cost optimization isn’t a one-time projectโit’s a continuous practice. By mastering resource requests and limits, implementing intelligent autoscaling, and adopting efficiency patterns, you can realistically achieve 20-40% cost reductions while improving reliability and performance.
The key is starting with the fundamentals: right-size your resource requests, implement HPA for variable workloads, and enable Cluster Autoscaler to scale infrastructure dynamically. From there, layer in advanced patterns like VPA, spot instances, and topology spread constraints.
Begin with Phase 1 this week. Audit your current cluster, identify the biggest waste, and start with one deployment. The compounding effect of these optimizations across your entire infrastructure will be substantial.
Comments