Kubernetes Cost Optimization: Resource Requests, Autoscaling, and Efficiency

Introduction

Your Kubernetes bill is likely higher than it needs to be. Studies show that 30-50% of cloud infrastructure spending in Kubernetes environments goes to waste through over-provisioning, inefficient autoscaling, and poor resource utilization. The good news? You can reclaim 20-40% of that spending without sacrificing performance or reliability.

This comprehensive guide walks you through three critical pillars of Kubernetes cost optimization: configuring resource requests and limits correctly, implementing intelligent autoscaling mechanisms, and adopting efficiency patterns that compound savings over time. Whether you’re managing a single cluster or orchestrating infrastructure across multiple environments, these strategies are immediately actionable and grounded in real-world production experience.

By the end of this post, you’ll understand how to right-size your workloads, implement autoscaling that actually works, and identify the hidden cost drains in your cluster. More importantly, you’ll have concrete YAML configurations and monitoring strategies you can deploy today.

Part 1: Resource Requests and Limits—The Foundation of Cost Optimization

Understanding Requests vs. Limits

The distinction between resource requests and limits is fundamental to Kubernetes cost optimization, yet it’s where many teams go wrong.

Resource Requests tell the Kubernetes scheduler how much CPU and memory your pod needs to run. The scheduler uses this information to decide which node can accommodate your pod. Requests are also what you’re charged for in most cloud environments—they represent your guaranteed resource allocation.

Resource Limits set a hard ceiling on how much CPU and memory a pod can consume. If a pod exceeds its memory limit, it gets killed (OOMKilled). CPU limits are throttled, preventing the container from using more than allocated.

Here’s the cost implication: if you set requests too high, you’re paying for resources you don’t use. If you set them too low, your pods crash or perform poorly, leading to failed deployments and emergency scaling. The sweet spot is setting requests to match your actual 95th percentile usage, not your peak theoretical maximum.

Right-Sizing Resource Requests

Start by understanding your current resource consumption. Most teams over-provision by 2-3x because they’re uncertain about actual requirements.

Here’s a practical approach:

Deploy with minimal requests (but not zero—the scheduler needs guidance)
Monitor actual usage for 1-2 weeks under normal load
Set requests to 95th percentile usage plus a small buffer (10-15%)
Set limits to 150-200% of requests to allow for traffic spikes

For example, if you observe that your web service uses an average of 150m CPU and peaks at 280m CPU:

apiVersion: v1
kind: Pod
metadata:
  name: web-service
spec:
  containers:
  - name: app
    image: myapp:latest
    resources:
      requests:
        cpu: "200m"        # 95th percentile + buffer
        memory: "256Mi"
      limits:
        cpu: "400m"        # 2x requests for headroom
        memory: "512Mi"

This configuration ensures your pod gets scheduled appropriately while preventing runaway resource consumption.

Common Over-Provisioning Patterns

Pattern 1: Cargo Cult Requests Many teams copy resource requests from examples or other services without understanding their workload. A typical mistake: setting every service to cpu: 500m and memory: 512Mi regardless of actual needs.

Pattern 2: Fear-Based Limits Setting limits extremely high (“just in case”) defeats the purpose. If your limit is 10x your request, you’re essentially not limiting anything, and you’re paying for that headroom.

Pattern 3: Ignoring Burstable QoS Kubernetes assigns Quality of Service (QoS) classes based on requests and limits. Pods with requests but no limits get “Burstable” QoS, which means they’re first to be evicted during node pressure. Understanding QoS helps you make intentional trade-offs.

Practical Right-Sizing Workflow

Use this workflow to right-size your existing deployments:

# 1. Get current resource usage across all pods
kubectl top pods --all-namespaces --sort-by=memory

# 2. Identify pods with high request-to-usage ratios
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | {name: .metadata.name, requests: .spec.containers[].resources.requests}'

# 3. For a specific deployment, check actual vs. requested
kubectl describe deployment myapp -n production | grep -A 5 "Requests"

Once you’ve identified over-provisioned workloads, update them incrementally. Change one deployment at a time, monitor for 24 hours, then adjust further if needed.

Part 2: Autoscaling Mechanisms—Scaling Intelligently

Autoscaling is where static resource allocation becomes dynamic cost optimization. Rather than paying for peak capacity 24/7, you scale up when needed and scale down during quiet periods.

Horizontal Pod Autoscaler (HPA)

HPA automatically scales the number of pod replicas based on observed metrics. It’s the most common autoscaling mechanism and typically delivers 25-35% cost savings for variable workloads.

How HPA Works:

Metrics server collects CPU and memory usage from pods
HPA controller checks metrics every 15 seconds (default)
If average CPU exceeds target, HPA scales up
If average CPU falls below target, HPA scales down (after cooldown period)

Here’s a production-ready HPA configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max

Key configuration decisions:

minReplicas: 2 — Always run at least 2 replicas for availability. Running 1 replica saves money but creates a single point of failure.
maxReplicas: 20 — Set this based on your infrastructure limits and budget. This prevents runaway scaling during traffic spikes or bugs.
targetCPUUtilization: 70% — Scale up when average CPU hits 70%. This provides headroom for traffic spikes without over-provisioning.
scaleDown stabilization: 300s — Wait 5 minutes before scaling down. This prevents thrashing (rapid up/down scaling) during variable load.
scaleUp policies — Scale up aggressively (100% increase every 15 seconds) to handle traffic spikes quickly.

Vertical Pod Autoscaler (VPA)

While HPA scales the number of pods, VPA scales the resources per pod. It’s useful for workloads where you can’t easily add more replicas (like databases or stateful services) or where you want to optimize resource requests over time.

VPA works by:

Monitoring actual resource usage
Recommending new request/limit values
Evicting and restarting pods with updated resources

Install VPA:

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

Configure VPA for a deployment:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-service
  updatePolicy:
    updateMode: "Auto"  # Options: Off, Initial, Recreate, Auto
  resourcePolicy:
    containerPolicies:
    - containerName: app
      minAllowed:
        cpu: 50m
        memory: 64Mi
      maxAllowed:
        cpu: 2
        memory: 2Gi
      controlledResources: ["cpu", "memory"]

When to use VPA:

Workloads with unpredictable resource needs
Stateful services where HPA isn’t suitable
Fine-tuning resource requests after initial deployment
Workloads that need to scale vertically (more powerful pods) rather than horizontally

Important caveat: VPA evicts pods to apply new resource values, causing brief downtime. Use updateMode: Initial for production workloads where you want recommendations without automatic updates.

Cluster Autoscaler

HPA and VPA scale pods, but what happens when your cluster runs out of node capacity? Cluster Autoscaler automatically adds nodes when pods can’t be scheduled and removes nodes when they’re underutilized.

Cluster Autoscaler is essential for cost optimization because it prevents you from over-provisioning nodes upfront. Instead, you start with a baseline and let the autoscaler add capacity as needed.

Configure Cluster Autoscaler for AWS EKS:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: cluster-autoscaler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
rules:
- apiGroups: [""]
  resources: ["events", "endpoints"]
  verbs: ["create", "patch"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
- apiGroups: [""]
  resources: ["pods/status"]
  verbs: ["update"]
- apiGroups: [""]
  resources: ["endpoints"]
  resourceNames: ["cluster-autoscaler"]
  verbs: ["get", "update"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["watch", "list", "get", "update"]
- apiGroups: [""]
  resources: ["namespaces", "pods", "services", "replicationcontrollers", "persistentvolumeclaims", "persistentvolumes"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["extensions"]
  resources: ["replicasets", "statefulsets"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["policy"]
  resources: ["poddisruptionbudgets"]
  verbs: ["watch", "list"]
- apiGroups: ["apps"]
  resources: ["statefulsets", "daemonsets", "replicasets", "deployments"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
  resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["batch", "extensions"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
  resourceNames: ["cluster-autoscaler"]
  resources: ["leases"]
  verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-autoscaler
subjects:
- kind: ServiceAccount
  name: cluster-autoscaler
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.27.0
        name: cluster-autoscaler
        resources:
          limits:
            cpu: 100m
            memory: 600Mi
          requests:
            cpu: 100m
            memory: 600Mi
        command:
        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag:k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
        - --balance-similar-node-groups
        - --skip-nodes-with-local-storage=false
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m
        - --scale-down-delay-after-failure=3m
        - --scale-down-delay-after-delete=10s
        - --scale-down-unneeded-time=10m
        env:
        - name: AWS_REGION
          value: us-east-1
      nodeSelector:
        node-role.kubernetes.io/control-plane: ""
      tolerations:
      - key: node-role.kubernetes.io/control-plane
        operator: Equal
        effect: NoSchedule

Key parameters:

scale-down-enabled=true — Removes underutilized nodes
scale-down-unneeded-time=10m — Wait 10 minutes before removing a node (prevents thrashing)
scale-down-delay-after-add=10m — Wait 10 minutes after adding a node before considering removal
expander=least-waste — When multiple node groups can satisfy a pod, choose the one with least wasted resources

Combining HPA, VPA, and Cluster Autoscaler

These three mechanisms work together:

HPA scales pod replicas based on CPU/memory
Cluster Autoscaler adds nodes when pods can’t be scheduled
VPA fine-tunes resource requests over time

For most workloads, use HPA + Cluster Autoscaler. Use VPA only for workloads where horizontal scaling isn’t practical.

Part 3: Efficiency Improvements and Advanced Patterns

Node Affinity and Pod Topology Spread

Efficient resource utilization depends on how pods are distributed across nodes. Poor scheduling can leave nodes partially empty while others are overloaded.

Use Pod Topology Spread Constraints to distribute pods evenly:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-service
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web-service
  template:
    metadata:
      labels:
        app: web-service
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web-service
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: web-service
      containers:
      - name: app
        image: myapp:latest
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "200m"
            memory: "256Mi"

This configuration ensures:

No more than 1 pod difference per node (maxSkew: 1)
Pods spread across availability zones
Better resource utilization and fault tolerance

Using Spot Instances for Non-Critical Workloads

Spot instances cost 70-90% less than on-demand instances but can be interrupted. Use them for fault-tolerant workloads like batch jobs, CI/CD runners, and non-critical services.

Configure a node pool for spot instances:

apiVersion: v1
kind: Node
metadata:
  labels:
    workload-type: spot
    capacity-type: spot
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  replicas: 5
  selector:
    matchLabels:
      app: batch-processor
  template:
    metadata:
      labels:
        app: batch-processor
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: capacity-type
                operator: In
                values:
                - spot
      tolerations:
      - key: spot-instance
        operator: Equal
        value: "true"
        effect: NoSchedule
      containers:
      - name: processor
        image: batch-processor:latest
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"

Namespace Resource Quotas

Prevent runaway resource consumption by setting namespace-level quotas:

apiVersion: v1
kind: Namespace
metadata:
  name: production
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: "200Gi"
    limits.cpu: "200"
    limits.memory: "400Gi"
    pods: "500"
    services.loadbalancers: "2"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: production-limits
  namespace: production
spec:
  limits:
  - max:
      cpu: "2"
      memory: "2Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
    type: Container
  - max:
      cpu: "4"
      memory: "4Gi"
    min:
      cpu: "100m"
      memory: "128Mi"
    type: Pod

Monitoring and Cost Tracking

You can’t optimize what you don’t measure. Implement cost monitoring to track spending by namespace, team, and workload.

Tools for cost tracking:

Kubecost — Kubernetes-native cost monitoring with allocation by namespace, pod, and label
CloudZero — Cloud cost intelligence platform
Infracost — Infrastructure cost estimation for IaC
AWS Cost Explorer — Native AWS cost tracking

Install Kubecost:

helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost kubecost/cost-analyzer \
  --namespace kubecost \
  --create-namespace \
  --set kubecostModel.warmCache=true \
  --set kubecostModel.warmSavingsCache=true

Query Kubecost API for cost by namespace:

kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090

# Get costs for last 7 days
curl http://localhost:9090/api/v1/allocation \
  "?window=7d&aggregate=namespace&accumulate=true"

Common Anti-Patterns and How to Avoid Them

Anti-Pattern 1: No Resource Requests Pods without requests get scheduled anywhere, leading to uneven utilization and poor autoscaling.

Fix: Always set resource requests. Use a LimitRange to enforce this at the namespace level.

Anti-Pattern 2: Requests Equal Limits Setting requests equal to limits prevents the scheduler from packing pods efficiently and disables autoscaling headroom.

Fix: Set limits to 150-200% of requests.

Anti-Pattern 3: Ignoring Pod Disruption Budgets During node maintenance or scale-down, pods get evicted without warning, causing service disruptions.

Fix: Define PodDisruptionBudgets for critical workloads:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-service-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: web-service

Anti-Pattern 4: Over-Relying on Limits Setting extremely high limits doesn’t prevent cost overruns; it just delays the problem.

Fix: Combine limits with HPA and Cluster Autoscaler to scale proactively.

Anti-Pattern 5: Not Cleaning Up Old Resources Orphaned deployments, unused PersistentVolumes, and forgotten namespaces accumulate costs.

Fix: Implement a cleanup policy:

# Find deployments with zero replicas
kubectl get deployments --all-namespaces --field-selector status.replicas=0

# Find unused PersistentVolumes
kubectl get pv --field-selector status.phase=Released

# Find namespaces with no pods
kubectl get namespaces -o json | \
  jq '.items[] | select(.status.phase=="Active") | .metadata.name' | \
  while read ns; do
    count=$(kubectl get pods -n $ns --no-headers 2>/dev/null | wc -l)
    if [ $count -eq 0 ]; then
      echo "Empty namespace: $ns"
    fi
  done

Implementation Roadmap

Here’s a phased approach to implementing these optimizations:

Phase 1: Foundation (Week 1-2)

Audit current resource requests and actual usage
Identify over-provisioned workloads
Set baseline resource requests based on observed usage
Enable metrics-server if not already running

Phase 2: Autoscaling (Week 3-4)

Deploy HPA for variable workloads
Configure Cluster Autoscaler
Set up monitoring and alerting
Test scale-up and scale-down behavior

Phase 3: Optimization (Week 5-8)

Deploy VPA for stateful workloads
Implement Pod Topology Spread Constraints
Set up namespace quotas and limits
Migrate non-critical workloads to spot instances

Phase 4: Continuous Improvement (Ongoing)

Monitor costs weekly
Review and adjust HPA targets
Update resource requests based on VPA recommendations
Identify and clean up unused resources

Conclusion

Kubernetes cost optimization isn’t a one-time project—it’s a continuous practice. By mastering resource requests and limits, implementing intelligent autoscaling, and adopting efficiency patterns, you can realistically achieve 20-40% cost reductions while improving reliability and performance.

The key is starting with the fundamentals: right-size your resource requests, implement HPA for variable workloads, and enable Cluster Autoscaler to scale infrastructure dynamically. From there, layer in advanced patterns like VPA, spot instances, and topology spread constraints.

Begin with Phase 1 this week. Audit your current cluster, identify the biggest waste, and start with one deployment. The compounding effect of these optimizations across your entire infrastructure will be substantial.