Stop Losing Money on Spot Instance Interruptions in EKS

Running AWS Spot Instances on EKS can save you 60-90% compared to On-Demand pricing. But without proper interruption handling, you're trading reliability for savings—a tradeoff that will eventually bite you in production.

The Problem

Every engineering leader running EKS faces the same dilemma: Spot Instances promise massive savings, but the fear of unexpected interruptions keeps teams on expensive On-Demand instances "just to be safe."

Here's the reality: AWS can reclaim Spot Instances with just 2 minutes' notice. When this happens to a node in your EKS cluster, any pods running on that node get terminated abruptly. Without graceful handling:

In-flight requests get dropped, causing customer-facing errors
Stateful applications lose data mid-transaction
Downstream services cascade into failure as connections break
Your team gets paged at 3 AM for "random" pod failures

The worst part? Many teams run Spot exclusively to cut costs, then spend that savings (and more) on incident response and customer apologizes.

The Technical Reality

When AWS decides to reclaim a Spot Instance, it sends an Instance Termination Notice via the EC2 metadata service. Your node has exactly 2 minutes to gracefully evacuate workloads before hard termination.

The challenge is that Kubernetes doesn't natively understand Spot interruptions. Without explicit handling, your cluster continues scheduling pods to a node that's about to disappear.

The Solution: A Complete Spot Interruption Strategy

1. Deploy aws-node-termination-handler (NTH)

The Node Termination Handler is the critical first line of defense:

# Install via Helm
helm repo add eks https://aws.github.io/eks-charts
helm install aws-node-termination-handler \
  eks/aws-node-termination-handler \
  --namespace kube-system \
  --set enableSpotInterruptionDraining=true \
  --set enableScheduledEventDraining=true

NTH polls the EC2 metadata service, detects the termination notice, cordons the node (preventing new pods), and initiates a graceful drain—giving your pods the full 2 minutes to shut down cleanly.

2. Configure Pod Disruption Budgets (PDBs)

PDBs ensure your services maintain availability during node drains:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: critical-service

This guarantees at least 2 replicas stay running during any disruption—including Spot interruptions.

3. Diversify Instance Types

Spot interruptions correlate by instance type. If you're running all m5.xlarge, a single capacity squeeze can hit all your nodes simultaneously.

With Karpenter, diversification is trivial:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["m5.xlarge", "m5a.xlarge", "m6i.xlarge", 
               "c5.xlarge", "c5a.xlarge", "r5.xlarge"]

4. Implement Graceful Shutdown in Applications

Your applications must handle SIGTERM properly:

import signal
import sys

def graceful_shutdown(signum, frame):
    # Finish in-flight requests
    server.stop(grace=30)  # 30-second grace period
    # Close database connections
    db.close()
    sys.exit(0)

signal.signal(signal.SIGTERM, graceful_shutdown)

Match your terminationGracePeriodSeconds in Kubernetes to your actual shutdown time:

spec:
  terminationGracePeriodSeconds: 60

The Payoff

Teams implementing this strategy achieve 70%+ Spot adoption with zero customer-facing incidents during interruptions. The 60-90% compute savings become real, tangible savings instead of hidden costs from incidents and reliability debt.

Consider the math: A typical mid-market EKS cluster running $50K/month in compute can save $30-45K/month with proper Spot adoption. That's $360-540K annually—money that goes directly to your bottom line instead of AWS.

But here's what most cost optimization guides won't tell you: the savings only materialize if you maintain reliability. One major outage from a poorly-handled Spot interruption can cost more in customer trust and engineering time than months of Spot savings.

Need Help?

At Uptime & Spend, we specialize in EKS cost optimization and reliability. We've helped mid-market companies implement Spot strategies that cut their AWS bills by 18-35% in 30 days—without sacrificing reliability.

If your Spot adoption is stuck below 30%, or you're experiencing interruption-related incidents, let's talk.

Schedule a Free Consultation

#AWS#EKS#Kubernetes#SpotInstances#FinOps#SRE#PlatformEngineering