AWS Spot instance launch failure fallback with kubernetes cluster autoscaler

Neha Agarwal
4 min readSep 12, 2020

Problem Statement: Kubernetes Cluster Autoscaler fails to stabilize environment when AWS Auto Scaling Group (ASG) has no spot pools left to launch instances from.

Solution: Configure one Auto Scaling Group in every Availability Zone (AZ) to have maximum spot pools and if one goes out of spot capacity, make it fallback on any of the other ASGs selected after some calculations.

If you are using spot instances with kubernetes on AWS, you know the pain of stabalizing the environment. Reason being, spot instances often get completely unavailable in some AZs during particular times of the day. This not only impacts launching new instances, but also knocks out existing instances making that particular environment completely unstable.

Some straightforward temporary/permanent solutions as per your use case could be:

  1. Not restricting ASG to any particular AZ
  2. Configuring multiple instance types in one ASG having same CPU and Memory resources
  3. Launching On-demand instances on spot instance launch failure (increases cost)
  4. Reducing node provision time of cluster autoscaler to a certain extent (not recommended beyond that, as this may effect the working scenarios)

But, there is always an exceptional scenario where none of these makeshifts work.

I had a similar one where none of these solutions worked.

One ASG per AZ

As you see in the image above, I was using 3 AZs (us-east-1a, us-east-1c and us-east-1d) out of the 6AZs available in us-east-1. The instance type that I was using was of c5 series. So the easiest way cound have been to add c5a series as the second priority in the ASG, which would have worked for ASG-1. Right when you thought you have tackled the problem, you discover that c5a series is available only in us-east-1a and us-east-1b availability zones. Here is when your plan of action fails.

When the kubernetes cluster autoscaler will try launching instances from AZs other than 1a and 1b, and the spot capacity has been exhausted in both of these AZs, cluster autoscaler will fail to scale up your environment.

Services used for the existing setup:

  1. AWS Spot instances
  2. Self hosted Kubernetes worker nodes on AWS Auto Scaling Groups
  3. Kubernetes Cluster Autoscaler

Services integrated with existing ones to run this custom logic:

  1. Lambda
  2. SSM (Systems Manager)
  3. SNS (Simple Notification Service)
  4. Multi AZ kubernetes cluster worker nodes setup

High level overview:

Spot Instance launch failure fallback strategy

No spot pool left in ASG-1 > fallback to ASG-2
No spot pool left in ASG-2 > fallback to ASG-3
No spot pool left in ASG-3 > fallback to ASG-4
No spot pool left in ASG-4 > fallback to ASG-5

Requirements for ASGs:

  1. ASG configured with 100% Spot instances
  2. At least one instance type in every ASG
  3. Unique tag added to all ASGs linked with cluster autoscaler
  4. Same SNS topic attached to every ASG

Algorithm explained:

Let’s consider there are no spot pools left in AZs with ASG-1, ASG-2 and ASG-3.

  1. Cluster autoscaler requests ASG-1 to scale up ‘n’ instances
  2. It fails as there is no capacity left in that AZ to scale up, it throws an error message in the events which triggers a SNS topic
  3. SNS will trigger a custom logic through a lambda function which will take the following steps:
    a. Fetch other ASGs with unique tag
    b. Check which ASGs have the capacity to scale up ‘n’ instances (let’s say it is ASG-2)
    c. Check if any key is created in SSM Parameter store for this ASG (explained in #d); if yes: do not take any action; if no: continue with #d to #g
    d. Create a new parameter store key in SSM that has the name of ASG-2 (This is done so that this logic is not triggered for redundant error messages) as AWS throws multiple error messages for single launch failure
    e. Increase the desired count of ASG-2 by ‘n’ instances
    f. Decrease the desired count of ASG-1 by ’n’ instances
    g. Delete the SSM parameter created

Similarly, ASG-2 will fallback to ASG-3, ASG-3 will fallback to ASG-4, which has the capacity and scales up the environment.

To have the best results of this algorithm (without having last option of making it fallback on on-demand instances), use all AZs of the region that you are working in as it will increase the number of spot pools and moreover, your compute cost will also not increase.

Also note that one iteration of this run through lambda, it will consume:
Max Memory of lambda function: 80Mb
Execution time: 2 seconds

Thank you!

To read the updated version of this blog, please refer tavisca’s website: https://www.tavisca.com/blog/aws-spot-instance-launch-failure-fallback

--

--

Neha Agarwal

DevOps | AWS | Cloud Engineering | Kubernetes | ECS | Autoscaling | Cloud Cost | Security