Amazon ECS cluster with EC2 Spot Capacity

Nathan Peck profile picture
Nathan Peck
Senior Developer Advocate at AWS

About

EC2 Spot Capacity is spare EC2 capacity that is available for less than the On-Demand price. Because Spot Instances enable you to request unused EC2 instances at steep discounts, you can lower your Amazon EC2 costs significantly. The hourly price for a Spot Instance is called a Spot price. The Spot price of each instance type in each Availability Zone is set by Amazon EC2, and is adjusted gradually based on the long-term supply of and demand for Spot Instances. Your Spot Instance runs whenever capacity is available.

Spot Capacity can be interrupted at any time. Therefore you need to be careful about what types of applications you run on Spot capacity and how you run those application. Amazon ECS is ideal if you want a more stable way to run workloads on top of interruptible Spot Capacity.

Install SAM CLI

This pattern uses AWS SAM CLI for deploying CloudFormation stacks on your account. You should follow the appropriate steps for installing SAM CLI.

Cluster Template

This cluster template demonstrates how to configure an autoscaling group to launch spot capacity with mixed types. A variety of EC2 instances of different sizes will be launched. However, Amazon ECS can gracefully handle this and adjust container density on each individual instance to match the size of that instance.

File: spot-cluster.ymlLanguage: yml
AWSTemplateFormatVersion: '2010-09-09'
Description: ECS cluster that has EC2 Spot capacity to host the containers
Parameters:
  DesiredCapacity:
    Type: Number
    Default: 0
    Description: Number of EC2 instances to launch in your ECS cluster.
  MaxSize:
    Type: Number
    Default: 100
    Description: Maximum number of EC2 instances that can be launched in your ECS cluster.
  ECSAMI:
    Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
    Default: /aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id
    Description: The Amazon Machine Image ID used for the cluster, leave it as the default value to get the latest AMI
  VpcId:
    Type: AWS::EC2::VPC::Id
    Description: VPC ID where the ECS cluster is launched
  SubnetIds:
    Type: List<AWS::EC2::Subnet::Id>
    Description: List of subnet IDs where the EC2 instances will be launched

Resources:
  # Cluster that keeps track of container deployments
  ECSCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterSettings:
        - Name: containerInsights
          Value: enabled

  # The config for each EC2 instance that is added to the cluster
  ContainerInstances:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateData:
        ImageId: !Ref ECSAMI
        IamInstanceProfile:
          Name: !Ref EC2InstanceProfile
        SecurityGroupIds:
          - !Ref ContainerHostSecurityGroup
        UserData:
          # This injected configuration file is how the EC2 instance
          # knows which ECS cluster on your AWS account it should be joining
          Fn::Base64: !Sub |
            #!/bin/bash
            echo "ECS_CLUSTER=${ECSCluster}" >> /etc/ecs/ecs.config
            echo "ECS_ENABLE_SPOT_INSTANCE_DRAINING=true" >> /etc/ecs/ecs.config
            echo "ECS_CONTAINER_STOP_TIMEOUT=90s" >> /etc/ecs/ecs.config
        BlockDeviceMappings:
          - DeviceName: "/dev/xvda"
            Ebs:
              VolumeSize: 50
              VolumeType: gp3
        # Disable IMDSv1, and require IMDSv2
        MetadataOptions:
          HttpEndpoint: enabled
          HttpTokens: required
  EC2InstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      Path: /
      Roles:
        - !Ref EC2Role

  # Autoscaling group. This launches the actual EC2 instances that will register
  # themselves as members of the cluster, and run the docker containers.
  ECSAutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    UpdatePolicy:
      AutoScalingReplacingUpdate:
        WillReplace: 'true'
    Properties:
      VPCZoneIdentifier:
        - !Select [ 0, !Ref SubnetIds ]
        - !Select [ 1, !Ref SubnetIds ]
      MinSize: 0
      MaxSize: !Ref MaxSize
      DesiredCapacity: !Ref DesiredCapacity
      NewInstancesProtectedFromScaleIn: true

      # This policy sets up rules which allow the ASG to launch
      # a variety of mixed spot instance types
      MixedInstancesPolicy:
        # Request no on demand, only spot instances
        InstancesDistribution:
          OnDemandBaseCapacity: 0
          OnDemandPercentageAboveBaseCapacity: 0
          SpotAllocationStrategy: capacity-optimized
        # Rules about what type of instances to launch
        LaunchTemplate:
          LaunchTemplateSpecification:
            LaunchTemplateId: !Ref ContainerInstances
            Version: !GetAtt ContainerInstances.LatestVersionNumber
          Overrides:
            - InstanceRequirements:
                VCpuCount:
                  Min: 2
                  Max: 4
                MemoryMiB:
                  Min: 4096
                  Max: 8192
                BurstablePerformance: "excluded"
                InstanceGenerations:
                  - "current"
                CpuManufacturers:
                  - "intel"
                  - "amd"
                ExcludedInstanceTypes: ["t2*","r*","d*","g*","i*","z*","x*"]

  # Custom resource that force destroys the ASG. This cleans up EC2 instances that had
  # managed termination protection enabled, but which are not yet released.
  # This is necessary because ECS does not immediately release an EC2 instance from termination
  # protection as soon as the instance is no longer running tasks. There is a cooldown delay.
  # In the case of tearing down the CloudFormation stack, CloudFormation will delete the
  # AWS::ECS::Service and immediately move on to tearing down the AWS::ECS::Cluster, disconnecting
  # the AWS::AutoScaling::AutoScalingGroup from ECS management too fast, before ECS has a chance
  # to asynchronously turn off managed instance protection on the EC2 instances.
  # This will leave some EC2 instances stranded in a state where they are protected from scale-in forever.
  # This then blocks the AWS::AutoScaling::AutoScalingGroup from cleaning itself up.
  # The custom resource function force destroys the autoscaling group when tearing down the stack,
  # avoiding the issue of protected EC2 instances that can never be cleaned up.
  CustomAsgDestroyerFunction:
    Type: AWS::Lambda::Function
    Properties:
      Code:
        ZipFile: !Sub |
          const { AutoScalingClient, DeleteAutoScalingGroupCommand } = require("@aws-sdk/client-auto-scaling");
          const autoscaling = new AutoScalingClient({ region: '${AWS::Region}' });
          const response = require('cfn-response');

          exports.handler = async function(event, context) {
            console.log(event);

            if (event.RequestType !== "Delete") {
              await response.send(event, context, response.SUCCESS);
              return;
            }

            const input = {
              AutoScalingGroupName: '${ECSAutoScalingGroup}',
              ForceDelete: true
            };
            const command = new DeleteAutoScalingGroupCommand(input);
            const deleteResponse = await autoscaling.send(command);
            console.log(deleteResponse);

            await response.send(event, context, response.SUCCESS);
          };
      Handler: index.handler
      Runtime: nodejs20.x
      Timeout: 30
      Role: !GetAtt CustomAsgDestroyerRole.Arn

  # The role used by the ASG destroyer
  CustomAsgDestroyerRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - sts:AssumeRole
      ManagedPolicyArns:
        # https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaBasicExecutionRole.html
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: allow-to-delete-autoscaling-group
          PolicyDocument:
            Version: 2012-10-17
            Statement:
              - Effect: Allow
                Action: autoscaling:DeleteAutoScalingGroup
                Resource: !Sub arn:aws:autoscaling:${AWS::Region}:${AWS::AccountId}:autoScalingGroup:*:autoScalingGroupName/${ECSAutoScalingGroup}

  CustomAsgDestroyer:
    Type: Custom::AsgDestroyer
    DependsOn:
      - CapacityProviderAssociation
    Properties:
      ServiceToken: !GetAtt CustomAsgDestroyerFunction.Arn
      Region: !Ref "AWS::Region"

  # Create an ECS capacity provider to attach the ASG to the ECS cluster
  # so that it autoscales as we launch more containers
  CapacityProvider:
    Type: AWS::ECS::CapacityProvider
    Properties:
      AutoScalingGroupProvider:
        AutoScalingGroupArn: !Ref ECSAutoScalingGroup
        ManagedScaling:
          InstanceWarmupPeriod: 60
          MinimumScalingStepSize: 1
          MaximumScalingStepSize: 100
          Status: ENABLED
          # Percentage of cluster reservation to try to maintain
          TargetCapacity: 100
        ManagedTerminationProtection: ENABLED
        ManagedDraining: ENABLED

  # Create a cluster capacity provider assocation so that the cluster
  # will use the capacity provider
  CapacityProviderAssociation:
    Type: AWS::ECS::ClusterCapacityProviderAssociations
    Properties:
      CapacityProviders:
        - !Ref CapacityProvider
      Cluster: !Ref ECSCluster
      DefaultCapacityProviderStrategy:
        - Base: 0
          CapacityProvider: !Ref CapacityProvider
          Weight: 1

  # A security group for the EC2 hosts that will run the containers.
  # This can be used to limit incoming traffic to or outgoing traffic
  # from the container's host EC2 instance.
  ContainerHostSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Access to the EC2 hosts that run containers
      VpcId: !Ref VpcId

  # Role for the EC2 hosts. This allows the ECS agent on the EC2 hosts
  # to communciate with the ECS control plane, as well as download the docker
  # images from ECR to run on your host.
  EC2Role:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: [ec2.amazonaws.com]
            Action: ['sts:AssumeRole']
      Path: /

      # See reference: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonEC2ContainerServiceforEC2Role
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

  # This is a role which is used within Fargate to allow the Fargate agent
  # to download images, and upload logs.
  ECSTaskExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: [ecs-tasks.amazonaws.com]
            Action: ['sts:AssumeRole']
            Condition:
              ArnLike:
                aws:SourceArn: !Sub arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:*
              StringEquals:
                aws:SourceAccount: !Ref AWS::AccountId
      Path: /

      # This role enables basic features of ECS. See reference:
      # https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonECSTaskExecutionRolePolicy
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

Outputs:
  ClusterName:
    Description: The ECS cluster into which to launch resources
    Value: !Ref ECSCluster
  ECSTaskExecutionRole:
    Description: The role used to start up a task
    Value: !Ref ECSTaskExecutionRole
  CapacityProvider:
    Description: The cluster capacity provider that the service should use
                 to request capacity when it wants to start up a task
    Value: !Ref CapacityProvider

A couple important things to note in this template:

  • ContainerInstances.Properties.UserData - The ecs.config file set configuration that is used by the ECS agent. In addition to the basic info of which cluster to join, this template also sets up automatic task draining whenever a Spot termination notice is sent to the instance. It also modifies the stop timeout period to 90 seconds, so as to be less than the Spot termination window.
  • ECSAutoScalingGroup.Properites.MixedInstancesPolicy - This autoscaling group policy allows the launch of mixed EC2 types, entirely on Spot, with no on-demand instances.

This template requires two input parameters:

  • VpcId - The ID of a VPC on your AWS account. This can be the default VPC
  • SubnetIds - A comma separated list of subnet ID's within that VPC

Additionally you can modify the following parameters:

  • DesiredCapacity - Number of EC2 instances to start with. Default 0
  • MaxSize - An upper limit on number of EC2 instances to scale up to. Default 100
  • ECSAMI - The Amazon Machine Image to use for each EC2 instance. Don't change this unless you really know what you are doing.

Service Template

This template deploys an ECS service that uses the Spot capacity provider. ECS will automatically adapt to launch as many mixed EC2 instances as necessary to host the service tasks.

File: service.ymlLanguage: yml
AWSTemplateFormatVersion: '2010-09-09'
Description: An example service that deploys onto EC2 capacity with
             a capacity provider strategy that autoscales the underlying
             EC2 Capacity as needed by the service

Parameters:
  VpcId:
    Type: String
    Description: The VPC that the service is running inside of
  SubnetIds:
    Type: List<AWS::EC2::Subnet::Id>
    Description: List of subnet IDs the AWS VPC tasks are inside of
  ClusterName:
    Type: String
    Description: The name of the ECS cluster into which to launch capacity.
  ECSTaskExecutionRole:
    Type: String
    Description: The role used to start up an ECS task
  CapacityProvider:
    Type: String
    Description: The cluster capacity provider that the service should use
                 to request capacity when it wants to start up a task
  ServiceName:
    Type: String
    Default: example-service
    Description: A name for the service
  ImageUrl:
    Type: String
    Default: public.ecr.aws/docker/library/busybox:latest
    Description: The url of a docker image that contains the application process that
                 will handle the traffic for this service
  ContainerCpu:
    Type: Number
    Default: 256
    Description: How much CPU to give the container. 1024 is 1 CPU
  ContainerMemory:
    Type: Number
    Default: 512
    Description: How much memory in megabytes to give the container
  Command:
    Type: String
    Default: sleep 3600
    Description: The command to run inside of the container
  DesiredCount:
    Type: Number
    Default: 0
    Description: How many copies of the service task to run

Resources:

  # The task definition. This is a simple metadata description of what
  # container to run, and what resource requirements it has.
  TaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      Family: !Ref ServiceName
      Cpu: !Ref ContainerCpu
      Memory: !Ref ContainerMemory
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - EC2
      ExecutionRoleArn: !Ref ECSTaskExecutionRole
      ContainerDefinitions:
        - Name: !Ref ServiceName
          Cpu: !Ref ContainerCpu
          Memory: !Ref ContainerMemory
          Image: !Ref ImageUrl
          Command: !Split [' ', !Ref 'Command']
          LogConfiguration:
            LogDriver: 'awslogs'
            Options:
              mode: non-blocking
              max-buffer-size: 25m
              awslogs-group: !Ref LogGroup
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: !Ref ServiceName

  # The service. The service is a resource which allows you to run multiple
  # copies of a type of task, and gather up their logs and metrics, as well
  # as monitor the number of running tasks and replace any that have crashed
  Service:
    Type: AWS::ECS::Service
    Properties:
      ServiceName: !Ref ServiceName
      Cluster: !Ref ClusterName
      PlacementStrategies:
        - Field: attribute:ecs.availability-zone
          Type: spread
        - Field: cpu
          Type: binpack
      CapacityProviderStrategy:
        - Base: 0
          CapacityProvider: !Ref CapacityProvider
          Weight: 1
      DeploymentConfiguration:
        MaximumPercent: 200
        MinimumHealthyPercent: 75
      DesiredCount: !Ref DesiredCount
      NetworkConfiguration:
        AwsvpcConfiguration:
          SecurityGroups:
            - !Ref ServiceSecurityGroup
          Subnets:
            - !Select [ 0, !Ref SubnetIds ]
            - !Select [ 1, !Ref SubnetIds ]
      TaskDefinition: !Ref TaskDefinition

  # Because we are launching tasks in AWS VPC networking mode
  # the tasks themselves also have an extra security group that is unique
  # to them. This is a unique security group just for this service,
  # to control which things it can talk to, and who can talk to it
  ServiceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: !Sub Access to service ${ServiceName}
      VpcId: !Ref VpcId

  # This log group stores the stdout logs from this service's containers
  LogGroup:
    Type: AWS::Logs::LogGroup

Most parameters in this stack will be supplied by a parent stack that passes in resources from the capacity provider stack. However you may be interested in overriding the following parameters:

  • ServiceName - A human name for the service.
  • ImageUrl - URL of a container image to run. By default this stack deploys public.ecr.aws/docker/library/busybox:latest
  • ContainerCpu - CPU shares, where 1024 CPU is 1 vCPU. Default: 256 (1/4th vCPU)
  • ContainerMemory - Megabytes of memory to give the conatiner. Default 512
  • Command - Command to run in the container. Default: sleep 3600
  • DesiredCount - Number of copies of the container to run. Default: 0 (So you can test scaling up from zero)

Parent Stack

This stack deploys both stacks as nested stacks, for ease of grouping and passing parameters from one stack to the next.

File: parent.ymlLanguage: yml
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: Parent stack that deploys the ECS cluster and capacity provider
             then launches a service inside of the cluster

Parameters:
  VpcId:
    Type: AWS::EC2::VPC::Id
    Description: VPC ID where the ECS cluster is launched
  SubnetIds:
    Type: List<AWS::EC2::Subnet::Id>
    Description: List of subnet IDs where the EC2 instances will be launched

Resources:

  # This stack contains cluster wide resources that will be shared
  # by all services that get launched in the stack
  BaseStack:
    Type: AWS::Serverless::Application
    Properties:
      Location: spot-cluster.yml
      Parameters:
        VpcId: !Ref VpcId
        SubnetIds: !Join [',', !Ref SubnetIds]

  # This service will be launched into the cluster by passing
  # details from the base stack into the service stack
  Service:
    Type: AWS::Serverless::Application
    Properties:
      Location: service.yml
      Parameters:
        VpcId: !Ref VpcId
        SubnetIds: !Join [',', !Ref SubnetIds]
        ClusterName: !GetAtt BaseStack.Outputs.ClusterName
        ECSTaskExecutionRole: !GetAtt BaseStack.Outputs.ECSTaskExecutionRole
        CapacityProvider: !GetAtt BaseStack.Outputs.CapacityProvider

This parent stack requires the following parameters:

  • VpcId - The ID of a VPC on your AWS account. This can be the default VPC
  • SubnetIds - A comma separated list of subnet ID's within that VPC

Usage

First deploy the cluster and spot capacity autoscaling group:

Language: sh
sam deploy \
  --template-file parent.yml \
  --stack-name spot-capacity-provider-environment \
  --resolve-s3 \
  --capabilities CAPABILITY_IAM \
  --parameter-overrides VpcId=vpc-79508710 SubnetIds=subnet-b4676dfe,subnet-c71ebfae

Test it out

The service is initially deployed with a desired count of 0. Use the Amazon ECS console to update the service and scale it up to a larger number of tasks. After an initial delay you will observe the ECS Capacity Provider request instances from the autoscaling group. The autoscaling group will fullfil the request by launching a mixture of EC2 instances based on current Spot market availability and pricing. As the instances join the ECS cluster they will be filled with service tasks.

See Also