CloudFormation template that demonstrates setting up an EC2 Spot capacity provider to supply compute for containers in the cluster
Nathan Peck
Senior Developer Advocate at AWS
About
EC2 Spot Capacity is spare EC2 capacity that is available for less than the On-Demand price. Because Spot Instances enable you to request unused EC2 instances at steep discounts, you can lower your Amazon EC2 costs significantly. The hourly price for a Spot Instance is called a Spot price. The Spot price of each instance type in each Availability Zone is set by Amazon EC2, and is adjusted gradually based on the long-term supply of and demand for Spot Instances. Your Spot Instance runs whenever capacity is available.
Spot Capacity can be interrupted at any time. Therefore you need to be careful about what types of applications you run on Spot capacity and how you run those application. Amazon ECS is ideal if you want a more stable way to run workloads on top of interruptible Spot Capacity.
Install SAM CLI
This pattern uses AWS SAM CLI for deploying CloudFormation stacks on your account.
You should follow the appropriate steps for installing SAM CLI.
Cluster Template
This cluster template demonstrates how to configure an autoscaling group to launch spot capacity
with mixed types. A variety of EC2 instances of different sizes will be launched.
However, Amazon ECS can gracefully handle this and adjust container density on each individual instance to match the size of that instance.
AWSTemplateFormatVersion:'2010-09-09'Description:ECS cluster that has EC2 Spot capacity to host the containersParameters:DesiredCapacity:Type:NumberDefault:0Description:Number of EC2 instances to launch in your ECS cluster.MaxSize:Type:NumberDefault:100Description:Maximum number of EC2 instances that can be launched in your ECS cluster.ECSAMI:Type:AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>Default:/aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_idDescription:The Amazon Machine Image ID used for the cluster, leave it as the default value to get the latest AMIVpcId:Type:AWS::EC2::VPC::IdDescription:VPC ID where the ECS cluster is launchedSubnetIds:Type:List<AWS::EC2::Subnet::Id>Description:List of subnet IDs where the EC2 instances will be launchedResources:# Cluster that keeps track of container deploymentsECSCluster:Type:AWS::ECS::ClusterProperties:ClusterSettings:- Name:containerInsightsValue:enabled# The config for each EC2 instance that is added to the clusterContainerInstances:Type:AWS::EC2::LaunchTemplateProperties:LaunchTemplateData:ImageId:!Ref ECSAMIIamInstanceProfile:Name:!Ref EC2InstanceProfileSecurityGroupIds:- !Ref ContainerHostSecurityGroupUserData:# This injected configuration file is how the EC2 instance# knows which ECS cluster on your AWS account it should be joiningFn::Base64:!Sub |#!/bin/bashecho "ECS_CLUSTER=${ECSCluster}" >> /etc/ecs/ecs.configecho "ECS_ENABLE_SPOT_INSTANCE_DRAINING=true" >> /etc/ecs/ecs.configecho "ECS_CONTAINER_STOP_TIMEOUT=90s" >> /etc/ecs/ecs.configBlockDeviceMappings:- DeviceName:"/dev/xvda"Ebs:VolumeSize:50VolumeType:gp3# Disable IMDSv1, and require IMDSv2MetadataOptions:HttpEndpoint:enabledHttpTokens:requiredEC2InstanceProfile:Type:AWS::IAM::InstanceProfileProperties:Path:/Roles:- !Ref EC2Role# Autoscaling group. This launches the actual EC2 instances that will register# themselves as members of the cluster, and run the docker containers.ECSAutoScalingGroup:Type:AWS::AutoScaling::AutoScalingGroupUpdatePolicy:AutoScalingReplacingUpdate:WillReplace:'true'Properties:VPCZoneIdentifier:- !Select [ 0, !Ref SubnetIds ]- !Select [ 1, !Ref SubnetIds ]MinSize:0MaxSize:!Ref MaxSizeDesiredCapacity:!Ref DesiredCapacityNewInstancesProtectedFromScaleIn:true# This policy sets up rules which allow the ASG to launch# a variety of mixed spot instance typesMixedInstancesPolicy:# Request no on demand, only spot instancesInstancesDistribution:OnDemandBaseCapacity:0OnDemandPercentageAboveBaseCapacity:0SpotAllocationStrategy:capacity-optimized# Rules about what type of instances to launchLaunchTemplate:LaunchTemplateSpecification:LaunchTemplateId:!Ref ContainerInstancesVersion:!GetAtt ContainerInstances.LatestVersionNumberOverrides:- InstanceRequirements:VCpuCount:Min:2Max:4MemoryMiB:Min:4096Max:8192BurstablePerformance:"excluded"InstanceGenerations:- "current"CpuManufacturers:- "intel"- "amd"ExcludedInstanceTypes:["t2*","r*","d*","g*","i*","z*","x*"]# Custom resource that force destroys the ASG. This cleans up EC2 instances that had# managed termination protection enabled, but which are not yet released.# This is necessary because ECS does not immediately release an EC2 instance from termination# protection as soon as the instance is no longer running tasks. There is a cooldown delay.# In the case of tearing down the CloudFormation stack, CloudFormation will delete the# AWS::ECS::Service and immediately move on to tearing down the AWS::ECS::Cluster, disconnecting# the AWS::AutoScaling::AutoScalingGroup from ECS management too fast, before ECS has a chance# to asynchronously turn off managed instance protection on the EC2 instances.# This will leave some EC2 instances stranded in a state where they are protected from scale-in forever.# This then blocks the AWS::AutoScaling::AutoScalingGroup from cleaning itself up.# The custom resource function force destroys the autoscaling group when tearing down the stack,# avoiding the issue of protected EC2 instances that can never be cleaned up.CustomAsgDestroyerFunction:Type:AWS::Lambda::FunctionProperties:Code:ZipFile:!Sub |const { AutoScalingClient, DeleteAutoScalingGroupCommand } = require("@aws-sdk/client-auto-scaling");const autoscaling = new AutoScalingClient({ region: '${AWS::Region}' });const response = require('cfn-response');exports.handler = async function(event, context) {console.log(event);if (event.RequestType !== "Delete") {await response.send(event, context, response.SUCCESS);return;}const input = {AutoScalingGroupName:'${ECSAutoScalingGroup}',ForceDelete:true};const command = new DeleteAutoScalingGroupCommand(input);const deleteResponse = await autoscaling.send(command);console.log(deleteResponse);await response.send(event, context, response.SUCCESS);};Handler:index.handlerRuntime:nodejs20.xTimeout:30Role:!GetAtt CustomAsgDestroyerRole.Arn# The role used by the ASG destroyerCustomAsgDestroyerRole:Type:AWS::IAM::RoleProperties:AssumeRolePolicyDocument:Version:2012-10-17Statement:- Effect:AllowPrincipal:Service:- lambda.amazonaws.comAction:- sts:AssumeRoleManagedPolicyArns:# https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaBasicExecutionRole.html- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRolePolicies:- PolicyName:allow-to-delete-autoscaling-groupPolicyDocument:Version:2012-10-17Statement:- Effect:AllowAction:autoscaling:DeleteAutoScalingGroupResource:!Sub arn:aws:autoscaling:${AWS::Region}:${AWS::AccountId}:autoScalingGroup:*:autoScalingGroupName/${ECSAutoScalingGroup}CustomAsgDestroyer:Type:Custom::AsgDestroyerDependsOn:- CapacityProviderAssociationProperties:ServiceToken:!GetAtt CustomAsgDestroyerFunction.ArnRegion:!Ref "AWS::Region"# Create an ECS capacity provider to attach the ASG to the ECS cluster# so that it autoscales as we launch more containersCapacityProvider:Type:AWS::ECS::CapacityProviderProperties:AutoScalingGroupProvider:AutoScalingGroupArn:!Ref ECSAutoScalingGroupManagedScaling:InstanceWarmupPeriod:60MinimumScalingStepSize:1MaximumScalingStepSize:100Status:ENABLED# Percentage of cluster reservation to try to maintainTargetCapacity:100ManagedTerminationProtection:ENABLEDManagedDraining:ENABLED# Create a cluster capacity provider assocation so that the cluster# will use the capacity providerCapacityProviderAssociation:Type:AWS::ECS::ClusterCapacityProviderAssociationsProperties:CapacityProviders:- !Ref CapacityProviderCluster:!Ref ECSClusterDefaultCapacityProviderStrategy:- Base:0CapacityProvider:!Ref CapacityProviderWeight:1# A security group for the EC2 hosts that will run the containers.# This can be used to limit incoming traffic to or outgoing traffic# from the container's host EC2 instance.ContainerHostSecurityGroup:Type:AWS::EC2::SecurityGroupProperties:GroupDescription:Access to the EC2 hosts that run containersVpcId:!Ref VpcId# Role for the EC2 hosts. This allows the ECS agent on the EC2 hosts# to communciate with the ECS control plane, as well as download the docker# images from ECR to run on your host.EC2Role:Type:AWS::IAM::RoleProperties:AssumeRolePolicyDocument:Statement:- Effect:AllowPrincipal:Service:[ec2.amazonaws.com]Action:['sts:AssumeRole']Path:/# See reference: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonEC2ContainerServiceforEC2RoleManagedPolicyArns:- arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role# This is a role which is used within Fargate to allow the Fargate agent# to download images, and upload logs.ECSTaskExecutionRole:Type:AWS::IAM::RoleProperties:AssumeRolePolicyDocument:Statement:- Effect:AllowPrincipal:Service:[ecs-tasks.amazonaws.com]Action:['sts:AssumeRole']Condition:ArnLike:aws:SourceArn:!Sub arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:*StringEquals:aws:SourceAccount:!Ref AWS::AccountIdPath:/# This role enables basic features of ECS. See reference:# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonECSTaskExecutionRolePolicyManagedPolicyArns:- arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicyOutputs:ClusterName:Description:The ECS cluster into which to launch resourcesValue:!Ref ECSClusterECSTaskExecutionRole:Description:The role used to start up a taskValue:!Ref ECSTaskExecutionRoleCapacityProvider:Description:The cluster capacity provider that the service should useto request capacity when it wants to start up a taskValue:!Ref CapacityProvider
A couple important things to note in this template:
ContainerInstances.Properties.UserData - The ecs.config file set configuration that
is used by the ECS agent. In addition to the basic info of which cluster to join, this template
also sets up automatic task draining whenever a Spot termination notice is sent to the instance.
It also modifies the stop timeout period to 90 seconds, so as to be less than the Spot termination window.
ECSAutoScalingGroup.Properites.MixedInstancesPolicy - This autoscaling group policy allows the
launch of mixed EC2 types, entirely on Spot, with no on-demand instances.
This template requires two input parameters:
VpcId - The ID of a VPC on your AWS account. This can be the default VPC
SubnetIds - A comma separated list of subnet ID’s within that VPC
Additionally you can modify the following parameters:
DesiredCapacity - Number of EC2 instances to start with. Default 0
MaxSize - An upper limit on number of EC2 instances to scale up to. Default 100
ECSAMI - The Amazon Machine Image to use for each EC2 instance. Don’t change this unless you really know what you are doing.
Service Template
This template deploys an ECS service that uses the Spot capacity provider. ECS will automatically
adapt to launch as many mixed EC2 instances as necessary to host the service tasks.
AWSTemplateFormatVersion:'2010-09-09'Description:An example service that deploys onto EC2 capacity witha capacity provider strategy that autoscales the underlyingEC2 Capacity as needed by the serviceParameters:VpcId:Type:StringDescription:The VPC that the service is running inside ofSubnetIds:Type:List<AWS::EC2::Subnet::Id>Description:List of subnet IDs the AWS VPC tasks are inside ofClusterName:Type:StringDescription:The name of the ECS cluster into which to launch capacity.ECSTaskExecutionRole:Type:StringDescription:The role used to start up an ECS taskCapacityProvider:Type:StringDescription:The cluster capacity provider that the service should useto request capacity when it wants to start up a taskServiceName:Type:StringDefault:example-serviceDescription:A name for the serviceImageUrl:Type:StringDefault:public.ecr.aws/docker/library/busybox:latestDescription:The url of a docker image that contains the application process thatwill handle the traffic for this serviceContainerCpu:Type:NumberDefault:256Description:How much CPU to give the container. 1024 is 1 CPUContainerMemory:Type:NumberDefault:512Description:How much memory in megabytes to give the containerCommand:Type:StringDefault:sleep 3600Description:The command to run inside of the containerDesiredCount:Type:NumberDefault:0Description:How many copies of the service task to runResources:# The task definition. This is a simple metadata description of what# container to run, and what resource requirements it has.TaskDefinition:Type:AWS::ECS::TaskDefinitionProperties:Family:!Ref ServiceNameCpu:!Ref ContainerCpuMemory:!Ref ContainerMemoryNetworkMode:awsvpcRequiresCompatibilities:- EC2ExecutionRoleArn:!Ref ECSTaskExecutionRoleContainerDefinitions:- Name:!Ref ServiceNameCpu:!Ref ContainerCpuMemory:!Ref ContainerMemoryImage:!Ref ImageUrlCommand:!Split [' ', !Ref 'Command']LogConfiguration:LogDriver:'awslogs'Options:mode:non-blockingmax-buffer-size:25mawslogs-group:!Ref LogGroupawslogs-region:!Ref AWS::Regionawslogs-stream-prefix:!Ref ServiceName# The service. The service is a resource which allows you to run multiple# copies of a type of task, and gather up their logs and metrics, as well# as monitor the number of running tasks and replace any that have crashedService:Type:AWS::ECS::ServiceProperties:ServiceName:!Ref ServiceNameCluster:!Ref ClusterNamePlacementStrategies:- Field:attribute:ecs.availability-zoneType:spread- Field:cpuType:binpackCapacityProviderStrategy:- Base:0CapacityProvider:!Ref CapacityProviderWeight:1DeploymentConfiguration:MaximumPercent:200MinimumHealthyPercent:75DesiredCount:!Ref DesiredCountNetworkConfiguration:AwsvpcConfiguration:SecurityGroups:- !Ref ServiceSecurityGroupSubnets:- !Select [ 0, !Ref SubnetIds ]- !Select [ 1, !Ref SubnetIds ]TaskDefinition:!Ref TaskDefinition# Because we are launching tasks in AWS VPC networking mode# the tasks themselves also have an extra security group that is unique# to them. This is a unique security group just for this service,# to control which things it can talk to, and who can talk to itServiceSecurityGroup:Type:AWS::EC2::SecurityGroupProperties:GroupDescription:!Sub Access to service ${ServiceName}VpcId:!Ref VpcId# This log group stores the stdout logs from this service's containersLogGroup:Type:AWS::Logs::LogGroup
Most parameters in this stack will be supplied by a parent stack that passes in
resources from the capacity provider stack. However you may be interested
in overriding the following parameters:
ServiceName - A human name for the service.
ImageUrl - URL of a container image to run. By default this stack deploys public.ecr.aws/docker/library/busybox:latest
ContainerCpu - CPU shares, where 1024 CPU is 1 vCPU. Default: 256 (1/4th vCPU)
ContainerMemory - Megabytes of memory to give the conatiner. Default 512
Command - Command to run in the container. Default: sleep 3600
DesiredCount - Number of copies of the container to run. Default: 0 (So you can test scaling up from zero)
Parent Stack
This stack deploys both stacks as nested stacks, for ease of grouping and
passing parameters from one stack to the next.
AWSTemplateFormatVersion:"2010-09-09"Transform:AWS::Serverless-2016-10-31Description:Parent stack that deploys the ECS cluster and capacity providerthen launches a service inside of the clusterParameters:VpcId:Type:AWS::EC2::VPC::IdDescription:VPC ID where the ECS cluster is launchedSubnetIds:Type:List<AWS::EC2::Subnet::Id>Description:List of subnet IDs where the EC2 instances will be launchedResources:# This stack contains cluster wide resources that will be shared# by all services that get launched in the stackBaseStack:Type:AWS::Serverless::ApplicationProperties:Location:spot-cluster.ymlParameters:VpcId:!Ref VpcIdSubnetIds:!Join [',', !Ref SubnetIds]# This service will be launched into the cluster by passing# details from the base stack into the service stackService:Type:AWS::Serverless::ApplicationProperties:Location:service.ymlParameters:VpcId:!Ref VpcIdSubnetIds:!Join [',', !Ref SubnetIds]ClusterName:!GetAtt BaseStack.Outputs.ClusterNameECSTaskExecutionRole:!GetAtt BaseStack.Outputs.ECSTaskExecutionRoleCapacityProvider:!GetAtt BaseStack.Outputs.CapacityProvider
This parent stack requires the following parameters:
VpcId - The ID of a VPC on your AWS account. This can be the default VPC
SubnetIds - A comma separated list of subnet ID’s within that VPC
Usage
First deploy the cluster and spot capacity autoscaling group:
The service is initially deployed with a desired count of 0. Use the Amazon ECS console to update the service and scale it up to a larger number of tasks. After an initial delay you will observe the ECS Capacity Provider request instances from the autoscaling group. The autoscaling group will fullfil the request by launching a mixture of EC2 instances based on current Spot market availability and pricing. As the instances join the ECS cluster they will be filled with service tasks.