Amazon ECS cluster with EC2 Spot Capacity
About
EC2 Spot Capacity is spare EC2 capacity that is available for less than the On-Demand price. Because Spot Instances enable you to request unused EC2 instances at steep discounts, you can lower your Amazon EC2 costs significantly. The hourly price for a Spot Instance is called a Spot price. The Spot price of each instance type in each Availability Zone is set by Amazon EC2, and is adjusted gradually based on the long-term supply of and demand for Spot Instances. Your Spot Instance runs whenever capacity is available.
Spot Capacity can be interrupted at any time. Therefore you need to be careful about what types of applications you run on Spot capacity and how you run those application. Amazon ECS is ideal if you want a more stable way to run workloads on top of interruptible Spot Capacity.
Install SAM CLI
This pattern uses AWS SAM CLI for deploying CloudFormation stacks on your account. You should follow the appropriate steps for installing SAM CLI.
Cluster Template
This cluster template demonstrates how to configure an autoscaling group to launch spot capacity with mixed types. A variety of EC2 instances of different sizes will be launched. However, Amazon ECS can gracefully handle this and adjust container density on each individual instance to match the size of that instance.
AWSTemplateFormatVersion: '2010-09-09'
Description: ECS cluster that has EC2 Spot capacity to host the containers
Parameters:
DesiredCapacity:
Type: Number
Default: 0
Description: Number of EC2 instances to launch in your ECS cluster.
MaxSize:
Type: Number
Default: 100
Description: Maximum number of EC2 instances that can be launched in your ECS cluster.
ECSAMI:
Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
Default: /aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id
Description: The Amazon Machine Image ID used for the cluster, leave it as the default value to get the latest AMI
VpcId:
Type: AWS::EC2::VPC::Id
Description: VPC ID where the ECS cluster is launched
SubnetIds:
Type: List<AWS::EC2::Subnet::Id>
Description: List of subnet IDs where the EC2 instances will be launched
Resources:
# Cluster that keeps track of container deployments
ECSCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterSettings:
- Name: containerInsights
Value: enabled
# The config for each EC2 instance that is added to the cluster
ContainerInstances:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateData:
ImageId: !Ref ECSAMI
IamInstanceProfile:
Name: !Ref EC2InstanceProfile
SecurityGroupIds:
- !Ref ContainerHostSecurityGroup
UserData:
# This injected configuration file is how the EC2 instance
# knows which ECS cluster on your AWS account it should be joining
Fn::Base64: !Sub |
#!/bin/bash
echo "ECS_CLUSTER=${ECSCluster}" >> /etc/ecs/ecs.config
echo "ECS_ENABLE_SPOT_INSTANCE_DRAINING=true" >> /etc/ecs/ecs.config
echo "ECS_CONTAINER_STOP_TIMEOUT=90s" >> /etc/ecs/ecs.config
BlockDeviceMappings:
- DeviceName: "/dev/xvda"
Ebs:
VolumeSize: 50
VolumeType: gp3
# Disable IMDSv1, and require IMDSv2
MetadataOptions:
HttpEndpoint: enabled
HttpTokens: required
EC2InstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
Path: /
Roles:
- !Ref EC2Role
# Autoscaling group. This launches the actual EC2 instances that will register
# themselves as members of the cluster, and run the docker containers.
ECSAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
UpdatePolicy:
AutoScalingReplacingUpdate:
WillReplace: 'true'
Properties:
VPCZoneIdentifier:
- !Select [ 0, !Ref SubnetIds ]
- !Select [ 1, !Ref SubnetIds ]
MinSize: 0
MaxSize: !Ref MaxSize
DesiredCapacity: !Ref DesiredCapacity
NewInstancesProtectedFromScaleIn: true
# This policy sets up rules which allow the ASG to launch
# a variety of mixed spot instance types
MixedInstancesPolicy:
# Request no on demand, only spot instances
InstancesDistribution:
OnDemandBaseCapacity: 0
OnDemandPercentageAboveBaseCapacity: 0
SpotAllocationStrategy: capacity-optimized
# Rules about what type of instances to launch
LaunchTemplate:
LaunchTemplateSpecification:
LaunchTemplateId: !Ref ContainerInstances
Version: !GetAtt ContainerInstances.LatestVersionNumber
Overrides:
- InstanceRequirements:
VCpuCount:
Min: 2
Max: 4
MemoryMiB:
Min: 4096
Max: 8192
BurstablePerformance: "excluded"
InstanceGenerations:
- "current"
CpuManufacturers:
- "intel"
- "amd"
ExcludedInstanceTypes: ["t2*","r*","d*","g*","i*","z*","x*"]
# Custom resource that force destroys the ASG. This cleans up EC2 instances that had
# managed termination protection enabled, but which are not yet released.
# This is necessary because ECS does not immediately release an EC2 instance from termination
# protection as soon as the instance is no longer running tasks. There is a cooldown delay.
# In the case of tearing down the CloudFormation stack, CloudFormation will delete the
# AWS::ECS::Service and immediately move on to tearing down the AWS::ECS::Cluster, disconnecting
# the AWS::AutoScaling::AutoScalingGroup from ECS management too fast, before ECS has a chance
# to asynchronously turn off managed instance protection on the EC2 instances.
# This will leave some EC2 instances stranded in a state where they are protected from scale-in forever.
# This then blocks the AWS::AutoScaling::AutoScalingGroup from cleaning itself up.
# The custom resource function force destroys the autoscaling group when tearing down the stack,
# avoiding the issue of protected EC2 instances that can never be cleaned up.
CustomAsgDestroyerFunction:
Type: AWS::Lambda::Function
Properties:
Code:
ZipFile: !Sub |
const { AutoScalingClient, DeleteAutoScalingGroupCommand } = require("@aws-sdk/client-auto-scaling");
const autoscaling = new AutoScalingClient({ region: '${AWS::Region}' });
const response = require('cfn-response');
exports.handler = async function(event, context) {
console.log(event);
if (event.RequestType !== "Delete") {
await response.send(event, context, response.SUCCESS);
return;
}
const input = {
AutoScalingGroupName: '${ECSAutoScalingGroup}',
ForceDelete: true
};
const command = new DeleteAutoScalingGroupCommand(input);
const deleteResponse = await autoscaling.send(command);
console.log(deleteResponse);
await response.send(event, context, response.SUCCESS);
};
Handler: index.handler
Runtime: nodejs20.x
Timeout: 30
Role: !GetAtt CustomAsgDestroyerRole.Arn
# The role used by the ASG destroyer
CustomAsgDestroyerRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- sts:AssumeRole
ManagedPolicyArns:
# https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaBasicExecutionRole.html
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Policies:
- PolicyName: allow-to-delete-autoscaling-group
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action: autoscaling:DeleteAutoScalingGroup
Resource: !Sub arn:aws:autoscaling:${AWS::Region}:${AWS::AccountId}:autoScalingGroup:*:autoScalingGroupName/${ECSAutoScalingGroup}
CustomAsgDestroyer:
Type: Custom::AsgDestroyer
DependsOn:
- CapacityProviderAssociation
Properties:
ServiceToken: !GetAtt CustomAsgDestroyerFunction.Arn
Region: !Ref "AWS::Region"
# Create an ECS capacity provider to attach the ASG to the ECS cluster
# so that it autoscales as we launch more containers
CapacityProvider:
Type: AWS::ECS::CapacityProvider
Properties:
AutoScalingGroupProvider:
AutoScalingGroupArn: !Ref ECSAutoScalingGroup
ManagedScaling:
InstanceWarmupPeriod: 60
MinimumScalingStepSize: 1
MaximumScalingStepSize: 100
Status: ENABLED
# Percentage of cluster reservation to try to maintain
TargetCapacity: 100
ManagedTerminationProtection: ENABLED
ManagedDraining: ENABLED
# Create a cluster capacity provider assocation so that the cluster
# will use the capacity provider
CapacityProviderAssociation:
Type: AWS::ECS::ClusterCapacityProviderAssociations
Properties:
CapacityProviders:
- !Ref CapacityProvider
Cluster: !Ref ECSCluster
DefaultCapacityProviderStrategy:
- Base: 0
CapacityProvider: !Ref CapacityProvider
Weight: 1
# A security group for the EC2 hosts that will run the containers.
# This can be used to limit incoming traffic to or outgoing traffic
# from the container's host EC2 instance.
ContainerHostSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Access to the EC2 hosts that run containers
VpcId: !Ref VpcId
# Role for the EC2 hosts. This allows the ECS agent on the EC2 hosts
# to communciate with the ECS control plane, as well as download the docker
# images from ECR to run on your host.
EC2Role:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service: [ec2.amazonaws.com]
Action: ['sts:AssumeRole']
Path: /
# See reference: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonEC2ContainerServiceforEC2Role
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
# This is a role which is used within Fargate to allow the Fargate agent
# to download images, and upload logs.
ECSTaskExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service: [ecs-tasks.amazonaws.com]
Action: ['sts:AssumeRole']
Condition:
ArnLike:
aws:SourceArn: !Sub arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:*
StringEquals:
aws:SourceAccount: !Ref AWS::AccountId
Path: /
# This role enables basic features of ECS. See reference:
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonECSTaskExecutionRolePolicy
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
Outputs:
ClusterName:
Description: The ECS cluster into which to launch resources
Value: !Ref ECSCluster
ECSTaskExecutionRole:
Description: The role used to start up a task
Value: !Ref ECSTaskExecutionRole
CapacityProvider:
Description: The cluster capacity provider that the service should use
to request capacity when it wants to start up a task
Value: !Ref CapacityProvider
A couple important things to note in this template:
ContainerInstances.Properties.UserData
- Theecs.config
file set configuration that is used by the ECS agent. In addition to the basic info of which cluster to join, this template also sets up automatic task draining whenever a Spot termination notice is sent to the instance. It also modifies the stop timeout period to 90 seconds, so as to be less than the Spot termination window.ECSAutoScalingGroup.Properites.MixedInstancesPolicy
- This autoscaling group policy allows the launch of mixed EC2 types, entirely on Spot, with no on-demand instances.
This template requires two input parameters:
VpcId
- The ID of a VPC on your AWS account. This can be the default VPCSubnetIds
- A comma separated list of subnet ID's within that VPC
Additionally you can modify the following parameters:
DesiredCapacity
- Number of EC2 instances to start with. Default0
MaxSize
- An upper limit on number of EC2 instances to scale up to. Default100
ECSAMI
- The Amazon Machine Image to use for each EC2 instance. Don't change this unless you really know what you are doing.
Service Template
This template deploys an ECS service that uses the Spot capacity provider. ECS will automatically adapt to launch as many mixed EC2 instances as necessary to host the service tasks.
AWSTemplateFormatVersion: '2010-09-09'
Description: An example service that deploys onto EC2 capacity with
a capacity provider strategy that autoscales the underlying
EC2 Capacity as needed by the service
Parameters:
VpcId:
Type: String
Description: The VPC that the service is running inside of
SubnetIds:
Type: List<AWS::EC2::Subnet::Id>
Description: List of subnet IDs the AWS VPC tasks are inside of
ClusterName:
Type: String
Description: The name of the ECS cluster into which to launch capacity.
ECSTaskExecutionRole:
Type: String
Description: The role used to start up an ECS task
CapacityProvider:
Type: String
Description: The cluster capacity provider that the service should use
to request capacity when it wants to start up a task
ServiceName:
Type: String
Default: example-service
Description: A name for the service
ImageUrl:
Type: String
Default: public.ecr.aws/docker/library/busybox:latest
Description: The url of a docker image that contains the application process that
will handle the traffic for this service
ContainerCpu:
Type: Number
Default: 256
Description: How much CPU to give the container. 1024 is 1 CPU
ContainerMemory:
Type: Number
Default: 512
Description: How much memory in megabytes to give the container
Command:
Type: String
Default: sleep 3600
Description: The command to run inside of the container
DesiredCount:
Type: Number
Default: 0
Description: How many copies of the service task to run
Resources:
# The task definition. This is a simple metadata description of what
# container to run, and what resource requirements it has.
TaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: !Ref ServiceName
Cpu: !Ref ContainerCpu
Memory: !Ref ContainerMemory
NetworkMode: awsvpc
RequiresCompatibilities:
- EC2
ExecutionRoleArn: !Ref ECSTaskExecutionRole
ContainerDefinitions:
- Name: !Ref ServiceName
Cpu: !Ref ContainerCpu
Memory: !Ref ContainerMemory
Image: !Ref ImageUrl
Command: !Split [' ', !Ref 'Command']
LogConfiguration:
LogDriver: 'awslogs'
Options:
mode: non-blocking
max-buffer-size: 25m
awslogs-group: !Ref LogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: !Ref ServiceName
# The service. The service is a resource which allows you to run multiple
# copies of a type of task, and gather up their logs and metrics, as well
# as monitor the number of running tasks and replace any that have crashed
Service:
Type: AWS::ECS::Service
Properties:
ServiceName: !Ref ServiceName
Cluster: !Ref ClusterName
PlacementStrategies:
- Field: attribute:ecs.availability-zone
Type: spread
- Field: cpu
Type: binpack
CapacityProviderStrategy:
- Base: 0
CapacityProvider: !Ref CapacityProvider
Weight: 1
DeploymentConfiguration:
MaximumPercent: 200
MinimumHealthyPercent: 75
DesiredCount: !Ref DesiredCount
NetworkConfiguration:
AwsvpcConfiguration:
SecurityGroups:
- !Ref ServiceSecurityGroup
Subnets:
- !Select [ 0, !Ref SubnetIds ]
- !Select [ 1, !Ref SubnetIds ]
TaskDefinition: !Ref TaskDefinition
# Because we are launching tasks in AWS VPC networking mode
# the tasks themselves also have an extra security group that is unique
# to them. This is a unique security group just for this service,
# to control which things it can talk to, and who can talk to it
ServiceSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: !Sub Access to service ${ServiceName}
VpcId: !Ref VpcId
# This log group stores the stdout logs from this service's containers
LogGroup:
Type: AWS::Logs::LogGroup
Most parameters in this stack will be supplied by a parent stack that passes in resources from the capacity provider stack. However you may be interested in overriding the following parameters:
ServiceName
- A human name for the service.ImageUrl
- URL of a container image to run. By default this stack deployspublic.ecr.aws/docker/library/busybox:latest
ContainerCpu
- CPU shares, where 1024 CPU is 1 vCPU. Default:256
(1/4th vCPU)ContainerMemory
- Megabytes of memory to give the conatiner. Default512
Command
- Command to run in the container. Default:sleep 3600
DesiredCount
- Number of copies of the container to run. Default:0
(So you can test scaling up from zero)
Parent Stack
This stack deploys both stacks as nested stacks, for ease of grouping and passing parameters from one stack to the next.
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: Parent stack that deploys the ECS cluster and capacity provider
then launches a service inside of the cluster
Parameters:
VpcId:
Type: AWS::EC2::VPC::Id
Description: VPC ID where the ECS cluster is launched
SubnetIds:
Type: List<AWS::EC2::Subnet::Id>
Description: List of subnet IDs where the EC2 instances will be launched
Resources:
# This stack contains cluster wide resources that will be shared
# by all services that get launched in the stack
BaseStack:
Type: AWS::Serverless::Application
Properties:
Location: spot-cluster.yml
Parameters:
VpcId: !Ref VpcId
SubnetIds: !Join [',', !Ref SubnetIds]
# This service will be launched into the cluster by passing
# details from the base stack into the service stack
Service:
Type: AWS::Serverless::Application
Properties:
Location: service.yml
Parameters:
VpcId: !Ref VpcId
SubnetIds: !Join [',', !Ref SubnetIds]
ClusterName: !GetAtt BaseStack.Outputs.ClusterName
ECSTaskExecutionRole: !GetAtt BaseStack.Outputs.ECSTaskExecutionRole
CapacityProvider: !GetAtt BaseStack.Outputs.CapacityProvider
This parent stack requires the following parameters:
VpcId
- The ID of a VPC on your AWS account. This can be the default VPCSubnetIds
- A comma separated list of subnet ID's within that VPC
Usage
First deploy the cluster and spot capacity autoscaling group:
sam deploy \
--template-file parent.yml \
--stack-name spot-capacity-provider-environment \
--resolve-s3 \
--capabilities CAPABILITY_IAM \
--parameter-overrides VpcId=vpc-79508710 SubnetIds=subnet-b4676dfe,subnet-c71ebfae
Test it out
The service is initially deployed with a desired count of 0
. Use the Amazon ECS console to update the service and scale it up to a larger number of tasks. After an initial delay you will observe the ECS Capacity Provider request instances from the autoscaling group. The autoscaling group will fullfil the request by launching a mixture of EC2 instances based on current Spot market availability and pricing. As the instances join the ECS cluster they will be filled with service tasks.
See Also
- If you prefer to explictly choose on-demand EC2 instances to run then check out the ECS EC2 Capacity Provider pattern.