Amazon ECS Capacity Provider for EC2 instances
Terminology and Background
Amazon Elastic Container Service (ECS) is container orchestrator that deploy containerized applications to both Amazon EC2 capacity as well as serverless AWS Fargate capacity.
Amazon ECS capacity providers are a built-in feature that helps you launch EC2 capacity on fly. When application containers need to run, the capacity provider provisions as many EC2 hosts as necessary. When all containers are done running, the cluster can "scale to zero" by shutting down all EC2 hosts.
This pattern shows a production ready ECS on EC2 capacity provider configuration. It comes with a variety of helpful, out of the box configurations and failsafes to keep your ECS on EC2 cluster resilient.
Architecture Diagrams
The following diagrams show what this pattern will deploy:
By following the instructions here, you will deploy:
- A group of EC2 instances launched by an EC2 Auto Scaling Group, spread across availability zones
- Each EC2 instance hosts a lightweight (<10 MB memory) daemon task used for health verification
- Each EC2 instance can host multiple application containers. This allows you to save on infrastructure costs and acheive better utilization of your EC2 instances, by running more instances of your application per host instance.
The runtime aspects of the architecture are orchestrated by Amazon Elastic Container Service in the following manner:
- Amazon ECS manages the size of the Auto Scaling Group, and automatically scales it to the appropriate size to match the number of application containers you want to run
- A
DAEMON
type service is used to automatically launch one copy of the health verification task onto each instance when it joins the ECS cluster. - A
REPLICA
type service is used to decide how many EC2 instances to scale up to. The service's application container is distributed across the instances.
This architecture also comes with operational enhancements designed to make it easier and safer to manage the EC2 instances that are used as container capacity:
- The CloudFormation template uses a dynamic SSM parameter to determine what ECS Optimized AMI to deploy. This parameter ensures that each time you deploy the stack it will check to see if there is available update that needs to be applied to the EC2 instances.
- The Auto Scaling Group is configured to monitor CloudFormation signals when applying a rolling AMI update to the EC2 instances.
- Each EC2 instance runs a CloudFormation initialization script that verifies that the host is actually able to connect to the ECS control plane and launch a health daemon task. Only once the EC2 instance is successfully registered with ECS, and has launched the health check task, then the CloudFormation signal is used to notify the Auto Scaling Group that the host is healthy.
- In the event that an configuration or AMI update does not function, this configuration will automatically rollback the stack to the previous EC2 configuration. This gives you a safe way to continuously roll out updates to the ECS AMI on a regular basis.
Dependencies
This pattern uses AWS SAM CLI for deploying CloudFormation stacks on your account. You should follow the appropriate steps for installing SAM CLI.
Cluster with EC2 Capacity Provider
Download the following cluster-capacity-provider.yml
file, which deploys an ECS cluster that has a capacity provider linked to an EC2 Autoscaling Group. The Autoscaling Group starts out scaled to zero, empty of EC2 instances.
AWSTemplateFormatVersion: '2010-09-09'
Description: EC2 ECS cluster that starts out empty, with no EC2 instances yet.
An ECS capacity provider automatically launches more EC2 instances
as required on the fly when you request ECS to launch services or
standalone tasks.
Parameters:
InstanceType:
Type: String
Default: c5.xlarge
Description: Class of EC2 instance used to host containers. Choose t2 for testing, m5 for general purpose, c5 for CPU intensive services, and r5 for memory intensive services
AllowedValues: ["a1.2xlarge", "a1.4xlarge", "a1.large", "a1.medium", "a1.metal", "a1.xlarge", "c1.medium", "c1.xlarge", "c3.2xlarge", "c3.4xlarge", "c3.8xlarge", "c3.large", "c3.xlarge", "c4.2xlarge", "c4.4xlarge", "c4.8xlarge", "c4.large", "c4.xlarge", "c5.12xlarge", "c5.18xlarge", "c5.24xlarge", "c5.2xlarge", "c5.4xlarge", "c5.9xlarge", "c5.large", "c5.metal", "c5.xlarge", "c5a.12xlarge", "c5a.16xlarge", "c5a.24xlarge", "c5a.2xlarge", "c5a.4xlarge", "c5a.8xlarge", "c5a.large", "c5a.xlarge", "c5ad.12xlarge", "c5ad.16xlarge", "c5ad.24xlarge", "c5ad.2xlarge", "c5ad.4xlarge", "c5ad.8xlarge", "c5ad.large", "c5ad.xlarge", "c5d.12xlarge", "c5d.18xlarge", "c5d.24xlarge", "c5d.2xlarge", "c5d.4xlarge", "c5d.9xlarge", "c5d.large", "c5d.metal", "c5d.xlarge", "c5n.18xlarge", "c5n.2xlarge", "c5n.4xlarge", "c5n.9xlarge", "c5n.large", "c5n.metal", "c5n.xlarge", "c6a.12xlarge", "c6a.16xlarge", "c6a.24xlarge", "c6a.2xlarge", "c6a.32xlarge", "c6a.48xlarge", "c6a.4xlarge", "c6a.8xlarge", "c6a.large", "c6a.metal", "c6a.xlarge", "c6g.12xlarge", "c6g.16xlarge", "c6g.2xlarge", "c6g.4xlarge", "c6g.8xlarge", "c6g.large", "c6g.medium", "c6g.metal", "c6g.xlarge", "c6gd.12xlarge", "c6gd.16xlarge", "c6gd.2xlarge", "c6gd.4xlarge", "c6gd.8xlarge", "c6gd.large", "c6gd.medium", "c6gd.metal", "c6gd.xlarge", "c6gn.12xlarge", "c6gn.16xlarge", "c6gn.2xlarge", "c6gn.4xlarge", "c6gn.8xlarge", "c6gn.large", "c6gn.medium", "c6gn.xlarge", "c6i.12xlarge", "c6i.16xlarge", "c6i.24xlarge", "c6i.2xlarge", "c6i.32xlarge", "c6i.4xlarge", "c6i.8xlarge", "c6i.large", "c6i.metal", "c6i.xlarge", "c6id.12xlarge", "c6id.16xlarge", "c6id.24xlarge", "c6id.2xlarge", "c6id.32xlarge", "c6id.4xlarge", "c6id.8xlarge", "c6id.large", "c6id.metal", "c6id.xlarge", "c6in.12xlarge", "c6in.16xlarge", "c6in.24xlarge", "c6in.2xlarge", "c6in.32xlarge", "c6in.4xlarge", "c6in.8xlarge", "c6in.large", "c6in.metal", "c6in.xlarge", "c7g.12xlarge", "c7g.16xlarge", "c7g.2xlarge", "c7g.4xlarge", "c7g.8xlarge", "c7g.large", "c7g.medium", "c7g.metal", "c7g.xlarge", "c7gd.12xlarge", "c7gd.16xlarge", "c7gd.2xlarge", "c7gd.4xlarge", "c7gd.8xlarge", "c7gd.large", "c7gd.medium", "c7gd.xlarge", "c7gn.12xlarge", "c7gn.16xlarge", "c7gn.2xlarge", "c7gn.4xlarge", "c7gn.8xlarge", "c7gn.large", "c7gn.medium", "c7gn.xlarge", "cc2.8xlarge", "cr1.8xlarge", "d2.2xlarge", "d2.4xlarge", "d2.8xlarge", "d2.xlarge", "d3.2xlarge", "d3.4xlarge", "d3.8xlarge", "d3.xlarge", "d3en.12xlarge", "d3en.2xlarge", "d3en.4xlarge", "d3en.6xlarge", "d3en.8xlarge", "d3en.xlarge", "dl1.24xlarge", "f1.16xlarge", "f1.2xlarge", "f1.4xlarge", "g2.2xlarge", "g2.8xlarge", "g3.16xlarge", "g3.4xlarge", "g3.8xlarge", "g3s.xlarge", "g4ad.16xlarge", "g4ad.2xlarge", "g4ad.4xlarge", "g4ad.8xlarge", "g4ad.xlarge", "g4dn.12xlarge", "g4dn.16xlarge", "g4dn.2xlarge", "g4dn.4xlarge", "g4dn.8xlarge", "g4dn.metal", "g4dn.xlarge", "g5.12xlarge", "g5.16xlarge", "g5.24xlarge", "g5.2xlarge", "g5.48xlarge", "g5.4xlarge", "g5.8xlarge", "g5.xlarge", "g5g.16xlarge", "g5g.2xlarge", "g5g.4xlarge", "g5g.8xlarge", "g5g.metal", "g5g.xlarge", "h1.16xlarge", "h1.2xlarge", "h1.4xlarge", "h1.8xlarge", "hpc7g.16xlarge", "hpc7g.4xlarge", "hpc7g.8xlarge", "hs1.8xlarge", "i2.2xlarge", "i2.4xlarge", "i2.8xlarge", "i2.large", "i2.xlarge", "i3.16xlarge", "i3.2xlarge", "i3.4xlarge", "i3.8xlarge", "i3.large", "i3.metal", "i3.xlarge", "i3en.12xlarge", "i3en.24xlarge", "i3en.2xlarge", "i3en.3xlarge", "i3en.6xlarge", "i3en.large", "i3en.metal", "i3en.xlarge", "i4g.16xlarge", "i4g.2xlarge", "i4g.4xlarge", "i4g.8xlarge", "i4g.large", "i4g.xlarge", "i4i.16xlarge", "i4i.2xlarge", "i4i.32xlarge", "i4i.4xlarge", "i4i.8xlarge", "i4i.large", "i4i.metal", "i4i.xlarge", "im4gn.16xlarge", "im4gn.2xlarge", "im4gn.4xlarge", "im4gn.8xlarge", "im4gn.large", "im4gn.xlarge", "inf1.24xlarge", "inf1.2xlarge", "inf1.6xlarge", "inf1.xlarge", "inf2.24xlarge", "inf2.48xlarge", "inf2.8xlarge", "inf2.xlarge", "is4gen.2xlarge", "is4gen.4xlarge", "is4gen.8xlarge", "is4gen.large", "is4gen.medium", "is4gen.xlarge", "m1.large", "m1.medium", "m1.small", "m1.xlarge", "m2.2xlarge", "m2.4xlarge", "m2.xlarge", "m3.2xlarge", "m3.large", "m3.medium", "m3.xlarge", "m4.10xlarge", "m4.16xlarge", "m4.2xlarge", "m4.4xlarge", "m4.large", "m4.xlarge", "m5.12xlarge", "m5.16xlarge", "m5.24xlarge", "m5.2xlarge", "m5.4xlarge", "m5.8xlarge", "m5.large", "m5.metal", "m5.xlarge", "m5a.12xlarge", "m5a.16xlarge", "m5a.24xlarge", "m5a.2xlarge", "m5a.4xlarge", "m5a.8xlarge", "m5a.large", "m5a.xlarge", "m5ad.12xlarge", "m5ad.16xlarge", "m5ad.24xlarge", "m5ad.2xlarge", "m5ad.4xlarge", "m5ad.8xlarge", "m5ad.large", "m5ad.xlarge", "m5d.12xlarge", "m5d.16xlarge", "m5d.24xlarge", "m5d.2xlarge", "m5d.4xlarge", "m5d.8xlarge", "m5d.large", "m5d.metal", "m5d.xlarge", "m5dn.12xlarge", "m5dn.16xlarge", "m5dn.24xlarge", "m5dn.2xlarge", "m5dn.4xlarge", "m5dn.8xlarge", "m5dn.large", "m5dn.metal", "m5dn.xlarge", "m5n.12xlarge", "m5n.16xlarge", "m5n.24xlarge", "m5n.2xlarge", "m5n.4xlarge", "m5n.8xlarge", "m5n.large", "m5n.metal", "m5n.xlarge", "m5zn.12xlarge", "m5zn.2xlarge", "m5zn.3xlarge", "m5zn.6xlarge", "m5zn.large", "m5zn.metal", "m5zn.xlarge", "m6a.12xlarge", "m6a.16xlarge", "m6a.24xlarge", "m6a.2xlarge", "m6a.32xlarge", "m6a.48xlarge", "m6a.4xlarge", "m6a.8xlarge", "m6a.large", "m6a.metal", "m6a.xlarge", "m6g.12xlarge", "m6g.16xlarge", "m6g.2xlarge", "m6g.4xlarge", "m6g.8xlarge", "m6g.large", "m6g.medium", "m6g.metal", "m6g.xlarge", "m6gd.12xlarge", "m6gd.16xlarge", "m6gd.2xlarge", "m6gd.4xlarge", "m6gd.8xlarge", "m6gd.large", "m6gd.medium", "m6gd.metal", "m6gd.xlarge", "m6i.12xlarge", "m6i.16xlarge", "m6i.24xlarge", "m6i.2xlarge", "m6i.32xlarge", "m6i.4xlarge", "m6i.8xlarge", "m6i.large", "m6i.metal", "m6i.xlarge", "m6id.12xlarge", "m6id.16xlarge", "m6id.24xlarge", "m6id.2xlarge", "m6id.32xlarge", "m6id.4xlarge", "m6id.8xlarge", "m6id.large", "m6id.metal", "m6id.xlarge", "m6idn.12xlarge", "m6idn.16xlarge", "m6idn.24xlarge", "m6idn.2xlarge", "m6idn.32xlarge", "m6idn.4xlarge", "m6idn.8xlarge", "m6idn.large", "m6idn.metal", "m6idn.xlarge", "m6in.12xlarge", "m6in.16xlarge", "m6in.24xlarge", "m6in.2xlarge", "m6in.32xlarge", "m6in.4xlarge", "m6in.8xlarge", "m6in.large", "m6in.metal", "m6in.xlarge", "m7a.12xlarge", "m7a.16xlarge", "m7a.24xlarge", "m7a.2xlarge", "m7a.32xlarge", "m7a.48xlarge", "m7a.4xlarge", "m7a.8xlarge", "m7a.large", "m7a.medium", "m7a.metal-48xl", "m7a.xlarge", "m7g.12xlarge", "m7g.16xlarge", "m7g.2xlarge", "m7g.4xlarge", "m7g.8xlarge", "m7g.large", "m7g.medium", "m7g.metal", "m7g.xlarge", "m7gd.12xlarge", "m7gd.16xlarge", "m7gd.2xlarge", "m7gd.4xlarge", "m7gd.8xlarge", "m7gd.large", "m7gd.medium", "m7gd.xlarge", "m7i-flex.2xlarge", "m7i-flex.4xlarge", "m7i-flex.8xlarge", "m7i-flex.large", "m7i-flex.xlarge", "m7i.12xlarge", "m7i.16xlarge", "m7i.24xlarge", "m7i.2xlarge", "m7i.48xlarge", "m7i.4xlarge", "m7i.8xlarge", "m7i.large", "m7i.xlarge", "mac1.metal", "mac2.metal", "p2.16xlarge", "p2.8xlarge", "p2.xlarge", "p3.16xlarge", "p3.2xlarge", "p3.8xlarge", "p3dn.24xlarge", "p4d.24xlarge", "p4de.24xlarge", "p5.48xlarge", "r3.2xlarge", "r3.4xlarge", "r3.8xlarge", "r3.large", "r3.xlarge", "r4.16xlarge", "r4.2xlarge", "r4.4xlarge", "r4.8xlarge", "r4.large", "r4.xlarge", "r5.12xlarge", "r5.16xlarge", "r5.24xlarge", "r5.2xlarge", "r5.4xlarge", "r5.8xlarge", "r5.large", "r5.metal", "r5.xlarge", "r5a.12xlarge", "r5a.16xlarge", "r5a.24xlarge", "r5a.2xlarge", "r5a.4xlarge", "r5a.8xlarge", "r5a.large", "r5a.xlarge", "r5ad.12xlarge", "r5ad.16xlarge", "r5ad.24xlarge", "r5ad.2xlarge", "r5ad.4xlarge", "r5ad.8xlarge", "r5ad.large", "r5ad.xlarge", "r5b.12xlarge", "r5b.16xlarge", "r5b.24xlarge", "r5b.2xlarge", "r5b.4xlarge", "r5b.8xlarge", "r5b.large", "r5b.metal", "r5b.xlarge", "r5d.12xlarge", "r5d.16xlarge", "r5d.24xlarge", "r5d.2xlarge", "r5d.4xlarge", "r5d.8xlarge", "r5d.large", "r5d.metal", "r5d.xlarge", "r5dn.12xlarge", "r5dn.16xlarge", "r5dn.24xlarge", "r5dn.2xlarge", "r5dn.4xlarge", "r5dn.8xlarge", "r5dn.large", "r5dn.metal", "r5dn.xlarge", "r5n.12xlarge", "r5n.16xlarge", "r5n.24xlarge", "r5n.2xlarge", "r5n.4xlarge", "r5n.8xlarge", "r5n.large", "r5n.metal", "r5n.xlarge", "r6a.12xlarge", "r6a.16xlarge", "r6a.24xlarge", "r6a.2xlarge", "r6a.32xlarge", "r6a.48xlarge", "r6a.4xlarge", "r6a.8xlarge", "r6a.large", "r6a.metal", "r6a.xlarge", "r6g.12xlarge", "r6g.16xlarge", "r6g.2xlarge", "r6g.4xlarge", "r6g.8xlarge", "r6g.large", "r6g.medium", "r6g.metal", "r6g.xlarge", "r6gd.12xlarge", "r6gd.16xlarge", "r6gd.2xlarge", "r6gd.4xlarge", "r6gd.8xlarge", "r6gd.large", "r6gd.medium", "r6gd.metal", "r6gd.xlarge", "r6i.12xlarge", "r6i.16xlarge", "r6i.24xlarge", "r6i.2xlarge", "r6i.32xlarge", "r6i.4xlarge", "r6i.8xlarge", "r6i.large", "r6i.metal", "r6i.xlarge", "r6id.12xlarge", "r6id.16xlarge", "r6id.24xlarge", "r6id.2xlarge", "r6id.32xlarge", "r6id.4xlarge", "r6id.8xlarge", "r6id.large", "r6id.metal", "r6id.xlarge", "r6idn.12xlarge", "r6idn.16xlarge", "r6idn.24xlarge", "r6idn.2xlarge", "r6idn.32xlarge", "r6idn.4xlarge", "r6idn.8xlarge", "r6idn.large", "r6idn.metal", "r6idn.xlarge", "r6in.12xlarge", "r6in.16xlarge", "r6in.24xlarge", "r6in.2xlarge", "r6in.32xlarge", "r6in.4xlarge", "r6in.8xlarge", "r6in.large", "r6in.metal", "r6in.xlarge", "r7g.12xlarge", "r7g.16xlarge", "r7g.2xlarge", "r7g.4xlarge", "r7g.8xlarge", "r7g.large", "r7g.medium", "r7g.metal", "r7g.xlarge", "r7gd.12xlarge", "r7gd.16xlarge", "r7gd.2xlarge", "r7gd.4xlarge", "r7gd.8xlarge", "r7gd.large", "r7gd.medium", "r7gd.xlarge", "r7iz.12xlarge", "r7iz.16xlarge", "r7iz.2xlarge", "r7iz.32xlarge", "r7iz.4xlarge", "r7iz.8xlarge", "r7iz.large", "r7iz.xlarge", "t1.micro", "t2.2xlarge", "t2.large", "t2.medium", "t2.micro", "t2.nano", "t2.small", "t2.xlarge", "t3.2xlarge", "t3.large", "t3.medium", "t3.micro", "t3.nano", "t3.small", "t3.xlarge", "t3a.2xlarge", "t3a.large", "t3a.medium", "t3a.micro", "t3a.nano", "t3a.small", "t3a.xlarge", "t4g.2xlarge", "t4g.large", "t4g.medium", "t4g.micro", "t4g.nano", "t4g.small", "t4g.xlarge", "trn1.2xlarge", "trn1.32xlarge", "trn1n.32xlarge", "u-12tb1.112xlarge", "u-18tb1.112xlarge", "u-24tb1.112xlarge", "u-3tb1.56xlarge", "u-6tb1.112xlarge", "u-6tb1.56xlarge", "u-9tb1.112xlarge", "vt1.24xlarge", "vt1.3xlarge", "vt1.6xlarge", "x1.16xlarge", "x1.32xlarge", "x1e.16xlarge", "x1e.2xlarge", "x1e.32xlarge", "x1e.4xlarge", "x1e.8xlarge", "x1e.xlarge", "x2gd.12xlarge", "x2gd.16xlarge", "x2gd.2xlarge", "x2gd.4xlarge", "x2gd.8xlarge", "x2gd.large", "x2gd.medium", "x2gd.metal", "x2gd.xlarge", "x2idn.16xlarge", "x2idn.24xlarge", "x2idn.32xlarge", "x2idn.metal", "x2iedn.16xlarge", "x2iedn.24xlarge", "x2iedn.2xlarge", "x2iedn.32xlarge", "x2iedn.4xlarge", "x2iedn.8xlarge", "x2iedn.metal", "x2iedn.xlarge", "x2iezn.12xlarge", "x2iezn.2xlarge", "x2iezn.4xlarge", "x2iezn.6xlarge", "x2iezn.8xlarge", "x2iezn.metal", "z1d.12xlarge", "z1d.2xlarge", "z1d.3xlarge", "z1d.6xlarge", "z1d.large", "z1d.metal", "z1d.xlarge"]
ConstraintDescription: Please choose a valid instance type.
MaxSize:
Type: Number
Default: '100'
Description: Maximum number of EC2 instances that can be launched in your ECS cluster.
ECSAMI:
Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
Default: /aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id
Description: The Amazon Machine Image ID used for the cluster, leave it as the default value to get the latest AMI
VpcId:
Type: AWS::EC2::VPC::Id
Description: VPC ID where the ECS cluster is launched
SubnetIds:
Type: List<AWS::EC2::Subnet::Id>
Description: List of subnet IDs where the EC2 instances will be launched
Resources:
# Cluster that keeps track of container deployments
ECSCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterSettings:
- Name: containerInsights
Value: enabled
# Custom resource that force destroys the ASG. This cleans up EC2 instances that had
# managed termination protection enabled, but which are not yet released.
# This is necessary because ECS does not immediately release an EC2 instance from termination
# protection as soon as the instance is no longer running tasks. There is a cooldown delay.
# In the case of tearing down the CloudFormation stack, CloudFormation will delete the
# AWS::ECS::Service and immediately move on to tearing down the AWS::ECS::Cluster, disconnecting
# the AWS::AutoScaling::AutoScalingGroup from ECS management too fast, before ECS has a chance
# to asynchronously turn off managed instance protection on the EC2 instances.
# This will leave some EC2 instances stranded in a state where they are protected from scale-in forever.
# This then blocks the AWS::AutoScaling::AutoScalingGroup from cleaning itself up.
# The custom resource function force destroys the autoscaling group when tearing down the stack,
# avoiding the issue of protected EC2 instances that can never be cleaned up.
CustomAsgDestroyerFunction:
Type: AWS::Lambda::Function
Properties:
Code:
ZipFile: |
const { AutoScalingClient, DeleteAutoScalingGroupCommand } = require("@aws-sdk/client-auto-scaling");
const response = require('cfn-response');
exports.handler = async function(event, context) {
console.log(event);
if (event.RequestType !== "Delete") {
await response.send(event, context, response.SUCCESS);
return;
}
const autoscaling = new AutoScalingClient({ region: event.ResourceProperties.Region });
const input = {
AutoScalingGroupName: event.ResourceProperties.AutoScalingGroupName,
ForceDelete: true
};
const command = new DeleteAutoScalingGroupCommand(input);
const deleteResponse = await autoscaling.send(command);
console.log(deleteResponse);
await response.send(event, context, response.SUCCESS);
};
Handler: index.handler
Runtime: nodejs20.x
Timeout: 30
Role: !GetAtt CustomAsgDestroyerRole.Arn
# The role used by the ASG destroyer
CustomAsgDestroyerRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- sts:AssumeRole
ManagedPolicyArns:
# https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaBasicExecutionRole.html
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Policies:
- PolicyName: allow-to-delete-autoscaling-group
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action: autoscaling:DeleteAutoScalingGroup
Resource: !Sub arn:aws:autoscaling:${AWS::Region}:${AWS::AccountId}:autoScalingGroup:*:autoScalingGroupName/${ECSAutoScalingGroup}
CustomAsgDestroyer:
Type: Custom::AsgDestroyer
DependsOn:
- EC2Role
Properties:
ServiceToken: !GetAtt CustomAsgDestroyerFunction.Arn
Region: !Ref "AWS::Region"
AutoScalingGroupName: !Ref ECSAutoScalingGroup
# Turn on ENI trunking for the EC2 instances. This setting is not on by default,
# but it is highly important for increasing the density of AWS VPC networking mode
# tasks per instance. Additionally, it is not controllable by default in CloudFormation
# because it has some complexity of needing to be turned on by a bearer of the role
# of the EC2 instances themselves. With this custom function we can assume the EC2 role
# then use that role to call the ecs:PutAccountSetting API in order to enable
# ENI trunking
CustomEniTrunkingFunction:
Type: AWS::Lambda::Function
Properties:
Code:
ZipFile: |
const { ECSClient, PutAccountSettingCommand } = require("@aws-sdk/client-ecs");
const { STSClient, AssumeRoleCommand } = require("@aws-sdk/client-sts");
const response = require('cfn-response');
exports.handler = async function(event, context) {
console.log(event);
if (event.RequestType == "Delete") {
await response.send(event, context, response.SUCCESS);
return;
}
const sts = new STSClient({ region: event.ResourceProperties.Region });
const assumeRoleResponse = await sts.send(new AssumeRoleCommand({
RoleArn: event.ResourceProperties.EC2Role,
RoleSessionName: "eni-trunking-enable-session",
DurationSeconds: 900
}));
// Instantiate an ECS client using the credentials of the EC2 role
const ecs = new ECSClient({
region: event.ResourceProperties.Region,
credentials: {
accessKeyId: assumeRoleResponse.Credentials.AccessKeyId,
secretAccessKey: assumeRoleResponse.Credentials.SecretAccessKey,
sessionToken: assumeRoleResponse.Credentials.SessionToken
}
});
const putAccountResponse = await ecs.send(new PutAccountSettingCommand({
name: 'awsvpcTrunking',
value: 'enabled'
}));
console.log(putAccountResponse);
await response.send(event, context, response.SUCCESS);
};
Handler: index.handler
Runtime: nodejs20.x
Timeout: 30
Role: !GetAtt CustomEniTrunkingRole.Arn
# The role used by the ENI trunking custom resource
CustomEniTrunkingRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- sts:AssumeRole
ManagedPolicyArns:
# https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaBasicExecutionRole.html
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
# This allows the custom CloudFormation resource in Lambda
# to assume the role that is used by the EC2 instances. The Lambda function must
# assume this role because the ecs:PutAccountSetting must be called either
# by the role that the setting is for, or by the root account, and we aren't
# using the root account for CloudFormation.
AllowEniTrunkingRoleToAssumeEc2Role:
Type: AWS::IAM::Policy
Properties:
Roles:
- !Ref CustomEniTrunkingRole
PolicyName: allow-to-assume-ec2-role
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action: sts:AssumeRole
Resource: !GetAtt EC2Role.Arn
# This is the actual custom resource, which triggers the invocation
# of the Lambda function that enabled ENI trunking during the stack deploy
CustomEniTrunking:
Type: Custom::CustomEniTrunking
Properties:
ServiceToken: !GetAtt CustomEniTrunkingFunction.Arn
Region: !Ref "AWS::Region"
EC2Role: !GetAtt EC2Role.Arn
# Autoscaling group. This launches the actual EC2 instances that will register
# themselves as members of the cluster, and run the docker containers.
ECSAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
DependsOn:
# This is to ensure that the ASG gets deleted first before these
# resources, when it comes to stack teardown.
- ECSCluster
- EC2Role
UpdatePolicy:
# This configures the ASG to wait on resource signals from the cfn-init
# script that runs on the instance itself. Depending on the expected
# total size of your ASG you may need to tune the parameters below
AutoScalingRollingUpdate:
MaxBatchSize: 5
MinInstancesInService: 1 # Note that ECS draining hook will maintain instances that are still hosting tasks
PauseTime: PT2M
WaitOnResourceSignals: true
MinSuccessfulInstancesPercent: 100
Properties:
VPCZoneIdentifier: !Ref SubnetIds
LaunchTemplate:
LaunchTemplateId: !Ref ContainerInstances
Version: !GetAtt ContainerInstances.LatestVersionNumber
MinSize: 0
MaxSize: !Ref MaxSize
# We are relying on ECS draining to safely drain tasks from hosts that need
# to be replaced.
NewInstancesProtectedFromScaleIn: false
# The config for each instance that is added to the cluster
ContainerInstances:
Type: AWS::EC2::LaunchTemplate
Metadata:
AWS::CloudFormation::Init:
configSets:
full_install: [install_deps, verify_instance_health, signal_cfn]
# Install dependencies
install_deps:
commands:
InstallDependencies:
command: |
yum install -y awscli jq
# Check the ECS API to see if this instance is available as capacity
# inside of the ECS cluster, and wait for it to run the healthiness daemon
verify_instance_health:
commands:
ECSHealthCheck:
command: |
echo "Introspecting ECS agent status"
find_container_instance_arn() {
CONTAINER_INSTANCE_ARN=$(curl --connect-timeout 1 --max-time 1 -s http://localhost:51678/v1/metadata | jq -r '.ContainerInstanceArn')
}
find_container_instance_arn
while [ "$CONTAINER_INSTANCE_ARN" == "" ]; do sleep 2; find_container_instance_arn; done
echo "Container Instance ARN: $CONTAINER_INSTANCE_ARN"
echo "Waiting for at least one running task"
count_instance_tasks() {
NUMBER_OF_TASKS=$(curl -s http://localhost:51678/v1/tasks | jq '.Tasks | length')
}
count_instance_tasks
while [ $NUMBER_OF_TASKS -lt 1 ]; do sleep 2; count_instance_tasks; done
echo "Instance $CONTAINER_INSTANCE_ARN is now hosting $NUMBER_OF_TASKS task(s)"
# This signals back to CloudFormation once the instance has become healthy in ECS
# and has started hosting at least one task
signal_cfn:
commands:
SignalCloudFormation:
command: !Sub |
/opt/aws/bin/cfn-signal -e $? --stack ${AWS::StackId} --resource ECSAutoScalingGroup --region ${AWS::Region}
Properties:
LaunchTemplateData:
ImageId: !Ref ECSAMI
InstanceType: !Ref InstanceType
IamInstanceProfile:
Name: !Ref EC2InstanceProfile
SecurityGroupIds:
- !Ref ContainerHostSecurityGroup
UserData:
# This injected configuration file is how the EC2 instance
# knows which ECS cluster on your AWS account it should be joining
# It also initiates a CloudFormation init, so that the instance can
# signal back to CloudFormation when it is ready and healthy in the ECS cluster
Fn::Base64: !Sub |
#!/bin/bash -xe
echo ECS_CLUSTER=${ECSCluster} >> /etc/ecs/ecs.config
yum install -y aws-cfn-bootstrap
/opt/aws/bin/cfn-init -v --stack ${AWS::StackId} --resource ContainerInstances --configsets full_install --region ${AWS::Region} &
BlockDeviceMappings:
- DeviceName: "/dev/xvda"
Ebs:
VolumeSize: 50
VolumeType: gp3
# Disable IMDSv1, and require IMDSv2
MetadataOptions:
HttpEndpoint: enabled
HttpTokens: required
EC2InstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
Path: /
Roles:
- !Ref EC2Role
# Create an ECS capacity provider to attach the ASG to the ECS cluster
# so that it autoscales as we launch more containers
CapacityProvider:
Type: AWS::ECS::CapacityProvider
Properties:
AutoScalingGroupProvider:
AutoScalingGroupArn: !Ref ECSAutoScalingGroup
ManagedScaling:
InstanceWarmupPeriod: 60
MinimumScalingStepSize: 1
MaximumScalingStepSize: 100
Status: ENABLED
# Percentage of cluster reservation to try to maintain
TargetCapacity: 100
ManagedTerminationProtection: DISABLED
ManagedDraining: ENABLED
# Create a cluster capacity provider assocation so that the cluster
# will use the capacity provider
CapacityProviderAssociation:
Type: AWS::ECS::ClusterCapacityProviderAssociations
DependsOn:
- CustomEniTrunking
- CustomAsgDestroyer
Properties:
CapacityProviders:
- !Ref CapacityProvider
Cluster: !Ref ECSCluster
DefaultCapacityProviderStrategy:
- Base: 0
CapacityProvider: !Ref CapacityProvider
Weight: 1
# A security group for the EC2 hosts that will run the containers.
# This can be used to limit incoming traffic to or outgoing traffic
# from the container's host EC2 instance.
ContainerHostSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Access to the EC2 hosts that run containers
VpcId: !Ref VpcId
# Role for the EC2 hosts. This allows the ECS agent on the EC2 hosts
# to communciate with the ECS control plane, as well as download the docker
# images from ECR to run on your host.
EC2Role:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Statement:
# Allow the EC2 instances to assume this role
- Effect: Allow
Principal:
Service: [ec2.amazonaws.com]
Action: ['sts:AssumeRole']
# Allow the ENI trunking function to assume this role in order to enable
# ENI trunking while operating under the identity of this role
- Effect: Allow
Principal:
AWS: !GetAtt CustomEniTrunkingRole.Arn
Action: ['sts:AssumeRole']
Path: /
# See reference: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonEC2ContainerServiceforEC2Role
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
Policies:
# The ENI trunking function will assume this role and then use
# the ecs:PutAccountSetting to set ENI trunking on for this role
- PolicyName: allow-to-modify-ecs-settings
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action: ecs:PutAccountSetting
Resource: '*'
# This is a role which is used within Fargate to allow the Fargate agent
# to download images, and upload logs.
ECSTaskExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service: [ecs-tasks.amazonaws.com]
Action: ['sts:AssumeRole']
Condition:
ArnLike:
aws:SourceArn: !Sub arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:*
StringEquals:
aws:SourceAccount: !Ref AWS::AccountId
Path: /
# This role enables basic features of ECS. See reference:
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonECSTaskExecutionRolePolicy
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
# This launches a very basic container which is only used to verify that an EC2
# host is capable of launching tasks. The existence of this task is used as an
# EC2 host sanity check. If the EC2 host is incapable of launching this task it will
# fail to signal CloudFormation, and CloudFormation will rollback.
HealthinessDaemonDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: 'healthiness-daemon'
Memory: 10
RequiresCompatibilities:
- EC2
ExecutionRoleArn: !GetAtt ECSTaskExecutionRole.Arn
ContainerDefinitions:
- Name: 'healthcheck-pause'
Image: public.ecr.aws/docker/library/busybox:latest
EntryPoint:
- /bin/sh
- -c
Command:
- while :; do sleep 2073600; done
# This launches one copy of the healthiness daemon onto each host
# in the cluster.
HealthinessDaemon:
Type: AWS::ECS::Service
Properties:
ServiceName: 'healthiness-daemon'
Cluster: !Ref ECSCluster
LaunchType: EC2
SchedulingStrategy: DAEMON
TaskDefinition: !Ref HealthinessDaemonDefinition
Outputs:
ClusterName:
Description: The ECS cluster into which to launch resources
Value: !Ref ECSCluster
ECSTaskExecutionRole:
Description: The role used to start up a task
Value: !Ref ECSTaskExecutionRole
CapacityProvider:
Description: The cluster capacity provider that the service should use
to request capacity when it wants to start up a task
Value: !Ref CapacityProvider
This stack accepts the following parameters that can used to adjust its behavior:
InstanceType
- An ECS instance type. By default the stack deploysc5.large
MaxSize
- An upper limit on number of EC2 instances to scale up to. Default100
ECSAMI
- The Amazon Machine Image to use for each EC2 instance. Don't change this unless you really know what you are doing.VpcId
- The VPC to launch EC2 instances in. Can be the default account VPC.SubnetIds
- A comma separated list of subnets from that VPC.
A few things to look out for in this template:
CustomAsgDestroyerFunction
- This is a custom CloudFormation resource that helps clean up the Auto Scaling Group faster when tearing down the stack.CustomEniTrunkingFunction
- This custom CloudFormation resource enables ENI trunking. See the "ENI trunking for Amazon ECS" pattern for more detailsAWS::AutoScaling::AutoScalingGroup
->UpdatePolicy
- This configuration enables the Auto Scaling Group to automatically roll out updates whenever the ECS AMI is updated. TheWaitOnResourceSignals
setting is used to validate the EC2 instance health during rolling updates.AWS::CloudFormation::Init
- This block of configuration defines commands that run on each EC2 instance after it launches. The commands use the ECS agent introspection endpoint to validate that the instance is able to connect to ECS and launch a taskHealthinessDaemon
- This is an ECSDAEMON
type service that launches a lightweight container on each host that just sleeps forever. The existence of this container is used as an indication that the host has been able to successfully join the ECS cluster and launch an ECS task.
Service with a Capacity Provider Strategy
Download the following service-capacity-provider.yml
file. This CloudFormation template deploys an ECS service into the cluster, with a capacity provider strategy setup. The service will signal the capacity provider to request capacity, and the capacity provider will scale up the EC2 Autoscaling Group automatically.
AWSTemplateFormatVersion: '2010-09-09'
Description: An example service that deploys onto EC2 capacity with
a capacity provider strategy that autoscales the underlying
EC2 Capacity as needed by the service
Parameters:
VpcId:
Type: String
Description: The VPC that the service is running inside of
SubnetIds:
Type: List<AWS::EC2::Subnet::Id>
Description: List of subnet IDs the AWS VPC tasks are inside of
ClusterName:
Type: String
Description: The name of the ECS cluster into which to launch capacity.
ECSTaskExecutionRole:
Type: String
Description: The role used to start up an ECS task
CapacityProvider:
Type: String
Description: The cluster capacity provider that the service should use
to request capacity when it wants to start up a task
ServiceName:
Type: String
Default: example-service
Description: A name for the service
ImageUrl:
Type: String
Default: public.ecr.aws/docker/library/busybox:latest
Description: The url of a docker image that contains the application process that
will handle the traffic for this service
ContainerCpu:
Type: Number
Default: 256
Description: How much CPU to give the container. 1024 is 1 CPU
ContainerMemory:
Type: Number
Default: 512
Description: How much memory in megabytes to give the container
Command:
Type: String
Default: sleep 3600
Description: The command to run inside of the container
DesiredCount:
Type: Number
Default: 0
Description: How many copies of the service task to run
Resources:
# The task definition. This is a simple metadata description of what
# container to run, and what resource requirements it has.
TaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: !Ref ServiceName
Cpu: !Ref ContainerCpu
Memory: !Ref ContainerMemory
NetworkMode: awsvpc
RequiresCompatibilities:
- EC2
ExecutionRoleArn: !Ref ECSTaskExecutionRole
ContainerDefinitions:
- Name: !Ref ServiceName
Cpu: !Ref ContainerCpu
Memory: !Ref ContainerMemory
Image: !Ref ImageUrl
Command: !Split [' ', !Ref 'Command']
LogConfiguration:
LogDriver: 'awslogs'
Options:
mode: non-blocking
max-buffer-size: 25m
awslogs-group: !Ref LogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: !Ref ServiceName
# The service. The service is a resource which allows you to run multiple
# copies of a type of task, and gather up their logs and metrics, as well
# as monitor the number of running tasks and replace any that have crashed
Service:
Type: AWS::ECS::Service
Properties:
ServiceName: !Ref ServiceName
Cluster: !Ref ClusterName
PlacementStrategies:
- Field: attribute:ecs.availability-zone
Type: spread
- Field: cpu
Type: binpack
CapacityProviderStrategy:
- Base: 0
CapacityProvider: !Ref CapacityProvider
Weight: 1
DeploymentConfiguration:
MaximumPercent: 200
MinimumHealthyPercent: 75
DesiredCount: !Ref DesiredCount
NetworkConfiguration:
AwsvpcConfiguration:
SecurityGroups:
- !Ref ServiceSecurityGroup
Subnets: !Ref SubnetIds
TaskDefinition: !Ref TaskDefinition
# Because we are launching tasks in AWS VPC networking mode
# the tasks themselves also have an extra security group that is unique
# to them. This is a unique security group just for this service,
# to control which things it can talk to, and who can talk to it
ServiceSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: !Sub Access to service ${ServiceName}
VpcId: !Ref VpcId
# This log group stores the stdout logs from this service's containers
LogGroup:
Type: AWS::Logs::LogGroup
Most parameters in this stack will be supplied by a parent stack that passes in resources from the capacity provider stack. However you may be interested in overriding the following parameters:
ServiceName
- A human name for the service.ImageUrl
- URL of a container image to run. By default this stack deployspublic.ecr.aws/docker/library/busybox:latest
ContainerCpu
- CPU shares, where 1024 CPU is 1 vCPU. Default:256
(1/4th vCPU)ContainerMemory
- Megabytes of memory to give the conatiner. Default512
Command
- Command to run in the container. Default:sleep 3600
DesiredCount
- Number of copies of the container to run. Default:0
(So you can test scaling up from zero)
Parent Stack
Download the following parent.yml
file. This stack deploys both of the previous stacks as nested stacks, for ease of grouping and passing parameters from one stack to the next.
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: Parent stack that deploys the ECS cluster and capacity provider
then launches a service inside of the cluster
Parameters:
VpcId:
Type: AWS::EC2::VPC::Id
Description: VPC ID where the ECS cluster is launched
SubnetIds:
Type: List<AWS::EC2::Subnet::Id>
Description: List of subnet IDs where the EC2 instances will be launched
Resources:
# This stack contains cluster wide resources that will be shared
# by all services that get launched in the stack
BaseStack:
Type: AWS::Serverless::Application
Properties:
Location: cluster-capacity-provider.yml
Parameters:
VpcId: !Ref VpcId
SubnetIds: !Join [',', !Ref SubnetIds]
# This service will be launched into the cluster by passing
# details from the base stack into the service stack
Service:
Type: AWS::Serverless::Application
Properties:
Location: service-capacity-provider.yml
Parameters:
VpcId: !Ref VpcId
SubnetIds: !Join [',', !Ref SubnetIds]
ClusterName: !GetAtt BaseStack.Outputs.ClusterName
ECSTaskExecutionRole: !GetAtt BaseStack.Outputs.ECSTaskExecutionRole
CapacityProvider: !GetAtt BaseStack.Outputs.CapacityProvider
This parent stack requires the following parameters:
VpcId
- The ID of a VPC on your AWS account. This can be the default VPCSubnetIds
- A comma separated list of subnet ID's within that VPC
Deploying the stacks with SAM
You should now have three files:
cluster-capacity-provider.yml
- Defines an ECS cluster with production ready operational enhancementsservice-capacity-provider.yml
- Defines an ECS service that deploys into the clusterparent.yml
- Parent file that deploys both of the previous files
Use SAM CLI to deploy everything with a command like this:
# Get the VPC ID of the default VPC on the AWS account
DEFAULT_VPC_ID=$(aws ec2 describe-vpcs --filters Name=is-default,Values=true --query 'Vpcs[0].VpcId' --output text)
# Grab the list of subnet ID's from the default VPC, and glue it together into a comma separated list
DEFAULT_VPC_SUBNET_IDS=$(aws ec2 describe-subnets --filters Name=vpc-id,Values=$DEFAULT_VPC_ID --query "Subnets[*].[SubnetId]" --output text | paste -sd, -)
# Now deploy the ECS cluster to the default VPC and it's subnets
sam deploy \
--template-file parent.yml \
--stack-name capacity-provider-environment \
--resolve-s3 \
--capabilities CAPABILITY_IAM \
--parameter-overrides VpcId=$DEFAULT_VPC_ID SubnetIds=$DEFAULT_VPC_SUBNET_IDS
INFO
This sample command deploys the stack to the AWS account's pre-existing default VPC. You may wish to deploy the workload to a custom VPC, such as the "Large sized VPC for an Amazon ECS cluster".
WARNING
Depending on what you choose to call your stack in the stack-name
parameter, you may get an error in CloudFormation that looks like this:
CreateCapacityProvider error: The specified capacity provider name is invalid. Up to 255 characters are allowed, including letters (upper and lowercase), numbers, underscores, and hyphens. The name cannot be prefixed with "aws", "ecs", or "fargate". Specify a valid name and try again.
If this happens ensure that your parent CloudFormation stack's name does not start with "aws", "ecs", "fargate". The capacity provider in the stack gets an autogenerated name that is derived from the stack name, so if the stack starts with a prohibited word it will cause the capacity provider's name to also start with that prohibited word.
Test scaling up from zero
Initially the ECS cluster will be empty, with no EC2 instances. Additionally the deployed service has a DesiredCount
of zero, so there are initially no containers being launched either.
Use the Amazon ECS web console to update the service and set the desired count to a higher number of tasks. You will observe the ECS cluster launch the requested tasks into an initial status of PROVISIONING
. At this point the task is just a virtual placeholder. The capacity provider notices the task waiting for capacity and responds by scaling up the autoscaling group to provide some EC2 capacity in the cluster. Finally, ECS places tasks onto this brand new capacity as it comes online.
Test rolling out an EC2 instance update
Whenever there is a new ECS Optimized AMI available the Auto Scaling Group will roll out the update as part of the next CloudFormation stack update. However, you can simulate an update by modifying the AWS::EC2::LaunchTemplate
. Locate the UserData
script that runs on each EC2 instance, and add a comment to it. For example:
UserData:
Fn::Base64: !Sub |
#!/bin/bash -xe
# added a test comment here so there is a change for CloudFormation to detect
echo ECS_CLUSTER=${ECSCluster} >> /etc/ecs/ecs.config
yum install -y aws-cfn-bootstrap
/opt/aws/bin/cfn-init -v --stack ${AWS::StackId} --resource ContainerInstances --configsets full_install --region ${AWS::Region} &
Now the next time you deploy it will initiate a rolling update of the Auto Scaling Group to replace all the EC2 instances with new instances. You will see that the container workloads on old hosts are gracefully drained and replaced onto new EC2 hosts prior to the older EC2 hosts shutting down.
Test scaling back down to zero
Last but not least update the service in the ECS console to adjust its desired count back down to zero. Once all instances are empty you will see ECS begin to shutdown EC2 instances until the cluster has been scaled back down to zero.
Tear it Down
You can use the following command to tear down the test stack and all of it's created resources:
sam delete --stack-name capacity-provider-environment --no-prompts
See Also
- If your workload is interruptible you may prefer to save money on your infrastructure costs by using an EC2 Spot Capacity provider instead.