Evenly balance a large ECS deployment across availability zones
About
Amazon Elastic Container Service is a serverless orchestrator that manages container deployments on your behalf.
Capacity providers are a built-in feature of Amazon ECS. A capacity provider launches Amazon EC2 capacity automatically whenever you need compute capacity to run containers.
This reference architecture shows how to create a set of zonal capacity providers, and use a capacity provider strategy to evenly distribute ECS tasks across multiple zonal capacity providers.
Why?
Amazon ECS comes with built-in placement strategies that serve the vast majority of workloads. In specific you could use the following "AZ balanced binpack" strategy to tell Amazon ECS to distribute containers evenly across multiple availability zones, while densely utilizing CPU in order to save infrastructure cost where possible:
spread(attribute:ecs.availability-zone), binpack(CPU)
However, task placement strategies are a best effort. Amazon ECS still places tasks even when the most optimal placement option is unavailable.
This means that in some circumstances Amazon ECS may choose to place an excessive number of tasks into one or two AZ's. The following diagrams demonstrate one scenario in which this may occur.
Imagine a cluster of three instances distributed across three availability zones. Each instance has capacity to run four tasks:
Now you launch service A which deploys four copies of container A, distributed across availability zones:
Because there is one more task than there are availability zones and instances, the first instance in the first AZ gets two tasks instead of one.
Now you deploy a second service B, which deploys four copies of container B:
Because of the "binpack" part of the strategy, the first instance in the first AZ is once again selected to get two tasks instead of a single task. That instance is now packed full of tasks and can not host any additional tasks.
Now you deploy a third service C, which deploys four copies of container C:
This time the only instances that still have capacity are the two instances in the second and third availability zone. As a result these two instances each get two tasks.
The problem is that this third service is not actually distributed across all availability zones. If the workload had a high availability requirement to be distributed across three availability zones, then this reduced availability distribution may not be acceptable.
This is not the only scenario in which ECS tasks may end up unbalanced. Rolling deployments and scaling up may also choose to make denser usage of the currently available capacity rather than launching additional instances. This results in a deployment that is excessively concentrated into a single availability zone or two availability zones. In the best case the deployment is just excessively weighted in favor of one availability zone. In the worst case all tasks for a service could end up placed onto capacity from a single AZ.
In general, this problem gets less serious with larger services that have more desired count, and with an increased number of availability zones. However, instead of relying on higher scale and random chance, you can also utilize ECS capacity providers to enforce evenly balanced task placement.
Architecture
The following diagram shows how this reference architecture solves for even task balancing across availability zones:
- Instead of one large EC2 Auto Scaling group that spans all three availability zones, there is a separate EC2 Auto Scaling group for each availability zone.
- Each Auto Scaling group is linked to it's own ECS capacity provider.
- An ECS capacity provider strategy is configured to distribute tasks for the service evenly across the three capacity providers.
- Each capacity provider then manages the capacity for it's own zone, allowing zones to independently scale to a larger size if necessary to maintain a distributed task placement.
In the above example the same three services have been placed into the cluster. This time they are evenly balanced across all three availability zones. This has been accomplished by scaling up the first AZ to a larger size, while keeping some wasted space on the other two AZ's. As a result there is one entire instance of aggregate wasted compute capacity, but all three services are distributed across all three AZ's.
WARNING
This approach will deliberately waste EC2 capacity in order to evenly distribute tasks across availability zones. This capacity provider strategy is not optimized for cost. It is optimized for high availability.
Dependencies
This pattern requires the following local dependencies:
- AWS SAM CLI for deploying CloudFormation stacks on your AWS account. You should follow the appropriate steps for installing SAM CLI.
This architecture will be defined as a series of separate infrastructure as code modules that are linked together by a parent file that defines the application as a whole. Download each the following files. Instructions for deployment will follow.
Define the ECS cluster
This following cluster.yml
file defines an ECS cluster, plus some supporting infrastructure that will be reused later on.
AWSTemplateFormatVersion: '2010-09-09'
Description: EC2 ECS cluster that starts out empty, with no EC2 instances yet.
An ECS capacity provider automatically launches more EC2 instances
as required on the fly when you request ECS to launch services or
standalone tasks.
Parameters:
VpcId:
Type: AWS::EC2::VPC::Id
Description: VPC ID where the ECS cluster is launched
Resources:
# Cluster that keeps track of container deployments
ECSCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterSettings:
- Name: containerInsights
Value: enabled
# Custom resource that force destroys the ASG. This cleans up EC2 instances that had
# managed termination protection enabled, but which are not yet released.
# This is necessary because ECS does not immediately release an EC2 instance from termination
# protection as soon as the instance is no longer running tasks. There is a cooldown delay.
# In the case of tearing down the CloudFormation stack, CloudFormation will delete the
# AWS::ECS::Service and immediately move on to tearing down the AWS::ECS::Cluster, disconnecting
# the AWS::AutoScaling::AutoScalingGroup from ECS management too fast, before ECS has a chance
# to asynchronously turn off managed instance protection on the EC2 instances.
# This will leave some EC2 instances stranded in a state where they are protected from scale-in forever.
# This then blocks the AWS::AutoScaling::AutoScalingGroup from cleaning itself up.
# The custom resource function force destroys the autoscaling group when tearing down the stack,
# avoiding the issue of protected EC2 instances that can never be cleaned up.
CustomAsgDestroyerFunction:
Type: AWS::Lambda::Function
Properties:
Code:
ZipFile: |
const { AutoScalingClient, DeleteAutoScalingGroupCommand } = require("@aws-sdk/client-auto-scaling");
const response = require('cfn-response');
exports.handler = async function(event, context) {
console.log(event);
if (event.RequestType !== "Delete") {
await response.send(event, context, response.SUCCESS);
return;
}
const autoscaling = new AutoScalingClient({ region: event.ResourceProperties.Region });
const input = {
AutoScalingGroupName: event.ResourceProperties.AutoScalingGroupName,
ForceDelete: true
};
const command = new DeleteAutoScalingGroupCommand(input);
const deleteResponse = await autoscaling.send(command);
console.log(deleteResponse);
await response.send(event, context, response.SUCCESS);
};
Handler: index.handler
Runtime: nodejs20.x
Timeout: 30
Role: !GetAtt CustomAsgDestroyerRole.Arn
# The role used by the ASG destroyer. Note that this role
# starts out with no permissions to actually delete any ASG's. The stack that
# creates the ASG also adds permissions to this role to allow the role to
# delete the ASG
CustomAsgDestroyerRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- sts:AssumeRole
ManagedPolicyArns:
# https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaBasicExecutionRole.html
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
# Turn on ENI trunking for the EC2 instances. This setting is not on by default,
# but it is highly important for increasing the density of AWS VPC networking mode
# tasks per instance. Additionally, it is not controllable by default in CloudFormation
# because it has some complexity of needing to be turned on by a bearer of the role
# of the EC2 instances themselves. With this custom function we can assume the EC2 role
# then use that role to call the ecs:PutAccountSetting API in order to enable
# ENI trunking
CustomEniTrunkingFunction:
Type: AWS::Lambda::Function
Properties:
Code:
ZipFile: |
const { ECSClient, PutAccountSettingCommand } = require("@aws-sdk/client-ecs");
const { STSClient, AssumeRoleCommand } = require("@aws-sdk/client-sts");
const response = require('cfn-response');
exports.handler = async function(event, context) {
console.log(event);
if (event.RequestType == "Delete") {
await response.send(event, context, response.SUCCESS);
return;
}
const sts = new STSClient({ region: event.ResourceProperties.Region });
const assumeRoleResponse = await sts.send(new AssumeRoleCommand({
RoleArn: event.ResourceProperties.EC2Role,
RoleSessionName: "eni-trunking-enable-session",
DurationSeconds: 900
}));
// Instantiate an ECS client using the credentials of the EC2 role
const ecs = new ECSClient({
region: event.ResourceProperties.Region,
credentials: {
accessKeyId: assumeRoleResponse.Credentials.AccessKeyId,
secretAccessKey: assumeRoleResponse.Credentials.SecretAccessKey,
sessionToken: assumeRoleResponse.Credentials.SessionToken
}
});
const putAccountResponse = await ecs.send(new PutAccountSettingCommand({
name: 'awsvpcTrunking',
value: 'enabled'
}));
console.log(putAccountResponse);
await response.send(event, context, response.SUCCESS);
};
Handler: index.handler
Runtime: nodejs20.x
Timeout: 30
Role: !GetAtt CustomEniTrunkingRole.Arn
# The role used by the ENI trunking custom resource
CustomEniTrunkingRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- sts:AssumeRole
ManagedPolicyArns:
# https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaBasicExecutionRole.html
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
# This allows the custom CloudFormation resource in Lambda
# to assume the role that is used by the EC2 instances. The Lambda function must
# assume this role because the ecs:PutAccountSetting must be called either
# by the role that the setting is for, or by the root account, and we aren't
# using the root account for CloudFormation.
AllowEniTrunkingRoleToAssumeEc2Role:
Type: AWS::IAM::Policy
Properties:
Roles:
- !Ref CustomEniTrunkingRole
PolicyName: allow-to-assume-ec2-role
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action: sts:AssumeRole
Resource: !GetAtt EC2Role.Arn
# This is the actual custom resource, which triggers the invocation
# of the Lambda function that enabled ENI trunking during the stack deploy
CustomEniTrunking:
Type: Custom::CustomEniTrunking
DependsOn:
- AllowEniTrunkingRoleToAssumeEc2Role
Properties:
ServiceToken: !GetAtt CustomEniTrunkingFunction.Arn
Region: !Ref "AWS::Region"
EC2Role: !GetAtt EC2Role.Arn
# A security group for the EC2 hosts that will run the containers.
# This can be used to limit incoming traffic to or outgoing traffic
# from the container's host EC2 instance.
ContainerHostSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Access to the EC2 hosts that run containers
VpcId: !Ref VpcId
# Role for the EC2 hosts. This allows the ECS agent on the EC2 hosts
# to communciate with the ECS control plane, as well as download the docker
# images from ECR to run on your host.
EC2Role:
Type: AWS::IAM::Role
Properties:
Path: /
AssumeRolePolicyDocument:
Statement:
# Allow the EC2 instances to assume this role
- Effect: Allow
Principal:
Service: [ec2.amazonaws.com]
Action: ['sts:AssumeRole']
# Allow the ENI trunking function to assume this role in order to enable
# ENI trunking while operating under the identity of this role
- Effect: Allow
Principal:
AWS: !GetAtt CustomEniTrunkingRole.Arn
Action: ['sts:AssumeRole']
# See reference: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonEC2ContainerServiceforEC2Role
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
# The ENI trunking function will assume this role and then use
# the ecs:PutAccountSetting to set ENI trunking on for this role
Policies:
- PolicyName: allow-to-modify-ecs-settings
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action: ecs:PutAccountSetting
Resource: '*'
# This is a role which is used by the ECS agent
# to download images, and upload logs.
ECSTaskExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service: [ecs-tasks.amazonaws.com]
Action: ['sts:AssumeRole']
Condition:
ArnLike:
aws:SourceArn: !Sub arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:*
StringEquals:
aws:SourceAccount: !Ref AWS::AccountId
Path: /
# This role enables basic features of ECS. See reference:
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonECSTaskExecutionRolePolicy
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
Outputs:
ClusterName:
Description: The ECS cluster into which to launch resources
Value: !Ref ECSCluster
ECSTaskExecutionRole:
Description: The role used to start up a task
Value: !Ref ECSTaskExecutionRole
ContainerHostSecurityGroup:
Description: The security group of the host EC2 instances
Value: !Ref ContainerHostSecurityGroup
EC2Role:
Description: The role used by EC2 instances in the cluster
Value: !Ref EC2Role
CustomAsgDestroyerFunctionArn:
Description: The Lambda function that assists with cleaning up capacity provider ASG's
Value: !GetAtt CustomAsgDestroyerFunction.Arn
CustomAsgDestroyerFunctionRole:
Description: The Lambda function's role, used for adding policies to allow deleting an ASG
Value: !Ref CustomAsgDestroyerRole
Things to look for:
CustomAsgDestroyerFunction
- This custom CloudFormation resource helps clean up the stack faster on tear downCustomEniTrunkingFunction
- A custom CloudFormation resource that enables ENI trunking for the EC2 instances
Define a zonal capacity provider
Now download the following single-az-capacity-provider.yml
file, to define an Auto Scaling Group and capacity provider for each availability zone:
AWSTemplateFormatVersion: '2010-09-09'
Description: EC2 ECS cluster that starts out empty, with no EC2 instances yet.
An ECS capacity provider automatically launches more EC2 instances
as required on the fly when you request ECS to launch services or
standalone tasks.
Parameters:
InstanceType:
Type: String
Default: c5.xlarge
Description: Class of EC2 instance used to host containers. Choose t2 for testing, m5 for general purpose, c5 for CPU intensive services, and r5 for memory intensive services
AllowedValues: ["a1.2xlarge", "a1.4xlarge", "a1.large", "a1.medium", "a1.metal", "a1.xlarge", "c1.medium", "c1.xlarge", "c3.2xlarge", "c3.4xlarge", "c3.8xlarge", "c3.large", "c3.xlarge", "c4.2xlarge", "c4.4xlarge", "c4.8xlarge", "c4.large", "c4.xlarge", "c5.12xlarge", "c5.18xlarge", "c5.24xlarge", "c5.2xlarge", "c5.4xlarge", "c5.9xlarge", "c5.large", "c5.metal", "c5.xlarge", "c5a.12xlarge", "c5a.16xlarge", "c5a.24xlarge", "c5a.2xlarge", "c5a.4xlarge", "c5a.8xlarge", "c5a.large", "c5a.xlarge", "c5ad.12xlarge", "c5ad.16xlarge", "c5ad.24xlarge", "c5ad.2xlarge", "c5ad.4xlarge", "c5ad.8xlarge", "c5ad.large", "c5ad.xlarge", "c5d.12xlarge", "c5d.18xlarge", "c5d.24xlarge", "c5d.2xlarge", "c5d.4xlarge", "c5d.9xlarge", "c5d.large", "c5d.metal", "c5d.xlarge", "c5n.18xlarge", "c5n.2xlarge", "c5n.4xlarge", "c5n.9xlarge", "c5n.large", "c5n.metal", "c5n.xlarge", "c6a.12xlarge", "c6a.16xlarge", "c6a.24xlarge", "c6a.2xlarge", "c6a.32xlarge", "c6a.48xlarge", "c6a.4xlarge", "c6a.8xlarge", "c6a.large", "c6a.metal", "c6a.xlarge", "c6g.12xlarge", "c6g.16xlarge", "c6g.2xlarge", "c6g.4xlarge", "c6g.8xlarge", "c6g.large", "c6g.medium", "c6g.metal", "c6g.xlarge", "c6gd.12xlarge", "c6gd.16xlarge", "c6gd.2xlarge", "c6gd.4xlarge", "c6gd.8xlarge", "c6gd.large", "c6gd.medium", "c6gd.metal", "c6gd.xlarge", "c6gn.12xlarge", "c6gn.16xlarge", "c6gn.2xlarge", "c6gn.4xlarge", "c6gn.8xlarge", "c6gn.large", "c6gn.medium", "c6gn.xlarge", "c6i.12xlarge", "c6i.16xlarge", "c6i.24xlarge", "c6i.2xlarge", "c6i.32xlarge", "c6i.4xlarge", "c6i.8xlarge", "c6i.large", "c6i.metal", "c6i.xlarge", "c6id.12xlarge", "c6id.16xlarge", "c6id.24xlarge", "c6id.2xlarge", "c6id.32xlarge", "c6id.4xlarge", "c6id.8xlarge", "c6id.large", "c6id.metal", "c6id.xlarge", "c6in.12xlarge", "c6in.16xlarge", "c6in.24xlarge", "c6in.2xlarge", "c6in.32xlarge", "c6in.4xlarge", "c6in.8xlarge", "c6in.large", "c6in.metal", "c6in.xlarge", "c7g.12xlarge", "c7g.16xlarge", "c7g.2xlarge", "c7g.4xlarge", "c7g.8xlarge", "c7g.large", "c7g.medium", "c7g.metal", "c7g.xlarge", "c7gd.12xlarge", "c7gd.16xlarge", "c7gd.2xlarge", "c7gd.4xlarge", "c7gd.8xlarge", "c7gd.large", "c7gd.medium", "c7gd.xlarge", "c7gn.12xlarge", "c7gn.16xlarge", "c7gn.2xlarge", "c7gn.4xlarge", "c7gn.8xlarge", "c7gn.large", "c7gn.medium", "c7gn.xlarge", "cc2.8xlarge", "cr1.8xlarge", "d2.2xlarge", "d2.4xlarge", "d2.8xlarge", "d2.xlarge", "d3.2xlarge", "d3.4xlarge", "d3.8xlarge", "d3.xlarge", "d3en.12xlarge", "d3en.2xlarge", "d3en.4xlarge", "d3en.6xlarge", "d3en.8xlarge", "d3en.xlarge", "dl1.24xlarge", "f1.16xlarge", "f1.2xlarge", "f1.4xlarge", "g2.2xlarge", "g2.8xlarge", "g3.16xlarge", "g3.4xlarge", "g3.8xlarge", "g3s.xlarge", "g4ad.16xlarge", "g4ad.2xlarge", "g4ad.4xlarge", "g4ad.8xlarge", "g4ad.xlarge", "g4dn.12xlarge", "g4dn.16xlarge", "g4dn.2xlarge", "g4dn.4xlarge", "g4dn.8xlarge", "g4dn.metal", "g4dn.xlarge", "g5.12xlarge", "g5.16xlarge", "g5.24xlarge", "g5.2xlarge", "g5.48xlarge", "g5.4xlarge", "g5.8xlarge", "g5.xlarge", "g5g.16xlarge", "g5g.2xlarge", "g5g.4xlarge", "g5g.8xlarge", "g5g.metal", "g5g.xlarge", "h1.16xlarge", "h1.2xlarge", "h1.4xlarge", "h1.8xlarge", "hpc7g.16xlarge", "hpc7g.4xlarge", "hpc7g.8xlarge", "hs1.8xlarge", "i2.2xlarge", "i2.4xlarge", "i2.8xlarge", "i2.large", "i2.xlarge", "i3.16xlarge", "i3.2xlarge", "i3.4xlarge", "i3.8xlarge", "i3.large", "i3.metal", "i3.xlarge", "i3en.12xlarge", "i3en.24xlarge", "i3en.2xlarge", "i3en.3xlarge", "i3en.6xlarge", "i3en.large", "i3en.metal", "i3en.xlarge", "i4g.16xlarge", "i4g.2xlarge", "i4g.4xlarge", "i4g.8xlarge", "i4g.large", "i4g.xlarge", "i4i.16xlarge", "i4i.2xlarge", "i4i.32xlarge", "i4i.4xlarge", "i4i.8xlarge", "i4i.large", "i4i.metal", "i4i.xlarge", "im4gn.16xlarge", "im4gn.2xlarge", "im4gn.4xlarge", "im4gn.8xlarge", "im4gn.large", "im4gn.xlarge", "inf1.24xlarge", "inf1.2xlarge", "inf1.6xlarge", "inf1.xlarge", "inf2.24xlarge", "inf2.48xlarge", "inf2.8xlarge", "inf2.xlarge", "is4gen.2xlarge", "is4gen.4xlarge", "is4gen.8xlarge", "is4gen.large", "is4gen.medium", "is4gen.xlarge", "m1.large", "m1.medium", "m1.small", "m1.xlarge", "m2.2xlarge", "m2.4xlarge", "m2.xlarge", "m3.2xlarge", "m3.large", "m3.medium", "m3.xlarge", "m4.10xlarge", "m4.16xlarge", "m4.2xlarge", "m4.4xlarge", "m4.large", "m4.xlarge", "m5.12xlarge", "m5.16xlarge", "m5.24xlarge", "m5.2xlarge", "m5.4xlarge", "m5.8xlarge", "m5.large", "m5.metal", "m5.xlarge", "m5a.12xlarge", "m5a.16xlarge", "m5a.24xlarge", "m5a.2xlarge", "m5a.4xlarge", "m5a.8xlarge", "m5a.large", "m5a.xlarge", "m5ad.12xlarge", "m5ad.16xlarge", "m5ad.24xlarge", "m5ad.2xlarge", "m5ad.4xlarge", "m5ad.8xlarge", "m5ad.large", "m5ad.xlarge", "m5d.12xlarge", "m5d.16xlarge", "m5d.24xlarge", "m5d.2xlarge", "m5d.4xlarge", "m5d.8xlarge", "m5d.large", "m5d.metal", "m5d.xlarge", "m5dn.12xlarge", "m5dn.16xlarge", "m5dn.24xlarge", "m5dn.2xlarge", "m5dn.4xlarge", "m5dn.8xlarge", "m5dn.large", "m5dn.metal", "m5dn.xlarge", "m5n.12xlarge", "m5n.16xlarge", "m5n.24xlarge", "m5n.2xlarge", "m5n.4xlarge", "m5n.8xlarge", "m5n.large", "m5n.metal", "m5n.xlarge", "m5zn.12xlarge", "m5zn.2xlarge", "m5zn.3xlarge", "m5zn.6xlarge", "m5zn.large", "m5zn.metal", "m5zn.xlarge", "m6a.12xlarge", "m6a.16xlarge", "m6a.24xlarge", "m6a.2xlarge", "m6a.32xlarge", "m6a.48xlarge", "m6a.4xlarge", "m6a.8xlarge", "m6a.large", "m6a.metal", "m6a.xlarge", "m6g.12xlarge", "m6g.16xlarge", "m6g.2xlarge", "m6g.4xlarge", "m6g.8xlarge", "m6g.large", "m6g.medium", "m6g.metal", "m6g.xlarge", "m6gd.12xlarge", "m6gd.16xlarge", "m6gd.2xlarge", "m6gd.4xlarge", "m6gd.8xlarge", "m6gd.large", "m6gd.medium", "m6gd.metal", "m6gd.xlarge", "m6i.12xlarge", "m6i.16xlarge", "m6i.24xlarge", "m6i.2xlarge", "m6i.32xlarge", "m6i.4xlarge", "m6i.8xlarge", "m6i.large", "m6i.metal", "m6i.xlarge", "m6id.12xlarge", "m6id.16xlarge", "m6id.24xlarge", "m6id.2xlarge", "m6id.32xlarge", "m6id.4xlarge", "m6id.8xlarge", "m6id.large", "m6id.metal", "m6id.xlarge", "m6idn.12xlarge", "m6idn.16xlarge", "m6idn.24xlarge", "m6idn.2xlarge", "m6idn.32xlarge", "m6idn.4xlarge", "m6idn.8xlarge", "m6idn.large", "m6idn.metal", "m6idn.xlarge", "m6in.12xlarge", "m6in.16xlarge", "m6in.24xlarge", "m6in.2xlarge", "m6in.32xlarge", "m6in.4xlarge", "m6in.8xlarge", "m6in.large", "m6in.metal", "m6in.xlarge", "m7a.12xlarge", "m7a.16xlarge", "m7a.24xlarge", "m7a.2xlarge", "m7a.32xlarge", "m7a.48xlarge", "m7a.4xlarge", "m7a.8xlarge", "m7a.large", "m7a.medium", "m7a.metal-48xl", "m7a.xlarge", "m7g.12xlarge", "m7g.16xlarge", "m7g.2xlarge", "m7g.4xlarge", "m7g.8xlarge", "m7g.large", "m7g.medium", "m7g.metal", "m7g.xlarge", "m7gd.12xlarge", "m7gd.16xlarge", "m7gd.2xlarge", "m7gd.4xlarge", "m7gd.8xlarge", "m7gd.large", "m7gd.medium", "m7gd.xlarge", "m7i-flex.2xlarge", "m7i-flex.4xlarge", "m7i-flex.8xlarge", "m7i-flex.large", "m7i-flex.xlarge", "m7i.12xlarge", "m7i.16xlarge", "m7i.24xlarge", "m7i.2xlarge", "m7i.48xlarge", "m7i.4xlarge", "m7i.8xlarge", "m7i.large", "m7i.xlarge", "mac1.metal", "mac2.metal", "p2.16xlarge", "p2.8xlarge", "p2.xlarge", "p3.16xlarge", "p3.2xlarge", "p3.8xlarge", "p3dn.24xlarge", "p4d.24xlarge", "p4de.24xlarge", "p5.48xlarge", "r3.2xlarge", "r3.4xlarge", "r3.8xlarge", "r3.large", "r3.xlarge", "r4.16xlarge", "r4.2xlarge", "r4.4xlarge", "r4.8xlarge", "r4.large", "r4.xlarge", "r5.12xlarge", "r5.16xlarge", "r5.24xlarge", "r5.2xlarge", "r5.4xlarge", "r5.8xlarge", "r5.large", "r5.metal", "r5.xlarge", "r5a.12xlarge", "r5a.16xlarge", "r5a.24xlarge", "r5a.2xlarge", "r5a.4xlarge", "r5a.8xlarge", "r5a.large", "r5a.xlarge", "r5ad.12xlarge", "r5ad.16xlarge", "r5ad.24xlarge", "r5ad.2xlarge", "r5ad.4xlarge", "r5ad.8xlarge", "r5ad.large", "r5ad.xlarge", "r5b.12xlarge", "r5b.16xlarge", "r5b.24xlarge", "r5b.2xlarge", "r5b.4xlarge", "r5b.8xlarge", "r5b.large", "r5b.metal", "r5b.xlarge", "r5d.12xlarge", "r5d.16xlarge", "r5d.24xlarge", "r5d.2xlarge", "r5d.4xlarge", "r5d.8xlarge", "r5d.large", "r5d.metal", "r5d.xlarge", "r5dn.12xlarge", "r5dn.16xlarge", "r5dn.24xlarge", "r5dn.2xlarge", "r5dn.4xlarge", "r5dn.8xlarge", "r5dn.large", "r5dn.metal", "r5dn.xlarge", "r5n.12xlarge", "r5n.16xlarge", "r5n.24xlarge", "r5n.2xlarge", "r5n.4xlarge", "r5n.8xlarge", "r5n.large", "r5n.metal", "r5n.xlarge", "r6a.12xlarge", "r6a.16xlarge", "r6a.24xlarge", "r6a.2xlarge", "r6a.32xlarge", "r6a.48xlarge", "r6a.4xlarge", "r6a.8xlarge", "r6a.large", "r6a.metal", "r6a.xlarge", "r6g.12xlarge", "r6g.16xlarge", "r6g.2xlarge", "r6g.4xlarge", "r6g.8xlarge", "r6g.large", "r6g.medium", "r6g.metal", "r6g.xlarge", "r6gd.12xlarge", "r6gd.16xlarge", "r6gd.2xlarge", "r6gd.4xlarge", "r6gd.8xlarge", "r6gd.large", "r6gd.medium", "r6gd.metal", "r6gd.xlarge", "r6i.12xlarge", "r6i.16xlarge", "r6i.24xlarge", "r6i.2xlarge", "r6i.32xlarge", "r6i.4xlarge", "r6i.8xlarge", "r6i.large", "r6i.metal", "r6i.xlarge", "r6id.12xlarge", "r6id.16xlarge", "r6id.24xlarge", "r6id.2xlarge", "r6id.32xlarge", "r6id.4xlarge", "r6id.8xlarge", "r6id.large", "r6id.metal", "r6id.xlarge", "r6idn.12xlarge", "r6idn.16xlarge", "r6idn.24xlarge", "r6idn.2xlarge", "r6idn.32xlarge", "r6idn.4xlarge", "r6idn.8xlarge", "r6idn.large", "r6idn.metal", "r6idn.xlarge", "r6in.12xlarge", "r6in.16xlarge", "r6in.24xlarge", "r6in.2xlarge", "r6in.32xlarge", "r6in.4xlarge", "r6in.8xlarge", "r6in.large", "r6in.metal", "r6in.xlarge", "r7g.12xlarge", "r7g.16xlarge", "r7g.2xlarge", "r7g.4xlarge", "r7g.8xlarge", "r7g.large", "r7g.medium", "r7g.metal", "r7g.xlarge", "r7gd.12xlarge", "r7gd.16xlarge", "r7gd.2xlarge", "r7gd.4xlarge", "r7gd.8xlarge", "r7gd.large", "r7gd.medium", "r7gd.xlarge", "r7iz.12xlarge", "r7iz.16xlarge", "r7iz.2xlarge", "r7iz.32xlarge", "r7iz.4xlarge", "r7iz.8xlarge", "r7iz.large", "r7iz.xlarge", "t1.micro", "t2.2xlarge", "t2.large", "t2.medium", "t2.micro", "t2.nano", "t2.small", "t2.xlarge", "t3.2xlarge", "t3.large", "t3.medium", "t3.micro", "t3.nano", "t3.small", "t3.xlarge", "t3a.2xlarge", "t3a.large", "t3a.medium", "t3a.micro", "t3a.nano", "t3a.small", "t3a.xlarge", "t4g.2xlarge", "t4g.large", "t4g.medium", "t4g.micro", "t4g.nano", "t4g.small", "t4g.xlarge", "trn1.2xlarge", "trn1.32xlarge", "trn1n.32xlarge", "u-12tb1.112xlarge", "u-18tb1.112xlarge", "u-24tb1.112xlarge", "u-3tb1.56xlarge", "u-6tb1.112xlarge", "u-6tb1.56xlarge", "u-9tb1.112xlarge", "vt1.24xlarge", "vt1.3xlarge", "vt1.6xlarge", "x1.16xlarge", "x1.32xlarge", "x1e.16xlarge", "x1e.2xlarge", "x1e.32xlarge", "x1e.4xlarge", "x1e.8xlarge", "x1e.xlarge", "x2gd.12xlarge", "x2gd.16xlarge", "x2gd.2xlarge", "x2gd.4xlarge", "x2gd.8xlarge", "x2gd.large", "x2gd.medium", "x2gd.metal", "x2gd.xlarge", "x2idn.16xlarge", "x2idn.24xlarge", "x2idn.32xlarge", "x2idn.metal", "x2iedn.16xlarge", "x2iedn.24xlarge", "x2iedn.2xlarge", "x2iedn.32xlarge", "x2iedn.4xlarge", "x2iedn.8xlarge", "x2iedn.metal", "x2iedn.xlarge", "x2iezn.12xlarge", "x2iezn.2xlarge", "x2iezn.4xlarge", "x2iezn.6xlarge", "x2iezn.8xlarge", "x2iezn.metal", "z1d.12xlarge", "z1d.2xlarge", "z1d.3xlarge", "z1d.6xlarge", "z1d.large", "z1d.metal", "z1d.xlarge"]
ConstraintDescription: Please choose a valid instance type.
DesiredCapacity:
Type: Number
Default: '0'
Description: Number of EC2 instances to launch in your ECS cluster.
MaxSize:
Type: Number
Default: '100'
Description: Maximum number of EC2 instances that can be launched in your ECS cluster.
ECSAMI:
Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
Default: /aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id
Description: The Amazon Machine Image ID used for the cluster, leave it as the default value to get the latest AMI
SubnetId:
Type: AWS::EC2::Subnet::Id
Description: A single subnet ID to launch the ECS nodes in
ClusterName:
Type: String
Description: The ECS cluster that this capacity provider will be associated with
EC2Role:
Type: String
Description: The role that the EC2 instances will use
ContainerHostSecurityGroup:
Type: AWS::EC2::SecurityGroup::Id
Description: The security group used by the EC2 instances
CustomAsgDestroyerFunctionArn:
Type: String
Description: ARN of a shared Lambda function that provides the logic for a custom CloudFormation
resource that helps cleanup ASG's associated with an ECS capacity provider.
CustomAsgDestroyerFunctionRole:
Type: String
Description: The role used by the ASG destroyer function
Resources:
# This allows the ASG destroyer to destroy the capacity provider ASG from this stack.
AllowAsgDestroyerToDestroyThisAsg:
Type: AWS::IAM::Policy
Properties:
Roles:
- !Ref CustomAsgDestroyerFunctionRole
PolicyName: !Sub allow-to-destroy-asg-${ECSAutoScalingGroup}
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action: autoscaling:DeleteAutoScalingGroup
Resource: !Sub arn:aws:autoscaling:${AWS::Region}:${AWS::AccountId}:autoScalingGroup:*:autoScalingGroupName/${ECSAutoScalingGroup}
# This configures a custom hook that helps destroy the ASG cleanly when
# tearing down the CloudFormation stack.
CustomAsgDestroyer:
Type: Custom::AsgDestroyer
DependsOn:
- AllowAsgDestroyerToDestroyThisAsg
Properties:
ServiceToken: !Ref CustomAsgDestroyerFunctionArn
Region: !Ref "AWS::Region"
AutoScalingGroupName: !Ref ECSAutoScalingGroup
# Autoscaling group. This launches the actual EC2 instances that will register
# themselves as members of the cluster, and run the docker containers.
ECSAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
VPCZoneIdentifier:
- !Ref SubnetId
LaunchTemplate:
LaunchTemplateId: !Ref ContainerInstances
Version: !GetAtt ContainerInstances.LatestVersionNumber
MinSize: 0
MaxSize: !Ref MaxSize
DesiredCapacity: !Ref DesiredCapacity
NewInstancesProtectedFromScaleIn: true
UpdatePolicy:
AutoScalingReplacingUpdate:
WillReplace: 'true'
# The config for each instance that is added to the cluster
ContainerInstances:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateData:
ImageId: !Ref ECSAMI
InstanceType: !Ref InstanceType
IamInstanceProfile:
Name: !Ref EC2InstanceProfile
SecurityGroupIds:
- !Ref ContainerHostSecurityGroup
UserData:
# This injected configuration file is how the EC2 instance
# knows which ECS cluster on your AWS account it should be joining
Fn::Base64: !Sub |
#!/bin/bash
echo ECS_CLUSTER=${ClusterName} >> /etc/ecs/ecs.config
BlockDeviceMappings:
- DeviceName: "/dev/xvda"
Ebs:
VolumeSize: 50
VolumeType: gp3
# Disable IMDSv1, and require IMDSv2
MetadataOptions:
HttpEndpoint: enabled
HttpTokens: required
EC2InstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
Path: /
Roles:
- !Ref EC2Role
# Create an ECS capacity provider to attach the ASG to the ECS cluster
# so that it autoscales as we launch more containers
CapacityProvider:
Type: AWS::ECS::CapacityProvider
Properties:
AutoScalingGroupProvider:
AutoScalingGroupArn: !Ref ECSAutoScalingGroup
ManagedScaling:
InstanceWarmupPeriod: 60
MinimumScalingStepSize: 1
MaximumScalingStepSize: 100
Status: ENABLED
# Percentage of cluster reservation to try to maintain
TargetCapacity: 100
ManagedTerminationProtection: ENABLED
ManagedDraining: ENABLED
Outputs:
CapacityProvider:
Description: The cluster capacity provider that the service should use
to request capacity when it wants to start up a task
Value: !Ref CapacityProvider
This pattern will be reused one time for each different AZ that we wish to host tasks in. Things to look for in this template:
AWS::AutoScaling::AutoScalingGroup
- The Auto Scaling group that will launch EC2 instances. Notice that this is a single zonal Auto Scaling group.AWS::ECS::CapacityProvider
- The ECS capacity provider that will scale the Auto Scaling group up and down in response to task launchesCustom::AsgDestroyer
- This instantiates the custom CloudFormation resource that destroy the Auto Scaling Group on stack teardown.
Define the capacity provider association
In order to use a capacity provider with an ECS cluster it must first be associated with the cluster. This capacity-provider-associations.yml
file defines this association:
AWSTemplateFormatVersion: "2010-09-09"
Description: This stack defines the capacity provider strategy that distributes
tasks evenly across all zonal capacity providers for the cluster.
Parameters:
ClusterName:
Type: String
Description: The cluster that uses the capacity providers
CapacityProvider00:
Type: String
Description: The first capacity provider
CapacityProvider01:
Type: String
Description: The second capacity provider
CapacityProvider02:
Type: String
Description: The third capacity provider
Resources:
# Create a cluster capacity provider assocation list so that the cluster
# will use the capacity provider
CapacityProviderAssociation:
Type: AWS::ECS::ClusterCapacityProviderAssociations
Properties:
CapacityProviders:
- !Ref CapacityProvider00
- !Ref CapacityProvider01
- !Ref CapacityProvider02
Cluster: !Ref ClusterName
DefaultCapacityProviderStrategy:
- Base: 0
CapacityProvider: !Ref CapacityProvider00
Weight: 1
- Base: 0
CapacityProvider: !Ref CapacityProvider01
Weight: 1
- Base: 0
CapacityProvider: !Ref CapacityProvider02
Weight: 1
Things to note:
- A default capacity provider strategy is configured to distribute tasks evenly across each of three capacity providers. You can tune the
Base
later on depending on the total size of your deployment, but theWeight
should be set to one for all capacity providers. - If you expect to launch a large number of containers then you could utilize the capacity provider strategy
Base
to ensure that a minimum number of tasks are always deployed to each capacity provider. For example if you expect to always run more than 300 tasks, you could set a base of one hundred on each capacity provider. The first 300 tasks will always be distributed perfectly evenly across the three capacity providers. Any remaining tasks will be distributed evenly across the capacity providers. This can help ensure even balance even under unusual circumstances such as all tasks in specific AZ crashing and undergoing replacement at the same time.
Define a service
When launching a service in the cluster you need to specify a capacity provider strategy on the service as well. This service-capacity-provider.yml
defines the service that will run in the cluster, and distributes it across all three capacity providers:
AWSTemplateFormatVersion: '2010-09-09'
Description: An example service that deploys onto EC2 capacity with
a capacity provider strategy that autoscales the underlying
EC2 Capacity as needed by the service
Parameters:
VpcId:
Type: String
Description: The VPC that the service is running inside of
SubnetIds:
Type: List<AWS::EC2::Subnet::Id>
Description: List of subnet IDs the AWS VPC tasks are inside of
ClusterName:
Type: String
Description: The name of the ECS cluster into which to launch capacity.
ECSTaskExecutionRole:
Type: String
Description: The role used to start up an ECS task
CapacityProvider00:
Type: String
Description: First AZ capacity provider
CapacityProvider01:
Type: String
Description: Second AZ capacity provider
CapacityProvider02:
Type: String
Description: Third AZ capacity provider
ServiceName:
Type: String
Default: example-service
Description: A name for the service
ImageUrl:
Type: String
Default: public.ecr.aws/docker/library/busybox:latest
Description: The url of a docker image that contains the application process that
will handle the traffic for this service
ContainerCpu:
Type: Number
Default: 256
Description: How much CPU to give the container. 1024 is 1 CPU
ContainerMemory:
Type: Number
Default: 512
Description: How much memory in megabytes to give the container
Command:
Type: String
Default: sleep 86400
Description: The command to run inside of the container
DesiredCount:
Type: Number
Default: 35
Description: How many copies of the service task to run
Resources:
# The task definition. This is a simple metadata description of what
# container to run, and what resource requirements it has.
TaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: !Ref ServiceName
Cpu: !Ref ContainerCpu
Memory: !Ref ContainerMemory
NetworkMode: awsvpc
RequiresCompatibilities:
- EC2
ExecutionRoleArn: !Ref ECSTaskExecutionRole
ContainerDefinitions:
- Name: !Ref ServiceName
Cpu: !Ref ContainerCpu
Memory: !Ref ContainerMemory
Image: !Ref ImageUrl
Command: !Split [' ', !Ref 'Command']
LogConfiguration:
LogDriver: 'awslogs'
Options:
mode: non-blocking
max-buffer-size: 25m
awslogs-group: !Ref LogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: !Ref ServiceName
# The service. The service is a resource which allows you to run multiple
# copies of a type of task, and gather up their logs and metrics, as well
# as monitor the number of running tasks and replace any that have crashed
Service:
Type: AWS::ECS::Service
Properties:
ServiceName: !Ref ServiceName
Cluster: !Ref ClusterName
PlacementStrategies:
- Field: cpu
Type: binpack
CapacityProviderStrategy:
- Base: 0
CapacityProvider: !Ref CapacityProvider00
Weight: 1
- Base: 0
CapacityProvider: !Ref CapacityProvider01
Weight: 1
- Base: 0
CapacityProvider: !Ref CapacityProvider02
Weight: 1
DeploymentConfiguration:
MaximumPercent: 200
MinimumHealthyPercent: 100
DesiredCount: !Ref DesiredCount
NetworkConfiguration:
AwsvpcConfiguration:
SecurityGroups:
- !Ref ServiceSecurityGroup
Subnets:
- !Select [ 0, !Ref SubnetIds ]
- !Select [ 1, !Ref SubnetIds ]
- !Select [ 2, !Ref SubnetIds ]
TaskDefinition: !Ref TaskDefinition
# Because we are launching tasks in AWS VPC networking mode
# the tasks themselves also have an extra security group that is unique
# to them. This is a unique security group just for this service,
# to control which things it can talk to, and who can talk to it
ServiceSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: !Sub Access to service ${ServiceName}
VpcId: !Ref VpcId
# This log group stores the stdout logs from this service's containers
LogGroup:
Type: AWS::Logs::LogGroup
Things to note:
- By default this template deploys 35 tasks. Consider increasing or decreasing this value as part of your test workload.
Put it all together
It's time to deploy all this infrastructure. The following serverless application definition ties all the pieces together:
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: Parent stack that deploys an ECS cluster that has a separate capacity
provider per availability zone, as well as an ECS service that uses a
capacity provider strategy to evenly distribute tasks to each AZ
Parameters:
VpcId:
Type: AWS::EC2::VPC::Id
Description: VPC ID where the ECS cluster is launched
SubnetIds:
Type: List<AWS::EC2::Subnet::Id>
Description: List of subnet IDs where the EC2 instances will be launched
Resources:
# This stack contains cluster wide resources that will be shared
# by all services that get launched in the stack
ClusterStack:
Type: AWS::Serverless::Application
Properties:
Location: cluster.yml
Parameters:
VpcId: !Ref VpcId
# Capacity provider for the first availability zone
AzCapacityProviderStack00:
Type: AWS::Serverless::Application
Properties:
Location: single-az-capacity-provider.yml
Parameters:
SubnetId: !Select [0, !Ref SubnetIds]
ClusterName: !GetAtt ClusterStack.Outputs.ClusterName
EC2Role: !GetAtt ClusterStack.Outputs.EC2Role
ContainerHostSecurityGroup: !GetAtt ClusterStack.Outputs.ContainerHostSecurityGroup
CustomAsgDestroyerFunctionArn: !GetAtt ClusterStack.Outputs.CustomAsgDestroyerFunctionArn
CustomAsgDestroyerFunctionRole: !GetAtt ClusterStack.Outputs.CustomAsgDestroyerFunctionRole
# Capacity provider for the second availability zone
AzCapacityProviderStack01:
Type: AWS::Serverless::Application
Properties:
Location: single-az-capacity-provider.yml
Parameters:
SubnetId: !Select [1, !Ref SubnetIds]
ClusterName: !GetAtt ClusterStack.Outputs.ClusterName
EC2Role: !GetAtt ClusterStack.Outputs.EC2Role
ContainerHostSecurityGroup: !GetAtt ClusterStack.Outputs.ContainerHostSecurityGroup
CustomAsgDestroyerFunctionArn: !GetAtt ClusterStack.Outputs.CustomAsgDestroyerFunctionArn
CustomAsgDestroyerFunctionRole: !GetAtt ClusterStack.Outputs.CustomAsgDestroyerFunctionRole
# Capacity provider for the third availability zone
AzCapacityProviderStack02:
Type: AWS::Serverless::Application
Properties:
Location: single-az-capacity-provider.yml
Parameters:
SubnetId: !Select [2, !Ref SubnetIds]
ClusterName: !GetAtt ClusterStack.Outputs.ClusterName
EC2Role: !GetAtt ClusterStack.Outputs.EC2Role
ContainerHostSecurityGroup: !GetAtt ClusterStack.Outputs.ContainerHostSecurityGroup
CustomAsgDestroyerFunctionArn: !GetAtt ClusterStack.Outputs.CustomAsgDestroyerFunctionArn
CustomAsgDestroyerFunctionRole: !GetAtt ClusterStack.Outputs.CustomAsgDestroyerFunctionRole
# Define the strategy for distributing tasks across the capacity providers
CapacityProviderStrategyStack:
Type: AWS::Serverless::Application
Properties:
Location: capacity-provider-associations.yml
Parameters:
ClusterName: !GetAtt ClusterStack.Outputs.ClusterName
CapacityProvider00: !GetAtt AzCapacityProviderStack00.Outputs.CapacityProvider
CapacityProvider01: !GetAtt AzCapacityProviderStack01.Outputs.CapacityProvider
CapacityProvider02: !GetAtt AzCapacityProviderStack02.Outputs.CapacityProvider
# This service will be launched into the cluster by passing
# details from the base stack into the service stack
Service:
Type: AWS::Serverless::Application
# Ensure that the service stack get's torn down before the capacity provider stack
DependsOn:
- CapacityProviderStrategyStack
Properties:
Location: service-capacity-provider.yml
Parameters:
VpcId: !Ref VpcId
SubnetIds: !Join [',', !Ref SubnetIds]
ClusterName: !GetAtt ClusterStack.Outputs.ClusterName
ECSTaskExecutionRole: !GetAtt ClusterStack.Outputs.ECSTaskExecutionRole
CapacityProvider00: !GetAtt AzCapacityProviderStack00.Outputs.CapacityProvider
CapacityProvider01: !GetAtt AzCapacityProviderStack01.Outputs.CapacityProvider
CapacityProvider02: !GetAtt AzCapacityProviderStack02.Outputs.CapacityProvider
Note that this parent.yml
file defines three copies of the single-az-capacity-provider.yml
stack, one copy for each availability zone.
Deploy
You should now have five files:
cluster.yml
- Define the ECS cluster and supporting infrastructuresingle-az-capacity-provider.yml
- Define an Auto Scaling group and ECS capacity provider for a single zonecapacity-provider-associations.yml
- Links multiple capacity providers to the ECS clusterservice-capacity-provider.yml
- Defines a service distributed across three capacity providersparent.yml
- Instantiates the other YAML files, including creating three copies of the zonal capacity provider
You can use the following commands to deploy the reference architecture to your account, using the default VPC. If you wish to utilize a dedicated VPC for this workload, then consider downloading and deploying the "Large sized VPC for an Amazon ECS cluster" as part of this parent.yml
stack.
# Get the VPC ID of the default VPC on the AWS account
DEFAULT_VPC_ID=$(aws ec2 describe-vpcs --filters Name=is-default,Values=true --query 'Vpcs[0].VpcId' --output text)
# Grab the list of subnet ID's from the default VPC, and glue it together into a comma separated list
DEFAULT_VPC_SUBNET_IDS=$(aws ec2 describe-subnets --filters Name=vpc-id,Values=$DEFAULT_VPC_ID --query "Subnets[*].[SubnetId]" --output text | paste -sd, -)
# Now deploy the ECS cluster to the default VPC and it's subnets
sam deploy \
--template-file parent.yml \
--stack-name capacity-provider-environment \
--resolve-s3 \
--capabilities CAPABILITY_IAM \
--parameter-overrides VpcId=$DEFAULT_VPC_ID SubnetIds=$DEFAULT_VPC_SUBNET_IDS
Test it out
Use the Amazon ECS console to inspect the details of the cluster that was just deployed. You should see something similar to this:
Scale the service up and down and observe that perfect balance is achieved across each availability zone.
WARNING
Notice that each of the three capacity providers scales out to a minimum of two instances. This is because each capacity provider wants to provide high availability for itself, so it launches more than one instance to spread the tasks across. By separating the availability zones into three capacity providers we have created some extra redundancy across each zonal capacity provider. For this reason, this architectural approach is more intended for extremely large deployments, where there will be less overhead wasted space as each zonal capacity provider will be fully utilizing more than two EC2 instances anyway.
Operational Caveats
While this approach functions very well under typical circumstances, there is a caveat to consider for large deployments in the event of an availability zone outage.
If one of the zonal capacity providers is unable to provide capacity, then the tasks that are distributed to that capacity provider will not be able to be launched. Instead they will wait in PROVISIONING
state until the capacity provider is able to obtain capacity.
Amazon ECS also has a per cluster limit on the number of tasks that can be in the PROVISIONING
state. This limit is set to 500 tasks. Therefore if you have a deployment of 1500 tasks, distributed across three zonal capacity providers, and for some reason an entire availability zone of capacity is lost, there will be 500 tasks waiting in the PROVISIONING
state. This may block other task launches for a time, until capacity is restored or the provisioning tasks time out and fail to launch.
To recover from such a scenario you could update your service's capacity provider strategy to temporarily remove the failing availability zone from the capacity provider strategy, and force an update to the service.
For smaller deployments where there would be fewer than 500 tasks stuck in PROVISIONING
state there will not be any issue of task launch deadlocks. You may still wish to respond to a zonal outage by scaling your service up to a larger size to distribute more tasks to the remaining zonal capacity providers. ECS will not automatically redistribute the PROVISIONING
tasks to other capacity providers because the capacity provider strategy demands perfect balance.
Tear it Down
You can tear down the infrastructure created by this reference architecture by using the following command:
sam delete --stack-name capacity-provider-environment --no-prompts