Nathan Peck
Nathan Peck
Senior Developer Advocate at AWS
 Sep 19, 2023 29 min read

Amazon ECS Scalability Best Practices

This presentation covers best practices for planning and executing an auto scaling strategy for containers on AWS under Amazon Elastic Container Service (ECS) orchestration. It covers the following topics:

  • Application scaling mindset - How to think about scaling when deploying containers
  • Vertical scaling - How to choose the right vertical size for your application container
  • Horizontal scaling - How to build automation that increases and decreases the number of containers being run
  • Cluster capacity - How do you ensure you have enough underlying compute capacity to run all your containers?
  • Summary - The TL;DR if you just want some fast tips

The content below is extracted from the presentation deck, which you can download at the bottom of this article.

Application Scaling Mindset

In order to have great auto scaling with containers you must have an application scaling mindset. This will carry through to how you vertically size your application containers, how you horizontally scale them, and how you provide capacity to run those containers.

A decade ago you might have had a data center with virtual machines inside of it, running on physical hardware. If you wanted to add more physical capacity you would launch more VM’s or add more servers to your racks, and if you wanted to vertically scale you would launch a larger VM or you would slot a more powerful server into the rack. But there was a problem with this type of scaling.

When you scale based on servers or VM instances it expects that the application is going to adjust to the compute it is on. For example in this diagram you can see an instance which has 8 vCPU’s and 16 GB of memory, but then you have an application which is only using 2 vCPU and 6 GB of memory. Often there is a mismatch between the amount of resources on your underlying compute, and the amount of resources that the application actually needs.

Applications often don’t have enough load to fully utilize the compute that they are running on. You see this when you have an application that occasionally bursts and receives traffic, but it doesn’t always have enough traffic to fully utilize a server or VM. All that white space in the diagram is being wasted; you can see only a small fraction is actually being utilized.

One potential solution to this problem is to combine applications onto compute so you can get better compute utilization. Here you can see two applications: the orange application and the blue application. They have been put together onto the compute and as a total we now have a more reasonable 50% CPU utilization and 75% memory utilization. That’s not bad, but we still have some problems with this setup.

Applications rarely run perfectly steady. One of those applications might have an occasional burst of activity. When that happens the two applications compete with each other for the available compute resources. In this diagram you can see an overlap between how much CPU the blue application wants and how much CPU the orange application actually wants. This is called “cpu contention”.

Containers were created as a solution to help you configure limits per application so that you can put multiple applications onto a server or VM and they won’t compete with each other. Each container has a boundary around resources that it needs to reserve on that instance. When there is a resource constraint and both of these applications are demanding high levels of resource, they will not interfere with each other. Each container will be guaranteed to get as much resources as it had reserved for itself initially.

Once you start to put a box around the resources that an application requires you realize that the underlying VM is no longer the unit of scaling. Instead the application container is the actual unit of scaling. We just need to provide enough resources to run the container.

This is it means to have an application first mindset to scaling.

Container orchestration is something that enables you to have that application first mindset. Orchestration allows you to focus on the application by treating the instances and VM’s underneath as generic capacity.

Here’s how that works: When you have an orchestrator such as Amazon Elastic Container Service you can specify how much resources are required to run a copy of your application container. For example: 1 CPU and 2 GB of memory. Then you can specify how many copies of that application container you want to run. Separately you give Amazon ECS a collection of instances to use as capacity.

These EC2 instances don’t have to be all the same size. They don’t have to be the same EC2 generation or have the same amount of CPU and memory. They can be a random collection of devices with varying CPU and memory.

You tell ECS: “Figure it out for me. I would just like to run 4 copies of my application on this collection of instances”.

Amazon ECS looks at all the available capacity from all the servers that you gave it and it finds a solution to place containers across those EC2 instances in order to fulfill your overall request.

In this diagram there are three different types of applications, with three different application sizes, and we’ve told ECS that we want to run them across these 4 instances. Amazon ECS has found a solution to place containers onto these instances in a way that they will fit and they can share the underlying resources without overlapping with each other.

This is the job that ECS does: it lets you think about the application first, and think about the server or VM’s as generic capacity that you provide under the hood.

AWS Fargate is the next step. It provides serverless container capacity, that lets you focus even more on your application. You no longer have any EC2 instances to worry about. Taking the same scenario as before, you can actually remove all the EC2 instances from the picture.

Amazon Elastic Container Service uses AWS Fargate to run 4 copies of the container directly. You don’t have to supply any EC2 instances to serve as capacity for the cluster.

The way this works is AWS Fargate launches each of the containers that you want to run into it’s own isolated micro VM. Each micro VM is sized perfectly for the needs of a specific container.

In this diagram rather than having large EC2 instances with multiple applications inside of each EC2 VM, you can see that each of these containers has it’s own micro VM that is sized perfectly for that container.

With the serverless paradigm, and with the existing capabilities on EC2, we have a new way to think about scaling: application first.

There are two types of application scaling we need to think about.

The first is vertical scaling. Vertical scaling is when you want to increase the size of an application container to give it more resources. And the second is horizontal scaling. That’s when you want to run more copies of the application container in parallel, in order to do more work.

In general, vertical scaling will allow the application to serve a request faster and horizontal scaling will allow the application to serve more requests overall because incoming requests are being distributed across more copies of the application.

Vertical Scaling

Even if you plan to scale primarily using horizontal scaling, it is still important to consider vertical scaling first to make sure that each individual copy of your application container has been sized appropriately. This will lead to better horizontal scaling later on.

For vertical scaling, there are two goals:

  1. Improve the quality of work that is done by a particular container. (Lower latency, faster response time)
  2. Increase the amount of work that a particular container can get done. (More requests per second out of a single container)

When you vertically scale my application the first step is to identify what resources the application container actually needs in order to function.

There are different dimensions of resources that an application needs. For example: CPU, memory, storage, and network bandwidth. Some machine learning workloads may actually require GPU as well.

Make a list of all of those resources that you think the application needs in order to function.

The second step is identifying how much of each resource is needed, to define a performance envelope within which you would expect the application to function well.

The goal of the performance envelope is to define constraints for the container. If the application exceeds those constraints you will know that it is in a danger zone.

For example, if your application exceeds the CPU dimension of the performance envelope then it is likely that web request latency will increase past your acceptable SLA, and some requests may even begin to timeout or get dropped.

How do you actually figure out the right performance envelope for an application?

The key is to utilize load tests and metrics.

For load tests you can start out with a simple request tool like ApacheBench, or hey. You point these tools at your domain for the application and the tool sends HTTP requests to that public endpoint. The tool measures how your application responds to a certain level of traffic.

In the long run if you really want to load test your application application, you need a little bit more than that. You need a way to actually simulate real user agents using your system.

Think about how a real user uses a system: they sign up, they sign in, they start creating resources in the system. Maybe they edit things, they delete things. They make real API calls that mutate data by calling the database. You can’t truly load test a system using plain HTTP request to a web server. You need to actually simulate the real API requests that a user would make.

Create a script which simulates a real user actually using the application and doing all the steps that a real user would: from signing up and signing in, to making real API calls. Then you see how many of those user agents you can run concurrently. This creates a much more realistic load test.

Once you have your load tests running, you need to measure the results of the load test. That’s where metrics come in. ECS comes with default CloudWatch metrics that you can use as a starting point. ECS will gather up statistics on how much CPU and memory the application is consuming. You can also optionally enable a deeper level of metrics with Container Insights. Container Insights gathers metrics like network I/O.

Check the patterns collection for an example of how to query ECS telemetry from Container Insights and how to create a custom dashboard for ECS metrics and events.

The key to using load tests is to graph your gathered metrics over time. As a load test starts out, you want the load test to ramp up gradually. Start with a low level of traffic and then gradually increase traffic to the highest level traffic that you expect the application is probably going to break at. Then check to see how application metrics look over time, as the load ramps up to the max.

When you do that you are going to get some interesting graphs that you can use to figure out how your application is performing and scaling. Let’s look at some examples of these graphs.

In this graph you can see CPU and memory utilization over time as the load test ramps up. The CPU metric is much higher than the memory metric, and it flattens out around 100%.

This means that the application ran out of CPU resource first. The workload is primarily CPU bound. This is quite normal, as most workloads run out of CPU before they run out of memory. As the application runs out of CPU, the quality of the service suffers before it actually runs out of memory.

This tells us one micro optimization we might be able to make, is to modify the performance envelope to add a bit more CPU and a bit less memory.

The patterns collection has a reference architecture for a custom dashboard that identifies ECS tasks that are wasting CPU or memory reservation. This can help you identify opportunities to right size your container sizes.

This graph might look weird. It looks like the CPU is going over 100%. How is it possible for CPU to go up to 125%?

This is specific to running containers on EC2 instances. You won’t see this in AWS Fargate, but you will see this if you’re running on a EC2 instance or if you’re running a container via ECS Anywhere instance, on an on-premise server.

By default Docker allows a container to utilize spare CPU capacity on the instance, as long as that capacity isn’t needed for another application container running on the same underlying instance. This “burst” will show up as CPU consumption actually going over 100%.

This is a danger sign because it means that your application may have been functioning pretty well during the load test, but it was actually only functioning well because it was utilizing spare, unreserved capacity.

If you launch another application, or if you try to horizontally scale by launching more copies of your application, now there will actually be more containers that attempt to reserve CPU that the existing containers are currently bursting into. CPU utilization that is currently being used as burst capacity is going to go away as additional containers launch. The service will be forced to go from 125% utiliation down to an upper ceiling of 100% utilization.

This is a dangerous situation because it can actually cause your performance to tank as you try to scale up. You may try to increase the number of containers you are running, but you will not get a linear increase in performance. Instead you’ll have flat or even worse performance, because scaling out is removing burst capacity that was functioning as a crutch for the application.

Be careful when you see a resource metric for your application going over 100%. This is a sign that you may need to adjust the performance envelope of your application, and adjust your scaling policies to react faster.

This is burst being used properly. When an application starts up, it usually needs to do an initial bit of work. An interpreted application might need to read and create bytecode for a lot of code files. Maybe the application needs to download some additional data as part of setup work before it can start really processing requests.

If the application initially bursts the 125%, but then it settles down to below 100% this is good. It is okay for containers to use available burst capacity for efficiency, but containers should not be relying on burst capacity for an extended period of time.

In this graph you can see the CPU and memory utilization flatten out below the maximum amount defined as a performance envelope. CPU is flattening out at roughly 75%. Memory is flattening out below 50%. However the response time for the application is actually skyrocketing.

There’s some type of bottleneck in the application and it’s not the CPU and it’s not the memory. It could be network IO, disk IO, or maybe it’s another downstream service that the application is depending on, such as a database or another downstream API. Response time is skyrocketing because of that other underlying resource.

Most often this happens with a database. The database is overloaded because it’s handling too many queries. This is not an issue that you can actually fix by scaling your application tier. You can’t vertically or horizontally scale out of this scenario.

Fixing this problem might requiring fixing application code by optimizing database queries. Or it could require that you increase the size of your backend database to a larger tier that can handle more concurrent queries for your application.

This graph is very problematic. As the load test ramps up, you see CPU flattens out at roughly 75%, but memory keeps on rising. It rises in an almost straight line.

This is a memory leak. Every time the application is serving a request, it’s keeping some data in memory and failing to get rid of that data after it serves the request. The data is accumulating in memory. Memory will keep being consumed until the application reaches a point where it has consumed all of the available memory. The application will then crash.

A memory leak can not be fixed by scaling. You can’t vertically or horizontally scale yourself out of a memory leak. The only way to fix this is to fix the application code. You cannot have scalability with a memory leak.

The end goal of all this load testing and metric analysis is to define an expected performance envelope that fits your application needs. Ideally it should also provide a little bit of extra space for occasional bursts of activity.

There is no such thing as a standard performance envelope for an application. A common question is: “How big should my AWS Fargate task be? Is 1024 CPU enough? Is two gigabytes of memory enough?”

Each application that you run has its own needs and its own performance envelope. Some may require more CPU, some require more memory. Some may require more network bandwidth. You need to find the unique performance envelope for each application. Don’t try to stuff every application container into the same performance envelope, or you’ll be wasting compute resources for some applications, while failing to provide enough resources for other applications.

Performance envelopes can change over time. As you add features or modify how features work to optimize things, the performance envelope will change.

In the diagram above, version one of the application has high network bandwidth needs, but then in 1.01 version, the network bandwidth needs were actually reduced because of an added feature to gzip the responses. Compressing the responses as they go out of the server reduced the amount of network bandwidth that was needed, but it increased the amount of CPU needed.

This is an example of how performance envelopes change over time.

As you migrate to new EC2 instance generations you may also need to adjust your performance envelope. If you look at the C4 instance class compared to the C5 instance class EC2 instances, the C4 instance has a 2.9 gigahertz processor while the C5 has 3.6 gigahertz sustained processor speed.

If you migrate your application from running on a C4 instance to running on a C5 instance, it may not actually require one whole CPU anymore because the new CPU is actually much faster. Maybe the application only needs 75% of the CPU.

The key thing to remember is that load testing and metric monitoring needs to be ongoing. You can’t do it as a one-time thing and then say: “this will be the static performance envelope forever.” You have to consistently be load testing and tuning the vertical scaling of your containers.

There are limits to how far you can vertically scale.

For ECS on EC2 vertical scaling is fairly flexible because you can create any arbitrary task size that will fit on onto the underlying EC2 instance. When it comes to vertically scaling storage, you can adjust the EBS volume attached to the EC2 instance.

For ECS on AWS Fargate there are limits in AWS Fargate. The smallest cCPU size for a task is 1/4th of a vCPU and the largest size is 16 vCPU. You can go all the way down to 512 MB of memory, or up to 120 GB of memory. AWS Fargate has ephemeral storage that can be scaled up to 200 gigabytes.

Keep vertical scaling limits in mind. But you should start horizontally scaling long before you hit vertical scaling limits.

Horizontal Scaling

Horizontal scaling is when you spread your workload across a larger number of application containers.

It’s usually a good idea to do some vertical scaling experiments first so you know how to size your application performance envelope, and ensure that your application can function properly within the reserved resource dimensions of it’s performance envelope. But once you have validated that application performance envelope you can now increase your traffic exponentially by launching additional copies of the application container.

Horizontal scaling is based on aggregate resource consumption metrics for the service. For example you can look at average CPU resource consumption metric across all copies of your container. When the aggregate average utilization breaches a high threshold you scale out by adding more copies of the container. If it breaches a low threshold you reduce the number of copies of the container.

ECS integrates with Application Auto Scaling to automatically scale your service. ECS captures aggregate metrics for your containers. The aggregate metrics trigger an application auto scaling policy, and then ECS responds to that policy by adding or removing tasks according to what the policy asked for.

One pitfall that is very important to avoid is unevenly distributed workload across tasks. You must avoid having hot tasks and cold tasks. This often happens with WebSocket requests, long running upload or download streams, or other types of applications that serve a lot of workload over one connection.

If you aren’t careful you can have one client or one connection that is doing a lot of volume over the connection compared to other connections. This causes hot tasks that are hitting the limits of their performance envelope while other tasks are underutilized.

The danger with this is that you can end up with aggregate resource consumption metrics that look good. The hot tasks and cold tasks average out to create a metric that looks well within bounds. It will appear as if there is no need to scale out, when the reality is that some of your tasks are extremely hot and they’re probably causing a negative experience for your customer.

For HTTP workloads in specific, make sure you have a load balancing ingress that will evenly distribute the traffic across the containers that are available. You can use the Application Load Balancer web service pattern to evenly distribute HTTP requests across your containers.

When your workload is evenly distributed you’ll see is all of the containers will be roughly in sync with their resource utilization rather than having some that are hot and some that are cold.

Horizontal scaling should be based on the compute resource that the application runs out first when you are doing a load test. With most runtimes and web workloads this is almost always going to be the CPU utilization.

One tempting way to horizontally scale your service is to do a calculation based on concurrent requests, or number of requests per period. The simple approach is to take the total number of incoming requests arriving per period at the load balancer, divide by the total number of requests per period that you think a single container / task can handle, and then that gives you how many tasks you should run to serve that level of traffic.

However this approach is not a good idea in the long run. It has two core problems:

  1. It assumes that all requests are equal, and that’s not true. Especially if you have a monolithic application, you will have some requests that are orders of magnitude heavier in terms of resource consumption than others. 100 concurrent requests for your heaviest endpoint that consumes the most resources is a completely different scenario than 100 concurrent requests for your most lightweight endpoint. For example, this will hit you hard if you have a burst of signups and authentication, as password hashing tends to be 100x heavier than other endpoints.
  2. This scaling approach assumes that app performance per request doesn’t change over time, and that’s also not true. There are situations where as the number of requests increases, the performance per request actually decreases. For example a database becomes more burdened over time as more data is persisted into it. Or programmers may optimize a code path or introduce a new feature that demands more resources per request. You can quickly end up with a situation where your calculation for number of requests that a single task can handle isn’t based on reality, and you are scaling your service to a number of tasks that doesn’t work anymore.

This type of scaling isn’t recommended unless your application is extraordinarily stable, and you have a microservice environment where every request is super uniform in terms of resource consumption.

Another pitfall is scaling based on a response time. This is not a good metric to scale on because response time isn’t always directly related to available resources for the application. Therefore scaling out in response to request latency won’t necessarily fix the response time.

For example, if you have an inefficient code path or a really large request that takes a long time that can skew your overall response time and make it look like you need to scale out when you don’t actually need to.

Conversely you may have performance issues overall on your most important endpoint, but because the application is also serving a bunch of really light requests that respond extremely quickly, these light requests bring down the average response time. It will look as if performance is good when actually there are performance issues on the most valuable endpoint you have.

Finally you may have a downstream service or database that’s overloaded, and that is what is actually causing the really long response times. If you were to respond to that by scaling out the application tier it won’t help the situation. In fact, it may actually make it much worse because launching more copies of the application causes more connections to an already overloaded database server.

You should always scale based on the metric for the real compute resource that your application runs out of first. When this is working properly, you’ll see a sawtooth pattern that goes up and down.

When your aggregate application resource consumption reaches a high point, then an additional new task is launched. This causes the aggregate resource utilization to drop as load is distributed across an additional task.

If aggregate resource consumption reaches a low point, a task is stopped. This raises the aggregate consumption a bit as additional load is redistributed to the remaining tasks. If it reaches too low, it’ll stop a task and then go back up.

Always make sure to give your service a little bit of resource headroom in case there was a burst of activity and your scaling can’t respond fast enough.

The easiest way to scale is to setup an AWS Application Auto Scaling target tracking policy. You only have to define a single target number in your policy, and then the service will scale up and down automatically to try to keep the resource consumption as close as possible to that target number.

The downside is this policy is that it can be a little bit slow to respond sometimes, and it also expects that your resource utilization metric is going to respond proportionally when you add or remove tasks.

If you want to try this out, check out the target tracking scaling policy pattern.

Another option is step scaling. With step scaling you define your own custom metric boundaries. If a metric breaches your specified boundary you can choose to add either a fixed number of tasks, or a percentage of tasks.

This approach is great for giving you the most control over how your infrastructure responds to scaling. Try out the step scaling policy pattern for an example of how to set this up.

The last type of scaling is scheduled scaling. You can specify specific times during the day that you would like to scale up or scale down. This is great for predictable traffic patterns like batch traffic overnight, or services that are mostly used during work hours. But scheduled scaling is not good for handling unexpected traffic spikes. And it’s not good for gradual traffic trends, like if you’re still signing up lots of new users and your traffic is gradually increasing week over week.

Cluster Capacity

As you launch more containers you are going to need more compute capacity to actually run those containers. Managing underlying cluster capacity is critical to having effective vertical and horizontal scaling.

AWS Fargate makes cluster capacity easy. With AWS Fargate you don’t have to worry about EC2 instances or how much underlying capacity your cluster uses. Your tasks just run automatically in AWS Fargate.

Make sure you check your AWS account vCPU quota however. AWS enforces limits on the total number of vCPU worth of tasks you can launch. In some cases new accounts may have a very low limit, while long term AWS accounts with significant spend can launch thousands of vCPU worth of AWS Fargate capacity. If your vCPU limit looks too low for your needs you can open a support ticket in the AWS Service Quotas dashboard to launch as many tasks as you need.

When running on EC2 capacity, it’s a little more complicated. You have to actually launch EC2 instances to host your tasks on. EC2 has it’s own vCPU limit that you should check, but you should also be aware of ECS limits.

By default ECS allows 5000 EC2 instances per cluster, 5,000 services per cluster and 5,000 tasks per service. So yes, this does mean you could technically launch twenty-five million tasks per cluster on ECS on EC2 without opening a support ticket. But chances are you are going to hit other EC2 vCPU limits or budget constraints long before you hit the ECS limits.

If you do need more tasks than that or if you want more than 5000 instances to run your tasks, keep in mind that these are soft limits and you can open a support ticket to get these limits raised.

ECS Capacity Providers are designed to solve the problem of managaging the EC2 capacity for your container workloads as you scale the number of containers up and down.

ECS comes with two capacity providers out of the box: AWS Fargate and AWS Fargate Spot. You can also optionally create EC2 capacity providers that launch EC2 tasks automatically whenever you need capacity to launch a task.

Capacity providers track the total resource reservation by all the different types of application containers in your cluster. This produces a total reservation metric which is used to adjust the size of an EC2 Autoscaling Group that provides capacity for running containers.

Here is the end to end flow:

I would like to run 10 copies of my container. ECS initiates launching those 10 containers by creating 10 proposed containers in a pending state. If there is available EC2 capacity to host the containers, then they will be placed immediately. If there is not enough EC2 capacity registered to the cluster this is where the capacity provider kicks in. The ECS capacity provider sees that cluster reservation is exceeding my upper bound. It calculates that the cluster needs an additional two EC2 instances in order to maintain a little bit of headroom inside the cluster. Then it configures the EC2 Auto Scaling Group to launch those two additional EC2 instances.

Your target capacity for a capacity provider could be any number between zero percent and 100%. An easy way to configure the capacity provider is to set target capacity to 100%. This allows the cluster to scale to zero when all your containers stop, and it will attempt to avoid wasted capacity. However, this will also cause additional latency for container launches, as ECS will have to launch EC2 instances on the fly when scaling out the number of containers for a service.

So in general you will likely want to keep just a little headroom in the cluster in case you need to scale up quickly, or you want to do a deployment and roll out a new task version. A target capacity percentage of 90% can result in far faster auto scaling and deployments.

Capacity providers also manage the cluster scale in. If you have tasks running on an EC2 instance, the capacity will protect that EC2 instance from termination. Once the EC2 instance no longer has containers running on it, the capacity provider will release the termination protection, allowing the EC2 instance to be stopped automatically by the Auto Scaling Group.

You can spread your container deployments across multiple capacity providers. For example, in this diagram a capcity provider is configured to run six copies of my container, with three containers running in AWS Fargate on-demand, and thee containers running in AWS Fargate Spot.

Capacity provider strategies enable you to specify a base and a weight for task distribution.

The base is the initial baseline number of tasks that you want to place into a capacity provider.

The weight kicks in once the base number of tasks have been placed. It defines the ratio to use when distributing additional tasks across different capacity providers. Here are a couple examples to better explain how that works.

This strategy helps you fully utilize one set of EC2 instances first. This could be useful if you want to prioritize placement onto a specific class of EC2 instances, or a specific reserved set of EC2 instances. The base of 50 specifies to launch the first 50 tasks onto that set of EC2 instances. But the weight is zero, so that means to stop putting new tasks on these EC2 instances once the first 50 have been placed.

The other EC2 auto scaling group has a base of zero and a weight of one. This means to avoid placing any tasks through this capacity provider initially, but then after the baseline set of 50 tasks has been placed, launch 100% of additional tasks using the second capacity provider.

This strategy is good if you want to minimize the cost of traffic bursts. It is configured to launch a baseline capacity of 100 tasks using the AWS Fargate on-demand capacity provider. But above 100 tasks, launch one task using AWS Fargate on-demand for every three tasks launched using AWS Fargate spot.

This allows you to increase the quality of your service as it scales up, but at a discounted rate by launching three fourths of the additional capacity using AWS Fargate Spot instead of on-demand. As a result you may be able to achieve lower latency and have more workload burst absorption capability at the same cost as using 100% on-demand capacity.

If you’d like to experiment with capacity providers check out the ECS on EC2 capacity providers CloudFormation pattern.

Summary

Here is the quick TL;DR recap of tips for how to scale your container workloads using Amazon ECS:

  1. Scaling container deployments has to start with an application first mindset. We aren’t thinking in VM’s, we are thinking in application containers. VM’s are just generic capacity that comes in later to provide computing horsepower for the containers.
  2. Use load tests to define the right performance envelope for your service’s application container.
  3. Performance envelope tuning is not a one-time thing. Plan to keep the performance envelope updated as you add new features, optimize existing code paths, and as you upgrade the compute infrastructure that you have underneath your containers.
  1. Base your horizontal scaling on the aggregate resource metric that your application runs out of first when you’re load testing. Your goal is to create a “sawtooth pattern” that keeps that resource utilization metric within reasonable bounds.
  2. When it comes to the underlying VM capacity to actually horizontally scale onto, default to thinking about using AWS Fargate first if you want easy, worry free capacity.
  3. If you do want to manage EC2 capacity, then EC2 capacity providers are there to save the day and help you launch the correct number of EC2 instances as capacity to run your containers across.

Presentation Download

Did you enjoy this article? You can grab the presentation deck for yourself if you’d like to use the diagrams, or share the presentation with someone else.

Download Presentation