Service Discovery for AWS Fargate tasks with AWS Cloud Map
About
Service discovery is a technique for getting traffic from one container to another using a direct peer to peer connection, instead of routing traffic through an intermediary like a load balancer. Service discovery is suitable for a variety of use cases:
- Privately networked, internal services that will not be used from the public internet
- Low latency communication between services.
- Long lived bidirectional connections, such as gRPC.
- Low traffic, low cost deployments where you do not wish to pay the hourly fee for a persistent load balancer.
Service discovery for AWS Fargate tasks is powered by AWS Cloud Map. Amazon Elastic Container Service integrates with AWS Cloud Map to configure and sync a list of all your containers. You can then use Cloud Map DNS or API calls to look up the IP address of another task and open a direct connection to it.
Architecture
In this reference you will deploy the following architecture:
Two services will be deployed as AWS Fargate tasks:
- A front facing
hello
service - A backend
name
service
Inbound traffic from the public internet will arrive at the hello
service via an Application Load Balancer.
The hello
service needs to fetch a name from the name
service. In order to locate instances of the name
service task, it will use DNS based service discovery to get a list of tasks to send traffic to. The hello
service will do client side load balancing to distribute it's requests across available instances of the name
service's task.
Network traffic between the hello
service and the name
service is direct peer to peer traffic.
Dependencies
This pattern requires that you have an AWS account, and that you use AWS Serverless Application Model (SAM) CLI. If not already installed then please install SAM CLI for your system.
Define the networking
For this architecture we are going to use private networking for the backend services, so grab the vpc.yml
file from "Large VPC for Amazon ECS Cluster". Do not deploy this CloudFormation yet. We will deploy it later on.
Define the cluster
The following template defines an ECS cluster and a Cloud Map namespace that will be used to store information about the tasks in the cluster:
AWSTemplateFormatVersion: '2010-09-09'
Description: Empty ECS cluster that has no EC2 instances. It is designed
to be used with AWS Fargate serverless capacity
Parameters:
VpcId:
Type: String
Description: The VPC that the service is running inside of
Resources:
# Cluster that keeps track of container deployments
ECSCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterSettings:
- Name: containerInsights
Value: enabled
# This is a role which is used within Fargate to allow the Fargate agent
# to download images, and upload logs.
ECSTaskExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service: [ecs-tasks.amazonaws.com]
Action: ['sts:AssumeRole']
Condition:
ArnLike:
aws:SourceArn: !Sub arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:*
StringEquals:
aws:SourceAccount: !Ref AWS::AccountId
Path: /
# This role enables basic features of ECS. See reference:
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonECSTaskExecutionRolePolicy
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
# This namespace will keep track of all the tasks in the cluster
ServiceDiscoveryNamespace:
Type: AWS::ServiceDiscovery::PrivateDnsNamespace
Properties:
Name: internal
Description: Internal, private service discovery namespace
Vpc: !Ref VpcId
Outputs:
ClusterName:
Description: The ECS cluster into which to launch resources
Value: !Ref ECSCluster
ECSTaskExecutionRole:
Description: The role used to start up a task
Value: !Ref ECSTaskExecutionRole
ServiceDiscoveryNamespaceId:
Description: The shared service discovery namespace for all services in the cluster
Value: !Ref ServiceDiscoveryNamespace
Some things to note in this template:
- An
AWS::ServiceDiscovery::PrivateDnsNamespace
ensures that Cloud Map can be accessed from inside of the VPC, using a TLD (Top Level Domain) ofinternal
. This will allow us to lookup other services in the VPC using a DNS address likehttp://name.internal
Define the name
service
Because the hello
service depends on the name
service, it makes sense to define the name
service first. This service will be deploying a public sample image located at public.ecr.aws/ecs-sample-image/name-server
.
AWSTemplateFormatVersion: '2010-09-09'
Description: An example service that has service discovery service attached
so that it can be easily located by other services in the ECS cluster.
Parameters:
VpcId:
Type: String
Description: The VPC that the service is running inside of
PrivateSubnetIds:
Type: List<AWS::EC2::Subnet::Id>
Description: List of private subnet ID's that the AWS VPC tasks are in
ClusterName:
Type: String
Description: The name of the ECS cluster into which to launch capacity.
ServiceDiscoveryNamespaceId:
Type: String
Description: The ID of a CloudMap namespace into which the service will be registered
ECSTaskExecutionRole:
Type: String
Description: The role used to start up an ECS task
ServiceName:
Type: String
Default: name
Description: A name for the service
ImageUrl:
Type: String
Default: public.ecr.aws/ecs-sample-image/name-server
Description: The url of a sample container image for this workload
ContainerCpu:
Type: Number
Default: 256
Description: How much CPU to give the container. 1024 is 1 CPU
ContainerMemory:
Type: Number
Default: 512
Description: How much memory in megabytes to give the container
ContainerPort:
Type: Number
Default: 80
Description: What port that the application expects traffic on
DesiredCount:
Type: Number
Default: 2
Description: How many copies of the service task to run
Resources:
# The task definition. This is a simple metadata description of what
# container to run, and what resource requirements it has.
TaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: !Ref ServiceName
Cpu: !Ref ContainerCpu
Memory: !Ref ContainerMemory
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
ExecutionRoleArn: !Ref ECSTaskExecutionRole
ContainerDefinitions:
- Name: !Ref ServiceName
Cpu: !Ref ContainerCpu
Memory: !Ref ContainerMemory
Image: !Ref ImageUrl
PortMappings:
- ContainerPort: !Ref ContainerPort
HostPort: !Ref ContainerPort
LogConfiguration:
LogDriver: 'awslogs'
Options:
mode: non-blocking
max-buffer-size: 25m
awslogs-group: !Ref LogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: !Ref ServiceName
# The service. The service is a resource which allows you to run multiple
# copies of a type of task, and gather up their logs and metrics, as well
# as monitor the number of running tasks and replace any that have crashed
NameService:
Type: AWS::ECS::Service
Properties:
Cluster: !Ref ClusterName
LaunchType: FARGATE
NetworkConfiguration:
AwsvpcConfiguration:
AssignPublicIp: DISABLED
SecurityGroups:
- !Ref ServiceSecurityGroup
Subnets: !Ref PrivateSubnetIds
DeploymentConfiguration:
MaximumPercent: 200
MinimumHealthyPercent: 75
DesiredCount: !Ref DesiredCount
TaskDefinition: !Ref TaskDefinition
ServiceRegistries:
- RegistryArn: !GetAtt ServiceDiscoveryService.Arn
ContainerName: !Ref ServiceName
# Security group that limits network access
# to tasks from this service
ServiceSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Security group for service
VpcId: !Ref VpcId
# Keeps track of the list of tasks for the service
ServiceDiscoveryService:
Type: AWS::ServiceDiscovery::Service
Properties:
Name: !Ref ServiceName
DnsConfig:
NamespaceId: !Ref ServiceDiscoveryNamespaceId
DnsRecords:
- TTL: 0
Type: A
# This log group stores the stdout logs from this service's containers
LogGroup:
Type: AWS::Logs::LogGroup
Outputs:
ServiceSecurityGroup:
Description: The security group of the name service
Value: !Ref ServiceSecurityGroup
Note the following things from the template above:
- A
AWS::ServiceDiscovery::Service
defines the DNS record type (A
) and the TTL (Time to Live) - The
AWS::ECS::Service
is configured to attach to theAWS::ServiceDiscovery::Service
so that it can keep it in sync with a list of the tasks. - The
AWS::ECS::TaskDefinition
is configured inawsvpc
networking mode. This gives each task it's own unique IP address, which can be plugged into the service discovery DNS. - The
name
service listens for traffic on port 3000.
Define the hello
service
Now we need to define the hello
service. It will be based on the public sample image public.ecr.aws/ecs-sample-image/hello-server:node
:
AWSTemplateFormatVersion: '2010-09-09'
Description: An example service that has service discovery service attached
so that it can be easily located by other services in the ECS cluster.
Parameters:
VpcId:
Type: String
Description: The VPC that the service is running inside of
PublicSubnetIds:
Type: List<AWS::EC2::Subnet::Id>
Description: List of public subnet ID's to put the load balancer and tasks in
PrivateSubnetIds:
Type: List<AWS::EC2::Subnet::Id>
Description: List of private subnet ID's that the AWS VPC tasks are in
ClusterName:
Type: String
Description: The name of the ECS cluster into which to launch capacity.
NameServiceSecurityGroup:
Type: String
Description: The security group of the downstream name service
ECSTaskExecutionRole:
Type: String
Description: The role used to start up an ECS task
ServiceName:
Type: String
Default: hello
Description: A name for the service
ImageUrl:
Type: String
Default: public.ecr.aws/ecs-sample-image/hello-server:node
Description: The url of a sample container image for this workload
ContainerCpu:
Type: Number
Default: 256
Description: How much CPU to give the container. 1024 is 1 CPU
ContainerMemory:
Type: Number
Default: 512
Description: How much memory in megabytes to give the container
ContainerPort:
Type: Number
Default: 3000
Description: What port that the application expects traffic on
DesiredCount:
Type: Number
Default: 2
Description: How many copies of the service task to run
Resources:
# The task definition. This is a simple metadata description of what
# container to run, and what resource requirements it has.
TaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: !Ref ServiceName
Cpu: !Ref ContainerCpu
Memory: !Ref ContainerMemory
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
ExecutionRoleArn: !Ref ECSTaskExecutionRole
ContainerDefinitions:
- Name: !Ref ServiceName
Cpu: !Ref ContainerCpu
Memory: !Ref ContainerMemory
Image: !Ref ImageUrl
Environment:
- Name: NAME_SERVER
Value: http://name.internal:3000/
PortMappings:
- ContainerPort: !Ref ContainerPort
HostPort: !Ref ContainerPort
LogConfiguration:
LogDriver: 'awslogs'
Options:
mode: non-blocking
max-buffer-size: 25m
awslogs-group: !Ref LogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: !Ref ServiceName
# The service. The service is a resource which allows you to run multiple
# copies of a type of task, and gather up their logs and metrics, as well
# as monitor the number of running tasks and replace any that have crashed
HelloService:
Type: AWS::ECS::Service
# Avoid race condition between ECS service creation and associating
# the target group with the LB
DependsOn: PublicLoadBalancerListener
Properties:
Cluster: !Ref ClusterName
LaunchType: FARGATE
NetworkConfiguration:
AwsvpcConfiguration:
AssignPublicIp: DISABLED
SecurityGroups:
- !Ref ServiceSecurityGroup
Subnets: !Ref PrivateSubnetIds
DeploymentConfiguration:
MaximumPercent: 200
MinimumHealthyPercent: 75
DesiredCount: !Ref DesiredCount
TaskDefinition: !Ref TaskDefinition
LoadBalancers:
- ContainerName: !Ref ServiceName
ContainerPort: !Ref ContainerPort
TargetGroupArn: !Ref ServiceTargetGroup
# Security group that limits network access
# to tasks from this service
ServiceSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Security group for hello service
VpcId: !Ref VpcId
# Configure the security group of the name service to accept
# incoming traffic from the security group of this service
NameServiceIngressFromHello:
Type: AWS::EC2::SecurityGroupIngress
Properties:
Description: Allow hello service to make calls to name service
GroupId: !Ref NameServiceSecurityGroup
FromPort: 3000
ToPort: 3000
IpProtocol: -1
SourceSecurityGroupId: !Ref ServiceSecurityGroup
# Keeps track of the list of tasks for the service
ServiceTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
HealthCheckIntervalSeconds: 6
HealthCheckPath: /
HealthCheckProtocol: HTTP
HealthCheckTimeoutSeconds: 5
HealthyThresholdCount: 2
TargetType: ip
Port: !Ref ContainerPort
Protocol: HTTP
UnhealthyThresholdCount: 10
VpcId: !Ref VpcId
TargetGroupAttributes:
- Key: deregistration_delay.timeout_seconds
Value: 0
# A public facing load balancer, this is used as ingress for
# public facing internet traffic.
PublicLoadBalancerSG:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Access to the public facing load balancer
VpcId: !Ref VpcId
SecurityGroupIngress:
# Allow access to public facing ALB from any IP address
- CidrIp: 0.0.0.0/0
IpProtocol: -1
PublicLoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Scheme: internet-facing
LoadBalancerAttributes:
- Key: idle_timeout.timeout_seconds
Value: '30'
Subnets: !Ref PublicSubnetIds
SecurityGroups:
- !Ref PublicLoadBalancerSG
PublicLoadBalancerListener:
Type: AWS::ElasticLoadBalancingV2::Listener
Properties:
DefaultActions:
- Type: 'forward'
ForwardConfig:
TargetGroups:
- TargetGroupArn: !Ref ServiceTargetGroup
Weight: 100
LoadBalancerArn: !Ref 'PublicLoadBalancer'
Port: 80
Protocol: HTTP
# Open up the service's security group to traffic originating
# from the security group of the load balancer.
ServiceIngressfromLoadBalancer:
Type: AWS::EC2::SecurityGroupIngress
Properties:
Description: Ingress from the public ALB
GroupId: !Ref ServiceSecurityGroup
IpProtocol: -1
SourceSecurityGroupId: !Ref 'PublicLoadBalancerSG'
# This log group stores the stdout logs from this service's containers
LogGroup:
Type: AWS::Logs::LogGroup
Some things to note in this template:
- The
AWS::ECS::TaskDefinition
is configured to have an environment variableNAME_SERVER
with the valuehttp://name.internal:3000/
. This service discovery endpoint is based on the TLD of the Cloud Map namespace (internal
), the name of the service (name
), and the port that the service binds to (3000
). - The
hello
service must create anAWS::EC2::SecurityGroupIngress
on the security group of thename
service, allowing inbound traffic from the security group of thehello
service. Without this ingress rule any direct, inbound, peer to peer connections would be denied by thename
security group.
Look at the code
Properly using DNS based service discovery requires some client side implementation. Let's look at the source for the hello
service.
import os from 'node:os';
import url from 'node:url';
import express from 'express';
import fetch from 'node-fetch';
import retry from 'async-retry';
import { Resolver } from 'node:dns/promises'
const resolver = new Resolver();
const app = express()
const HOSTNAME = os.hostname();
const PORT = process.env.PORT || 3000;
const NAME_SERVER = process.env.NAME_SERVER;
const NAME_URL = url.parse(NAME_SERVER);
if (!NAME_SERVER) {
throw new Error('Expected environment variable NAME_SERVER');
}
// Logic for looking up the DNS based service discovery record
// and selecting a random record from it.
var dnsRecords;
var lastResolveTime = 0;
const TTL = 5000;
async function resolveNameService() {
if (lastResolveTime < new Date().getTime() - TTL) {
dnsRecords = await resolver.resolve(NAME_URL.hostname);
}
return dnsRecords[Math.floor(Math.random() * dnsRecords.length)]
}
app.get('/', async function (req, res) {
// Just in case a downstream task crashes, we wrap this in a retry
// that will retry against a different task if needed.
const randomName = await retry(
async function () {
const randomIp = await resolveNameService();
const randomNameResponse = await fetch(`http://${randomIp}:${NAME_URL.port}`);
return await randomNameResponse.text();
},
{
retries: 5
}
);
res.send(`Hello (from ${HOSTNAME}) ${randomName}`)
})
app.listen(PORT)
console.log(`Listening on http://localhost:${PORT} fetch`);
Things to note:
- Each time a request is being made to the downstream
name
service, the sevice discovery DNS name must be resolved. Doing a full DNS lookup each time would be expensive for the underlying system, and impact performance, so the process caches the DNS lookup results for a brief time. - The service discovery DNS record returns a list of IP addresses. Note that if the DNS address was used by plugging it directly into a
fetch()
the runtime would just naively send all requests to the first IP address in the list. In order to evenly distribute traffic across all the downstream targets, the code must implement client side load balancing. - The entire network request is wrapped in a retry. This is because there is no guarantee that downstream tasks are actually still there. Because of DNS propagation delay it is possible for a downstream
name
task to have crashed or been stopped by a scale-in. If the task is no longer be there when thehello
service tries to reach it, there will be a networking failure. The DNS record is eventually consistent with reality, so in the meantime it is important for thehello
service to detect networking issues and retry against a different backendname
task if necessary. Note that this simple demo application does not actually remove the failed backend IP address from its locally cached list. A potential improvement would be for the process to temporarily avoid attempting to send any more traffic to an IP address that has had a recent networking failure.
Deploy it all
You should have the following three files:
vpc.yml
- Template for the base VPC that you wish to host resources incluster.yml
- Template for the ECS cluster and its capacity providerhello.yml
- Template for thehello
service that will be deployed on the clustername.yml
- Template for thename
service that will be deployed on the cluster
Use the following parent stack to deploy all three stacks:
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: Parent stack that deploys an AWS Fargate service discovery example.
Resources:
# The networking configuration. This creates an isolated
# network specific to this particular environment
VpcStack:
Type: AWS::Serverless::Application
Properties:
Location: vpc.yml
# This stack contains the Amazon ECS cluster itself
ClusterStack:
Type: AWS::Serverless::Application
Properties:
Location: cluster.yml
Parameters:
VpcId: !GetAtt VpcStack.Outputs.VpcId
# Deploy the name server as a service
NameService:
Type: AWS::Serverless::Application
Properties:
Location: name.yml
Parameters:
VpcId: !GetAtt VpcStack.Outputs.VpcId
PrivateSubnetIds: !GetAtt VpcStack.Outputs.PrivateSubnetIds
ClusterName: !GetAtt ClusterStack.Outputs.ClusterName
ECSTaskExecutionRole: !GetAtt ClusterStack.Outputs.ECSTaskExecutionRole
ServiceDiscoveryNamespaceId: !GetAtt ClusterStack.Outputs.ServiceDiscoveryNamespaceId
# Deploy the hello server as a service
HelloService:
Type: AWS::Serverless::Application
Properties:
Location: hello.yml
Parameters:
VpcId: !GetAtt VpcStack.Outputs.VpcId
PublicSubnetIds: !GetAtt VpcStack.Outputs.PublicSubnetIds
PrivateSubnetIds: !GetAtt VpcStack.Outputs.PrivateSubnetIds
ClusterName: !GetAtt ClusterStack.Outputs.ClusterName
ECSTaskExecutionRole: !GetAtt ClusterStack.Outputs.ECSTaskExecutionRole
NameServiceSecurityGroup: !GetAtt NameService.Outputs.ServiceSecurityGroup
Use the following command to deploy all three stacks:
sam deploy \
--template-file parent.yml \
--stack-name service-discovery-environment \
--resolve-s3 \
--capabilities CAPABILITY_IAM
Test it Out
Once the stack deploys, you can use the Amazon ECS console to locate the address of the public facing load balancer that provides ingress to the hello
service from the public internet. Navigate to the ECS cluster, view the details of the hello
service, and click the link under Networking -> DNS Names -> Open Address.
You should see output similar to this:
Hello (from ip-10-0-138-3.us-east-2.compute.internal) Sophia (from ip-10-0-191-125.us-east-2.compute.internal)
If you refresh multiple times you should see different IP address and DNS names showing up, demonstrating that both the front facing load balancing, as well as the backend service discovery load balancing are working to evenly distribute traffic.
Try scaling the name
service up and down to test out how service discovery reacts to changes in the state of the cluster
Tear it Down
You can tear down the entire stack with the following command:
sam delete --stack-name service-discovery-environment
Alternative Patterns
Not quite right for you? Try another way to do this:ECS Service Connect is a similar peer to peer networking option, that operates more like a service mesh. With Service Connect you don't need to implement your own client side load balancing. Round robin request routing, and retries are offloaded to an Envoy Proxy sidecar that is managed by Amazon ECS.