Service Discovery for AWS Fargate tasks with AWS Cloud Map
How to setup service discovery in ECS, so that microservices can communicate with each other.
Nathan Peck
Senior Developer Advocate at AWS
About
Service discovery is a technique for getting traffic from one container to another using a direct peer to peer connection, instead of routing traffic through an intermediary like a load balancer. Service discovery is suitable for a variety of use cases:
Privately networked, internal services that will not be used from the public internet
Low latency communication between services.
Long lived bidirectional connections, such as gRPC.
Low traffic, low cost deployments where you do not wish to pay the hourly fee for a persistent load balancer.
Service discovery for AWS Fargate tasks is powered by AWS Cloud Map. Amazon Elastic Container Service integrates with AWS Cloud Map to configure and sync a list of all your containers. You can then use Cloud Map DNS or API calls to look up the IP address of another task and open a direct connection to it.
Architecture
In this reference you will deploy the following architecture:
Two services will be deployed as AWS Fargate tasks:
A front facing hello service
A backend name service
Inbound traffic from the public internet will arrive at the hello service via an Application Load Balancer.
The hello service needs to fetch a name from the name service. In order to locate instances of the name service
task, it will use DNS based service discovery to get a list of tasks to send traffic to. The hello service
will do client side load balancing to distribute it’s requests across available instances of the name service’s task.
Network traffic between the hello service and the name service is direct peer to peer traffic.
Dependencies
This pattern requires that you have an AWS account, and that you use AWS Serverless Application Model (SAM) CLI. If not already installed then please install SAM CLI for your system.
Define the networking
For this architecture we are going to use private networking for the backend services, so grab the vpc.yml file from “Large VPC for Amazon ECS Cluster”. Do not deploy this CloudFormation yet. We will deploy it later on.
Define the cluster
The following template defines an ECS cluster and a Cloud Map namespace that will be used to store information about the tasks in the cluster:
AWSTemplateFormatVersion:'2010-09-09'Description:Empty ECS cluster that has no EC2 instances. It is designedto be used with AWS Fargate serverless capacityParameters:VpcId:Type:StringDescription:The VPC that the service is running inside ofResources:# Cluster that keeps track of container deploymentsECSCluster:Type:AWS::ECS::ClusterProperties:ClusterSettings:- Name:containerInsightsValue:enabled# This is a role which is used within Fargate to allow the Fargate agent# to download images, and upload logs.ECSTaskExecutionRole:Type:AWS::IAM::RoleProperties:AssumeRolePolicyDocument:Statement:- Effect:AllowPrincipal:Service:[ecs-tasks.amazonaws.com]Action:['sts:AssumeRole']Condition:ArnLike:aws:SourceArn:!Sub arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:*StringEquals:aws:SourceAccount:!Ref AWS::AccountIdPath:/# This role enables basic features of ECS. See reference:# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonECSTaskExecutionRolePolicyManagedPolicyArns:- arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy# This namespace will keep track of all the tasks in the clusterServiceDiscoveryNamespace:Type:AWS::ServiceDiscovery::PrivateDnsNamespaceProperties:Name:internalDescription:Internal, private service discovery namespaceVpc:!Ref VpcIdOutputs:ClusterName:Description:The ECS cluster into which to launch resourcesValue:!Ref ECSClusterECSTaskExecutionRole:Description:The role used to start up a taskValue:!Ref ECSTaskExecutionRoleServiceDiscoveryNamespaceId:Description:The shared service discovery namespace for all services in the clusterValue:!Ref ServiceDiscoveryNamespace
Some things to note in this template:
An AWS::ServiceDiscovery::PrivateDnsNamespace ensures that Cloud Map can be accessed from inside of the VPC, using a TLD (Top Level Domain) of internal. This will allow us to lookup other services in the VPC using a DNS address like http://name.internal
Define the name service
Because the hello service depends on the name service, it makes sense to define the name service first. This
service will be deploying a public sample image located at public.ecr.aws/ecs-sample-image/name-server.
AWSTemplateFormatVersion:'2010-09-09'Description:An example service that has service discovery service attachedso that it can be easily located by other services in the ECS cluster.Parameters:VpcId:Type:StringDescription:The VPC that the service is running inside ofPrivateSubnetIds:Type:List<AWS::EC2::Subnet::Id>Description:List of private subnet ID's that the AWS VPC tasks are inClusterName:Type:StringDescription:The name of the ECS cluster into which to launch capacity.ServiceDiscoveryNamespaceId:Type:StringDescription:The ID of a CloudMap namespace into which the service will be registeredECSTaskExecutionRole:Type:StringDescription:The role used to start up an ECS taskServiceName:Type:StringDefault:nameDescription:A name for the serviceImageUrl:Type:StringDefault:public.ecr.aws/ecs-sample-image/name-serverDescription:The url of a sample container image for this workloadContainerCpu:Type:NumberDefault:256Description:How much CPU to give the container. 1024 is 1 CPUContainerMemory:Type:NumberDefault:512Description:How much memory in megabytes to give the containerContainerPort:Type:NumberDefault:80Description:What port that the application expects traffic onDesiredCount:Type:NumberDefault:2Description:How many copies of the service task to runResources:# The task definition. This is a simple metadata description of what# container to run, and what resource requirements it has.TaskDefinition:Type:AWS::ECS::TaskDefinitionProperties:Family:!Ref ServiceNameCpu:!Ref ContainerCpuMemory:!Ref ContainerMemoryNetworkMode:awsvpcRequiresCompatibilities:- FARGATEExecutionRoleArn:!Ref ECSTaskExecutionRoleContainerDefinitions:- Name:!Ref ServiceNameCpu:!Ref ContainerCpuMemory:!Ref ContainerMemoryImage:!Ref ImageUrlPortMappings:- ContainerPort:!Ref ContainerPortHostPort:!Ref ContainerPortLogConfiguration:LogDriver:'awslogs'Options:mode:non-blockingmax-buffer-size:25mawslogs-group:!Ref LogGroupawslogs-region:!Ref AWS::Regionawslogs-stream-prefix:!Ref ServiceName# The service. The service is a resource which allows you to run multiple# copies of a type of task, and gather up their logs and metrics, as well# as monitor the number of running tasks and replace any that have crashedNameService:Type:AWS::ECS::ServiceProperties:Cluster:!Ref ClusterNameLaunchType:FARGATENetworkConfiguration:AwsvpcConfiguration:AssignPublicIp:DISABLEDSecurityGroups:- !Ref ServiceSecurityGroupSubnets:!Ref PrivateSubnetIdsDeploymentConfiguration:MaximumPercent:200MinimumHealthyPercent:75DesiredCount:!Ref DesiredCountTaskDefinition:!Ref TaskDefinitionServiceRegistries:- RegistryArn:!GetAtt ServiceDiscoveryService.ArnContainerName:!Ref ServiceName# Security group that limits network access# to tasks from this serviceServiceSecurityGroup:Type:AWS::EC2::SecurityGroupProperties:GroupDescription:Security group for serviceVpcId:!Ref VpcId# Keeps track of the list of tasks for the serviceServiceDiscoveryService:Type:AWS::ServiceDiscovery::ServiceProperties:Name:!Ref ServiceNameDnsConfig:NamespaceId:!Ref ServiceDiscoveryNamespaceIdDnsRecords:- TTL:0Type:A# This log group stores the stdout logs from this service's containersLogGroup:Type:AWS::Logs::LogGroupOutputs:ServiceSecurityGroup:Description:The security group of the name serviceValue:!Ref ServiceSecurityGroup
Note the following things from the template above:
A AWS::ServiceDiscovery::Service defines the DNS record type (A) and the TTL (Time to Live)
The AWS::ECS::Service is configured to attach to the AWS::ServiceDiscovery::Service so that it can keep it in sync with a list of the tasks.
The AWS::ECS::TaskDefinition is configured in awsvpc networking mode. This gives each task it’s own unique IP address, which can be plugged into the service discovery DNS.
The name service listens for traffic on port 3000.
Define the hello service
Now we need to define the hello service. It will be based on the public sample image public.ecr.aws/ecs-sample-image/hello-server:node:
AWSTemplateFormatVersion:'2010-09-09'Description:An example service that has service discovery service attachedso that it can be easily located by other services in the ECS cluster.Parameters:VpcId:Type:StringDescription:The VPC that the service is running inside ofPublicSubnetIds:Type:List<AWS::EC2::Subnet::Id>Description:List of public subnet ID's to put the load balancer and tasks inPrivateSubnetIds:Type:List<AWS::EC2::Subnet::Id>Description:List of private subnet ID's that the AWS VPC tasks are inClusterName:Type:StringDescription:The name of the ECS cluster into which to launch capacity.NameServiceSecurityGroup:Type:StringDescription:The security group of the downstream name serviceECSTaskExecutionRole:Type:StringDescription:The role used to start up an ECS taskServiceName:Type:StringDefault:helloDescription:A name for the serviceImageUrl:Type:StringDefault:public.ecr.aws/ecs-sample-image/hello-server:nodeDescription:The url of a sample container image for this workloadContainerCpu:Type:NumberDefault:256Description:How much CPU to give the container. 1024 is 1 CPUContainerMemory:Type:NumberDefault:512Description:How much memory in megabytes to give the containerContainerPort:Type:NumberDefault:3000Description:What port that the application expects traffic onDesiredCount:Type:NumberDefault:2Description:How many copies of the service task to runResources:# The task definition. This is a simple metadata description of what# container to run, and what resource requirements it has.TaskDefinition:Type:AWS::ECS::TaskDefinitionProperties:Family:!Ref ServiceNameCpu:!Ref ContainerCpuMemory:!Ref ContainerMemoryNetworkMode:awsvpcRequiresCompatibilities:- FARGATEExecutionRoleArn:!Ref ECSTaskExecutionRoleContainerDefinitions:- Name:!Ref ServiceNameCpu:!Ref ContainerCpuMemory:!Ref ContainerMemoryImage:!Ref ImageUrlEnvironment:- Name:NAME_SERVERValue:http://name.internal:3000/PortMappings:- ContainerPort:!Ref ContainerPortHostPort:!Ref ContainerPortLogConfiguration:LogDriver:'awslogs'Options:mode:non-blockingmax-buffer-size:25mawslogs-group:!Ref LogGroupawslogs-region:!Ref AWS::Regionawslogs-stream-prefix:!Ref ServiceName# The service. The service is a resource which allows you to run multiple# copies of a type of task, and gather up their logs and metrics, as well# as monitor the number of running tasks and replace any that have crashedHelloService:Type:AWS::ECS::Service# Avoid race condition between ECS service creation and associating# the target group with the LBDependsOn:PublicLoadBalancerListenerProperties:Cluster:!Ref ClusterNameLaunchType:FARGATENetworkConfiguration:AwsvpcConfiguration:AssignPublicIp:DISABLEDSecurityGroups:- !Ref ServiceSecurityGroupSubnets:!Ref PrivateSubnetIdsDeploymentConfiguration:MaximumPercent:200MinimumHealthyPercent:75DesiredCount:!Ref DesiredCountTaskDefinition:!Ref TaskDefinitionLoadBalancers:- ContainerName:!Ref ServiceNameContainerPort:!Ref ContainerPortTargetGroupArn:!Ref ServiceTargetGroup# Security group that limits network access# to tasks from this serviceServiceSecurityGroup:Type:AWS::EC2::SecurityGroupProperties:GroupDescription:Security group for hello serviceVpcId:!Ref VpcId# Configure the security group of the name service to accept# incoming traffic from the security group of this serviceNameServiceIngressFromHello:Type:AWS::EC2::SecurityGroupIngressProperties:Description:Allow hello service to make calls to name serviceGroupId:!Ref NameServiceSecurityGroupFromPort:3000ToPort:3000IpProtocol:-1SourceSecurityGroupId:!Ref ServiceSecurityGroup# Keeps track of the list of tasks for the serviceServiceTargetGroup:Type:AWS::ElasticLoadBalancingV2::TargetGroupProperties:HealthCheckIntervalSeconds:6HealthCheckPath:/HealthCheckProtocol:HTTPHealthCheckTimeoutSeconds:5HealthyThresholdCount:2TargetType:ipPort:!Ref ContainerPortProtocol:HTTPUnhealthyThresholdCount:10VpcId:!Ref VpcIdTargetGroupAttributes:- Key:deregistration_delay.timeout_secondsValue:0# A public facing load balancer, this is used as ingress for# public facing internet traffic.PublicLoadBalancerSG:Type:AWS::EC2::SecurityGroupProperties:GroupDescription:Access to the public facing load balancerVpcId:!Ref VpcIdSecurityGroupIngress:# Allow access to public facing ALB from any IP address- CidrIp:0.0.0.0/0IpProtocol:-1PublicLoadBalancer:Type:AWS::ElasticLoadBalancingV2::LoadBalancerProperties:Scheme:internet-facingLoadBalancerAttributes:- Key:idle_timeout.timeout_secondsValue:'30'Subnets:!Ref PublicSubnetIdsSecurityGroups:- !Ref PublicLoadBalancerSGPublicLoadBalancerListener:Type:AWS::ElasticLoadBalancingV2::ListenerProperties:DefaultActions:- Type:'forward'ForwardConfig:TargetGroups:- TargetGroupArn:!Ref ServiceTargetGroupWeight:100LoadBalancerArn:!Ref 'PublicLoadBalancer'Port:80Protocol:HTTP# Open up the service's security group to traffic originating# from the security group of the load balancer.ServiceIngressfromLoadBalancer:Type:AWS::EC2::SecurityGroupIngressProperties:Description:Ingress from the public ALBGroupId:!Ref ServiceSecurityGroupIpProtocol:-1SourceSecurityGroupId:!Ref 'PublicLoadBalancerSG'# This log group stores the stdout logs from this service's containersLogGroup:Type:AWS::Logs::LogGroup
Some things to note in this template:
The AWS::ECS::TaskDefinition is configured to have an environment variable NAME_SERVER with the value http://name.internal:3000/. This service discovery endpoint is based on the TLD of the Cloud Map namespace (internal), the name of the service (name), and the port that the service binds to (3000).
The hello service must create an AWS::EC2::SecurityGroupIngress on the security group of the name service, allowing inbound traffic from the security group of the hello service. Without this ingress rule any direct, inbound, peer to peer connections would be denied by the name security group.
Look at the code
Properly using DNS based service discovery requires some client side implementation.
Let’s look at the source for the hello service.
importosfrom'node:os';importurlfrom'node:url';importexpressfrom'express';importfetchfrom'node-fetch';importretryfrom'async-retry';import{Resolver}from'node:dns/promises'constresolver=newResolver();constapp=express()constHOSTNAME=os.hostname();constPORT=process.env.PORT||3000;constNAME_SERVER=process.env.NAME_SERVER;constNAME_URL=url.parse(NAME_SERVER);if(!NAME_SERVER){thrownewError('Expected environment variable NAME_SERVER');}// Logic for looking up the DNS based service discovery record
// and selecting a random record from it.
vardnsRecords;varlastResolveTime=0;constTTL=5000;asyncfunctionresolveNameService(){if(lastResolveTime<newDate().getTime()-TTL){dnsRecords=awaitresolver.resolve(NAME_URL.hostname);}returndnsRecords[Math.floor(Math.random()*dnsRecords.length)]}app.get('/',asyncfunction(req,res){// Just in case a downstream task crashes, we wrap this in a retry
// that will retry against a different task if needed.
constrandomName=awaitretry(asyncfunction(){constrandomIp=awaitresolveNameService();constrandomNameResponse=awaitfetch(`http://${randomIp}:${NAME_URL.port}`);returnawaitrandomNameResponse.text();},{retries:5});res.send(`Hello (from ${HOSTNAME}) ${randomName}`)})app.listen(PORT)console.log(`Listening on http://localhost:${PORT} fetch`);
Things to note:
Each time a request is being made to the downstream name service, the sevice discovery DNS name must be resolved. Doing a full DNS lookup each time would be expensive for the underlying system, and impact performance, so the process caches the DNS lookup results for a brief time.
The service discovery DNS record returns a list of IP addresses. Note that if the DNS address was used by plugging it directly into a fetch() the runtime would just naively send all requests to the first IP address
in the list. In order to evenly distribute traffic across all the downstream targets, the code must implement
client side load balancing.
The entire network request is wrapped in a retry. This is because there is no guarantee that downstream tasks are
actually still there. Because of DNS propagation delay it is possible for a downstream name task to have crashed or been stopped by a scale-in. If the task is no longer be there when the hello service tries to reach it, there will be a networking failure. The DNS record is eventually consistent
with reality, so in the meantime it is important for the hello service to detect networking issues and retry against
a different backend name task if necessary. Note that this simple demo application does not actually remove the
failed backend IP address from its locally cached list. A potential improvement would be for the process to temporarily
avoid attempting to send any more traffic to an IP address that has had a recent networking failure.
Deploy it all
You should have the following three files:
vpc.yml - Template for the base VPC that you wish to host resources in
cluster.yml - Template for the ECS cluster and its capacity provider
hello.yml - Template for the hello service that will be deployed on the cluster
name.yml - Template for the name service that will be deployed on the cluster
Use the following parent stack to deploy all three stacks:
AWSTemplateFormatVersion:"2010-09-09"Transform:AWS::Serverless-2016-10-31Description:Parent stack that deploys an AWS Fargate service discovery example.Resources:# The networking configuration. This creates an isolated# network specific to this particular environmentVpcStack:Type:AWS::Serverless::ApplicationProperties:Location:vpc.yml# This stack contains the Amazon ECS cluster itselfClusterStack:Type:AWS::Serverless::ApplicationProperties:Location:cluster.ymlParameters:VpcId:!GetAtt VpcStack.Outputs.VpcId# Deploy the name server as a serviceNameService:Type:AWS::Serverless::ApplicationProperties:Location:name.ymlParameters:VpcId:!GetAtt VpcStack.Outputs.VpcIdPrivateSubnetIds:!GetAtt VpcStack.Outputs.PrivateSubnetIdsClusterName:!GetAtt ClusterStack.Outputs.ClusterNameECSTaskExecutionRole:!GetAtt ClusterStack.Outputs.ECSTaskExecutionRoleServiceDiscoveryNamespaceId:!GetAtt ClusterStack.Outputs.ServiceDiscoveryNamespaceId# Deploy the hello server as a serviceHelloService:Type:AWS::Serverless::ApplicationProperties:Location:hello.ymlParameters:VpcId:!GetAtt VpcStack.Outputs.VpcIdPublicSubnetIds:!GetAtt VpcStack.Outputs.PublicSubnetIdsPrivateSubnetIds:!GetAtt VpcStack.Outputs.PrivateSubnetIdsClusterName:!GetAtt ClusterStack.Outputs.ClusterNameECSTaskExecutionRole:!GetAtt ClusterStack.Outputs.ECSTaskExecutionRoleNameServiceSecurityGroup:!GetAtt NameService.Outputs.ServiceSecurityGroup
Use the following command to deploy all three stacks:
Once the stack deploys, you can use the Amazon ECS console to locate the address of the public facing
load balancer that provides ingress to the hello service from the public internet. Navigate to the ECS cluster,
view the details of the hello service, and click the link under Networking -> DNS Names -> Open Address.
If you refresh multiple times you should see different IP address and DNS names showing up,
demonstrating that both the front facing load balancing, as well as the backend service discovery load
balancing are working to evenly distribute traffic.
Try scaling the name service up and down to test out how service discovery reacts to changes in the state of the cluster
Tear it Down
You can tear down the entire stack with the following command:
sam delete --stack-name service-discovery-environment
🎓
New Workshop Series!
Join our upcoming container workshop series and learn best practices for Amazon ECS, AWS Fargate, and more.