Service Discovery for AWS Fargate tasks with AWS Cloud Map

Nathan Peck
Senior Developer Advocate at AWS

About

Service discovery is a technique for getting traffic from one container to another using a direct peer to peer connection, instead of routing traffic through an intermediary like a load balancer. Service discovery is suitable for a variety of use cases:

Privately networked, internal services that will not be used from the public internet
Low latency communication between services.
Long lived bidirectional connections, such as gRPC.
Low traffic, low cost deployments where you do not wish to pay the hourly fee for a persistent load balancer.

Service discovery for AWS Fargate tasks is powered by AWS Cloud Map. Amazon Elastic Container Service integrates with AWS Cloud Map to configure and sync a list of all your containers. You can then use Cloud Map DNS or API calls to look up the IP address of another task and open a direct connection to it.

Architecture

In this reference you will deploy the following architecture:

Two services will be deployed as AWS Fargate tasks:

A front facing hello service
A backend name service

Inbound traffic from the public internet will arrive at the hello service via an Application Load Balancer.

The hello service needs to fetch a name from the name service. In order to locate instances of the name service task, it will use DNS based service discovery to get a list of tasks to send traffic to. The hello service will do client side load balancing to distribute it's requests across available instances of the name service's task.

Network traffic between the hello service and the name service is direct peer to peer traffic.

Dependencies

This pattern requires that you have an AWS account, and that you use AWS Serverless Application Model (SAM) CLI. If not already installed then please install SAM CLI for your system.

Define the networking

For this architecture we are going to use private networking for the backend services, so grab the vpc.yml file from "Large VPC for Amazon ECS Cluster". Do not deploy this CloudFormation yet. We will deploy it later on.

Define the cluster

The following template defines an ECS cluster and a Cloud Map namespace that will be used to store information about the tasks in the cluster:

File: cluster.ymlLanguage: yml

AWSTemplateFormatVersion: '2010-09-09'
Description: Empty ECS cluster that has no EC2 instances. It is designed
             to be used with AWS Fargate serverless capacity
Parameters:
  VpcId:
    Type: String
    Description: The VPC that the service is running inside of

Resources:
  # Cluster that keeps track of container deployments
  ECSCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterSettings:
        - Name: containerInsights
          Value: enabled

  # This is a role which is used within Fargate to allow the Fargate agent
  # to download images, and upload logs.
  ECSTaskExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: [ecs-tasks.amazonaws.com]
            Action: ['sts:AssumeRole']
            Condition:
              ArnLike:
                aws:SourceArn: !Sub arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:*
              StringEquals:
                aws:SourceAccount: !Ref AWS::AccountId
      Path: /
      # This role enables basic features of ECS. See reference:
      # https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonECSTaskExecutionRolePolicy
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

  # This namespace will keep track of all the tasks in the cluster
  ServiceDiscoveryNamespace:
    Type: AWS::ServiceDiscovery::PrivateDnsNamespace
    Properties:
      Name: internal
      Description: Internal, private service discovery namespace
      Vpc: !Ref VpcId

Outputs:
  ClusterName:
    Description: The ECS cluster into which to launch resources
    Value: !Ref ECSCluster
  ECSTaskExecutionRole:
    Description: The role used to start up a task
    Value: !Ref ECSTaskExecutionRole
  ServiceDiscoveryNamespaceId:
    Description: The shared service discovery namespace for all services in the cluster
    Value: !Ref ServiceDiscoveryNamespace

Some things to note in this template:

An AWS::ServiceDiscovery::PrivateDnsNamespace ensures that Cloud Map can be accessed from inside of the VPC, using a TLD (Top Level Domain) of internal. This will allow us to lookup other services in the VPC using a DNS address like http://name.internal

Define the `name` service

Because the hello service depends on the name service, it makes sense to define the name service first. This service will be deploying a public sample image located at public.ecr.aws/ecs-sample-image/name-server.

File: name.ymlLanguage: yml

AWSTemplateFormatVersion: '2010-09-09'
Description: An example service that has service discovery service attached
             so that it can be easily located by other services in the ECS cluster.

Parameters:
  VpcId:
    Type: String
    Description: The VPC that the service is running inside of
  PrivateSubnetIds:
    Type: List<AWS::EC2::Subnet::Id>
    Description: List of private subnet ID's that the AWS VPC tasks are in
  ClusterName:
    Type: String
    Description: The name of the ECS cluster into which to launch capacity.
  ServiceDiscoveryNamespaceId:
    Type: String
    Description: The ID of a CloudMap namespace into which the service will be registered
  ECSTaskExecutionRole:
    Type: String
    Description: The role used to start up an ECS task
  ServiceName:
    Type: String
    Default: name
    Description: A name for the service
  ImageUrl:
    Type: String
    Default: public.ecr.aws/ecs-sample-image/name-server
    Description: The url of a sample container image for this workload
  ContainerCpu:
    Type: Number
    Default: 256
    Description: How much CPU to give the container. 1024 is 1 CPU
  ContainerMemory:
    Type: Number
    Default: 512
    Description: How much memory in megabytes to give the container
  ContainerPort:
    Type: Number
    Default: 80
    Description: What port that the application expects traffic on
  DesiredCount:
    Type: Number
    Default: 2
    Description: How many copies of the service task to run

Resources:

  # The task definition. This is a simple metadata description of what
  # container to run, and what resource requirements it has.
  TaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      Family: !Ref ServiceName
      Cpu: !Ref ContainerCpu
      Memory: !Ref ContainerMemory
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      ExecutionRoleArn: !Ref ECSTaskExecutionRole
      ContainerDefinitions:
        - Name: !Ref ServiceName
          Cpu: !Ref ContainerCpu
          Memory: !Ref ContainerMemory
          Image: !Ref ImageUrl
          PortMappings:
            - ContainerPort: !Ref ContainerPort
              HostPort: !Ref ContainerPort
          LogConfiguration:
            LogDriver: 'awslogs'
            Options:
              mode: non-blocking
              max-buffer-size: 25m
              awslogs-group: !Ref LogGroup
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: !Ref ServiceName

  # The service. The service is a resource which allows you to run multiple
  # copies of a type of task, and gather up their logs and metrics, as well
  # as monitor the number of running tasks and replace any that have crashed
  NameService:
    Type: AWS::ECS::Service
    Properties:
      Cluster: !Ref ClusterName
      LaunchType: FARGATE
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: DISABLED
          SecurityGroups:
            - !Ref ServiceSecurityGroup
          Subnets: !Ref PrivateSubnetIds
      DeploymentConfiguration:
        MaximumPercent: 200
        MinimumHealthyPercent: 75
      DesiredCount: !Ref DesiredCount
      TaskDefinition: !Ref TaskDefinition
      ServiceRegistries:
        - RegistryArn: !GetAtt ServiceDiscoveryService.Arn
          ContainerName: !Ref ServiceName

  # Security group that limits network access
  # to tasks from this service
  ServiceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for service
      VpcId: !Ref VpcId

  # Keeps track of the list of tasks for the service
  ServiceDiscoveryService:
    Type: AWS::ServiceDiscovery::Service
    Properties:
      Name: !Ref ServiceName
      DnsConfig:
        NamespaceId: !Ref ServiceDiscoveryNamespaceId
        DnsRecords:
          - TTL: 0
            Type: A

  # This log group stores the stdout logs from this service's containers
  LogGroup:
    Type: AWS::Logs::LogGroup

Outputs:
  ServiceSecurityGroup:
    Description: The security group of the name service
    Value: !Ref ServiceSecurityGroup

Note the following things from the template above:

A AWS::ServiceDiscovery::Service defines the DNS record type (A) and the TTL (Time to Live)
The AWS::ECS::Service is configured to attach to the AWS::ServiceDiscovery::Service so that it can keep it in sync with a list of the tasks.
The AWS::ECS::TaskDefinition is configured in awsvpc networking mode. This gives each task it's own unique IP address, which can be plugged into the service discovery DNS.
The name service listens for traffic on port 3000.

Define the `hello` service

Now we need to define the hello service. It will be based on the public sample image public.ecr.aws/ecs-sample-image/hello-server:node:

File: hello.ymlLanguage: yml

AWSTemplateFormatVersion: '2010-09-09'
Description: An example service that has service discovery service attached
             so that it can be easily located by other services in the ECS cluster.

Parameters:
  VpcId:
    Type: String
    Description: The VPC that the service is running inside of
  PublicSubnetIds:
    Type: List<AWS::EC2::Subnet::Id>
    Description: List of public subnet ID's to put the load balancer and tasks in
  PrivateSubnetIds:
    Type: List<AWS::EC2::Subnet::Id>
    Description: List of private subnet ID's that the AWS VPC tasks are in
  ClusterName:
    Type: String
    Description: The name of the ECS cluster into which to launch capacity.
  NameServiceSecurityGroup:
    Type: String
    Description: The security group of the downstream name service
  ECSTaskExecutionRole:
    Type: String
    Description: The role used to start up an ECS task
  ServiceName:
    Type: String
    Default: hello
    Description: A name for the service
  ImageUrl:
    Type: String
    Default: public.ecr.aws/ecs-sample-image/hello-server:node
    Description: The url of a sample container image for this workload
  ContainerCpu:
    Type: Number
    Default: 256
    Description: How much CPU to give the container. 1024 is 1 CPU
  ContainerMemory:
    Type: Number
    Default: 512
    Description: How much memory in megabytes to give the container
  ContainerPort:
    Type: Number
    Default: 3000
    Description: What port that the application expects traffic on
  DesiredCount:
    Type: Number
    Default: 2
    Description: How many copies of the service task to run

Resources:

  # The task definition. This is a simple metadata description of what
  # container to run, and what resource requirements it has.
  TaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      Family: !Ref ServiceName
      Cpu: !Ref ContainerCpu
      Memory: !Ref ContainerMemory
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      ExecutionRoleArn: !Ref ECSTaskExecutionRole
      ContainerDefinitions:
        - Name: !Ref ServiceName
          Cpu: !Ref ContainerCpu
          Memory: !Ref ContainerMemory
          Image: !Ref ImageUrl
          Environment:
            - Name: NAME_SERVER
              Value: http://name.internal:3000/
          PortMappings:
            - ContainerPort: !Ref ContainerPort
              HostPort: !Ref ContainerPort
          LogConfiguration:
            LogDriver: 'awslogs'
            Options:
              mode: non-blocking
              max-buffer-size: 25m
              awslogs-group: !Ref LogGroup
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: !Ref ServiceName

  # The service. The service is a resource which allows you to run multiple
  # copies of a type of task, and gather up their logs and metrics, as well
  # as monitor the number of running tasks and replace any that have crashed
  HelloService:
    Type: AWS::ECS::Service
    # Avoid race condition between ECS service creation and associating
    # the target group with the LB
    DependsOn: PublicLoadBalancerListener
    Properties:
      Cluster: !Ref ClusterName
      LaunchType: FARGATE
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: DISABLED
          SecurityGroups:
            - !Ref ServiceSecurityGroup
          Subnets: !Ref PrivateSubnetIds
      DeploymentConfiguration:
        MaximumPercent: 200
        MinimumHealthyPercent: 75
      DesiredCount: !Ref DesiredCount
      TaskDefinition: !Ref TaskDefinition
      LoadBalancers:
        - ContainerName: !Ref ServiceName
          ContainerPort: !Ref ContainerPort
          TargetGroupArn: !Ref ServiceTargetGroup

  # Security group that limits network access
  # to tasks from this service
  ServiceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for hello service
      VpcId: !Ref VpcId

  # Configure the security group of the name service to accept
  # incoming traffic from the security group of this service
  NameServiceIngressFromHello:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      Description: Allow hello service to make calls to name service
      GroupId: !Ref NameServiceSecurityGroup
      FromPort: 3000
      ToPort: 3000
      IpProtocol: -1
      SourceSecurityGroupId: !Ref ServiceSecurityGroup

  # Keeps track of the list of tasks for the service
  ServiceTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      HealthCheckIntervalSeconds: 6
      HealthCheckPath: /
      HealthCheckProtocol: HTTP
      HealthCheckTimeoutSeconds: 5
      HealthyThresholdCount: 2
      TargetType: ip
      Port: !Ref ContainerPort
      Protocol: HTTP
      UnhealthyThresholdCount: 10
      VpcId: !Ref VpcId
      TargetGroupAttributes:
        - Key: deregistration_delay.timeout_seconds
          Value: 0

  # A public facing load balancer, this is used as ingress for
  # public facing internet traffic.
  PublicLoadBalancerSG:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Access to the public facing load balancer
      VpcId: !Ref VpcId
      SecurityGroupIngress:
        # Allow access to public facing ALB from any IP address
        - CidrIp: 0.0.0.0/0
          IpProtocol: -1
  PublicLoadBalancer:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties:
      Scheme: internet-facing
      LoadBalancerAttributes:
      - Key: idle_timeout.timeout_seconds
        Value: '30'
      Subnets: !Ref PublicSubnetIds
      SecurityGroups:
        - !Ref PublicLoadBalancerSG
  PublicLoadBalancerListener:
    Type: AWS::ElasticLoadBalancingV2::Listener
    Properties:
      DefaultActions:
        - Type: 'forward'
          ForwardConfig:
            TargetGroups:
              - TargetGroupArn: !Ref ServiceTargetGroup
                Weight: 100
      LoadBalancerArn: !Ref 'PublicLoadBalancer'
      Port: 80
      Protocol: HTTP

  # Open up the service's security group to traffic originating
  # from the security group of the load balancer.
  ServiceIngressfromLoadBalancer:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      Description: Ingress from the public ALB
      GroupId: !Ref ServiceSecurityGroup
      IpProtocol: -1
      SourceSecurityGroupId: !Ref 'PublicLoadBalancerSG'

  # This log group stores the stdout logs from this service's containers
  LogGroup:
    Type: AWS::Logs::LogGroup

Some things to note in this template:

The AWS::ECS::TaskDefinition is configured to have an environment variable NAME_SERVER with the value http://name.internal:3000/. This service discovery endpoint is based on the TLD of the Cloud Map namespace (internal), the name of the service (name), and the port that the service binds to (3000).
The hello service must create an AWS::EC2::SecurityGroupIngress on the security group of the name service, allowing inbound traffic from the security group of the hello service. Without this ingress rule any direct, inbound, peer to peer connections would be denied by the name security group.

Look at the code

Properly using DNS based service discovery requires some client side implementation. Let's look at the source for the hello service.

File: index.jsLanguage: js

import os from 'node:os';
import url from 'node:url';
import express from 'express';
import fetch from 'node-fetch';
import retry from 'async-retry';
import { Resolver } from 'node:dns/promises'
const resolver = new Resolver();
const app = express()

const HOSTNAME = os.hostname();
const PORT = process.env.PORT || 3000;
const NAME_SERVER = process.env.NAME_SERVER;
const NAME_URL = url.parse(NAME_SERVER);
if (!NAME_SERVER) {
  throw new Error('Expected environment variable NAME_SERVER');
}

// Logic for looking up the DNS based service discovery record
// and selecting a random record from it.
var dnsRecords;
var lastResolveTime = 0;
const TTL = 5000;
async function resolveNameService() {
  if (lastResolveTime < new Date().getTime() - TTL) {
    dnsRecords = await resolver.resolve(NAME_URL.hostname);
  }
  return dnsRecords[Math.floor(Math.random() * dnsRecords.length)]
}

app.get('/', async function (req, res) {
  // Just in case a downstream task crashes, we wrap this in a retry
  // that will retry against a different task if needed.
  const randomName = await retry(
    async function () {
      const randomIp = await resolveNameService();
      const randomNameResponse = await fetch(`http://${randomIp}:${NAME_URL.port}`);
      return await randomNameResponse.text();
    },
    {
      retries: 5
    }
  );

  res.send(`Hello (from ${HOSTNAME}) ${randomName}`)
})

app.listen(PORT)

console.log(`Listening on http://localhost:${PORT} fetch`);

Things to note:

Each time a request is being made to the downstream name service, the sevice discovery DNS name must be resolved. Doing a full DNS lookup each time would be expensive for the underlying system, and impact performance, so the process caches the DNS lookup results for a brief time.
The service discovery DNS record returns a list of IP addresses. Note that if the DNS address was used by plugging it directly into a fetch() the runtime would just naively send all requests to the first IP address in the list. In order to evenly distribute traffic across all the downstream targets, the code must implement client side load balancing.
The entire network request is wrapped in a retry. This is because there is no guarantee that downstream tasks are actually still there. Because of DNS propagation delay it is possible for a downstream name task to have crashed or been stopped by a scale-in. If the task is no longer be there when the hello service tries to reach it, there will be a networking failure. The DNS record is eventually consistent with reality, so in the meantime it is important for the hello service to detect networking issues and retry against a different backend name task if necessary. Note that this simple demo application does not actually remove the failed backend IP address from its locally cached list. A potential improvement would be for the process to temporarily avoid attempting to send any more traffic to an IP address that has had a recent networking failure.

Deploy it all

You should have the following three files:

vpc.yml - Template for the base VPC that you wish to host resources in
cluster.yml - Template for the ECS cluster and its capacity provider
hello.yml - Template for the hello service that will be deployed on the cluster
name.yml - Template for the name service that will be deployed on the cluster

Use the following parent stack to deploy all three stacks:

File: parent.ymlLanguage: yml

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: Parent stack that deploys an AWS Fargate service discovery example.

Resources:

  # The networking configuration. This creates an isolated
  # network specific to this particular environment
  VpcStack:
    Type: AWS::Serverless::Application
    Properties:
      Location: vpc.yml

  # This stack contains the Amazon ECS cluster itself
  ClusterStack:
    Type: AWS::Serverless::Application
    Properties:
      Location: cluster.yml
      Parameters:
        VpcId: !GetAtt VpcStack.Outputs.VpcId

  # Deploy the name server as a service
  NameService:
    Type: AWS::Serverless::Application
    Properties:
      Location: name.yml
      Parameters:
        VpcId: !GetAtt VpcStack.Outputs.VpcId
        PrivateSubnetIds: !GetAtt VpcStack.Outputs.PrivateSubnetIds
        ClusterName: !GetAtt ClusterStack.Outputs.ClusterName
        ECSTaskExecutionRole: !GetAtt ClusterStack.Outputs.ECSTaskExecutionRole
        ServiceDiscoveryNamespaceId: !GetAtt ClusterStack.Outputs.ServiceDiscoveryNamespaceId

  # Deploy the hello server as a service
  HelloService:
    Type: AWS::Serverless::Application
    Properties:
      Location: hello.yml
      Parameters:
        VpcId: !GetAtt VpcStack.Outputs.VpcId
        PublicSubnetIds: !GetAtt VpcStack.Outputs.PublicSubnetIds
        PrivateSubnetIds: !GetAtt VpcStack.Outputs.PrivateSubnetIds
        ClusterName: !GetAtt ClusterStack.Outputs.ClusterName
        ECSTaskExecutionRole: !GetAtt ClusterStack.Outputs.ECSTaskExecutionRole
        NameServiceSecurityGroup: !GetAtt NameService.Outputs.ServiceSecurityGroup

Use the following command to deploy all three stacks:

Language: shell

sam deploy \
  --template-file parent.yml \
  --stack-name service-discovery-environment \
  --resolve-s3 \
  --capabilities CAPABILITY_IAM

Test it Out

Once the stack deploys, you can use the Amazon ECS console to locate the address of the public facing load balancer that provides ingress to the hello service from the public internet. Navigate to the ECS cluster, view the details of the hello service, and click the link under Networking -> DNS Names -> Open Address.

You should see output similar to this:

Language: txt

Hello (from ip-10-0-138-3.us-east-2.compute.internal) Sophia (from ip-10-0-191-125.us-east-2.compute.internal)

If you refresh multiple times you should see different IP address and DNS names showing up, demonstrating that both the front facing load balancing, as well as the backend service discovery load balancing are working to evenly distribute traffic.

Try scaling the name service up and down to test out how service discovery reacts to changes in the state of the cluster

Tear it Down

You can tear down the entire stack with the following command:

Language: shell

sam delete --stack-name service-discovery-environment

Edit this page on Github

Last Updated

Alternative Patterns

Not quite right for you? Try another way to do this:

ECS Service Connect Service to service communication with AWS Copilot

ECS Service Connect is a similar peer to peer networking option, that operates more like a service mesh. With Service Connect you don't need to implement your own client side load balancing. Round robin request routing, and retries are offloaded to an Envoy Proxy sidecar that is managed by Amazon ECS.

Development Tool

Development Tool

Feature

Type

Capacity

License

Service Discovery for AWS Fargate tasks with AWS Cloud Map

About

Architecture

Dependencies

Define the networking

Define the cluster

Define the `name` service

Define the `hello` service

Look at the code

Deploy it all

Test it Out

Tear it Down

Alternative Patterns

Development Tool

Development Tool

Feature

Type

Capacity

License

Service Discovery for AWS Fargate tasks with AWS Cloud Map

About ​

Architecture ​

Dependencies ​

Define the networking ​

Define the cluster ​

Define the name service ​

Define the hello service ​

Look at the code ​

Deploy it all ​

Test it Out ​

Tear it Down ​

Alternative Patterns

About

Architecture

Dependencies

Define the networking

Define the cluster

Define the `name` service

Define the `hello` service

Look at the code

Deploy it all

Test it Out

Tear it Down