Capture ECS task events into Amazon CloudWatch using Amazon EventBridge

Nathan Peck profile picture
Nathan Peck
Senior Developer Advocate at AWS

About

Amazon Elastic Container Service watches over your application 24/7, making autonomous decisions about how to keep your application up and running on your infrastructure. For example, if it sees that your application has crashed, then it will restart it. If an EC2 instance goes offline then Elastic Container Service can relaunch your application on a different EC2 instance that is still online.

By default, ECS only retains information on a task while it is running, and for a brief period of time after the task has stopped. What if you want to capture task history for longer, in order to review older tasks that crashed in the past?

With this pattern you can use Amazon EventBridge to capture ECS task data into long term storage in Amazon CloudWatch, then query that data back out later using CloudWatch Log Insights query language.

Amazon ElasticContainer ServiceAmazon CloudWatchAmazon EventBridgeView recent task info( <1 hour ago )Retain and query historical telemetry and eventsAmazon ECS ConsoleCloudWatch Log InsightsContainer InsightsApplication Logs

CloudWatch Container Insights

Amazon ECS CloudWatch Container Insights is an optional feature that you can enable to store and retain task telemetry data for as long as you want. The task telemetry data includes resource usage statistics, at one minute resolution, covering CPU, memory, networking, and storage.

TIP

There is no charge for using Amazon ECS, however the Container Insights feature does come with an additional cost based on the amount of data stored in CloudWatch, and an additional cost for querying that data using CloudWatch Log Insights. A task with one container generates about 1 MB of telemetry data per day. If there is more than one container per task, or you have frequent task turnover you may generate even more telemetry data. Queries will also cost more based on the amount of telemetry data processed by the query. See Amazon CloudWatch pricing for more info.

In order to activate Container Insights for a cluster, you can use the command line:

Language: sh
aws ecs update-cluster-settings \
  --cluster cluster_name_or_arn \
  --settings name=containerInsights,value=enabled \
  --region us-east-1

Or you can enable Container Insights when creating an ECS cluster with CloudFormation:

Language: yml
MyCluster:
  Type: AWS::ECS::Cluster
  Properties:
    ClusterName: production
    Configuration:
      containerInsights: enabled

From this point on you will start to see new metrics and new logs stored in CloudWatch. You can find the raw task details over time stored in CloudWatch Logs, under the namespace /aws/ecs/containerinsights/<cluster-name>. By default this log group only stores data for one day. However, you can edit the retention period to store this data for even longer, by finding the namespace in the CloudWatch Logs console, and editing it's settings.

Sample Container Insights Telemetry

Here are some sample telemetry events similar to what you will see in CloudWatch after enabling Container Insights:

  • Container Telemetry Event
  • Task Telemetry Event
Language: json
{
    "Version": "0",
    "Type": "Container",
    "ContainerName": "stress-ng",
    "TaskId": "fd84326dd7a44ad48c74d2487f773e1e",
    "TaskDefinitionFamily": "stress-ng",
    "TaskDefinitionRevision": "2",
    "ServiceName": "stress-ng",
    "ClusterName": "benchmark-cluster-ECSCluster-TOl9tY939Z2a",
    "Image": "209640446841.dkr.ecr.us-east-2.amazonaws.com/stress-ng:latest",
    "ContainerKnownStatus": "RUNNING",
    "Timestamp": 1654023960000,
    "CpuUtilized": 24.915774739583338,
    "CpuReserved": 256,
    "MemoryUtilized": 270,
    "MemoryReserved": 512,
    "StorageReadBytes": 0,
    "StorageWriteBytes": 0,
    "NetworkRxBytes": 0,
    "NetworkRxDropped": 0,
    "NetworkRxErrors": 0,
    "NetworkRxPackets": 4532,
    "NetworkTxBytes": 0,
    "NetworkTxDropped": 0,
    "NetworkTxErrors": 0,
    "NetworkTxPackets": 1899
}

Container Insights telemetry can be queried by using CloudWatch Log Insights. For example this is a sample query that grabs the telemetry for a specific task.

Language: query
fields @timestamp, @message
| filter Type="Container" and TaskId="33a03820a2ce4ced85af7e0d4f51daf7"
| sort @timestamp desc
| limit 20

You can find more sample queries and query syntax rules in the CloudWatch Log Insights docs.

Capture ECS Task History

In addition to the raw telemetry, Amazon ECS produces events which can be captured in a CloudWatch log group using Amazon EventBridge. These events happen when a service is updated, a task changes state, or a container instance changes state. Here is how you can capture these events using Amazon EventBridge.

The following CloudFormation will setup an EventBridge rule that captures events for a task into CloudWatch Logs:

File: eventbridge-ecs-task-events.ymlLanguage: yml
AWSTemplateFormatVersion: '2010-09-09'
Description: This template deploys an Amazon EventBridge rule that captures
             Elastic Container Service task history for persistence in Amazon CloudWatch.

Parameters:
  ServiceName:
    Type: String
    Description: The name of the ECS service that you would like to capture events from
  ServiceArn:
    Type: String
    Description: The full ARN of the service that you would like to capture events from

Resources:

  # A CloudWatch log group for persisting the Amazon ECS events
  ServiceEventLog:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub /benchmark/${ServiceName}-events

  # Create the EventBridge rule that captures deployment events into the CloudWatch log group
  CaptureServiceDeploymentEvents:
    Type: AWS::Events::Rule
    Properties:
      Description: !Sub 'Capture service deployment events from the ECS service ${ServiceName}'
      # Which events to capture
      EventPattern:
        source:
          - aws.ecs
        detail-type:
          - "ECS Deployment State Change"
          - "ECS Service Action"
        resources:
          - !Ref ServiceArn
      # Where to send the events
      Targets:
        - Arn: !Sub arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:${ServiceEventLog}
          Id: 'CloudWatchLogGroup'

  # Create a log group resource policy that allows EventBridge to put logs into
  # the log group
  LogGroupForEventsPolicy:
    Type: AWS::Logs::ResourcePolicy
    Properties:
      PolicyName: EventBridgeToCWLogsPolicy
      PolicyDocument: !Sub
      - >
        {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Sid": "EventBridgetoCWLogsPolicy",
              "Effect": "Allow",
              "Principal": {
                "Service": [
                  "delivery.logs.amazonaws.com",
                  "events.amazonaws.com"
                ]
              },
              "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
              ],
              "Resource": [
                "${LogArn}"
              ]
            }
          ]
        }
      - { LogArn: !GetAtt ServiceEventLog.Arn }

The template requires input parameters:

  • ServiceName - The name of an ECS Service you would like to start capturing events from. Example: sample-webapp
  • ServiceARN - The full ARN (Amazon Resource Name) for the service. Example: arn:aws:ecs:us-west-2:123456789012:service/sample-webapp

You can deploy this template using the CloudFormation console, or the AWS CLI using a command like:

Language: sh
aws cloudformation deploy \
  --template-file eventbridge-ecs-task-events.yml \
  --stack-name eventbridge-ecs-task-events \
  --capabilities CAPABILITY_IAM \
  --parameter-overrides \
     ServiceName=sample-webapp \
     ServiceArn=arn:aws:ecs:us-west-2:123456789012:service/sample-webapp

Sample ECS Task Event

Once deployed, Amazon EventBridge will start capturing ECS events into Amazon CloudWatch. Each event will be a full point in time snapshot of the ECS task's state. The following JSON is an example of what the event will look like:

File: sample-ecs-event.jsonLanguage: json
{
  "version": "0",
  "id": "b38a1269-debf-7ada-9576-f69ce2752526",
  "detail-type": "ECS Task State Change",
  "source": "aws.ecs",
  "account": "209640446841",
  "time": "2022-05-31T20:12:43Z",
  "region": "us-east-2",
  "resources": [
    "arn:aws:ecs:us-east-2:209640446841:task/benchmark-cluster-ECSCluster-TOl9tY939Z2a/0c45c999f51741509482c5829cebb82e"
  ],
  "detail": {
    "attachments": [
      {
        "id": "4b01ba81-00ee-471d-99dc-9d215bff56e5",
        "type": "eni",
        "status": "DELETED",
        "details": [
          {
            "name": "subnetId",
            "value": "subnet-04f3a518011557633"
          },
          {
            "name": "networkInterfaceId",
            "value": "eni-0685196fd7cf97f27"
          },
          {
            "name": "macAddress",
            "value": "06:a8:e2:77:53:2c"
          },
          {
            "name": "privateDnsName",
            "value": "ip-10-0-121-242.us-east-2.compute.internal"
          },
          {
            "name": "privateIPv4Address",
            "value": "10.0.121.242"
          }
        ]
      }
    ],
    "attributes": [
      {
        "name": "ecs.cpu-architecture",
        "value": "x86_64"
      }
    ],
    "availabilityZone": "us-east-2b",
    "capacityProviderName": "FARGATE",
    "clusterArn": "arn:aws:ecs:us-east-2:209640446841:cluster/benchmark-cluster-ECSCluster-TOl9tY939Z2a",
    "connectivity": "CONNECTED",
    "connectivityAt": "2022-05-31T18:08:12.052Z",
    "containers": [
      {
        "containerArn": "arn:aws:ecs:us-east-2:209640446841:container/benchmark-cluster-ECSCluster-TOl9tY939Z2a/0c45c999f51741509482c5829cebb82e/1471ad51-9c53-4d56-82d9-04b26f82369e",
        "exitCode": 0,
        "lastStatus": "STOPPED",
        "name": "stress-ng",
        "image": "209640446841.dkr.ecr.us-east-2.amazonaws.com/stress-ng:latest",
        "imageDigest": "sha256:75c15a49ea93c3ac12c73a283cb72eb7e602d9b09fe584440bdf7d888e055288",
        "runtimeId": "0c45c999f51741509482c5829cebb82e-2413177855",
        "taskArn": "arn:aws:ecs:us-east-2:209640446841:task/benchmark-cluster-ECSCluster-TOl9tY939Z2a/0c45c999f51741509482c5829cebb82e",
        "networkInterfaces": [
          {
            "attachmentId": "4b01ba81-00ee-471d-99dc-9d215bff56e5",
            "privateIpv4Address": "10.0.121.242"
          }
        ],
        "cpu": "256",
        "memory": "512"
      }
    ],
    "cpu": "256",
    "createdAt": "2022-05-31T18:08:08.011Z",
    "desiredStatus": "STOPPED",
    "enableExecuteCommand": false,
    "ephemeralStorage": {
      "sizeInGiB": 20
    },
    "executionStoppedAt": "2022-05-31T20:12:20.683Z",
    "group": "service:stress-ng",
    "launchType": "FARGATE",
    "lastStatus": "STOPPED",
    "memory": "512",
    "overrides": {
      "containerOverrides": [
        {
          "name": "stress-ng"
        }
      ]
    },
    "platformVersion": "1.4.0",
    "pullStartedAt": "2022-05-31T18:08:22.205Z",
    "pullStoppedAt": "2022-05-31T18:08:23.109Z",
    "startedAt": "2022-05-31T18:08:23.817Z",
    "startedBy": "ecs-svc/3941167241989127803",
    "stoppingAt": "2022-05-31T20:12:06.844Z",
    "stoppedAt": "2022-05-31T20:12:43.412Z",
    "stoppedReason": "Scaling activity initiated by (deployment ecs-svc/3941167241989127803)",
    "stopCode": "ServiceSchedulerInitiated",
    "taskArn": "arn:aws:ecs:us-east-2:209640446841:task/benchmark-cluster-ECSCluster-TOl9tY939Z2a/0c45c999f51741509482c5829cebb82e",
    "taskDefinitionArn": "arn:aws:ecs:us-east-2:209640446841:task-definition/stress-ng:2",
    "updatedAt": "2022-05-31T20:12:43.412Z",
    "version": 6
  }
}

Similar to telemetry, these task events can be queried using Amazon CloudWatch Log Insights. The following sample query will fetch task state change history for a single task:

Language: query
fields @timestamp, detail.attachments.0.status as ENI, detail.lastStatus as status, detail.desiredStatus as desiredStatus, detail.stopCode as stopCode, detail.stoppedReason as stoppedReason
| filter detail.taskArn = "<your task ARN>"
| sort @timestamp desc
| limit 20

Example output:

Language: txt
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|       @timestamp        |    ENI     |     status     | desiredStatus |         stopCode          |                             stoppedReason                              |
|-------------------------|------------|----------------|---------------|---------------------------|------------------------------------------------------------------------|
| 2022-06-01 19:03:41.000 | DELETED    | STOPPED        | STOPPED       | ServiceSchedulerInitiated | Scaling activity initiated by (deployment ecs-svc/8045142110272152487) |
| 2022-06-01 19:03:08.000 | ATTACHED   | DEPROVISIONING | STOPPED       | ServiceSchedulerInitiated | Scaling activity initiated by (deployment ecs-svc/8045142110272152487) |
| 2022-06-01 19:02:45.000 | ATTACHED   | RUNNING        | STOPPED       | ServiceSchedulerInitiated | Scaling activity initiated by (deployment ecs-svc/8045142110272152487) |
| 2022-06-01 18:56:56.000 | ATTACHED   | RUNNING        | RUNNING       |                           |                                                                        |
| 2022-06-01 18:56:51.000 | ATTACHED   | PENDING        | RUNNING       |                           |                                                                        |
| 2022-06-01 18:56:29.000 | PRECREATED | PROVISIONING   | RUNNING       |                           |                                                                        |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

With this abbreviated table you can see the history of state changes that an AWS Fargate task goes through as it normally starts up and then shuts down.

In the case of a task that unexpectedly stopped at some point in the past this history of events can be very useful for understanding just what happened to this task and why.

If you are interested in service level events, or container instance level events you can find samples of what those events look like in the Amazon ECS events documentation.

See Also