Cleaning up orphaned Amazon ECS container instances

I would like to share an interesting problem that came from some of our customers. This was an issue that they encountered with some of the EC2 instances in their ECS cluster. This blog post will suggest two solutions to solving this issue.

But first I want to give you a bit of background of how things usually work.

In Amazon ECS we have Capacity Providers that manage the scaling of infrastructure for tasks in your clusters. Each cluster can have one or more capacity providers and an optional capacity provider strategy. Today we are going to talk about a capacity provider for running ECS tasks on Amazon EC2. The capacity provider is made up of two main parts, and Auto Scaling group (which defines the compute instances, their types, which AMI you use for the instance and a number of additional configurations) and the scaling configuration.

When an EC2 instance is launched as part of the scaling group, you can either use the Amazon ECS-optimized AMI or if you would like, you can also create you own AMI based on the published recipes here - https://github.com/aws/amazon-ecs-ami.

As part of the provisioning of the instance, you provide (at a minimum) the appropriate configuration with your cluster name to allow the ECS agent to register your instance with your ECS cluster (this information is part of your user-data when an instance is launched). For more information about these and other agent runtime options, see Amazon ECS container agent configuration.

1
2
3
4


#!/bin/bash
cat <<'EOF' >> /etc/ecs/ecs.config
ECS_CLUSTER=MyCluster
EOF

That is how your instances are registered to your ECS cluster when everything works as it should, but as we all know things can sometimes go wrong (and probably will).

Let’s have look at the following example.

You have built your own custom AMI, based on the scripts above, and even after extensive testing, software issues can still occur and things break.

There are a number of reasons an EC2 instance could fail to register to the ECS cluster

misconfigured ECS Agent with invalid cluster.
misconfigured instance profile with missing permissions to interact with ECS.
misconfigured network in your VPC preventing ECS Agent from connecting to ECS.
broken container runtime (docker, containerd)
broken dependencies

So what does it look like when an instance fails to register to the cluster?

the autoscaling group provisions a new EC2 instance
the EC2 instance starts, passes its basic health checks, and registers as healthy from an EC2 perspective
The instance runs through the user data and fails to register to the ECS cluster (for reasons such as above)

And then? Well nothing really. From the perspective of the AutoScaling group, the EC2 instance is up and running, it is healthy. But from the ECS perspective, it has no knowledge of the EC2 instance, so the instance will just sit there, idling along doing nothing in an unregistered orphaned state. Since there is no straight forward way to alarm on such occurrences, it can take some time for you to recover from this state, while paying for unused EC2 instances.

I mentioned in the beginning that there were two solutions.

Ensuring this does not happen in the first place
Cleaning up these orphaned instances, if you did not use the solution above

Preventing orphaned instances in the first place

My colleague Nathan Peck has already taken care of this for you by providing a the Amazon ECS Capacity Provider for EC2 instances pattern which includes an AWS CloudFormation template that you can deploy in your own account.

But I want to dwell a bit how this is actually solved in the code, because it might not be that obvious.

In this template (lines 428-454) there is a task definition and a service that will be created on your cluster.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


HealthinessDaemonDefinition:
  Type: AWS::ECS::TaskDefinition
  Properties:
    Family: 'healthiness-daemon'
    Memory: 10
    RequiresCompatibilities:
      - EC2
    ExecutionRoleArn: !GetAtt ECSTaskExecutionRole.Arn
    ContainerDefinitions:
      - Name: 'healthcheck-pause'
        Image: public.ecr.aws/docker/library/busybox:latest
        EntryPoint:
          - /bin/sh
          - -c
        Command:
          - while :; do sleep 2073600; done

# This launches one copy of the healthiness daemon onto each host
# in the cluster.
HealthinessDaemon:
  Type: AWS::ECS::Service
  Properties:
    ServiceName: 'healthiness-daemon'
    Cluster: !Ref ECSCluster
    LaunchType: EC2
    SchedulingStrategy: DAEMON
    TaskDefinition: !Ref HealthinessDaemonDefinition

This snippet creates a daemon service on each of the launched EC2 instances with a single task. In other words, when an instance is added to your cluster, this task will run. The task is essentially a small container that will run for a really long time (24 days). This is a lightweight task that will hardly take up any resources on the instance.

And here is the snippet of code that prevents the issue from occurring (lines 260-288)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


# Check the ECS API to see if this instance is available as capacity
# inside of the ECS cluster, and wait for it to run the healthiness daemon
verify_instance_health:
  commands:
    ECSHealthCheck:
      command: |
        echo "Introspecting ECS agent status"
        find_container_instance_arn() {
          CONTAINER_INSTANCE_ARN=$(curl --connect-timeout 1 --max-time 1 -s http://localhost:51678/v1/metadata | jq -r '.ContainerInstanceArn')
        }
        find_container_instance_arn
        while [ "$CONTAINER_INSTANCE_ARN" == "" ]; do sleep 2; find_container_instance_arn; done
        echo "Container Instance ARN: $CONTAINER_INSTANCE_ARN"

        echo "Waiting for at least one running task"
        count_instance_tasks() {
          NUMBER_OF_TASKS=$(curl -s http://localhost:51678/v1/tasks | jq '.Tasks | length')
        }
        count_instance_tasks
        while [ $NUMBER_OF_TASKS -lt 1 ]; do sleep 2; count_instance_tasks; done

        echo "Instance $CONTAINER_INSTANCE_ARN is now hosting $NUMBER_OF_TASKS task(s)"        
# This signals back to CloudFormation once the instance has become healthy in ECS
# and has started hosting at least one task
signal_cfn:
  commands:
    SignalCloudFormation:
      command: !Sub |
        /opt/aws/bin/cfn-signal -e $? --stack ${AWS::StackId} --resource ECSAutoScalingGroup --region ${AWS::Region}

As part of the init script of the launch template, there is a health check script that runs a test verify_instance_health to check if there is an ECS task running on the instance. Only after the this health check passes successfully, will the EC2 instance report as healthy.

This solution provides a way that actually makes the EC2 instance aware of its state in the ECS cluster, and ensures that it has been registered correctly. If the ECS agent was not able to register the instance into the cluster, then there will be no tasks started on this instance, the health check will fail and the instance will be terminated.

But hold on!! I can already hear you thinking out loud. This is all great for the initial deployment of the CloudFormation template, but what about after the first deployment, when the autoscaling group scales up and adds new instances? How does the newly provisioned instance also ensure that this health check runs and fails if it does not register properly with the ECS cluster? There is no CloudFormation template here and there will not be any signal to the CloudFormation service to fail the deployment?

Here is where this snippet comes into play (lines 226-235)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


UpdatePolicy:
  # This configures the ASG to wait on resource signals from the cfn-init
  # script that runs on the instance itself. Depending on the expected
  # total size of your ASG you may need to tune the parameters below
  AutoScalingRollingUpdate:
    MaxBatchSize: 5
    MinInstancesInService: 1 # Note that ECS draining hook will maintain instances that are still hosting tasks
    PauseTime: PT2M
    WaitOnResourceSignals: true
    MinSuccessfulInstancesPercent: 100

The WaitOnResourceSignals does the magic. The documentation states:

Specifies whether the Auto Scaling group waits on signals from new instances during an update. Use this property to ensure that instances have completed installing and configuring applications before the Auto Scaling group update proceeds. AWS CloudFormation suspends the update of an Auto Scaling group after new EC2 instances are launched into the group. AWS CloudFormation must receive a signal from each new instance within the specified PauseTime before continuing the update. To signal the Auto Scaling group, use the cfn-signal helper script or SignalResource API.

This means that every instance launched in the autoscaling group must also send a cfn-signal to complete (which means that the health check has to pass successfully)

Cleaning up orphaned instances

If you are only coming across this post now, there is a good chance that you have not implemented the methods in the template from this pattern (yet), and you might have orphaned EC2 instances in your account, that never successfully registered with your cluster.

You can run a few bash commands in the AWS CLI to collect the information from both the ECS cluster and the autoscaling group, and compare the two to find the difference.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


ECS_CLUSTER="demo-cluster"
ASG_NAME="demo-cluster-asg"

ASG_INSTANCES=$(aws autoscaling describe-auto-scaling-instances \
 --query "AutoScalingInstances[?AutoScalingGroupName=='${ASG_NAME}'].InstanceId")

CONTAINER_INSTANCES=$(aws ecs list-container-instances --cluster $ECS_CLUSTER \
 --query containerInstanceArns)
CONTAINER_INSTANCE_EC2_ID=$(for instance in $CONTAINER_INSTANCES
    do
        aws ecs describe-container-instances --cluster $ECS_CLUSTER \
        --container-instances $instance \
        --query 'containerInstances[].ec2InstanceId'
    done)
    
sdiff <(echo $CONTAINER_INSTANCE_EC2_ID) <(echo $ASG_INSTANCE_ID)
[                                                               [
    "i-0216431f2808ef331"     |     "i-0216431f2808ef331",
                              >     "i-0474e361b077018e3",
                              >     "i-0a6f58230149d30ff"
]                                                               ]

In the output above I have one EC2 instance that is both registered in my ECS cluster and the autoscaling group and there are two instances that only exist in the autoscaling group, but not in my cluster. With these instance IDs, you can run a command that will terminate these instances.

I do want to point out a few challenges with this solution:

It will only work if you deploy the entire solution
It only works with AWS CloudFormation
It supports Linux Operating systems only

I would also like to share with you this Containers on AWS pattern. This pattern also provides a alternative comprehensive solution for addressing orphaned instances with existing autoscaling groups deployed in your account.

Article

« Getting Started with Amazon ECS and Core Concepts Sparking joy in container host maintenance »