Advanced Techniques for Amazon ECS Container Health Checks
Introduction
Amazon Elastic Container Service (Amazon ECS) provides a container health check feature that allows you to define health checks for your containerized workloads. This health check runs locally on the container instance or Fargate hosting your ECS task. It checks whether your application running in the container is available and responding as expected.
The container health check provides visibility into the availability of your application from the instance level. However, it does not monitor external network availability or other components of your architecture. As David Yanacek points out, health checks can be implemented at multiple levels of your architecture.
In this pattern, we dive into best practices around leveraging ECS container health checks:
- Improving visibility into the health check process
- Enhancing the security posture of your health checks
- Extending health checks by building a dedicated health check utility as part of your existing Amazon ECS Task.
The goal is to provide guidelines to help you effectively utilize ECS container health checks for monitoring workload availability.
Container Health Checks
Amazon ECS supports defining container health checks in task definitions. Health checks are commands or scripts that run locally within a container to validate application health and availability.
When a health check is defined in a task definition, the container runtime will execute the health check process inside the container and evaluate the exit code to determine the application health. If the health check fails consistently, ECS will mark the container and task as unhealthy and take remediation actions if the task is part of a service.
Because health checks execute inside the container, any tools used such as curl
must be included in the container image. The health check reaches the application via the container's loopback interface at localhost
or 127.0.0.1
.
The example below shows a task definition with a health check defined to run curl
against the nginx web server running in the same container:
{
"containerDefinitions": [
{
"name": "containerone",
"image": "public.ecr.aws/docker/library/nginx:latest",
"healthCheck": {
"command": [
"CMD-SHELL",
"curl -f http://localhost/ || exit 1"
],
"interval": 10,
"retries": 3,
"startPeriod": 0,
"timeout": 5
},
…
}
]
}
In this health check, curl
will be executed against http://localhost/ inside the container. The curl
binary must be included in the container image, as is the case in the official nginx image.
The ||
bash operator means that if curl
returns a non-zero exit code, indicating it was unable to reach the web server, the second exit 1
command will execute instead. ECS counts non-zero exit codes from the health check as failures.
Based on the retries
count, if ECS receives 3 consecutive health check failures, the container will be marked as unhealthy. If any essential container in a task is unhealthy, the entire task is marked unhealthy. For tasks that are part of an ECS service, unhealthy tasks will automatically be replaced.
While simple health checks can be useful, the example above has some drawbacks:
- Health check output is not visible in ECS console or APIs, limiting observability.
- Including additional binaries like
curl
in container images goes against security best practices of reducing the container attack surface area.
Optimizing the container health check
This pattern will provide multiple examples on how container health checks can be optimized in Amazon ECS.
Capturing the output of the health check process
Overview
When defining a container in an Amazon ECS task definition, you can specify a logging driver such as Amazon CloudWatch Logs. This logging driver captures the stdout and stderr streams from the container and forwards them to a central logging service. However, Amazon ECS does not capture the output of the health check process by default. You can optimize your health checks by forwarding the health check output to the stdout/stderr streams of your application. This allows the logging driver to collect the health check output.
Solution
The example below shows how to redirect the health check output so that it is forwarded to the central logging service. This builds on the first example task definition by routing the health check output to the first process in the container using >> /proc/1/fd/1
. It also ensures both stdout and stderr are captured using 2>&1
.
“healthCheck": {
"command": [
"CMD-SHELL",
"curl -f http://localhost/ >> /proc/1/fd/1 2>&1 || exit 1"
],
…
}
By routing the health check output to the application stdout/stderr streams, the configured logging driver can pick up this output and forward it to the central logging server. This provides observability into the health checks results.
Capturing and annotating the output of the health check process
Overview
Our health check process outputs debug information that gets logged, but this results in noisy and hard to parse logs. We want to transform the output to only include relevant data like status codes and timestamps.
Solution
We can encapsulate the health check logic in a bash script called healthcheck.sh
. This allows us to process the raw output and format it before logging. For example, the script can:
- Execute the health check curl command
- Extract and set key variables like HTTP status codes
- Construct log messages including relevant metadata
- Print the log output to stdout for the logging driver
Here is an example healthcheck.sh
script that implements this:
#!/bin/bash
# Define endpoint to check
endpoint="http://localhost/"
# Get current timestamp
timestamp=$(date +"%Y-%m-%d %T")
# Check if curl command is available
if ! [ -x "$(command -v curl)" ]; then
echo "$timestamp - Error: curl is not installed in the container image." >&2
echo "$timestamp - Please install it and try again." >&2
exit 1
fi
# Perform health check and redirect output to stdout
output=$(curl --max-time 5 -s -o /dev/null -w "%{http_code}" $endpoint 2>&1)
http_code=$(echo "$output" | tail -n1)
if [[ $http_code == "000" ]]; then
echo "$timestamp - Error: Connection timed out while trying to reach $endpoint"
exit 1
fi
# Log output to stdout
echo "$timestamp - Health check $endpoint: HTTP status code $http_code" >&1
# Check if output contains "200"
if [[ $http_code == "200" ]]; then
exit 0
else
exit 1
fi
To implement this, the curl
command in the health check definition needs to be replaced with a call to the bash script.
“healthCheck": {
"command": [
"CMD-SHELL",
"/healthcheck.sh >> /proc/1/fd/1"
],
…
}
WARNING
The healthscript.sh file must be copied to the container image.
Now the script will handle executing the health check, and will send the output to the stdout/stderr streams which get captured in the log stream. The resulting logs will only contain relevant information and metadata, easy to parse and troubleshoot.
Wrapping health checks in this bash script allows container logs to be more useful for diagnosing issues, by filtering noise and annotating the output.
Reducing the attack surface of a container image
Overview
When securing a container image, it's best practice to reduce the attack surface by removing non-required components like binaries, libraries, and shells. These components could potentially be leveraged by an attacker to exploit the container. Some container security tools may flag the inclusion of bash and curl as risks. To remove these while still providing a health check, a container health check process can be implemented in a module or binary.
To reduce the number of additional packages in the container image, build the health check using the same runtime environment as the application. For example, if the container runs a Python web application, implement the health check as a Python script that reuses the existing Python interpreter. The script can use Python's requests module to check the application's health, eliminating the need to include bash, curl, or other tools solely for the health check. This approach streamlines the container image by leveraging the application's existing dependencies.
Solution
A healthcheck.py script can leverage the requests library to query the app similar to a curl bash script, but more secured:
import logging
import requests
# Logging to stout
logging.basicConfig(filename="/proc/1/fd/1",
format="{'time':'%(asctime)s', 'level': '%(levelname)s', \
'message': '%(message)s'}",
level=logging.INFO)
URL = 'http://localhost'
PORT = '80'
TIMEOUT = 5
logging.info("Starting Health Check to %s:%s", URL, PORT)
# Health Check
response = requests.get(URL + ':' + PORT + '/health', timeout=TIMEOUT)
if response.status_code == requests.codes.ok:
logging.info(
f'Health check {URL}: HTTP status code {response.status_code}')
exit(0)
else:
logging.error(
f'Health check {URL}: HTTP status code {response.status_code}')
exit(1)
This removes non-required attack surfaces while still providing a health check.
A similar approach can be taken for other languages like Golang:
package main
import (
"fmt"
"log/slog"
"net/http"
"os"
"time"
)
const (
exitCodeError = 1
exitCodeSuccess = 0
)
var (
outfile, _ = os.Create("/proc/1/fd/1")
logger = slog.New(slog.NewJSONHandler(outfile, nil))
)
func runHealthcheck(url string, timeout time.Duration) int {
logger.Info(fmt.Sprintf("Querying Endpoint %v", url))
client := &http.Client{
Timeout: timeout,
}
r, err := http.NewRequest("HEAD", url, nil)
if err != nil {
logger.Error(fmt.Sprintf("error creating healthcheck request: %v", err))
return exitCodeError
}
resp, err := client.Do(r)
if err != nil {
logger.Error(fmt.Sprintf("Health check %s: Error %s", url, err))
return exitCodeError
}
resp.Body.Close()
logger.Info(fmt.Sprintf("Health check %s: HTTP status code %s", url, resp.Status))
return exitCodeSuccess
}
func main() {
healthcheckURL := "http://localhost:8080/health"
healthcheckTimeout := time.Second * 5
logger.Info(fmt.Sprintf("Starting Health Check to %v", healthcheckURL))
exitcode := runHealthcheck(healthcheckURL, healthcheckTimeout)
os.Exit(exitcode)
}
Conclusion
In this post, we demonstrated the health check options that you can configure on Amazon ECS tasks. We discussed advanced scenarios where you can set up the container health check to send logs to CloudWatch Logs. This provides more detailed information about why Amazon ECS tasks become unhealthy. We also covered two approaches to implement custom health checks - using the container command or application code. These allow you to evaluate container health in an advanced way.