Running GPU workloads with Amazon ECS and AWS Cloud Development Kit

Nathan Peck profile picture
Nathan Peck
Senior Developer Advocate at AWS

About

This pattern shows how to setup a fleet of GPU instances and use Amazon ECS to launch GPU enabled tasks across the cluster. You can use this pattern as the basis for setting up your own GPU accelerated machine learning workload orchestrated through Amazon ECS.

Setup Cloud Development Kit

To use this pattern you need TypeScript and Node. First, ensure that you have Node.js installed on your development machine. Then create the following files:

  • package.json
  • tsconfig.json
  • cdk.json
File: package.jsonLanguage: json
{
  "name": "ecs-cluster",
  "version": "1.0.0",
  "description": "ECS GPU Cluster and Task",
  "private": true,
  "scripts": {
    "build": "tsc",
    "watch": "tsc -w",
    "cdk": "cdk"
  },
  "author": {
    "name": "Amazon Web Services",
    "url": "https://aws.amazon.com",
    "organization": true
  },
  "license": "Apache-2.0",
  "devDependencies": {
    "@types/node": "^8.10.38",
    "aws-cdk": "2.102.0",
    "typescript": "~4.6.0",
    "ts-node": "^10.9.1"
  },
  "dependencies": {
    "aws-cdk-lib": "2.102.0",
    "constructs": "^10.0.0"
  }
}

The files above serve the following purpose:

  • package.json - This file is used by NPM or Yarn to identify and install all the required dependencies:
  • tsconfig.json - Configures the TypeScript settings for the project:
  • cdk.json - Tells CDK what command to run, and provides a place to pass other contextual settings to CDK.

Last but not least run the following commands to install dependencies and setup your AWS account for the deployment:

Language: sh
npm install
npm run-script cdk bootstrap

Create the CDK application

Now create the following file to define the CDK application itself:

File: index.tsLanguage: ts
import autoscaling = require("aws-cdk-lib/aws-autoscaling");
import ec2 = require("aws-cdk-lib/aws-ec2");
import ecs = require("aws-cdk-lib/aws-ecs");
import cdk = require("aws-cdk-lib");

class ECSCluster extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const vpc = new ec2.Vpc(this, "MyVpc", { maxAzs: 2 });

    // Autoscaling group that will launch a fleet of instances that have GPU's
    const asg = new autoscaling.AutoScalingGroup(this, "MyFleet", {
      instanceType: ec2.InstanceType.of(
        ec2.InstanceClass.G3,
        ec2.InstanceSize.XLARGE4
      ),
      machineImage: ec2.MachineImage.fromSsmParameter(
        "/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended/image_id"
      ),
      vpc,
      maxCapacity: 10,
    });

    // Attach the fleet to an ECS cluster with a capacity provider.
    // This capacity provider will automatically scale up the ASG
    // to launch more GPU instances when GPU tasks need them.
    const cluster = new ecs.Cluster(this, "EcsCluster", { vpc });
    const capacityProvider = new ecs.AsgCapacityProvider(
      this,
      "AsgCapacityProvider",
      { autoScalingGroup: asg }
    );
    cluster.addAsgCapacityProvider(capacityProvider);

    // Define a task that requires GPU. In this case we just run
    // Make sure to update the image with the last nvidia cuda drivers image for your usage
    // Adapt your nvidia image to your OS as well (for us, Amazon Linux is close to centos7)
    // nvidia-smi to verify that the task is able to reach the GPU
    const gpuTaskDefinition = new ecs.Ec2TaskDefinition(this, "gpu-task");
    gpuTaskDefinition.addContainer("gpu", {
      essential: true,
      image: ecs.ContainerImage.fromRegistry("nvidia/cuda:12.3.1-base-centos7"),
      memoryLimitMiB: 80,
      cpu: 100,
      gpuCount: 1,
      command: ["sh", "-c", "nvidia-smi && sleep 3600"],
      logging: new ecs.AwsLogDriver({
        streamPrefix: "gpu-service",
        logRetention: 1,
      }),
    });

    // Request ECS to launch the task onto the fleet
    new ecs.Ec2Service(this, "gpu-service", {
      cluster,
      desiredCount: 2,
      // Service will automatically request capacity from the
      // capacity provider
      capacityProviderStrategies: [
        {
          capacityProvider: capacityProvider.capacityProviderName,
          base: 0,
          weight: 1,
        },
      ],
      taskDefinition: gpuTaskDefinition,
    });
  }
}

const app = new cdk.App();

new ECSCluster(app, "GpuTask");

app.synth();

Some things to note in this application:

  • ec2.MachineImage.fromSsmParameter('/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended/image_id') - The CDK application uses this SSM Parameter to automatically launch the latest version of the ECS Optimized AMI with built-in GPU support.
  • gpuCount: 1 - The ECS task is configured to require one GPU core.

Build and deploy the CDK app

Preview what will be deployed:

Language: sh
npm run-script cdk diff

Deploy the GPU cluster and task:

Language: sh
npm run-script cdk deploy

Make sure it worked

Open up the Amazon ECS console and find the cluster that deployed. Within that cluster you will see the GPU enabled service has launched two tasks. Select one of the tasks to view it's details and click on the "Logs" tab to see its output. You should see something similar to this:

Language: txt
---------------------------------------------------------------------------------------------------
|   timestamp   |                                     message                                     |
|---------------|---------------------------------------------------------------------------------|
| 1681813598289 | Tue Apr 18 10:26:38 2023                                                        |
| 1681813598289 | +-----------------------------------------------------------------------------+ |
| 1681813598289 | | NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     | |
| 1681813598289 | |-------------------------------+----------------------+----------------------+ |
| 1681813598289 | | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | |
| 1681813598289 | | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | |
| 1681813598289 | |                               |                      |               MIG M. | |
| 1681813598289 | |===============================+======================+======================| |
| 1681813598295 | |   0  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 | |
| 1681813598295 | | N/A   28C    P0    36W / 150W |      0MiB /  7618MiB |     98%      Default | |
| 1681813598295 | |                               |                      |                  N/A | |
| 1681813598295 | +-------------------------------+----------------------+----------------------+ |
| 1681813598295 |                                                                                 |
| 1681813598295 | +-----------------------------------------------------------------------------+ |
| 1681813598295 | | Processes:                                                                  | |
| 1681813598295 | |  GPU   GI   CI        PID   Type   Process name                  GPU Memory | |
| 1681813598295 | |        ID   ID                                                   Usage      | |
| 1681813598295 | |=============================================================================| |
| 1681813598295 | |  No running processes found                                                 | |
| 1681813598295 | +-----------------------------------------------------------------------------+ |
---------------------------------------------------------------------------------------------------

Clean Up

You can tear down the example stack when you are done with it by running:

Language: sh
npm run-script cdk destroy