Running GPU workloads with Amazon ECS and AWS Cloud Development Kit
About
This pattern shows how to setup a fleet of GPU instances and use Amazon ECS to launch GPU enabled tasks across the cluster. You can use this pattern as the basis for setting up your own GPU accelerated machine learning workload orchestrated through Amazon ECS.
Setup Cloud Development Kit
To use this pattern you need TypeScript and Node. First, ensure that you have Node.js installed on your development machine. Then create the following files:
- package.json
- tsconfig.json
- cdk.json
{
"name": "ecs-cluster",
"version": "1.0.0",
"description": "ECS GPU Cluster and Task",
"private": true,
"scripts": {
"build": "tsc",
"watch": "tsc -w",
"cdk": "cdk"
},
"author": {
"name": "Amazon Web Services",
"url": "https://aws.amazon.com",
"organization": true
},
"license": "Apache-2.0",
"devDependencies": {
"@types/node": "^8.10.38",
"aws-cdk": "2.102.0",
"typescript": "~4.6.0",
"ts-node": "^10.9.1"
},
"dependencies": {
"aws-cdk-lib": "2.102.0",
"constructs": "^10.0.0"
}
}
The files above serve the following purpose:
package.json
- This file is used by NPM or Yarn to identify and install all the required dependencies:tsconfig.json
- Configures the TypeScript settings for the project:cdk.json
- Tells CDK what command to run, and provides a place to pass other contextual settings to CDK.
Last but not least run the following commands to install dependencies and setup your AWS account for the deployment:
npm install
npm run-script cdk bootstrap
Create the CDK application
Now create the following file to define the CDK application itself:
import autoscaling = require("aws-cdk-lib/aws-autoscaling");
import ec2 = require("aws-cdk-lib/aws-ec2");
import ecs = require("aws-cdk-lib/aws-ecs");
import cdk = require("aws-cdk-lib");
class ECSCluster extends cdk.Stack {
constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const vpc = new ec2.Vpc(this, "MyVpc", { maxAzs: 2 });
// Autoscaling group that will launch a fleet of instances that have GPU's
const asg = new autoscaling.AutoScalingGroup(this, "MyFleet", {
instanceType: ec2.InstanceType.of(
ec2.InstanceClass.G3,
ec2.InstanceSize.XLARGE4
),
machineImage: ec2.MachineImage.fromSsmParameter(
"/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended/image_id"
),
vpc,
maxCapacity: 10,
});
// Attach the fleet to an ECS cluster with a capacity provider.
// This capacity provider will automatically scale up the ASG
// to launch more GPU instances when GPU tasks need them.
const cluster = new ecs.Cluster(this, "EcsCluster", { vpc });
const capacityProvider = new ecs.AsgCapacityProvider(
this,
"AsgCapacityProvider",
{ autoScalingGroup: asg }
);
cluster.addAsgCapacityProvider(capacityProvider);
// Define a task that requires GPU. In this case we just run
// Make sure to update the image with the last nvidia cuda drivers image for your usage
// Adapt your nvidia image to your OS as well (for us, Amazon Linux is close to centos7)
// nvidia-smi to verify that the task is able to reach the GPU
const gpuTaskDefinition = new ecs.Ec2TaskDefinition(this, "gpu-task");
gpuTaskDefinition.addContainer("gpu", {
essential: true,
image: ecs.ContainerImage.fromRegistry("nvidia/cuda:12.3.1-base-centos7"),
memoryLimitMiB: 80,
cpu: 100,
gpuCount: 1,
command: ["sh", "-c", "nvidia-smi && sleep 3600"],
logging: new ecs.AwsLogDriver({
streamPrefix: "gpu-service",
logRetention: 1,
}),
});
// Request ECS to launch the task onto the fleet
new ecs.Ec2Service(this, "gpu-service", {
cluster,
desiredCount: 2,
// Service will automatically request capacity from the
// capacity provider
capacityProviderStrategies: [
{
capacityProvider: capacityProvider.capacityProviderName,
base: 0,
weight: 1,
},
],
taskDefinition: gpuTaskDefinition,
});
}
}
const app = new cdk.App();
new ECSCluster(app, "GpuTask");
app.synth();
Some things to note in this application:
ec2.MachineImage.fromSsmParameter('/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended/image_id')
- The CDK application uses this SSM Parameter to automatically launch the latest version of the ECS Optimized AMI with built-in GPU support.gpuCount: 1
- The ECS task is configured to require one GPU core.
Build and deploy the CDK app
Preview what will be deployed:
npm run-script cdk diff
Deploy the GPU cluster and task:
npm run-script cdk deploy
Make sure it worked
Open up the Amazon ECS console and find the cluster that deployed. Within that cluster you will see the GPU enabled service has launched two tasks. Select one of the tasks to view it's details and click on the "Logs" tab to see its output. You should see something similar to this:
---------------------------------------------------------------------------------------------------
| timestamp | message |
|---------------|---------------------------------------------------------------------------------|
| 1681813598289 | Tue Apr 18 10:26:38 2023 |
| 1681813598289 | +-----------------------------------------------------------------------------+ |
| 1681813598289 | | NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 | |
| 1681813598289 | |-------------------------------+----------------------+----------------------+ |
| 1681813598289 | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | |
| 1681813598289 | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |
| 1681813598289 | | | | MIG M. | |
| 1681813598289 | |===============================+======================+======================| |
| 1681813598295 | | 0 Tesla M60 Off | 00000000:00:1E.0 Off | 0 | |
| 1681813598295 | | N/A 28C P0 36W / 150W | 0MiB / 7618MiB | 98% Default | |
| 1681813598295 | | | | N/A | |
| 1681813598295 | +-------------------------------+----------------------+----------------------+ |
| 1681813598295 | |
| 1681813598295 | +-----------------------------------------------------------------------------+ |
| 1681813598295 | | Processes: | |
| 1681813598295 | | GPU GI CI PID Type Process name GPU Memory | |
| 1681813598295 | | ID ID Usage | |
| 1681813598295 | |=============================================================================| |
| 1681813598295 | | No running processes found | |
| 1681813598295 | +-----------------------------------------------------------------------------+ |
---------------------------------------------------------------------------------------------------
Clean Up
You can tear down the example stack when you are done with it by running:
npm run-script cdk destroy