Distributed machine learning with Amazon ECS
Running distributed machine learning (ML) workloads on Amazon Elastic Container Service (Amazon ECS) allows ML teams to focus on creating, training and deploying models, rather than spending time managing the container orchestration engine. With a simple architecture, control plane transparent upgrades, and native AWS Identity and Access Management (IAM) authentication, Amazon ECS provides a great environment to run ML projects. Additionally, Amazon ECS supports workloads that use NVIDIA GPUs and providesoptimized images with pre-installed NVIDIA Kernel drivers and Docker runtime.
When using a distributed training approach, multiple GPUs in a single instance (multi-gpu) or multiple instances with one or more GPUs (multi-node) are used for training. There are several techniques to accomplish distributed training: pipeline parallelism (different layers of the model are loaded in different GPUs), tensor parallelism (splits a single layer into multiple GPUs), and distributed data parallel …
Read More...