CMP304-R1: AWS infrastructure for large-scale training at Facebook AI
AWS re:Invent 2019 - Un pódcast de AWS
Categorías:
In this session, the Facebook AI team discusses its major machine learning models and workloads and the infrastructure challenges it faced with large-scale distributed training. They share details of how they tested these ML workloads on AWS infrastructure and the results of this benchmarking. Then we discuss how the deep breadth of AWS infrastructure for ML workloads in compute, networking, and storage helps address large-scale ML challenges. Specifically, we dive deep into the AWS machine learning stack to choose the right Amazon EC2 platform to fit your ML workload while leveraging 100 Gbps networking and high-performance file systems to efficiently scale from a single GPU to hundreds or thousands.