Enabling reliable and highly scalable distributed ML training on Meta’s large-scale GPU training infra with a focus on GenAI/LLM scaling
Requirements & Skills:
Currently has, or is in the process of obtaining a Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. The degree must be completed prior to joining Meta.
Specialized experience in one or more of the following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high-performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch).
PhD in Computer Science, Computer Engineering, or relevant technical field
Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband
Experience working with DL frameworks like PyTorch, Caffe2, or TensorFlow
Experience with both data-parallel and model parallel training, such as Distributed Data-Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel
Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models
Experience in HPC and parallel computing
Knowledge of GPU architectures and CUDA programming