Software Engineer – SystemML, Meta

Software Engineer - SystemML, Meta

Company Meta
Job title Software Engineer, SystemML – AI Networking
Job location Menlo Park, CA, US
Type Full Time

Responsibilities:

  • Enabling reliable and highly scalable distributed ML training on Meta’s large-scale GPU training infra with a focus on GenAI/LLM scaling

Requirements & Skills:

  • Currently has, or is in the process of obtaining a Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. The degree must be completed prior to joining Meta.
  • Specialized experience in one or more of the following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high-performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch).
  • PhD in Computer Science, Computer Engineering, or relevant technical field
  • Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband
  • Experience working with DL frameworks like PyTorch, Caffe2, or TensorFlow
  • Experience with both data-parallel and model parallel training, such as Distributed Data-Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel
  • Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models
  • Experience in HPC and parallel computing
  • Knowledge of GPU architectures and CUDA programming
  • Knowledge of ML, deep learning, and LLM

apply for job button