AI Infrastructure Engineer, Scale

Responsibilities:

Build highly available, observable, performant, and cost-effective APIs for model training.
Participate in our team’s on-call process to ensure the availability of our services.
Own projects end-to-end, from requirements, scoping, and design, to implementation, in a highly collaborative and cross-functional environment.
Exercise good taste in building systems and tools and know when to make build vs. buy tradeoffs, with an eye for cost efficiency.

4+ years of experience building machine learning training pipelines or inference services in a production setting.
Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
Experience building, deploying, and monitoring complex microservice architectures.
Experience with Python, Docker, Kubernetes, and Infrastructure as code (e.g. terraform).