AI Infrastructure Engineer, Scale

AI Infrastructure Engineer, Scale

Company Scale
Job title AI Infrastructure Engineer, ML Data Platform
Job location San Francisco, CA; New York, NY
Type Full Time

Responsibilities:

  • Design, implement, and maintain scalable data platforms to support diverse R&D and applied ML workloads.
  • Partner with ML researchers, product engineers, and operations teams to align data infrastructure with organizational goals.
  • Collaborate with ML researchers to build data access tools that help advance the state of the frontier post-training research.
  • Participate in our team’s on-call process to ensure the availability of our services.
  • Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.

Requirements & Skills:

  • 2+ years of experience in building and operating large-scale distributed data systems that support ML workloads.
  • Expertise in modern data platform technologies.
  • Experience working with standard containerization & deployment technologies like Kubernetes, Helm, Terraform, Docker, etc.
  • Strong problem-solving skills and the ability to work effectively in a fast-paced, dynamic environment.

Nice to have:

  • Familiarity with ML development tools such as PyTorch, HuggingFace, or Weights & Biases.
  • Experience with a variety of storage systems: object (S3), document (MongoDB), relational (Postgres), and distributed (Redis, Elasticsearch).
  • Exposure to orchestration platforms like Temporal, Airflow, or AWS Step Functions.
  • Experience supporting post-training workflows such as evaluation, fine-tuning, and RLHF in LLM systems.
  • Experience working in a fast-moving startup or high-scale ML infrastructure environment.

apply for job button