Company | Scale |
Job title | AI Infrastructure Engineer, ML Data Platform |
Job location | San Francisco, CA; New York, NY |
Type | Full Time |
Responsibilities:
- Design, implement, and maintain scalable data platforms to support diverse R&D and applied ML workloads.
- Partner with ML researchers, product engineers, and operations teams to align data infrastructure with organizational goals.
- Collaborate with ML researchers to build data access tools that help advance the state of the frontier post-training research.
- Participate in our team’s on-call process to ensure the availability of our services.
- Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.
Requirements & Skills:
- 2+ years of experience in building and operating large-scale distributed data systems that support ML workloads.
- Expertise in modern data platform technologies.
- Experience working with standard containerization & deployment technologies like Kubernetes, Helm, Terraform, Docker, etc.
- Strong problem-solving skills and the ability to work effectively in a fast-paced, dynamic environment.
Nice to have:
- Familiarity with ML development tools such as PyTorch, HuggingFace, or Weights & Biases.
- Experience with a variety of storage systems: object (S3), document (MongoDB), relational (Postgres), and distributed (Redis, Elasticsearch).
- Exposure to orchestration platforms like Temporal, Airflow, or AWS Step Functions.
- Experience supporting post-training workflows such as evaluation, fine-tuning, and RLHF in LLM systems.
- Experience working in a fast-moving startup or high-scale ML infrastructure environment.