AI Infrastructure Engineer, Scale

Company	Scale
Job title	AI Infrastructure Engineer, ML Data Platform
Job location	San Francisco, CA; New York, NY
Type	Full Time

Responsibilities:

Design, implement, and maintain scalable data platforms to support diverse R&D and applied ML workloads.
Partner with ML researchers, product engineers, and operations teams to align data infrastructure with organizational goals.
Collaborate with ML researchers to build data access tools that help advance the state of the frontier post-training research.
Participate in our team’s on-call process to ensure the availability of our services.
Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.

2+ years of experience in building and operating large-scale distributed data systems that support ML workloads.
Expertise in modern data platform technologies.
Experience working with standard containerization & deployment technologies like Kubernetes, Helm, Terraform, Docker, etc.
Strong problem-solving skills and the ability to work effectively in a fast-paced, dynamic environment.

Nice to have:

Familiarity with ML development tools such as PyTorch, HuggingFace, or Weights & Biases.
Experience with a variety of storage systems: object (S3), document (MongoDB), relational (Postgres), and distributed (Redis, Elasticsearch).
Exposure to orchestration platforms like Temporal, Airflow, or AWS Step Functions.
Experience supporting post-training workflows such as evaluation, fine-tuning, and RLHF in LLM systems.
Experience working in a fast-moving startup or high-scale ML infrastructure environment.