Tool Development and Management: Build, manage, and maintain tools for system reliability, including dashboards, logging systems, and pager systems.
Infrastructure Maintenance: Help maintain and enhance our CI/CD pipelines, logging infrastructure, and other operational systems crucial for MLOps.
Monorepo Management: Keep the monorepo up-to-date with the latest dependency and security updates, ensuring a secure and efficient development environment.
Vendor Collaboration: Assist in implementing and maintaining infrastructure and systems managed by external vendor teams.
Incident Management: Lead and participate in incident management processes, including troubleshooting, root cause analysis, and implementing corrective measures to prevent future occurrences.
Requirements & Skills:
AI/ML Knowledge: Solid understanding of AI/ML principles and technologies.
System Monitoring and Tools: Experience with system monitoring tools and observability. Knowledge of GCP, Vertex AI, or other cloud platforms is highly beneficial.
Programming and Scripting: Proficiency in programming languages such as Python and scripting for automation.
Problem-Solving Skills: Strong analytical and problem-solving skills, with the ability to work under pressure.
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
5+ years of experience in the technology field
Proven experience in a reliability engineering role, preferably with a focus on AI/ML systems.
Experience in incident management and performance optimization.