Implement and maintain CI/CD pipelines for AI and machine learning projects, ensuring robust deployment strategies and continuous integration
Monitor and ensure the reliability, availability, and performance of AI applications, particularly those involving LLMs and RAG
Collaborate with AI research teams to operationalize machine learning models and systems efficiently
Develop and enforce best practices for version control, configuration management, and testing of AI-driven software solutions
Utilize MLOps tools such as Kubeflow, MLflow, or TensorFlow Extended (TFX) to streamline the machine learning lifecycle from experimentation to production
Implement monitoring solutions that track both system metrics and model performance to facilitate proactive issue resolution
Participate in on-call rotations to support the operational health of critical systems, employing SRE principles to meet service-level objectives (SLOs) and reduce downtime
Requirements & Skills:
Bachelor’s degree in Computer Science, Engineering, or a related field
Proven experience as a DevOps Engineer or SRE, with a strong background in software development and automation
Experience with deployment and management of LLMs, including technologies like RAG
Proficient in CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) and infrastructure as code (e.g., Terraform, Ansible)
Knowledge of container orchestration technologies (e.g., Kubernetes, Docker)
Familiarity with MLOps tools and practices to support machine learning lifecycle management
Strong problem-solving skills and ability to work in a dynamic, fast-paced environment