AI DevOps / SRE Engineer, EPAM Systems

AI DevOps  SRE Engineer, EPAM Systems

Company EPAM Systems
Job title AI DevOps / SRE Engineer
Job location Remote
Type Full Time

Responsibilities:

  • Implement and maintain CI/CD pipelines for AI and machine learning projects, ensuring robust deployment strategies and continuous integration
  • Monitor and ensure the reliability, availability, and performance of AI applications, particularly those involving LLMs and RAG
  • Collaborate with AI research teams to operationalize machine learning models and systems efficiently
  • Develop and enforce best practices for version control, configuration management, and testing of AI-driven software solutions
  • Utilize MLOps tools such as Kubeflow, MLflow, or TensorFlow Extended (TFX) to streamline the machine learning lifecycle from experimentation to production
  • Implement monitoring solutions that track both system metrics and model performance to facilitate proactive issue resolution
  • Participate in on-call rotations to support the operational health of critical systems, employing SRE principles to meet service-level objectives (SLOs) and reduce downtime

Requirements & Skills:

  • Bachelor’s degree in Computer Science, Engineering, or a related field
  • Proven experience as a DevOps Engineer or SRE, with a strong background in software development and automation
  • Experience with deployment and management of LLMs, including technologies like RAG
  • Proficient in CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) and infrastructure as code (e.g., Terraform, Ansible)
  • Knowledge of container orchestration technologies (e.g., Kubernetes, Docker)
  • Familiarity with MLOps tools and practices to support machine learning lifecycle management
  • Strong problem-solving skills and ability to work in a dynamic, fast-paced environment

apply for job button