Senior Site Reliability Engineer, Stability AI

Senior Site Reliability Engineer, Stability AI

Company Stability AI
Job title Senior Site Reliability Engineer
Job location Remote – United States
Type Full Time

Responsibilities:

  • Developing and enforcing SRE best practices and standards across the organization.
  • Architecting and managing scalable systems in AWS and other cloud environments, focusing on high availability and resilience.
  • Implementing and maintaining infrastructure as code using Terraform.
  • Setting up and refining monitoring, logging, and alerting systems.
  • Driving incident management and root cause analysis to improve system reliability.
  • Championing SRE principles and mentoring junior team members.

Requirements & Skills:

  • Collaborating with development teams to enhance CI/CD pipelines.
  • Experience scaling resource-intensive systems, be it storage, networking, or compute.
  • Knowledge and experience with Kubernetes or other container scaling solutions
  • Background in software development or automation scripting.
  • Knowledge and experience with Grafana, ELK stack, or similar tools.

apply for job button