Senior Site Reliability Engineer, Stability AI

Company	Stability AI
Job title	Senior Site Reliability Engineer
Job location	Remote – United States
Type	Full Time

Responsibilities:

Developing and enforcing SRE best practices and standards across the organization.
Architecting and managing scalable systems in AWS and other cloud environments, focusing on high availability and resilience.
Implementing and maintaining infrastructure as code using Terraform.
Setting up and refining monitoring, logging, and alerting systems.
Driving incident management and root cause analysis to improve system reliability.
Championing SRE principles and mentoring junior team members.

Collaborating with development teams to enhance CI/CD pipelines.
Experience scaling resource-intensive systems, be it storage, networking, or compute.
Knowledge and experience with Kubernetes or other container scaling solutions
Background in software development or automation scripting.
Knowledge and experience with Grafana, ELK stack, or similar tools.