Maintain oversight on internal metrics, including the health, security, and performance of on-premises & hybrid-cloud network and systems infrastructure environments.
Execute timely and effective incident response, identifying and mitigating issues to minimize downtime.
Respond to alerts within our established SLOs and assist in incident triage, ensuring that the right teams are engaged to address issues promptly.
Participate in maintaining system backups, disaster recovery plans, and security protocols are in place and maintained.
Serve as a point-of-contact team for operational issues, providing both internal and external teams with technical support and ensuring the issue remains in custody until resolution.
Collaborate with product and software engineering teams to relay operational insights and requirements.
Continuously identify opportunities for optimization and present findings to technical leads and management.
Research and implement improvements enhancing systems performance and scalability.
Continuously research and embrace technological advancements and industry best practices to deliver exceptional service.
Actively identify and mitigate risks and escalate them so the team can proactively address present or anticipated operational challenges.
Develop, implement, and maintain automation frameworks streamlining operational processes, reducing time spent on manual tasks.
Identify catalysts for future optimization including provisioning techniques, deployment optimization, ancillary services, pipelines, ansible playbooks, power usage, bandwidth etc.
Draft comprehensive documentation for system configurations, processes, and incident resolution procedures.
Participate in knowledge sharing within the team and with support provided about the content and delivery, provide cross-training to other relevant departments.
Create and maintain runbooks and technical documentation, in addition to being familiar with internal and external escalation pathways.
Requirements & Skills:
In-depth understanding of the Linux operating environment: kernel tuning, network stack tuning, system observability & instrumentation, and security & access management.
Solid understanding of layer 2-7 networking fundamentals and the relationship between servers & services, and the transit of their packets through network hardware.
In-depth experience engineering and maintaining a private-cloud infrastructure: Bare-metal, vSphere, KVM, Kubernetes.
Experience with tools like Ansible, Terraform, Docker, Kafka, Nexus
Familiarity with Big Data tools: Hadoop, HDFS, Spark, HBase
Ability to write code in Go, Python, Bash, or Perl for automation.
3-6 years of proven experience in previous roles or one of the following roles:
DevOps Engineer
Linux System Administrator
Site Reliability Engineer (SRE)
Built or maintained a private cloud infrastructure running centos/rocky Linux on a mix of bare-metal, virtualization, and containerization.
Managed public cloud environments such as AWS, GCP, Azure, and their federation into on-premise environments.
Life-cycle management of baremetal servers such as Dell and Supermicro in globally distributed data centers (e.g. break-fix, baseband/firmware updates).
Built or maintained on-premise and cloud Kubernetes clusters: Kubadm, Kind, EKS, GKE
Built or operated automation & orchestration frameworks for deployment & maintenance pipelines: e.g. Kafka, stack storm, ansible, argocd, and terraform to push out code or configuration updates, and build new infrastructure systems