Production Operations Engineer, Index Exchange

Production Operations Engineer, Index Exchange

Company Index Exchange
Job title Production Operations Engineer
Job location Toronto, Ontario, Canada
Type Full Time

Responsibilities:

  • Maintain oversight on internal metrics, including the health, security, and performance of on-premises & hybrid-cloud network and systems infrastructure environments. 
  • Execute timely and effective incident response, identifying and mitigating issues to minimize downtime.  
  • Respond to alerts within our established SLOs and assist in incident triage, ensuring that the right teams are engaged to address issues promptly. 
  • Participate in maintaining system backups, disaster recovery plans, and security protocols are in place and maintained. 
  • Serve as a point-of-contact team for operational issues, providing both internal and external teams with technical support and ensuring the issue remains in custody until resolution. 
  • Collaborate with product and software engineering teams to relay operational insights and requirements. 
  • Continuously identify opportunities for optimization and present findings to technical leads and management.  
  • Research and implement improvements enhancing systems performance and scalability. 
  • Continuously research and embrace technological advancements and industry best practices to deliver exceptional service. 
  • Actively identify and mitigate risks and escalate them so the team can proactively address present or anticipated operational challenges. 
  • Develop, implement, and maintain automation frameworks streamlining operational processes, reducing time spent on manual tasks. 
  • Identify catalysts for future optimization including provisioning techniques, deployment optimization, ancillary services, pipelines, ansible playbooks, power usage, bandwidth etc. 
  • Draft comprehensive documentation for system configurations, processes, and incident resolution procedures. 
  • Participate in knowledge sharing within the team and with support provided about the content and delivery, provide cross-training to other relevant departments. 
  • Create and maintain runbooks and technical documentation, in addition to being familiar with internal and external escalation pathways.

Requirements & Skills:

  • In-depth understanding of the Linux operating environment: kernel tuning, network stack tuning, system observability & instrumentation, and security & access management. 
  • Solid understanding of layer 2-7 networking fundamentals and the relationship between servers & services, and the transit of their packets through network hardware. 
  • In-depth experience engineering and maintaining a private-cloud infrastructure: Bare-metal, vSphere, KVM, Kubernetes. 
  • Experience with tools like Ansible, Terraform, Docker, Kafka, Nexus  
  • Experiencing with observability platforms: InfluxDB, Prometheus, ELK, Jaeger, Grafana, Nagios, Zabbix 
  • Familiarity with Big Data tools: Hadoop, HDFS, Spark, HBase 
  • Ability to write code in Go, Python, Bash, or Perl for automation. 
  • 3-6 years of proven experience in previous roles or one of the following roles: 
    • DevOps Engineer  
    • Linux System Administrator 
    • Site Reliability Engineer (SRE) 
  • Built or maintained a private cloud infrastructure running centos/rocky Linux on a mix of bare-metal, virtualization, and containerization. 
  • Managed public cloud environments such as AWS, GCP, Azure, and their federation into on-premise environments. 
  • Life-cycle management of baremetal servers such as Dell and Supermicro in globally distributed data centers (e.g. break-fix, baseband/firmware updates). 
  • Built or maintained on-premise and cloud Kubernetes clusters: Kubadm, Kind, EKS, GKE  
  • Built or operated automation & orchestration frameworks for deployment & maintenance pipelines: e.g. Kafka, stack storm, ansible, argocd, and terraform to push out code or configuration updates, and build new infrastructure systems 

apply for job button