Team Building: Recruit, hire, and train a skilled team of ProdOps Engineers to establish a fully operational ProdOps team from scratch. Develop team goals, performance metrics, and performance evaluation processes. Foster a collaborative and inclusive team culture that promotes teamwork, innovation, and excellence.
Infrastructure Monitoring and Troubleshooting: Oversee the monitoring of our company’s global infrastructure using management tools to detect and resolve infrastructure issues proactively. Lead the troubleshooting efforts to resolve incidents in a timely manner, and escalate issues to appropriate teams for further investigation and resolution when necessary.
Incident Management: Establish and enforce incident management procedures, including incident reporting, escalation, and resolution processes. Coordinate with other engineering teams and external vendors to resolve technology incidents, minimize downtime, and ensure service level agreements (SLAs) are met.
Infrastructure Automation & Optimization: Continuously analyze infrastructure performance data, identify trends, develop and implement strategies to optimize performance and minimize disruptions. Collaborate with other engineering teams to automate and optimize for improved performance, implement upgrades, and changes to enhance infrastructure reliability, capacity, and security.
Documentation and Reporting: Develop and maintain comprehensive documentation, including network diagrams, run books & standard operating procedures (SOPs), and incident reports. Generate regular reports on performance, incidents, and trends to senior management and stakeholders.
Vendor Management: Establish and maintain relationships with technology equipment vendors, service providers, and other relevant stakeholders. Coordinate with vendors to resolve technical issues, manage maintenance contracts, and ensure timely delivery of services and equipment.
Training and Development: Provide ongoing training and professional development opportunities to the ProdOps team to enhance their technical skills, industry knowledge, and job performance. Mentor and coach team members to foster their growth and career advancement.
Requirements & Skills:
Proven experience in building and managing a 24×7 Production Operations team working with peers and colleagues in a distributed global operation
Strong leadership skills with the ability to motivate, mentor, and develop a high-performing team
In-depth knowledge of on-premise and cloud technology concepts, protocols, and procedures
Strong understanding of monitoring tools, incident management processes, automation, and optimization strategies
Ability to analyze complex technical issues, develop effective solutions, and make informed decisions in a fast-paced environment
Excellent communication skills, both written and verbal, with the ability to communicate technical concepts to non-technical stakeholders.
In-depth understanding of the Linux operating environment: kernel tuning, network stack tuning, system observability & instrumentation, and security & access management.
Solid understanding of layer 2-7 networking fundamentals and the relationship between servers & services, and the transit of their packets through network hardware.
In-depth experience engineering and maintaining a private-cloud infrastructure: Bare-metal, vSphere, KVM, Kubernetes.
Experience with tools like Ansible, Terraform, Docker, Kafka, Nexus
Experiencing with observability platforms: Prometheus, ELK, Jaeger, Grafana, Nagios, Zabbix
Familiarity with Big Data tools: Hadoop, HDFS, Spark, HBase
Ability to write code in Go, Python, Bash, or Perl for automation.