We are seeking a highly motivated and experienced Site Reliability Engineer to join our growing team. You will be responsible for ensuring the reliability, performance, and scalability of our production systems. You will play a critical role in ensuring our systems are designed and operated with resiliency and high availability in mind.
In this role, you will:
– Collaborate with cross-functional teams to design, deploy, and operate large-scale, high-availability systems
– Develop and maintain automation tools and processes to improve the reliability and efficiency of our systems
– Act as a technical lead for SRE-related initiatives, providing guidance and mentorship to junior team members
– Work closely with software engineers to diagnose and resolve production issues
– Continuously monitor and evaluate the health of our systems, proactively identifying and addressing potential issues before they become problems
– Participate in an on-call rotation to provide 24/7 support for production systems
– Drive innovation and improvement in our infrastructure and processes through experimentation and research
– Participate in the design and implementation of disaster recovery plans
Qualifications
1. Bachelor or above degree in Computer Science or a related technical discipline
2. 5+ years experience in Site Reliability Engineering, Production Engineering or similar role, working with large-scale distributed systems
3. Strong understanding of containers and container orchestration tools such as Docker and Kubernetes
4. In-depth knowledge of Unix/Linux systems administration, network fundamentals and storage systems
5. Proficiency in one or more programming languages, such as C, C++, Java, Python, Go, Ruby, Rust, JavaScript
6. Strong analytical and problem-solving skills 8. Excellent communication and collaboration skills, able to work effectively with cross-functional teams