Job Description
This is a challenging role for an experienced Site Reliability Engineer to join our team and drive best practices across the business.
The ideal candidate will have a strong technical background, deep experience in SRE, and a passion for building and delivering robust processes. They will be responsible for leading technical discussions, identifying and tracking actions associated with incident situations, and collaborating with product teams to thoroughly analyze failures and integrate insights to improve service reliability, scalability, and operational efficiency.
* Key Responsibilities:
* Own the incident management process, ensuring it drives enduring reliability across all products and services.
* Provide expert leadership during critical outages, coordinating multiple teams to ensure streamlined decision-making and quick resolution.
* Lead and advocate for the transformation to a world-leading SRE organization, promoting SRE principles within the Engineering Department.
* Promote a customer-focused approach by addressing and mitigating global customer environment issues, and fostering a culture of continuous learning and technical excellence within the SRE team.
Requirements:
* Previous career experience as a Site Reliability Engineer, in an Operations or Engineering environment.
* Networking knowledge and able to troubleshoot TCP/IP, SSL/TLS, DNSSEC, IPsec, and BGP issues.
* Coding experience (preferably Python) building tools, scripting, or automation.
* Strong communication skills including the ability to translate technical issues/concepts into agreed actions.