Job Overview
We are seeking a highly skilled Senior Site Reliability Engineer to join our team.
Main Responsibilities:
* Infrastructure Engineering: Design and implement scalable and resilient infrastructure solutions, ensuring they meet service level agreements (SLAs) and adapt dynamically to changing demands.
* Automation and DevOps: Develop and evolve infrastructure as code (IaC) using tools like Terraform, and continuous integration/continuous deployment (CI/CD) pipelines using GitHub Actions, to enable safe and rapid feature delivery.
* Reliability and Observability: Partner with teams to establish meaningful service level indicators (SLIs) and service level objectives (SLOs), implement real-time observability using tools like Datadog, Prometheus, and Grafana, and proactively identify potential risks before they impact users.
* Incident Response and Improvement: Lead the on-call rotation, facilitate blameless post-mortems, and foster a culture of continuous improvement to ensure outages become valuable learning opportunities.
* Mentorship and Knowledge Sharing: Share expertise through pairing with engineers, conducting brown-bag sessions on reliability best practices, and contributing to the growth of the global engineering organization.
* Security and Compliance: Collaborate with the Security team to integrate security controls into CI/CD, runtime environments, and disaster-recovery plans, ensuring the protection of customers and citizens.
Requirements:
* Demonstrated experience in production SRE, DevOps, or infrastructure roles, preferably in a SaaS or large-scale web environment.
* Expertise in at least one public cloud (AWS, Azure, or GCP) and comfort designing hybrid migrations from on-premises to cloud.
* Proven track record with IaC tools (Terraform, CloudFormation, or similar) and container orchestration (Kubernetes, ECS, AKS, OpenShift).
* Proven track record with virtual machine orchestration/provisioning and resiliency strategies (Kubevirt, Packer, Ansible, etc.).
* Strong coding/scripting skills (Python, Go, Bash, etc.) and passion for building reusable, tested libraries and tooling.
* Deep understanding of monitoring, logging, and tracing frameworks (Prometheus/Grafana, ELK/Opensearch, Jaeger, etc.).
* Excellent communication skills, thriving in cross-functional teams, with ability to translate complex technical issues into clear, actionable plans.