Job Overview
We are seeking a highly skilled and motivated Reliability Engineer to join our team. As a key member of our organization, you will play a crucial role in ensuring the reliability and scalability of our systems.
Your primary responsibilities will include investigating and resolving production issues, leading incident management efforts, improving observability and detection, and driving automation initiatives.
You will work closely with cross-functional teams to embed reliability practices and operational excellence into our software development lifecycle.
As a Reliability Engineer, you will have the opportunity to grow and develop your skills in a dynamic and supportive environment.
-----------------------------------
Responsibilities
1. Production Issue Resolution: Investigate and resolve production issues that impact our customers, ensuring minimal disruption and fast recovery.
2. Incident Management: Collaborate with engineering and support teams to drive major incident resolution, post-mortems, and systemic improvements.
3. Observability and Detection: Design, maintain, and evolve monitoring, alerting, and logging to catch issues before they affect our customers.
4. Automation: Eliminate repetitive tasks through scripting and tooling, enabling faster, safer, and more consistent operations.
-----------------------------------
Requirements
* Strong communication, problem-solving, and analytical skills, with the ability to balance customer impact and technical priorities.
* Solid understanding of cloud-native applications, distributed systems, and modern web technologies.
* Hands-on experience with public cloud services (AWS, GCP, or Azure).
* SQL knowledge (querying, troubleshooting, and performance tuning).
* Familiarity with the software delivery lifecycle, CI/CD practices, and DevOps culture.
* Proficiency in scripting/programming (Bash, Python, Go, or Java) and experience with observability stacks (Prometheus, Grafana, ELK, OpsGenie, Datadog, etc.) is a plus.
-----------------------------------
What We Offer
* Company equity for all employees.
* Learning and development opportunities.
* Hybrid/remote working arrangements (location dependent).
* 30 days paid time off per year.
* Four weeks paid sabbatical after five years of service.
* Additional benefits based on location.