Senior Site Reliability Engineer
Contract - 6 months
Sydney
Hybrid
Overview
We're looking for an experienced Senior Site Reliability Engineer to join a high-impact digital engineering team supporting one of Australia's most widely used customer-facing eCommerce applications.
This role is all about driving platform stability, performance, and scalability across a complex Azure and Kubernetes environment. You'll take ownership of monitoring, performance optimisation, and automation initiatives that ensure the digital platforms run smoothly.
Looking for a hands-on engineer who can hit the ground running, work autonomously, and help shape the platform strategy for the future.
Key Responsibilities
* Maintain and improve the reliability, performance, and scalability of large-scale customer-facing applications.
* Manage and optimise Azure Kubernetes Service (AKS) clusters ensuring cost efficiency and right-sizing at scale.
* Implement and refine monitoring, alerting, and observability using tools such as Dynatrace and Azure-native monitoring solutions.
* Identify and reduce unnecessary logs and alerts to improve signal-to-noise ratio and platform insight.
* Work closely with software engineering teams (primarily .NET and GraphQL stacks) to diagnose performance issues and improve application behaviour within the clusters.
* Collaborate on platform automation — driving efficiency and consistency through Infrastructure as Code and CI/CD pipelines.
* Contribute to defining and executing the platform strategy to ensure reliability, maintainability, and scalability across digital services.
* Take ownership of incident response, post-mortem analysis, and ongoing performance tuning.
* Support and optimise Microsoft SQL environments that underpin core application services.
Skills & Experience
* 7+ years' experience in Site Reliability Engineering, DevOps, or Platform Engineering roles.
* Proven experience running and optimising Azure AKS clusters in production at scale.
* Strong background in application performance tuning and monitoring/alerting frameworks (preferably Dynatrace).
* Familiarity with .NET and GraphQL application architectures, and an ability to collaborate effectively with development teams to diagnose issues.
* Strong SQL Server (MSQL) experience for performance monitoring and troubleshooting.
* Deep understanding of observability, logging, metrics, and tracing best practices.
* Hands-on experience with automation, scripting, and Infrastructure as Code (PowerShell, Terraform, ARM templates, etc.).
* A proactive mindset focused on platform stability, cost optimisation, and continuous improvement.
* Excellent communication skills and the ability to work independently with minimal guidance.