Reliability & observability senior analyst (x8) - data centers

Sydney

Capital Executive Search

Posted: 12 June

Offer description

We are currently partnering with a rapidly growing AI infrastructure and cloud technology organisation to appoint 8x IOC Reliability & Observability Senior Analysts to join their Sydney-based Infrastructure Operations Centre.

This business operates some of the world's largest GPU compute environments, delivering AI training and inference capabilities at scale. With significant investment in high-performance computing infrastructure and a commitment to sustainable operations, they continue to expand their global footprint while building one of the most advanced AI cloud platforms in the market.

The Role

This position sits within a 24/7 Infrastructure Operations Centre (running on 8-hour shifts, 8am or 10am start) and plays a critical role in ensuring the reliability, observability and operational performance of large-scale GPU compute environments.

Working closely with engineering, infrastructure and operations teams, you'll be responsible for improving incident detection, alert quality and operational visibility across a highly complex production environment, including:

* Perform advanced Level 2 incident analysis across GPU clusters, networking and supporting infrastructure.
* Improve alert quality, routing, enrichment and monitoring effectiveness to reduce operational noise and accelerate response times.
* Maintain operational dashboards, reliability metrics and service health reporting used by both technical teams and leadership.
* Investigate recurring incidents and identify opportunities to improve detection, automation and operational workflows.
* Analyse GPU health, performance degradation and failure patterns to support proactive incident management.
* Work with AIOps-generated insights, validating automated detections and ensuring operational signals remain accurate and actionable.
* Support incident management, RCA processes and operational reporting activities.

Desired Background

* 2-5 years of experience within IOC, NOC, Site Reliability Engineering, Production Operations, Observability or Reliability-focused environments.
* Strong understanding of incident management, service reliability and operational performance metrics such as MTTD and MTTR.
* Experience working with Linux systems, infrastructure monitoring platforms and enterprise ITSM tooling.
* Exposure to large-scale distributed environments, cloud infrastructure, HPC environments or GPU-based compute platforms.
* Hands‐on experience with observability tooling such as Splunk, Datadog or similar monitoring platforms.
* Ability to correlate logs, metrics and alerts across multiple technology domains to accelerate incident diagnosis and resolution.
* Experience improving alert quality, reducing false positives and optimising operational monitoring practices.
* Familiarity with automation, scripting or configuration‐driven operational workflows.

The Opportunity

This is an opportunity to join a business operating at the forefront of AI infrastructure, high-performance computing and large‐scale cloud operations. You'll gain exposure to cutting‐edge GPU environments, advanced observability platforms and modern reliability engineering practices while helping shape the operational maturity of a rapidly scaling global technology organisation.

P.S - This company does not offer sponsorship. We are only reviewing those with Australian Citizenship or Permanent Residency.

#J-18808-Ljbffr

Send an application

Create a job alert

Save