Principal site reliability engineer

Sydney

Tyro Payments

Posted: 11 February

Offer description

About Tyro

At Tyro, we’re into business big time. Through our integrated payments, banking and lending solutions, we’re here to ensure nothing stands in the way of Australian business success. With over 21 years' experience, we combine the best people, technology, and partners to deliver simplified payments and seamless business banking to our customers. We power more than 76,000 merchants across Australia and work with almost 800 partners to create seamless experiences for hospitality, retail, services and health providers.

It starts with You. We’re obsessed with the success of our people. When you join, you’ll get support to do your best work. Our close to 600 Tyros are a highly collaborative team, so you’ll work with smart, motivated and friendly people across Tyro. We are fast paced and innovative and live our values daily – commit to greatness, stay hungry, wow the customer, be good and win together. You’ll have opportunities to learn new skills with hands-on experience, further your career, and help unleash the potential of our customers, one payment at a time.

Watch a video to step inside life at Tyro.

About The Role

As a Principal Site Reliability Engineer (SRE) at Tyro, you will be responsible for establishing and maturing Tyro’s Site Reliability Engineering function from inception. This role focuses on improving the reliability, resilience, scalability, and observability of the platforms that underpin Tyro’s critical financial services.

This is a senior individual contributor role with organisation-wide technical leadership responsibility. While the role does not have formal people management accountability, it is expected to lead through influence by defining standards, setting technical direction, and coaching engineers across platform and product teams.

You will shape Tyro’s reliability engineering strategy, define SRE frameworks and operating models, and embed reliability and operational excellence into the software delivery lifecycle. This role is ideal for someone who has successfully built an SRE function from the ground up and is motivated by delivering lasting improvements in system reliability, operational maturity, and engineering culture.

Scope and Impact

This role operates across Tyro’s most critical platforms and services, influencing system architecture, engineering practices, operational risk management, and service resilience at an organisational level. The Principal SRE is a trusted advisor to engineering and business leaders on reliability, observability, and operational health.

What You\'ll Do

SRE Function Leadership and Enablement

* Establish and mature Tyro’s Site Reliability Engineering function, including its charter, engagement model, standards, and success metrics.
* Define and embed SRE principles, frameworks, and ways of working across engineering teams.
* Provide technical leadership and strategic direction for reliability engineering across platform and product domains.
* Coach, mentor, and uplift engineers in reliability, automation, observability, and operational excellence practices.
* Act as a trusted advisor to senior engineering and business stakeholders on reliability, resilience, and operational risk.

Reliability and Systems Architecture

* Architect and evolve reliable, scalable systems across hybrid environments, including on-premises Kubernetes platforms and AWS Kubernetes Services (EKS).
* Define and implement Service Level Objectives (SLOs), error budgets, reliability standards, and operational health metrics for critical services.
* Lead capacity planning, failover design, and disaster recovery initiatives to ensure service continuity.
* Influence architectural decisions to improve availability, performance, and resilience of distributed systems.

Observability and Operational Excellence

* Define and evolve Tyro’s observability strategy, standards, and operating model, with OpenTelemetry as the foundational approach for metrics, logs, and traces.
* Champion the adoption of OpenTelemetry across engineering teams to ensure consistent, vendor-agnostic instrumentation and high-quality telemetry.
* Lead the evolution and effective use of Tyro’s observability stack, including Grafana, Prometheus, Mimir, Loki, Tempo, and OpenTelemetry collectors and pipelines.
* Establish best practices for instrumentation, semantic conventions, context propagation, and sampling strategies to enable meaningful end-to-end visibility.
* Promote proactive monitoring, alerting, and distributed tracing practices that enable early issue detection and rapid incident resolution.
* Partner with platform and product teams to embed observability and operational readiness into every stage of the software development lifecycle.
* Drive continuous improvements in telemetry pipelines, signal correlation, and observability data quality to improve insight into system performance and customer experience.
* Ensure observability practices support operational risk management, service reliability objectives, and relevant regulatory expectations.

Automation, Tooling, and SDLC Integration

* Lead automation initiatives across build, deploy, and operate phases using GitHub, GitHub Actions, ArgoCD, and GitOps practices.
* Contribute to and maintain infrastructure-as-code (IaC) foundations using Terraform and related tooling.
* Design and build tools that reduce operational toil, improve reliability controls, and integrate with CI/CD workflows.
* Promote engineering practices that improve repeatability, safety, and operational efficiency

Incident Management and Continuous Improvement

* Lead and support complex incident response activities, providing calm, structured technical leadership during high-impact events.
* Facilitate high-quality post-incident reviews focused on root cause analysis and systemic remediation.
* Define and improve incident management, operational readiness, and resilience processes.
* Partner with engineering teams to prevent recurrence and continuously improve service reliability

Cross-Functional Integration and Jira Service Management

* Integrate SRE practices with Jira Service Management (JSM) to improve service visibility, incident workflows, and operational reporting.
* Collaborate with Engineering, Product, and Operations teams to align incident, change, and asset management with SRE principles.
* Support the adoption of consistent operational processes across teams.

What You\'ll Bring

* 10+ years of experience in Site Reliability Engineering, Platform Engineering, or Systems Engineering.
* Demonstrated experience establishing an SRE function from inception and scaling it to a successful, sustainable operating model.
* Strong background in hybrid and cloud-native architectures, including Kubernetes on bare metal and AWS EKS.
* Deep expertise in observability tooling and practices, including hands-on experience with OpenTelemetry.
* Strong experience with Grafana, Prometheus, Mimir, Loki, and Tempo.
* Hands-on experience with GitHub, GitHub Actions, and ArgoCD in modern CI/CD environments.
* Proficiency in Terraform, Linux, and scripting or programming languages such as Go, Python, or Bash.
* Strong understanding of networking, security, and performance considerations in distributed systems.
* A collaborative and influential communicator who bridges development, operations, and business outcomes.
* A builder who enjoys taking ideas from concept through to a durable, embedded capability.
* Comfortable operating in ambiguity and driving clarity, standards, and alignment across teams.
* Passionate about fostering a culture of transparency, accountability, and continuous improvement.

Nice to Have

* Experience in financial services, payments, or other highly regulated environments.
* Familiarity with Jira Service Management (JSM), including Operations, Assets, and third-party integrations.
* Knowledge of regulatory frameworks such as APRA CPS 230/234, PCI DSS, or ISO 27001.
* Experience implementing chaos engineering, resilience testing, or fault injection practices.
* Contributions to open-source reliability or observability tooling.
* Experience designing and operating OpenTelemetry-based observability architectures at scale.

What Success Looks Like

* A clearly defined and adopted SRE operating model embedded across Tyro’s engineering teams.
* Consistent use of SLOs, error budgets, and reliability metrics for critical services.
* Organisation-wide adoption of OpenTelemetry-based observability standards.
* Measurable improvements in service availability, incident response times, and reduction of operational toil.
* Improved engineering reliability maturity through coaching, standards, and tooling.
* Increased executive confidence in platform resilience, observability, and operational risk posture.

What’s in it for you?

We’ve worked hard to create an environment that’s big on diversity, inclusion, and flexibility, and one that suits the changing needs of team members across Australia. Here are just some of the things Tyros tell us they love about working here.

You’ll Also Receive

* A mix of in-office and remote working
* Learning and career development opportunities
* 16 weeks paid primary carers leave
* 12 weeks paid secondary carers leave
* Annual team-based volunteer day
* Birthday Leave
* Power Up Day (Additional day of leave)
* Weekly team social events, snacks, craft beer and wine, ping pong and video games
* Taco Tuesdays
* Mental health and wellness initiatives
* Novated leasing

Tyro is committed to a diverse, inclusive workplace where everyone thrives. We welcome applicants of all backgrounds and are an equal opportunity employer. If you need accommodations or adjustments at any stage of the recruitment process, simply inform our Talent team during your conversation with them.

Still with us?

If you’ve got this far, you might be a great fit for us. Don’t tick all the boxes above? That’s ok, apply anyway and our Talent team will review your profile – you might be a fit for future roles.

#J-18808-Ljbffr

Send an application

Create a job alert

Save