Purpose:
To help drive quality outcomes for client’s and enable engineering teams by providing capabilities focused on the supportability and reliability of client’s software.
1. The role ensures both engineering and operationally focused teams work seamlessly as one combined, end-2-end product engineering team.
2. The role must demonstrate an ability to work with a large global product engineering team supporting a complex mix of software, clients and delivery outcomes.
3. Must have experience in AWS services, Terraform, Monitoring tools & configuration management tools
Accountabilities & Deliverable
4. Approaches operations as a software problem and develops solutions to eliminate toils.
5. Improves reliability, availability, quality, and deployment frequency of client products
6. Designs and builds software and systems to manage platform infrastructure and applications
7. Runs, monitors, and observes production applications and improves the overall lifecycle of reliability services.
8. Coaching and mentoring other SRE team members
9. Work closely with engineering/product teams to define and align SLIs/SLOs to business needs and to troubleshoot application and infrastructure issues.
10. Optimises on-call and escalation process as part of an incident management system
11. Finding better ways of doing things, with a view to the future and technology engineering trends.
Core Skills, Knowledge and Attributes
12. Working knowledge of modern software and technology:
13. CI/CD processes and tools such as BuildKite, Jenkins, Azure DevOps or any other similar tool
14. Experience in at least one of the programming languages such as Java, Nodejs, Golang, C#, etc.
15. Strong scripting skills in bash and/or python
16. Strong knowledge of AWS services such as VPC, EC2, S3, RDS, ECS, etc.
17. Expertise in IaC patterns and tools such as Terraform / Cloud formation
18. Strong knowledge of configuration management tools such as Ansible / Chef / Puppet
19. Expertise in docker container management and orchestration of containers such as ECS, AKS, EKS, GKE or native Kubernetes
20. Knowledge of databases (e.g. Postgres, SQL Server, Oracle or NoSQL DBs)
21. Experience in Application monitoring and relevant monitoring tools such as Datadog, New Relic,
22. Dynatrace or AppDynamics. Experience in defining SLIs that will align with the team to meet availability and latency objectives.
23. Engineering practices: availability, reliability, scalability and disaster recovery
24. Is a modern thinker looks to the future while ensuring practical, commercial outcomes are achieved in the present – but does not aim to keep the status quo
25. Always communicates positively and confidently, internally and externally relevant, valuable information.
26. Has an ability to relate effectively and positively with all people at all levels.
27. Creates loyalty, trust and following.
28. A combination of personality traits; smart, innovative, low ego, collaborative, honest, of high integrity, intensity, and passion.
29. Capable of contributing to broader business conversations beyond operational engineering and technology
30. Solicits the involvement of others to build a sense of ownership and engagement. Must have the confidence to act quickly and decisively.
31. Can define a delivery plan and identify & propose any supporting budget
32. Can empathise with people and clients appropriately and use that empathy in effective decision making
33. Can effectively lead and engage with remote teams
34. Proven ability to navigate ambiguity and collaborate with other functional leaders to provide great outcomes.