Job Title
Site Reliability Engineer/Support Analyst.
Key Responsibilities:
* System Health & Reliability: Monitor, maintain, and proactively enhance the performance, availability, and reliability of data processing systems and Spark platforms through automated failure detection, proactive mitigation strategies, and disaster recovery planning.
* Collaboration & Integration: Work closely with data engineers, software developers, and stakeholders to support seamless integration, deployment, and optimisation of data solutions and infrastructure upgrades.
* Automation & Efficiency: Drive system efficiency by identifying automation opportunities, implementing scripts and tools to reduce manual intervention, and enhancing operational workflows.
* Incident & Problem Management: Provide end-to-end production support—including smoke checks, change control, on-call rotations, and incident response—while conducting root cause analysis to resolve recurring issues and maintain robust documentation.
* Governance & Compliance: Ensure adherence to ITIL frameworks, data security protocols, and backup and recovery standards by maintaining detailed runbooks and aligning with IT compliance controls.
Required Skills and Qualifications:
* Bachelor's degree in computer science, Data Science, or a related field.
* Proven track record as an SRE or in a similar role within Data Engineering, particularly in managing Spark platforms.
Benefits:
* Excellent problem-solving skills with a keen eye for detail.
* Strong communication and collaboration abilities.
* Ability to work efficiently under pressure in a fast-paced environment.
Why Choose Us?
* We're committed to making our company the best place to work in the country.