Job Title: Data Reliability Specialist
You are required to be a key member of our data team as a Data Reliability Specialist. This role will focus on ensuring the reliability, scalability, and performance of our data infrastructure and applications.
Key Responsibilities:
* Monitoring and maintaining the health, performance, and availability of data processing systems and infrastructure.
* Collaborating with data engineers, software developers, and stakeholders to ensure seamless integration and deployment of data solutions.
* Automating and optimizing system reliability and efficiency through scripting and tooling.
* Troubleshooting and resolving issues related to data processing, infrastructure, and application performance.
* Implementing best practices for data security, retention, backup, and recovery.
* Providing production support, including smoke checks, incident management, and change control.
* Conducting root cause analysis to identify and implement corrective actions for recurring issues.
* Maintaining documentation, runbooks, and troubleshooting guides for data systems and processes.
* Supporting data projects such as system migrations, upgrades, and expansions.
* Participating in on-call rotations, including weekend and holiday support as needed.
Required Skills and Qualifications:
* Bachelor's degree in Computer Science, Data Science, or a related field.
* Experience: Proven track record as an SRE or in a similar role within Data Engineering, particularly in managing Spark platforms.
Benefits:
* Excellent problem-solving skills with a keen eye for detail.
* Strong communication and collaboration abilities.
* Ability to work efficiently under pressure in a fast-paced environment.