A Site Reliability Engineer ensures the reliability, scalability, and performance of systems and services. They bridge the gap between development and operations by applying software engineering principles to infrastructure and operations problems.
Key Responsibilities
* System Reliability & Performance Design, build, systems.
* Monitor system health and performance using observability tools.
* Incident Management Respond to production incidents, perform root cause analysis, and implement preventive measures.
* Automation Develop scripts and tools to automate repetitive tasks and improve efficiency.
* Capacity Planning for scaling infrastructure.
* Collaboration Work closely with development teams to ensure reliability is built into applications.
* Security & Compliance Implement best practices for system security and compliance.
Required Skills
* Strong knowledge of Linux/Unix systems and networking fundamentals.
* Proficiency in programming/scripting languages (Python, Go, Bash).
* Experience with cloud platforms (AWS, Azure, GCP).
* Familiarity with CI/CD pipelines and DevOps practices.
* Expertise in monitoring tools (Prometheus, Grafana, ELK stack).
* Understanding of containerization and orchestration (Docker, Kubernetes).
Qualifications
* Bachelor’s degree in Computer Science, Engineering, or related field.
* 3+ years of experience in system administration, DevOps, or SRE roles.
* Strong problem-solving and troubleshooting skills.
Preferred
* Experience with Infrastructure as Code (Terraform, Ansible).
* Knowledge of distributed systems and microservices architecture.
#J-18808-Ljbffr