System Stability and Scalability Specialist
The ideal candidate will be a key player in ensuring the stability, performance, and scalability of our applications and supporting infrastructure.
Technical support, automation, and operational excellence are delivered across production systems to provide seamless and reliable services for customers.
By combining advanced troubleshooting with strong DevOps practices, this role optimizes systems to be secure, efficient, and well-maintained.
Key Responsibilities
* Technical Leadership: Act as a key escalation point for technical application issues, providing expert-level troubleshooting, root cause analysis, and timely resolution for production systems.
* Proactive Monitoring: Proactively monitor application health, performance, and availability, taking corrective actions before users are impacted.
* Automation Development: Develop, enhance, and maintain automation scripts in Python, bash, and other scripting languages to streamline operational workflows and reduce manual effort.
* Workflow Automation: Design, build, and manage workflow automation and system integrations using platforms such as Workato, Zapier, or equivalent tools.
* System Administration: Administer and maintain Linux-based systems, ensuring system stability, security, and optimal performance.
* Log Analysis and Troubleshooting: Perform detailed log analysis, system diagnostics, and error tracing to identify patterns, prevent recurrence, and implement permanent fixes.
* Version Control and CI/CD: Maintain and manage code repositories using Git, enforcing best practices in version control and supporting continuous integration and deployment (CI/CD) processes.
* Cloud Deployment: Deploy, monitor, and manage applications and services in AWS environments, leveraging cloud-native tools for scalability, resilience, and cost optimization.
* Documentation and Knowledge Sharing: Document technical solutions, operational procedures, and troubleshooting guides to support team knowledge sharing and improve support efficiency.
* Incident Response: Participate in incident response and post-incident reviews, driving improvements in system reliability and operational processes.
* Process Improvement: Contribute to ongoing process improvements, system enhancements, and automation initiatives aligned with DevOps best practices.