Job Summary:
We are seeking a highly skilled High-Performance Cloud Engineer to join our team.
This individual will be responsible for designing and implementing high-throughput systems, ensuring ultra-low-latency performance across our cloud infrastructure.
The ideal candidate will have expertise in Linux low-latency tuning, AWS operations at scale, and Infrastructure as Code / GitOps practices.
Main Responsibilities
* Ultra-Low Latency EC2 Fleets: Design and manage fleets of EC2 instances optimized for ultra-low latency.
* Cluster Placement Groups: Design cluster placement groups with ENA / SR-IOV networking for high-performance applications.
* Kernel-Level Performance Tuning: Perform kernel-level performance tuning: CPU pinning, NUMA alignment, IRQ affinity, hugepages, and TCP/UDP sysctl tweaks to flatten tail latency.
* Immutable Infrastructure & Automated Rollouts: Implement immutable infrastructure using Packer AMIs and Terraform Auto Scaling Groups; run GitLab/Jenkins pipelines with blue-green or canary deploys and sub-2-minute automatic rollbacks.
* High-Throughput Messaging & Gateways: Operate Kafka clusters (partition/ISR tuning, rack awareness) and Nginx WebSocket edges serving 100 k+ clients with single-digit-ms fan-out.
* Network Integrity: Run packet-loss analysis and MTU/ECN/queue-depth tuning; enforce least-privilege security-group micro-segmentation.
* Observability & SLO Stewardship: Instrument Prometheus/Grafana dashboards for order-ack latency, queue depth, reject rate; write Alertmanager rules driven by p95/p99 error-budget burn.
* Reliability Testing & Incident Response: Schedule chaos/load drills; take part in 24 × 7 on-call, use perf/eBPF/FlameGraphs/tcpdump for µs-level RCA, and publish post-mortems with remediation actions.
* Cross-Team Collaboration: Pair with Java/Rust engineers and quants to profile hot-path code, and eliminate bottlenecks without trading downtime.
Requirements
* Linux Low-Latency Tuning: Expertise in CPU pinning, NUMA awareness, IRQ affinity, TCP/UDP stack tweaks, hugepages.
* AWS Operations at Scale: Experience with EKS, EC2, VPC, NLB/ALB, Auto Scaling, multi-AZ fail-over, cost & quota management.
* Infrastructure as Code / GitOps: Knowledge of Terraform (modular state).
* CI/CD Pipelines: Familiarity with GitLab CI or Jenkins; blue-green / canary deploys, sub-2-minute rollbacks, latency smoke-test gates.
* Observability: Experience with Prometheus + Grafana, Alertmanager, high-cardinality metrics, centralized log aggregation, eBPF tracing for µs-level hotspots.
* High-Throughput Messaging: Expertise in Kafka cluster operations (partition strategy, ISR tuning).
* Performance & Reliability Engineering: Skills in perf, FlameGraph, chaos/load testing, p95/p99 latency SLO ownership.
* Automation & Scripting: Familiarity with Python or Go for tooling, incident remediation, environment bootstrap.
* Bonus: Rust/Go code familiarity, CNCF/AWS certifications, XDP/DPDK experience for kernel-bypass networking.