Overview
Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation.
You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.
Responsibilities
* Design and implement bare-metal provisioning workflows using Ironic and Kubernetes CRDs.
* Deploy and manage GPU-enabled AI compute nodes with RDMA, InfiniBand, and RoCE networking.
* Optimise Kubernetes and Slurm platforms for multi-node AI training performance, including NCCL, UCX, GPUDirect, and fabric tuning.
* Implement Kubernetes primitives for GPU scheduling, isolation, and resource management models.
* Design, deploy, and fine-tune Slurm GPU clusters with topology-aware configurations.
* Develop and execute performance benchmarking workloads, including MLPerf, NCCL tests, microbenchmarks, and throughput/latency validation.
* Establish observability across GPU, InfiniBand fabric, storage, and provisioning components.
* Document architecture designs, operational procedures, and performance results.
* Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.
* Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks.
* Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.
* Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI workload performance for large-scale GPU cluster commissioning.
Skills & Experience
* Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
* Experience with bare-metal cluster provisioning using tools such as Metal3, OpenStack Ironic, MaaS, xCAT, or similar.
* Deep knowledge of Kubernetes internals, including CRDs, controllers, operators, and cluster lifecycle management.
* Strong understanding of Slurm configuration and compiling AI and HPC applications.
* Strong understanding of GPU systems (NVIDIA H100/H200 SXM platforms), CUDA/NCCL, and GPU topology (NVLink, NVSwitch, PCIe).
* Familiarity with container runtimes for compute workloads, including Docker, Enroot, Singularity, and Podman.
* Experience with benchmarking and performance validation for AI, HPC, or distributed training workloads.
* Practical Linux systems engineering experience, including kernel, cgroups, system services, networking, and drivers.
* Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.
* Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.
* Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.
* Excellent documentation skills with strong attention to detail.
* Experience participating in an on-call rotation supporting production services.
* Proactive self-starter with a drive for continuous technical improvement.
* Systems Architecture: Ability to design and integrate bare-metal, GPU, RDMA, and Kubernetes/Slurm platforms.
* Infrastructure Automation: Skilled in automated provisioning and lifecycle management of hardware and clusters.
* GPU and HPC Performance: Understanding of GPU systems, RDMA fabrics, and distributed AI workload performance.
* Technical Communication: Ability to communicate technical concepts effectively across diverse engineering and operations teams.
* Continuous Improvement: Demonstrates curiosity, proactive learning, and innovation in AI and HPC infrastructure.
Success Metrics
* Reliable provisioning of Kubernetes and Slurm AI clusters.
* Performance validation and optimisation.
* Improved operational efficiency.
* High-quality documentation and effective knowledge transfer.
Location & Reporting
* Australia (Sydney, NSW or Launceston, TAS)
* Reporting to Senior Manager, Software Defined Infrastructure
Employment Basis
Full-time
Diversity
At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.
#J-18808-Ljbffr