The Domain Architect - AI Compute acts as the primary technical authority for the physical and logical lifecycle of high-performance GPU compute fleets across diverse client environments, who bridges the gap between architectural design and hands‐on execution. You are a \"doer\" who is as comfortable configuring a cluster in the CLI as you are explaining that configuration to a C‐level client.
As a System Integrator, we do not simply manage a static cloud; we design and deliver bespoke, high‐scale AI factories for the world's leading enterprises. In this role, you will define the Gold Standard for compute infrastructure, moving beyond single‐server management to architect repeatable, scalable, and automated compute fabrics. You will serve as the technical lead for NVIDIA Cloud Provider (NCP) and private enterprise AI cloud deployments, owning the \"Compute\" in the critical \"Compute‐Network‐Storage\" triad.
In this role, you will operate with a 60/40 split between delivering (60) complex AI infrastructure and providing Pre‐Sales (40) Subject Matter Expertise (SME). You will lead the physical provisioning of NVIDIA SuperPOD, NVIDIA BasePOD, and Cisco AI Factory environments, ensuring our clients receive \"Day 2\" ready AI factories, while assisting the sales team in defining the scope and cost of future deployments.
Key Responsibilities1. Delivery & Implementation (60%)
* Bare Metal Build & Provisioning:
* Lead the physical provisioning of clusters: NVIDIA NVL72, DGX SuperPOD, BasePOD, HGX, MGX, Cisco AI Factory.
* Utilise NVIDIA Base Command Manager (BCM) for diskless booting, firmware management, and OS hardening.
* Establish monitoring with NVIDIA Mission Control.
* Execute automated \"Zero Touch Provisioning\" (ZTP) workflows to transform bare‐metal hardware into production‐ready nodes.
* Define and enforce \"Fair Share\" policies, fractional GPU quotas using Multi‐Instance GPU (MIG), and pre‐emption logic for multi‐tenant environments.
* Ensure orchestration layers respect hardware topology (NUMA affinity, PCIe trees) for maximum performance.
* Orchestration & Day 2 Operations:
* Deploy and configure management planes like Rafay or Armada to enable multi‐cluster management and observability.
* Implement high‐fidelity telemetry using DCGM (Data Centre GPU Manager) to monitor GPU health, thermal throttling, and XID error rates.
* Lead the \"Day 2\" handover, ensuring the environment is fully integrated with the client's identity providers and storage backends.
* Performance Engineering:
* Conduct validation testing using NCCL-tests, HPL, and HPCG to verify cluster performance.
* Perform kernel‐level tuning (huge pages, sysctl) to optimise the OS for high‐bandwidth InfiniBand/RoCEv2 fabrics.
* Assist the sales team by validating customer technical requirements and producing accurate Labour Estimates (LOE) for Statements of Work (SOWs).
* Define the \"Standard Operating Environment\" (SOE) for proposals to ensure repeatability.
* Own the technical accuracy of the Compute Bill of Materials (BoM).
* Ensure all components (RAM, NVMe, NICs, Transceivers) are strictly aligned with the NVIDIA HCL (Hardware Compatibility List) and approved reference architectures (DGX, HGX, MGX, NVL72).
* Ensure alignment of GPU selection with client workloads.
* Client Workshops:
* Lead technical discovery workshops to determine specific workload requirements (e.g., distinguishing between ML workloads vs.LLM Inference vs.LLM Training requirements vs.Omniverse Digital Twin rendering).
* NVIDIA Ecosystem:
* Deep architectural understanding of NVIDIA GPU platforms (Hopper, Grace‐Hopper, Blackwell, Grace‐Blackwell).
* Mastery of NVL72 rack‐scale integration and NVSwitch fabrics.
* Expertise in the associated software stack (CUDA, cuDNN, NCCL).
* Expert‐level knowledge of Linux distributions (Ubuntu, RHEL) optimised for HPC/AI.
* Deep experience with kernel tuning, driver management, and system hardening.
* Infrastructure as Code (IaC):
* Proficiency in Python and Ansible for hardware configuration management and automation.
* Industry Background: Experience working within a System Integrator (SI) or Managed Service Provider (MSP) environment.
* Cluster Management: Hands‐on experience with NVIDIA Base Command Manager (BCM) (previously Bright Cluster Manager) and/or NVIDIA Mission Control.
* Network Integration: Solid understanding of high‐speed interconnects (InfiniBand NDR/HDR, RoCEv2) and how they interface with host PCIe/NVLink topologies.
* Facilities & Cooling: Familiarity with liquid cooling implementation (Direct‐to‐Chip, Rear Door Heat Exchangers) and its impact on server maintenance.
* Container Platforms: Experience with Kubernetes/Red Hat OpenShift installation and administration.
* Rafay
* NVIDIA Run:AI
* OpenShift AI
* Deployment Success: Successful handover of clusters that pass all NVIDIA validation tests (NCCL/HPL) on the first attempt.
* Estimation Accuracy: Achieving Positive technical feedback from client workshops and handover sessions.
#J-18808-Ljbffr