Hyphen Connect in Australia is looking for a skilled LLM Pre-training & Distributed Systems Engineer to manage large-scale machine learning training runs. This role requires expertise in GPU clusters and a strong background in systems engineering. Key responsibilities include orchestrating distributed training using advanced tools like PyTorch and optimizing networking to ensure seamless processes. The successful candidate will handle complex infrastructures for efficient training runs while implementing automated recovery techniques during extended training periods.
#J-18808-Ljbffr