Remote Jobs

Senior AI HPC Cluster Engineer

8/1/2025

No location specified

Job Summary

A company is looking for a Senior AI-HPC Cluster Engineer - MLOps.

Key Responsibilities

Provide leadership and mentorship on managing large-scale HPC systems, including compute, networking, and storage deployment
Develop scalable automation solutions for GPU-accelerated computing and support researchers with performance analysis and optimizations
Conduct root cause analysis, proactively address issues, and build innovative tooling to enhance researchers' efficiency

Required Qualifications

Bachelor's degree in Computer Science, Electrical Engineering, or related field, or equivalent experience
Minimum of 6 years of experience with large-scale compute infrastructure
Experience with AI/HPC job schedulers and orchestrators, such as Slurm or Kubernetes
Proficient in Linux distributions and container technologies like Docker
Proficiency in one scripting language and at least one compiled language