Job Summary
A company is looking for a Senior AI-HPC Cluster Engineer - MLOps.
Key Responsibilities
- Provide leadership and mentorship on managing large-scale HPC systems, including compute, networking, and storage deployment
- Develop scalable automation solutions for GPU-accelerated computing and support researchers with performance analysis and optimizations
- Conduct root cause analysis, proactively address issues, and build innovative tooling to enhance researchers' efficiency
Required Qualifications
- Bachelor's degree in Computer Science, Electrical Engineering, or related field, or equivalent experience
- Minimum of 6 years of experience with large-scale compute infrastructure
- Experience with AI/HPC job schedulers and orchestrators, such as Slurm or Kubernetes
- Proficient in Linux distributions and container technologies like Docker
- Proficiency in one scripting language and at least one compiled language