We are seeking an accomplished HPC/AI Platform Engineering Manager to lead the design, implementation, and optimization of advanced computing environments that power AI, ML, and LLM workloads. This role is ideal for a hands-on technologist with deep expertise in HPC systems, GPU-accelerated infrastructure, and large-scale AI deployments—combined with the leadership's ability to drive fast-paced, innovative initiatives.
You will collaborate with engineering, research, and business teams to define infrastructure strategy, assess emerging technologies, and deliver scalable, secure, and high-performance solutions. This role is pivotal in advancing generative AI, analytics, and model training capabilities through robust architecture, automation, and software integration.
Recruiting for this role ends on January 31, 2026.
Key Responsibilities
Architecture & Strategy
- Design and implement HPC and AI infrastructure leveraging HPE Apollo, ProLiant, Cray, and similar enterprise-class systems.
- Architect ultra-low-latency, high-throughput interconnect fabrics (InfiniBand NDR/800G, RoCEv2, 100–400 GbE) for large-scale GPU and HPC clusters.
- Deploy and optimize cutting-edge NVIDIA GPU architectures (e.g. H100, H200, RTX PRO / Blackwell series, NVL based systems)
- Develop scalable hybrid HPC and cloud architectures across Azure, AWS, GCP, and on-prem environments.
- Establish infrastructure blueprints supporting secure, high-throughput AI workloads.
AI/ML & LLM Platform Enablement
- Build and manage AI/ML infrastructure to maximize performance and productivity of ML research teams.
- Architect and optimize distributed training, storage, and scheduling systems for large GPU clusters.
- Implement automation, observability, and operational frameworks to minimize manual intervention.
- Deploy and manage GPU-accelerated Kubernetes clusters for AI and HPC workloads.
- Integrate open-source GenAI components, including vector databases and AI/ML frameworks, for model serving and experimentation.
- Identify and resolve performance and scalability of bottlenecks across infrastructure layers.
Software Engineering & Integration
- Develop and maintain automation tools and utilities in Python, Golang, and Bash.
- Integrate HPC infrastructure with ML frameworks, container runtimes, and orchestration platforms.
- Contribute to job scheduling, resource management, and telemetry components.
- Build APIs and interfaces for workload submission, monitoring, and reporting across heterogeneous environments.
Containerization & Orchestration
- Design Kubernetes and OpenShift architectures optimized for GPU and AI workloads.
- Implement GPU scheduling, persistent storage, and high-speed networking configurations.
- Collaborate with DevOps/MLOps teams to build CI/CD pipelines for containerized research and production environments.
Systems & Automation
- Oversee Linux system architectures (RHEL, Ubuntu, OpenShift) with automation via Ansible and Terraform.
- Implement monitoring and observability (e.g Prometheus, Grafana, DCGM, and NVML)
- Ensure system scalability, reliability, and security through proactive optimization.
Governance & Leadership
- Ensure architecture and deployments comply with organizational and regulatory standards.
- Conduct technical workshops, architecture reviews, and presentations for both technical and executive audiences.
- Define and drive the infrastructure roadmap in partnership with business stakeholders.
- Mentor and lead engineering teams, translating business requirements into actionable technical deliverables.
- Foster innovation and cross-functional collaboration to accelerate AI/ML initiatives.
Required Qualifications
- 10+ years of experience in HPC architecture, systems engineering, or platform design with a focus on architecting and operating on-premises Kubernetes for large-scale AI/ML workloads.
- 3+ years working hands on and with a proficiency utilizing Linux, Python, Golang, and/or Bash.
- 2+ years leading teams and/or processes
- 2+ years of recent experience working with GPU platforms (strong preference for NVIDIA), distributed systems, and performance optimization.
- Ability to travel 0-10%, on average, based on the work you do and the customers you serve.
- Must be a US Citizen.
Preferred Qualifications
- Master’s or Ph.D. in Computer Science, Electrical Engineering, or related discipline and work experience.
- Demonstrated success supporting LLM training and inference workloads in both R&D and production environments.
- Strong knowledge of high-performance networking, storage, and parallel computing frameworks.
- Exceptional communication and leadership skills, capable of bridging technical depth with executive strategy.
The wage range for this role takes into account the wide range of factors that are considered in making compensation decisions including but not limited to skill sets; experience and training; licensure and certifications; and other business and organizational needs. The disclosed range estimate has not been adjusted for the applicable geographic differential associated with the location at which the position may be filled. At Deloitte, it is not typical for an individual to be hired at or near the top of the range for their role and compensation decisions are dependent on the facts and circumstances of each case. A reasonable estimate of the current range is $130,000 to $241,000.
You may also be eligible to participate in a discretionary annual incentive program, subject to the rules governing the program, whereby an award, if any, depends on various factors, including, without limitation, individual and organizational performance.
Information for applicants with a need for accommodation: https://www2.deloitte.com/us/en/pages/careers/articles/join-deloitte-assistance-for-disabled-applicants.html
EA_ExpHire