Senior ML Infrastructure Engineer (Compute)

General Motors • Full-time • United States • 1w ago

About the Team:

The AI Validation Platform team owns the cloud-agnostic, reliable, and cost-efficient platform that powers GM’s AV efforts. We’re proud to serve as the infrastructure platform for teams developing autonomous vehicles (L3/L4/L5). Our platform supports the simulated validation of state-of-the-art (SOTA) machine learning models, with a focus on performance, availability, concurrency, and scalability. We enable rapid innovation and development by prioritizing high-impact, ML-centric use cases.

About the Role:

We are seeking a Senior ML Infrastructure engineer to help build and scale robust Compute platforms for Simulation workflows. In this role, you will focus on scaling, driving efficiency, and high utilization of cutting-edge GPUs, while also leveling up the platform’s reliability. The successful candidate will have experience building and running scalable distributed systems. They will rapidly test and promote ideas, have strong problem-solving skills, and demonstrate a bias for action.

You will play a key role in shaping the architecture, roadmap, and user experience of a robust service supporting our AI Validation / Simulation needs. The ideal candidate brings experience in designing distributed systems, strong problem-solving skills, and a get-it-done attitude. This is a high-impact opportunity to influence the future of AI infrastructure at GM.

What you’ll be doing:

Design and implement core platform backend software components.

Collaborate with Simulation engineers, ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value.

Lead technical decision-making on Compute architecture, cloud capacity provisioning, caching, and auto-scaling mechanisms.

Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization.

Proactively research and integrate frameworks, hardware accelerators, and distributed computing techniques.

Lead large-scale technical initiatives across GM’s ML infrastructure.

Raise the engineering bar through technical leadership and by establishing best practices.

At a Minimum We'd Like You To Have

4+ years of industry experience, with a focus on high performance backend services.

Strong expertise in Go, or other similar coding languages.

Experience working with cloud platforms such as GCP, Azure, or AWS.

Experience in delivering cross-functional initiatives.

Strong communication skills and a proven ability to drive cross-functional initiatives.

Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities.

It's Preferred If You Have

Hands-on experience with Cloud VM services Google Compute Engine.

Experience with hardware-in-the-loop validation systems.

Experience with high performance computing (HPC).

Experience working with or designing interfaces and clients for developer workflows.

Familiarity with telemetry, and other feedback loops to inform product improvements.

Familiarity with hardware acceleration (GPUs) and optimizations.

Why Join Us?

If you’re excited to tackle some of today’s most complex engineering challenges, see the impact of your work in real-world AV applications, and help shape the future of AI infrastructure at GM—this is the team for you.