Role :
Join General Motors’ Vehicle Security Platforms (VSP) teams, where we build resilient, secure, and scalable platforms supporting mission-critical vehicle security communications. We seek an experienced Staff Site Reliability Engineer (SRE) with extensive experience in scaling distributed systems and driving end-to-end reliability strategies.
In this role, you will shape the reliability of GM’s next-generation vehicle security platforms, influence cross-organizational architecture decisions, and embed reliability as a first-class product concern. Your leadership will contribute directly to protecting millions of vehicles and customers globally.
What You’ll Do:
- Implement, and evolve secure, highly available, and globally distributed systems powering GM’s vehicle security platforms.
- Own reliability roadmaps, establishing frameworks and strategies for system hardening, high availability, disaster recovery, and operational scalability.
- Develop automation-first solutions to eliminate operational toil, with advanced use of languages such as Python, Go, and Java.
- Lead incident response, driving systematic elimination of failure modes through blameless postmortems PRRs and cross-team preventative initiatives.
- Drive observability strategies with best-in-class practices for metrics, logging, and distributed tracing, using Prometheus, Datadog, or similar stacks.
- Partner with engineering, platform, and security teams to design for reliability from inception, influencing architecture reviews and CI/CD best practices.
- Lead optimization, capacity planning, and performance-tuning strategies for large-scale, security-critical platforms.
- Introduce modern SRE practices such as chaos engineering, resilience testing, and progressive delivery to validate support teams and evolve system safety along with SLO, SLI, and SLAs.
- Mentor engineers across disciplines on SRE, platform resilience, secure operational practices, and architectural trade-offs.
- Evaluate and adopt technologies (open-source, enterprise, homegrown) for security and reliability at scale.
- Influence product strategy in partnership with engineering leads, ensuring operational reliability is prioritized alongside customer and business outcomes.
Your Skills & Abilities (Required Qualifications):
- 7+ years of experience in Site Reliability Engineering, DevOps, or infrastructure/platform roles supporting secure, scalable systems.
- Strong Proven expertise in designing and scaling cloud infrastructure (Azure) and container orchestration systems (Kubernetes, Docker).
- Demonstrated mastery of infrastructure-as-code frameworks (Terraform, Helm, CloudFormation, etc).
- Proficiency in Python and one JVM language (Java or Kotlin), and working knowledge of Go.
- Deep architectural understanding of distributed systems, networking, system design, and large-scale security practices.
- Track record of architecting and running zero-downtime systems in production.
- Experience with modern monitoring and reliability tooling and frameworks (Prometheus, Datadog, OpenTelemetry, etc.).
- Experience leading incident response, uptime SLO/SLA management, and operational excellence initiatives across multiple teams.
- Capable of influencing architecture and product strategy while maintaining a hands-on approach to systems reliability.
- Exceptional communication skills, able to present complex trade-offs and foster alignment across executive, product, and engineering stakeholders.
What Will Give You A Competitive Edge (Preferred Qualifications)
- BS/MS/PhD in Computer Science, Engineering, or equivalent industry experience.
- Deep understanding of encryption technologies, secure data handling practices, and identity management.
- Experience designing and operating IoT or automotive-focused architectures with rigorous availability and safety requirements.
- Direct experience in chaos engineering, game-day testing, disaster recovery orchestration, and production load testing.
- Ability to grow and mentor engineers into leaders in their domain, building SRE teams that can operate independently at scale.
- Demonstrated success in defining and executing reliability strategies with measurable business impact.
- Strong product mindset with the ability to balance engineering excellence with speed and business priorities.