Manage a team that designs, develops, troubleshoots and debugs software programs for databases, applications, tools, networks etc.
Internal Responsibilities
- Own and build solutions to scale and optimize AI compute infrastructure components like GPU control plane and GPU data plane with the goal to optimize customer experience and customer workload performance on our AI infrastructure.
- Set and communicate individual expectations and team goals such that they align with the broader organization goals.
- Model and coach team members and drive modern software engineering practices like leveraging data/telemetry to make decisions, well-defined interfaces across components, design reviews, coding standards, code reviews, and comprehensive coverage from unit test, integration test and active production monitoring.
- Prioritize team’s work with focus on customer issues and requirements.
- Ensure that team solutions are well-defined and modularized, secure, reliable, diagnosable, actively monitored, compliant and reusable.
- Create roadmap, define SMART goals, and track team progress against committed OKRs.
Qualifications & Skills:
- 10+ years' experience in software development with programming languages including, but not limited to, C, C++, C#, Java, Go, Rust.
- 5+ years' experience in people management or leadership role while working on cross-functional projects.
- 5+ years' experience designing and developing large-scale distributed systems, services, and infrastructure.
- BS (or equivalent experience) in Computer Science, Engineering, or related field.
- Strong communication, collaboration, and project management skills.
- Ability to adapt to a fast-paced, dynamic environment and manage multiple tasks and priorities effectively.
Preferred Qualifications:
- Experience managing cloud infrastructure with hundreds of thousands of servers.
- Experience with containerization technologies such as Docker and Kubernetes.
- Experience scheduling high-performance workloads on Kubernetes or Slurm.
External Responsibilities
- Own and build solutions to scale and optimize AI compute infrastructure components like GPU control plane and GPU data plane with the goal to optimize customer experience and customer workload performance on our AI infrastructure.
- Set and communicate individual expectations and team goals such that they align with the broader organization goals.
- Model and coach team members and drive modern software engineering practices like leveraging data/telemetry to make decisions, well-defined interfaces across components, design reviews, coding standards, code reviews, and comprehensive coverage from unit test, integration test and active production monitoring.
- Prioritize team’s work with focus on customer issues and requirements.
- Ensure that team solutions are well-defined and modularized, secure, reliable, diagnosable, actively monitored, compliant and reusable.
- Create roadmap, define SMART goals, and track team progress against committed OKRs.
Qualifications & Skills:
- 10+ years' experience in software development with programming languages including, but not limited to, C, C++, C#, Java, Go, Rust.
- 5+ years' experience in people management or leadership role while working on cross-functional projects.
- 5+ years' experience designing and developing large-scale distributed systems, services, and infrastructure.
- BS (or equivalent experience) in Computer Science, Engineering, or related field.
- Strong communication, collaboration, and project management skills.
- Ability to adapt to a fast-paced, dynamic environment and manage multiple tasks and priorities effectively.
Preferred Qualifications:
- Experience managing cloud infrastructure with hundreds of thousands of servers.
- Experience with containerization technologies such as Docker and Kubernetes.
- Experience scheduling high-performance workloads on Kubernetes or Slurm.