You will work at the intersection of distributed systems, networking, and AI infrastructure, driving architecture, design, implementation, and performance optimization across software components that support thousands of GPUs and high-bandwidth network fabrics. The ideal candidate combines deep expertise in RDMA and distributed communication systems with a strong track record of delivering production-grade infrastructure at scale.
As a technical leader, you will influence architecture across multiple teams, mentor senior engineers, and help shape the roadmap for Oracle's AI networking platform.
What You'll Bring
- Ability to solve highly complex technical challenges spanning networking, distributed systems, and AI infrastructure.
- Strong system design skills with a focus on scalability, performance, and reliability.
- A data-driven approach to performance analysis and optimization.
- Excellent communication and collaboration skills across engineering organizations.
- Passion for building foundational technologies that enable the next generation of AI workloads.
Internal Responsibilities
Key Responsibilities
- Architect and develop high-performance networking software for large-scale AI and HPC environments.
- Design and implement RDMA-based services and infrastructure that enable low-latency, high-throughput communication across GPU clusters.
- Drive the evolution of collective communication frameworks and transport layers used by distributed AI training and inference workloads.
- Develop congestion management, traffic engineering, load balancing, and resiliency mechanisms for large-scale RDMA networks.
- Optimize end-to-end communication performance across networking, GPU, and software stacks.
- Collaborate with hardware, networking, distributed systems, and AI platform teams to deliver scalable infrastructure solutions.
- Lead performance analysis, bottleneck identification, and system-wide optimization efforts.
- Define architecture and technical direction for networking platforms supporting next-generation AI workloads.
- Build observability, monitoring, telemetry, and debugging capabilities for large-scale distributed systems.
- Drive reliability, fault tolerance, and recovery mechanisms for mission-critical AI infrastructure.
- Mentor engineers across the organization and provide technical leadership on complex cross-functional initiatives.
- Influence engineering best practices, architecture reviews, and long-term technology strategy.
Minimum Qualifications
- Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related field; advanced degree preferred.
- 10+ years of software engineering experience building distributed systems, networking software, or infrastructure platforms.
- Deep expertise in RDMA technologies including RoCE, InfiniBand, or equivalent high-performance networking technologies.
- Strong experience developing networking software in C/C++.
- Experience designing and optimizing distributed communication frameworks and transport protocols.
- Solid understanding of operating systems, networking stacks, memory management, and performance optimization.
- Experience troubleshooting and optimizing large-scale production systems.
- Demonstrated technical leadership driving architecture and execution across multiple teams.
- Strong knowledge of Linux systems and low-level systems programming.
Preferred Qualifications
- Experience with collective communication libraries such as NCCL, RCCL, MPI, UCC, UCX, XCCL, or similar technologies.
- Experience building AI infrastructure supporting distributed training and inference workloads.
- Expertise in GPU networking technologies including GPUDirect RDMA and GPU-aware communication stacks.
- Experience with congestion management, adaptive routing, traffic shaping, and network resiliency mechanisms.
- Familiarity with large-scale GPU clusters consisting of hundreds to thousands of accelerators.
- Experience developing services and platforms operating directly over RDMA transports.
- Knowledge of distributed training frameworks such as PyTorch, DeepSpeed, Megatron-LM, TensorFlow, or JAX.
- Experience with cloud infrastructure and large-scale production service deployment.
- Familiarity with Kubernetes, containerized environments, and cloud-native infrastructure.
- Experience leading architecture for highly available and performance-critical systems.
External Responsibilities
Key Responsibilities
- Architect and develop high-performance networking software for large-scale AI and HPC environments.
- Design and implement RDMA-based services and infrastructure that enable low-latency, high-throughput communication across GPU clusters.
- Drive the evolution of collective communication frameworks and transport layers used by distributed AI training and inference workloads.
- Develop congestion management, traffic engineering, load balancing, and resiliency mechanisms for large-scale RDMA networks.
- Optimize end-to-end communication performance across networking, GPU, and software stacks.
- Collaborate with hardware, networking, distributed systems, and AI platform teams to deliver scalable infrastructure solutions.
- Lead performance analysis, bottleneck identification, and system-wide optimization efforts.
- Define architecture and technical direction for networking platforms supporting next-generation AI workloads.
- Build observability, monitoring, telemetry, and debugging capabilities for large-scale distributed systems.
- Drive reliability, fault tolerance, and recovery mechanisms for mission-critical AI infrastructure.
- Mentor engineers across the organization and provide technical leadership on complex cross-functional initiatives.
- Influence engineering best practices, architecture reviews, and long-term technology strategy.
Minimum Qualifications
- Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related field; advanced degree preferred.
- 10+ years of software engineering experience building distributed systems, networking software, or infrastructure platforms.
- Deep expertise in RDMA technologies including RoCE, InfiniBand, or equivalent high-performance networking technologies.
- Strong experience developing networking software in C/C++.
- Experience designing and optimizing distributed communication frameworks and transport protocols.
- Solid understanding of operating systems, networking stacks, memory management, and performance optimization.
- Experience troubleshooting and optimizing large-scale production systems.
- Demonstrated technical leadership driving architecture and execution across multiple teams.
- Strong knowledge of Linux systems and low-level systems programming.
Preferred Qualifications
- Experience with collective communication libraries such as NCCL, RCCL, MPI, UCC, UCX, XCCL, or similar technologies.
- Experience building AI infrastructure supporting distributed training and inference workloads.
- Expertise in GPU networking technologies including GPUDirect RDMA and GPU-aware communication stacks.
- Experience with congestion management, adaptive routing, traffic shaping, and network resiliency mechanisms.
- Familiarity with large-scale GPU clusters consisting of hundreds to thousands of accelerators.
- Experience developing services and platforms operating directly over RDMA transports.
- Knowledge of distributed training frameworks such as PyTorch, DeepSpeed, Megatron-LM, TensorFlow, or JAX.
- Experience with cloud infrastructure and large-scale production service deployment.
- Familiarity with Kubernetes, containerized environments, and cloud-native infrastructure.
- Experience leading architecture for highly available and performance-critical systems.