The ideal candidate is an experienced RDMA software engineer with a strong background in high-performance networking, distributed communication systems, and systems programming. You will work closely with senior technical leaders to design, implement, optimize, and operate critical networking infrastructure used by large-scale AI training and inference workloads.
This is a hands-on engineering role requiring deep technical expertise, strong software development skills, and a passion for solving complex performance and scalability challenges.
What You'll Bring
- Strong software engineering fundamentals and systems programming expertise.
- Deep interest in RDMA, high-performance networking, and distributed communication systems.
- Ability to diagnose and solve complex performance and scalability problems.
- Strong collaboration and communication skills in cross-functional engineering environments.
- Ownership mindset with the ability to independently drive technical initiatives from design through production deployment.
- Passion for building infrastructure that enables next-generation AI systems.
Internal Responsibilities
Key Responsibilities
- Design, develop, and optimize RDMA-based software components and services for large-scale AI infrastructure.
- Build and enhance collective communication frameworks, transport layers, and communication libraries used by distributed AI workloads.
- Develop congestion management, load balancing, resiliency, and failover capabilities for RDMA-based networks.
- Analyze and improve communication performance across networking, GPU, and software stacks.
- Design and implement scalable distributed systems supporting AI training and inference environments.
- Collaborate with networking, AI infrastructure, hardware, and cloud platform teams to deliver high-performance solutions.
- Investigate and resolve complex networking, performance, and reliability issues in production environments.
- Develop observability, telemetry, debugging, and performance analysis tools for distributed communication systems.
- Contribute to architectural design discussions and technical direction for networking platforms.
- Participate in code reviews and help maintain engineering excellence across the team.
Minimum Qualifications
- Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related field; advanced degree preferred.
- 7+ years of software engineering experience in systems software, networking, distributed systems, or infrastructure platforms.
- Strong hands-on expertise with RDMA technologies, including RoCEv2 and/or InfiniBand.
- Experience developing RDMA-enabled software, communication libraries, networking services, or distributed infrastructure.
- Strong understanding of RDMA programming concepts, including queue pairs, completion queues, memory registration, verbs, and transport semantics.
- Proficiency in C/C++ and Linux systems programming.
- Experience debugging and optimizing performance-critical software systems.
- Solid understanding of networking fundamentals, operating systems, and distributed systems concepts.
Preferred Qualifications
- Experience with collective communication frameworks and libraries such as NCCL, RCCL, MPI, UCX, UCC, XCCL, or similar technologies.
- Experience supporting AI/ML infrastructure and distributed training environments.
- Knowledge of GPUDirect RDMA and GPU-aware communication technologies.
- Experience developing congestion management, traffic engineering, or network resiliency solutions.
- Familiarity with large-scale GPU clusters and high-performance computing environments.
- Experience building services and infrastructure operating directly over RDMA transports.
- Familiarity with distributed training frameworks such as PyTorch, DeepSpeed, Megatron-LM, TensorFlow, or JAX.
- Experience with Kubernetes, containers, and cloud infrastructure platforms.
- Understanding of performance profiling and benchmarking tools for networking and distributed systems.
External Responsibilities
Key Responsibilities
- Design, develop, and optimize RDMA-based software components and services for large-scale AI infrastructure.
- Build and enhance collective communication frameworks, transport layers, and communication libraries used by distributed AI workloads.
- Develop congestion management, load balancing, resiliency, and failover capabilities for RDMA-based networks.
- Analyze and improve communication performance across networking, GPU, and software stacks.
- Design and implement scalable distributed systems supporting AI training and inference environments.
- Collaborate with networking, AI infrastructure, hardware, and cloud platform teams to deliver high-performance solutions.
- Investigate and resolve complex networking, performance, and reliability issues in production environments.
- Develop observability, telemetry, debugging, and performance analysis tools for distributed communication systems.
- Contribute to architectural design discussions and technical direction for networking platforms.
- Participate in code reviews and help maintain engineering excellence across the team.
Minimum Qualifications
- Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related field; advanced degree preferred.
- 7+ years of software engineering experience in systems software, networking, distributed systems, or infrastructure platforms.
- Strong hands-on expertise with RDMA technologies, including RoCEv2 and/or InfiniBand.
- Experience developing RDMA-enabled software, communication libraries, networking services, or distributed infrastructure.
- Strong understanding of RDMA programming concepts, including queue pairs, completion queues, memory registration, verbs, and transport semantics.
- Proficiency in C/C++ and Linux systems programming.
- Experience debugging and optimizing performance-critical software systems.
- Solid understanding of networking fundamentals, operating systems, and distributed systems concepts.
Preferred Qualifications
- Experience with collective communication frameworks and libraries such as NCCL, RCCL, MPI, UCX, UCC, XCCL, or similar technologies.
- Experience supporting AI/ML infrastructure and distributed training environments.
- Knowledge of GPUDirect RDMA and GPU-aware communication technologies.
- Experience developing congestion management, traffic engineering, or network resiliency solutions.
- Familiarity with large-scale GPU clusters and high-performance computing environments.
- Experience building services and infrastructure operating directly over RDMA transports.
- Familiarity with distributed training frameworks such as PyTorch, DeepSpeed, Megatron-LM, TensorFlow, or JAX.
- Experience with Kubernetes, containers, and cloud infrastructure platforms.
- Understanding of performance profiling and benchmarking tools for networking and distributed systems.