We are the AI Infrastructure - Network Operations team at OCI. We support and operate the RDMA/RoCE network fabrics for OCI's largest AI and HPC customers. These fabrics are the foundation underneath OCI's AI, GPU and HPC services, and support major tier-0 vendors in the generative AI industry. If you're running an AI workload at OCI, we're running the RDMA network underneath your workload.
A Principal Network Engineer on our team supports the design, deployment, and operations of a large-scale global Oracle cloud computing environment (Oracle Cloud Infrastructure - OCI). Primarily focused on operation and support of RDMA/RoCE network fabrics and systems, through a combination of a deep network understanding and automation skills to operate a production environment. As OCI is a cloud-based network with a global footprint, this support will include hundreds of thousands of network devices supporting millions of servers, connected over a mix of dedicated backbone infrastructure and the Internet.
Internal Responsibilities
Key Responsibilities
- Lead network lifecycle management initiatives by defining technical objectives, delivery plans, and implementation procedures for large-scale network infrastructure projects.
- Translate high-level network architectures into detailed designs and deployment plans while ensuring scalability, reliability, and operational readiness.
- Serve as the technical lead for moderately complex network projects, coordinating the efforts of multiple engineers across design, deployment, automation, and operational support.
- Design, implement, and support network solutions across data center, backbone, cloud, and service provider environments.
- Partner with service owners and infrastructure teams to ensure network solutions are fully integrated with monitoring, observability, automation, and operational support systems.
- Act as a Tier 2 and specialized escalation point for network incidents, driving root cause analysis, corrective actions, and long-term reliability improvements.
- Lead the investigation and resolution of complex network issues and large-scale service-impacting events.
- Develop automation solutions, tools, and scripts to improve operational efficiency, network reliability, deployment consistency, and incident response.
- Contribute to the design and delivery of network automation frameworks and operational tooling.
- Collaborate closely with product teams, program managers, network leadership, and PMO organizations to align infrastructure capabilities with product and service requirements.
- Partner with vendor engineering teams and account managers to troubleshoot issues, evaluate new technologies, and drive operational improvements.
- Participate in hardware evaluations, RFQ/RFP processes, and adoption of new networking technologies and platforms.
- Drive technology decisions that support business, product, and service objectives.
- Mentor junior engineers through technical guidance, troubleshooting support, design reviews, and knowledge sharing.
- Contribute to engineering best practices, documentation standards, operational excellence, and continuous improvement initiatives.
Preferred Skills & Experience
- Strong experience with large-scale network operations, design, and troubleshooting.
- Expertise in routing and switching technologies, including BGP, OSPF, EVPN-VXLAN, MPLS, and data center networking.
- Experience with network automation using Python, Ansible, APIs, or similar technologies.
- Strong understanding of observability, monitoring, telemetry, and incident management.
- Experience working with cloud infrastructure, hyperscale environments, or large-scale distributed systems.
- Ability to lead technical projects and influence outcomes across multiple teams.
- Strong written and verbal communication skills with the ability to work effectively across engineering, operations, and leadership teams.
External Responsibilities
Key Responsibilities
- Lead network lifecycle management initiatives by defining technical objectives, delivery plans, and implementation procedures for large-scale network infrastructure projects.
- Translate high-level network architectures into detailed designs and deployment plans while ensuring scalability, reliability, and operational readiness.
- Serve as the technical lead for moderately complex network projects, coordinating the efforts of multiple engineers across design, deployment, automation, and operational support.
- Design, implement, and support network solutions across data center, backbone, cloud, and service provider environments.
- Partner with service owners and infrastructure teams to ensure network solutions are fully integrated with monitoring, observability, automation, and operational support systems.
- Act as a Tier 2 and specialized escalation point for network incidents, driving root cause analysis, corrective actions, and long-term reliability improvements.
- Lead the investigation and resolution of complex network issues and large-scale service-impacting events.
- Develop automation solutions, tools, and scripts to improve operational efficiency, network reliability, deployment consistency, and incident response.
- Contribute to the design and delivery of network automation frameworks and operational tooling.
- Collaborate closely with product teams, program managers, network leadership, and PMO organizations to align infrastructure capabilities with product and service requirements.
- Partner with vendor engineering teams and account managers to troubleshoot issues, evaluate new technologies, and drive operational improvements.
- Participate in hardware evaluations, RFQ/RFP processes, and adoption of new networking technologies and platforms.
- Drive technology decisions that support business, product, and service objectives.
- Mentor junior engineers through technical guidance, troubleshooting support, design reviews, and knowledge sharing.
- Contribute to engineering best practices, documentation standards, operational excellence, and continuous improvement initiatives.
Preferred Skills & Experience
- Strong experience with large-scale network operations, design, and troubleshooting.
- Expertise in routing and switching technologies, including BGP, OSPF, EVPN-VXLAN, MPLS, and data center networking.
- Experience with network automation using Python, Ansible, APIs, or similar technologies.
- Strong understanding of observability, monitoring, telemetry, and incident management.
- Experience working with cloud infrastructure, hyperscale environments, or large-scale distributed systems.
- Ability to lead technical projects and influence outcomes across multiple teams.
- Strong written and verbal communication skills with the ability to work effectively across engineering, operations, and leadership teams.