As a Senior Manager, you will lead a team responsible for the development, operation, and improvement of large-scale OCI network fabrics and supporting systems. This role requires deep networking expertise, especially in automation of Network Clos fabrics, telemetry, and performance troubleshooting, combined with software engineering experience. You will build and improve tools, automation, monitoring, and operational systems that make these fabrics more reliable, observable, and efficient at global cloud scale.
You will work closely with Network Availability, Network Monitoring, GNOC, hardware engineering, and service teams to resolve complex customer escalations, improve operational readiness, and drive engineering programs that increase performance and availability. The ideal candidate brings both hands-on technical depth and strong people leadership, with experience managing engineers who operate and build software for large-scale distributed infrastructure.
Internal Responsibilities
System Design & Architecture – System Scalability:
- Manages the development and implementation of scalable distributed systems and components across multiple teams, including the effective use of distributed state management tools.
- Oversees code and/or system optimization efforts for large-scale data processing and high-throughput requirements within and across teams to support hyper-scale systems.
- Guides teams to define scalability requirements for owned components and ensures design and implementation requirements are met.
- Manages the use of data plane platforms to effectively handle large-scale data retrieval, storage, and processing.
- Ensures team accurately designs performance and load testing.
System Design & Architecture – System Reliability Design:
- Manages the strategy for building fault-tolerant components and systems capable of withstanding in-service updates by guiding the implementation of redundancy, replication, and automatic failover mechanisms.
- Develops design strategies for systems to effectively handle service disruptions (e.g., network partitions) by prioritizing consistency, availability, or partition tolerance.
- Leads implementation and optimization initiatives across teams for approaches to handle network unreliability, including load-shedding, throttling, and rate-limiting.
- Guides teams to design components and systems that are durable and adhere to service level objectives (SLOs), setting expectations for availability and durability of other computing services within the department.
System Design & Architecture – System Reliability Performance:
- Provides oversight in defining key performance indicators (KPIs) and telemetry to identify gaps or issues in running systems.
- Oversees the building and customization of moderately complex dashboards, telemetry systems, and alerting mechanisms to proactively monitor components and system health.
System Design & Architecture – Correctness / Availability:
- Oversees the design and implementation of functional and correctness requirements for feature sets and/or systems in new or existing systems.
- Guides teams to design complex test scenarios (e.g., fault-injection, brown-out) to evaluate system correctness.
- Directs implementation strategies for data replication and synchronization techniques to maintain data integrity and availability.
Operational Troubleshooting & Incident Management:
- Guides teams to be proactive when diagnosing, debugging, and resolving issues in active components and systems to support ongoing operation.
- Ensures teams leverage expertise to prevent interruptions, ensuring no maintenance windows are required for customers and users when resolving issues.
- Oversees operational readiness protocol and ensures teams remain knowledgeable of owned components and systems to support effective troubleshooting and performance.
- Oversees and approves schedules for operational support rotations.
Compliance & Security:
- Oversees implementation of robust security measures to protect data and applications in multi-tenant environments, ensuring team strategies incorporate encryption techniques and access controls.
- Directs execution of remediation plans to address identified security gaps, promoting continuous improvement of security measures.
- Ensures comprehensive documentation and cloud infrastructure compliance with industry standards and regulations.
Automation & Change Management:
- Oversees the development and maintenance of automation scripts and tools (e.g., Infrastructure as Code (IaC)) to manage cloud infrastructure.
- Works with teams to create and adhere to change management plans for patching, updating, and rolling back applications, and guides development of components to allow for automation of these processes.
Core Responsibilities
Planning & Execution:
- Manages multiple medium- to large-scale projects or initiatives across teams, ensuring timelines, deliverables, and budgets (when applicable) are monitored and met.
- Provides direction to teams on project work, setting priorities, and aligning with business needs.
- Guides teams on adjusting plans to accommodate resource or timeline changes.
Collaboration & Partnership:
- Drives cross-functional partnerships to align on expectations and shared objectives across multiple teams.
- Coaches team members to develop strategic relationships with business leaders, stakeholders, and external partners to foster collaboration and long-term success.
- Promotes inclusivity by actively seeking and listening to diverse perspectives, ensuring others feel heard and respected.
Problem Solving:
- Provides direction to multiple teams on addressing complex operational and/or technical issues, as well as guidance on analyzing complex data and/or information to identify solutions.
- Reviews and provides insights into unresolved or critical issues, helping teams to identify potential solutions.
Continuous Learning:
- Models engaging in continuous learning to deepen expertise and stay ahead of industry trends, integrating best practices into strategic planning.
- Leverages feedback to drive personal and team skill improvements.
- Identifies skill gaps across teams and empowers team members to pursue learning and knowledge-sharing opportunities that build their expertise in new areas, coaching them to apply learnings to advance the organization.
Continuous Improvement:
- Drives teams to collaborate on, develop, and implement ideas to increase the efficiency and effectiveness of processes, protocols, and workflows within and across teams, providing oversight.
- Guides teams to adopt new ideas for alternative approaches and methods and encourages feedback for continued improvement.
Performance and Development:
- Drives performance across teams by providing feedback and coaching in alignment with performance management processes, guidelines, and expectations.
- Discusses development goals with team members, shares opportunities to facilitate career development, and ensures individual goals are aligned with broader organizational goals.
- Develops and manages talent acquisition pipeline by leading candidate interviews, monitoring promotion eligibility, and/or orchestrating talent resources.
External Responsibilities
System Design & Architecture – System Scalability:
- Manages the development and implementation of scalable distributed systems and components across multiple teams, including the effective use of distributed state management tools.
- Oversees code and/or system optimization efforts for large-scale data processing and high-throughput requirements within and across teams to support hyper-scale systems.
- Guides teams to define scalability requirements for owned components and ensures design and implementation requirements are met.
- Manages the use of data plane platforms to effectively handle large-scale data retrieval, storage, and processing.
- Ensures team accurately designs performance and load testing.
System Design & Architecture – System Reliability Design:
- Manages the strategy for building fault-tolerant components and systems capable of withstanding in-service updates by guiding the implementation of redundancy, replication, and automatic failover mechanisms.
- Develops design strategies for systems to effectively handle service disruptions (e.g., network partitions) by prioritizing consistency, availability, or partition tolerance.
- Leads implementation and optimization initiatives across teams for approaches to handle network unreliability, including load-shedding, throttling, and rate-limiting.
- Guides teams to design components and systems that are durable and adhere to service level objectives (SLOs), setting expectations for availability and durability of other computing services within the department.
System Design & Architecture – System Reliability Performance:
- Provides oversight in defining key performance indicators (KPIs) and telemetry to identify gaps or issues in running systems.
- Oversees the building and customization of moderately complex dashboards, telemetry systems, and alerting mechanisms to proactively monitor components and system health.
System Design & Architecture – Correctness / Availability:
- Oversees the design and implementation of functional and correctness requirements for feature sets and/or systems in new or existing systems.
- Guides teams to design complex test scenarios (e.g., fault-injection, brown-out) to evaluate system correctness.
- Directs implementation strategies for data replication and synchronization techniques to maintain data integrity and availability.
Operational Troubleshooting & Incident Management:
- Guides teams to be proactive when diagnosing, debugging, and resolving issues in active components and systems to support ongoing operation.
- Ensures teams leverage expertise to prevent interruptions, ensuring no maintenance windows are required for customers and users when resolving issues.
- Oversees operational readiness protocol and ensures teams remain knowledgeable of owned components and systems to support effective troubleshooting and performance.
- Oversees and approves schedules for operational support rotations.
Compliance & Security:
- Oversees implementation of robust security measures to protect data and applications in multi-tenant environments, ensuring team strategies incorporate encryption techniques and access controls.
- Directs execution of remediation plans to address identified security gaps, promoting continuous improvement of security measures.
- Ensures comprehensive documentation and cloud infrastructure compliance with industry standards and regulations.
Automation & Change Management:
- Oversees the development and maintenance of automation scripts and tools (e.g., Infrastructure as Code (IaC)) to manage cloud infrastructure.
- Works with teams to create and adhere to change management plans for patching, updating, and rolling back applications, and guides development of components to allow for automation of these processes.
Core Responsibilities
Planning & Execution:
- Manages multiple medium- to large-scale projects or initiatives across teams, ensuring timelines, deliverables, and budgets (when applicable) are monitored and met.
- Provides direction to teams on project work, setting priorities, and aligning with business needs.
- Guides teams on adjusting plans to accommodate resource or timeline changes.
Collaboration & Partnership:
- Drives cross-functional partnerships to align on expectations and shared objectives across multiple teams.
- Coaches team members to develop strategic relationships with business leaders, stakeholders, and external partners to foster collaboration and long-term success.
- Promotes inclusivity by actively seeking and listening to diverse perspectives, ensuring others feel heard and respected.
Problem Solving:
- Provides direction to multiple teams on addressing complex operational and/or technical issues, as well as guidance on analyzing complex data and/or information to identify solutions.
- Reviews and provides insights into unresolved or critical issues, helping teams to identify potential solutions.
Continuous Learning:
- Models engaging in continuous learning to deepen expertise and stay ahead of industry trends, integrating best practices into strategic planning.
- Leverages feedback to drive personal and team skill improvements.
- Identifies skill gaps across teams and empowers team members to pursue learning and knowledge-sharing opportunities that build their expertise in new areas, coaching them to apply learnings to advance the organization.
Continuous Improvement:
- Drives teams to collaborate on, develop, and implement ideas to increase the efficiency and effectiveness of processes, protocols, and workflows within and across teams, providing oversight.
- Guides teams to adopt new ideas for alternative approaches and methods and encourages feedback for continued improvement.
Performance and Development:
- Drives performance across teams by providing feedback and coaching in alignment with performance management processes, guidelines, and expectations.
- Discusses development goals with team members, shares opportunities to facilitate career development, and ensures individual goals are aligned with broader organizational goals.
- Develops and manages talent acquisition pipeline by leading candidate interviews, monitoring promotion eligibility, and/or orchestrating talent resources.