Join the team building Oracle Cloud Infrastructure's state of the art observability platform, powering visibility and operational intelligence for both OCI's internal cloud services and customers running mission-critical workloads on OCI. OCI Monitoring and Logging serve as foundational platforms used by OCI engineering teams to operate and troubleshoot hundreds of cloud services while also enabling customers to monitor, analyze, and gain insights into their own applications and infrastructure. This unique position offers the opportunity to build observability solutions that operate at massive scale, serving the demanding needs of OCI's own services as well as a global customer base. Our team tackles some of the industry's most challenging distributed systems problems, including high-throughput telemetry ingestion, large-scale data processing, cost-efficient storage, low-latency query execution, multi-tenant reliability, and operational excellence. If you are passionate about building cloud-native observability platforms that power both the cloud itself and the customers who depend on it, we'd love to talk to you.
Internal Responsibilities
* Lead the design, development, and operation of cloud-scale observability platforms supporting metrics, logs, traces, and related telemetry data.
* Architect and implement highly scalable, resilient, and cost-efficient telemetry collection, ingestion, processing, storage, and query systems.
* Drive the evolution of end-to-end observability pipelines, from instrumentation and data collection through real-time analytics and long-term retention.
* Design and optimize distributed systems capable of ingesting and processing massive volumes of telemetry data with stringent latency and availability requirements.
* Develop scalable storage and indexing solutions for high-cardinality metrics, large-scale log analytics, and distributed tracing workloads.
* Build and enhance query, search, and retrieval services that deliver fast, reliable, and intuitive access to observability data.
* Collaborate with product management, architects, SREs, and engineering teams to define and deliver next-generation observability capabilities.
* Identify and resolve performance bottlenecks across the observability stack, including ingestion, storage, indexing, aggregation, and query execution.
* Design systems with a strong focus on reliability, fault tolerance, scalability, security, and operational excellence.
* Drive technical strategy and architectural decisions for observability services operating at hyperscale cloud environments.
* Mentor senior and junior engineers, provide technical leadership, and foster engineering best practices across the organization.
* Partner with service teams to improve instrumentation, telemetry quality, and operational visibility across cloud services.
* Establish and monitor key service health, scalability, performance, and cost-efficiency metrics for observability platforms.
* Lead troubleshooting and root-cause analysis efforts for complex distributed systems and large-scale production environments.
* Stay current with emerging trends, technologies, and best practices in observability, distributed systems, data processing, and cloud-native architectures.
Minimum Qualifications:
B.S, M.S, or Ph.D in Computer Science or equivalent
8+ years of experience in the industry
Programming languages: Java, Go, C, C++, Python
Experience working with the following:
Cloud scale products and services
Mutli-tenant services
Concurrent Programming
Open source technologies for development and management
Cloud technologies
Full product/service development and operations lifecycle
Strong communication and analytical skills
Able to adapt to fast changing requirements
Preferred Qualifications:
Experience with designing and developing Observability Solutions (metrics, logs, traces)
Experience with tools such as terraform
Performance, Scalability, Reliability and Recovery of large scale distributed systems
External Responsibilities
* Lead the design, development, and operation of cloud-scale observability platforms supporting metrics, logs, traces, and related telemetry data.
* Architect and implement highly scalable, resilient, and cost-efficient telemetry collection, ingestion, processing, storage, and query systems.
* Drive the evolution of end-to-end observability pipelines, from instrumentation and data collection through real-time analytics and long-term retention.
* Design and optimize distributed systems capable of ingesting and processing massive volumes of telemetry data with stringent latency and availability requirements.
* Develop scalable storage and indexing solutions for high-cardinality metrics, large-scale log analytics, and distributed tracing workloads.
* Build and enhance query, search, and retrieval services that deliver fast, reliable, and intuitive access to observability data.
* Collaborate with product management, architects, SREs, and engineering teams to define and deliver next-generation observability capabilities.
* Identify and resolve performance bottlenecks across the observability stack, including ingestion, storage, indexing, aggregation, and query execution.
* Design systems with a strong focus on reliability, fault tolerance, scalability, security, and operational excellence.
* Drive technical strategy and architectural decisions for observability services operating at hyperscale cloud environments.
* Mentor senior and junior engineers, provide technical leadership, and foster engineering best practices across the organization.
* Partner with service teams to improve instrumentation, telemetry quality, and operational visibility across cloud services.
* Establish and monitor key service health, scalability, performance, and cost-efficiency metrics for observability platforms.
* Lead troubleshooting and root-cause analysis efforts for complex distributed systems and large-scale production environments.
* Stay current with emerging trends, technologies, and best practices in observability, distributed systems, data processing, and cloud-native architectures.
Minimum Qualifications:
B.S, M.S, or Ph.D in Computer Science or equivalent
8+ years of experience in the industry
Programming languages: Java, Go, C, C++, Python
Experience working with the following:
Cloud scale products and services
Mutli-tenant services
Concurrent Programming
Open source technologies for development and management
Cloud technologies
Full product/service development and operations lifecycle
Strong communication and analytical skills
Able to adapt to fast changing requirements
Preferred Qualifications:
Experience with designing and developing Observability Solutions (metrics, logs, traces)
Experience with tools such as terraform
Performance, Scalability, Reliability and Recovery of large scale distributed systems