Join the team building Oracle Cloud Infrastructure's state of the art observability platform, powering visibility and operational intelligence for both OCI's internal cloud services and customers running mission-critical workloads on OCI. OCI Logging and Monitoring serve as foundational platforms used by OCI engineering teams to operate and troubleshoot hundreds of cloud services while also enabling customers to monitor, analyze, and gain insights into their own applications and infrastructure.
This unique position in the Logging team offers the opportunity to build observability solutions that operate at massive scale, serving the demanding needs of OCI's own services as well as a global customer base. Our team tackles some of the industry's most challenging distributed systems problems, including high-throughput log ingestion, large-scale data processing, cost-efficient storage, low-latency query execution, multi-tenant reliability, and operational excellence. If you are passionate about building cloud-native observability platforms that power both the cloud itself and the customers who depend on it, we'd love to talk to you.
Internal Responsibilities
- Lead the design, development, and operation of cloud-scale logging platforms supporting log collection, ingestion, processing, storage, indexing, search, and query.
- Architect and implement highly scalable, resilient, and cost-efficient logging systems that serve internal OCI services and external customers.
- Design and optimize distributed systems capable of ingesting, storing, and querying massive volumes of log data with stringent latency, availability, durability, and compliance requirements.
- Develop scalable storage, indexing, and retrieval solutions for high-volume logs and large-scale log analytics workloads.
- Build and enhance query, search, and retrieval services that provide fast, reliable, and intuitive access to log data.
- Drive adoption of next-generation logging storage and query architectures, including optimized storage platforms, query acceleration, and migration from legacy data paths.
- Collaborate with product management, architects, SREs, security, compliance, and engineering teams to define and deliver next-generation logging capabilities.
- Identify and resolve performance bottlenecks across the logging stack, including log ingestion, buffering, processing, storage, indexing, retention, aggregation, and query execution.
- Drive technical strategy and architectural decisions for logging services operating in hyperscale cloud environments.
- Mentor senior and junior engineers, provide technical leadership, and foster strong engineering practices across the Logging team and broader Observability organization.
- Partner with OCI service teams to improve log emission, log quality, schema consistency, operational visibility, and customer troubleshooting experiences.
- Establish and monitor key service health, scalability, performance, availability, durability, and cost-efficiency metrics for logging platforms.
- Lead troubleshooting and root-cause analysis efforts for complex distributed systems, large-scale log processing pipelines, and production incidents.
- Drive technical alignment between Logging and adjacent Observability services, including Telemetry and Log Analytics, to support unified customer experiences across logs, metrics, and traces.
- Stay current with emerging trends, technologies, and best practices in logging, observability, distributed systems, data processing, search, storage, and cloud-native architectures.
Minimum Qualifications:
B.S, M.S, or Ph.D in Computer Science or equivalent
10+ years of experience in the industry
Programming languages: Java, Go, C, C++, Python
Experience working with the following:
Cloud scale products and services
Mutli-tenant services
Concurrent Programming
Open source technologies for development and management
Cloud technologies
Full product/service development and operations lifecycle
Strong communication and analytical skills
Able to adapt to fast changing requirements
Preferred Qualifications:
Experience with designing and developing Observability Solutions (metrics, logs, traces)
Experience with Kafka, Lucene, Spark, Parquet, Kubernetes, Terraform
Performance, Scalability, Reliability and Recovery of large scale distributed systems
External Responsibilities
- Lead the design, development, and operation of cloud-scale logging platforms supporting log collection, ingestion, processing, storage, indexing, search, and query.
- Architect and implement highly scalable, resilient, and cost-efficient logging systems that serve internal OCI services and external customers.
- Design and optimize distributed systems capable of ingesting, storing, and querying massive volumes of log data with stringent latency, availability, durability, and compliance requirements.
- Develop scalable storage, indexing, and retrieval solutions for high-volume logs and large-scale log analytics workloads.
- Build and enhance query, search, and retrieval services that provide fast, reliable, and intuitive access to log data.
- Drive adoption of next-generation logging storage and query architectures, including optimized storage platforms, query acceleration, and migration from legacy data paths.
- Collaborate with product management, architects, SREs, security, compliance, and engineering teams to define and deliver next-generation logging capabilities.
- Identify and resolve performance bottlenecks across the logging stack, including log ingestion, buffering, processing, storage, indexing, retention, aggregation, and query execution.
- Drive technical strategy and architectural decisions for logging services operating in hyperscale cloud environments.
- Mentor senior and junior engineers, provide technical leadership, and foster strong engineering practices across the Logging team and broader Observability organization.
- Partner with OCI service teams to improve log emission, log quality, schema consistency, operational visibility, and customer troubleshooting experiences.
- Establish and monitor key service health, scalability, performance, availability, durability, and cost-efficiency metrics for logging platforms.
- Lead troubleshooting and root-cause analysis efforts for complex distributed systems, large-scale log processing pipelines, and production incidents.
- Drive technical alignment between Logging and adjacent Observability services, including Telemetry and Log Analytics, to support unified customer experiences across logs, metrics, and traces.
- Stay current with emerging trends, technologies, and best practices in logging, observability, distributed systems, data processing, search, storage, and cloud-native architectures.
Minimum Qualifications:
B.S, M.S, or Ph.D in Computer Science or equivalent
10+ years of experience in the industry
Programming languages: Java, Go, C, C++, Python
Experience working with the following:
Cloud scale products and services
Mutli-tenant services
Concurrent Programming
Open source technologies for development and management
Cloud technologies
Full product/service development and operations lifecycle
Strong communication and analytical skills
Able to adapt to fast changing requirements
Preferred Qualifications:
Experience with designing and developing Observability Solutions (metrics, logs, traces)
Experience with Kafka, Lucene, Spark, Parquet, Kubernetes, Terraform
Performance, Scalability, Reliability and Recovery of large scale distributed systems