The Senior Principal AI Agent / ML Software Engineer is a Senior Staff-level, hands-on technical leadership role responsible for defining, building, and operating next-generation AI systems on Oracle Cloud Infrastructure (OCI). This person will set architecture and engineering direction for production-grade agentic AI platforms, autonomous workflows, scalable inference infrastructure, and enterprise AI applications used in large-scale, business-critical environments.
This role requires a proven engineer who can translate ambiguous product and platform goals into durable technical strategy, lead multi-team execution without direct authority, and remain deeply hands-on in design, code, reviews, operations, and incident follow-up. The ideal candidate combines deep distributed systems experience with practical AI-native engineering, including orchestration of LLMs, tools, APIs, memory, retrieval, evaluation, guardrails, and cloud services. The expectation is to ship, scale, and operate reliable, secure, observable, and cost-aware AI platform systems while raising the technical bar for engineers across the organization.
Internal Responsibilities
Responsibilities
- Serve as a senior technical owner for OCI AI platform capabilities, including agent execution, inference systems, model serving, AI workflow orchestration, evaluation, and observability.
- Design, architect, and deliver scalable agentic AI systems capable of reasoning, planning, tool use, workflow execution, multi-step task orchestration, and safe human-in-the-loop escalation.
- Build production-grade services for tool calling, agent memory, context management, Model Context Protocol (MCP) integration, vector retrieval, multi-agent coordination, policy enforcement, and evaluation.
- Lead architecture across distributed services optimized for low latency, high throughput, GPU efficiency, reliability, cost, operability, and secure multi-tenant operation.
- Define service boundaries, APIs, data models, state management, consistency tradeoffs, failure modes, SLIs/SLOs, rollout strategies, and operational readiness criteria for AI platform services.
- Drive technical strategy across infrastructure, platform, security, data, and application engineering teams, converting broad goals into executable multi-quarter plans and measurable milestones.
- Integrate AI agents securely and reliably with enterprise APIs, cloud services, databases, identity systems, secrets management, and external systems.
- Establish AgentOps and LLMOps practices for tracing, monitoring, eval suites, regression testing, experimentation, safety guardrails, prompt/tool versioning, and production reliability.
- Evaluate and operationalize emerging technologies in generative AI, agentic workflows, inference optimization, long-context systems, reasoning models, AI developer tooling, and agentic-first development.
- Drive engineering excellence through code reviews, design reviews, test strategy, deployment automation, incident analysis, documentation, and AI-assisted development practices using tools such as Codex, Claude Code, Cursor, Copilot, or similar systems.
- Mentor Staff and senior engineers, raise architectural standards, and influence engineering practices across OCI without requiring direct management authority.
- Own critical production outcomes, including reliability, performance, security posture, cost efficiency, and supportability for the systems delivered.
Required Qualifications
- Bachelor's, Master's, or Ph.D. in Computer Science, AI/ML, Engineering, or a related field, or equivalent practical experience.
- 12+ years of professional software engineering experience, including significant ownership of production systems; or equivalent experience demonstrating Senior Staff / Principal-level impact.
- Proven track record as a Staff, Senior Staff, Principal, or equivalent technical leader influencing architecture and execution across multiple teams.
- Deep experience designing, building, and operating high-scale distributed systems, cloud services, infrastructure platforms, or AI/ML platform services.
- Hands-on experience with production AI systems, agentic AI applications, autonomous workflows, tool-using agents, multi-step orchestration, or multi-agent systems.
- Practical experience with orchestration frameworks such as LangGraph, LangChain, CrewAI, AutoGen, LlamaIndex, or similar ecosystems.
- Deep understanding of LLM application patterns, including prompt design, structured outputs, function/tool calling, context management, RAG, memory, tool safety, and evaluation.
- Strong programming skills in Python and ability to contribute high-quality production code, reviews, tests, and debugging in complex distributed environments.
- Strong expertise with Kubernetes, Docker, cloud-native infrastructure, service-to-service communication, scalability, fault tolerance, observability, and performance analysis.
- Experience defining SLIs/SLOs, production readiness criteria, incident response practices, monitoring, tracing, experiments, and reliability programs for AI or distributed systems.
- Strong understanding of AI safety, governance, security, and operational risks for autonomous or semi-autonomous systems, including data handling, access control, auditability, and human accountability.
- Excellent written and verbal communication, with demonstrated ability to lead technical direction, resolve ambiguity, and influence senior stakeholders.
Preferred Qualifications
- Experience optimizing large-scale GPU inference or training workloads for latency, throughput, utilization, availability, and cost.
- Experience building or operating model serving, inference gateways, agent runtimes, workflow engines, developer platforms, or internal AI productivity platforms.
- Experience integrating AI systems with enterprise APIs, databases, cloud services, vector databases, embeddings, retrieval systems, identity systems, and policy enforcement layers.
- Experience with LLM fine-tuning, long-context systems, reasoning models, model routing, caching, batching, quantization, or emerging generative AI research.
- Experience building evaluation frameworks for agentic systems, including offline evals, online experiments, golden tasks, adversarial testing, regression gates, and observability dashboards.
- Experience using AI-assisted software development tools such as Codex, Claude Code, Cursor, Copilot, or similar systems in large-scale engineering environments.
- Track record of defining architectural standards, platform capabilities, or engineering practices adopted across multiple teams or organizations.
- Experience in enterprise, cloud infrastructure, regulated, security-sensitive, or mission-critical environments.
External Responsibilities
Responsibilities
- Serve as a senior technical owner for OCI AI platform capabilities, including agent execution, inference systems, model serving, AI workflow orchestration, evaluation, and observability.
- Design, architect, and deliver scalable agentic AI systems capable of reasoning, planning, tool use, workflow execution, multi-step task orchestration, and safe human-in-the-loop escalation.
- Build production-grade services for tool calling, agent memory, context management, Model Context Protocol (MCP) integration, vector retrieval, multi-agent coordination, policy enforcement, and evaluation.
- Lead architecture across distributed services optimized for low latency, high throughput, GPU efficiency, reliability, cost, operability, and secure multi-tenant operation.
- Define service boundaries, APIs, data models, state management, consistency tradeoffs, failure modes, SLIs/SLOs, rollout strategies, and operational readiness criteria for AI platform services.
- Drive technical strategy across infrastructure, platform, security, data, and application engineering teams, converting broad goals into executable multi-quarter plans and measurable milestones.
- Integrate AI agents securely and reliably with enterprise APIs, cloud services, databases, identity systems, secrets management, and external systems.
- Establish AgentOps and LLMOps practices for tracing, monitoring, eval suites, regression testing, experimentation, safety guardrails, prompt/tool versioning, and production reliability.
- Evaluate and operationalize emerging technologies in generative AI, agentic workflows, inference optimization, long-context systems, reasoning models, AI developer tooling, and agentic-first development.
- Drive engineering excellence through code reviews, design reviews, test strategy, deployment automation, incident analysis, documentation, and AI-assisted development practices using tools such as Codex, Claude Code, Cursor, Copilot, or similar systems.
- Mentor Staff and senior engineers, raise architectural standards, and influence engineering practices across OCI without requiring direct management authority.
- Own critical production outcomes, including reliability, performance, security posture, cost efficiency, and supportability for the systems delivered.
Required Qualifications
- Bachelor's, Master's, or Ph.D. in Computer Science, AI/ML, Engineering, or a related field, or equivalent practical experience.
- 12+ years of professional software engineering experience, including significant ownership of production systems; or equivalent experience demonstrating Senior Staff / Principal-level impact.
- Proven track record as a Staff, Senior Staff, Principal, or equivalent technical leader influencing architecture and execution across multiple teams.
- Deep experience designing, building, and operating high-scale distributed systems, cloud services, infrastructure platforms, or AI/ML platform services.
- Hands-on experience with production AI systems, agentic AI applications, autonomous workflows, tool-using agents, multi-step orchestration, or multi-agent systems.
- Practical experience with orchestration frameworks such as LangGraph, LangChain, CrewAI, AutoGen, LlamaIndex, or similar ecosystems.
- Deep understanding of LLM application patterns, including prompt design, structured outputs, function/tool calling, context management, RAG, memory, tool safety, and evaluation.
- Strong programming skills in Python and ability to contribute high-quality production code, reviews, tests, and debugging in complex distributed environments.
- Strong expertise with Kubernetes, Docker, cloud-native infrastructure, service-to-service communication, scalability, fault tolerance, observability, and performance analysis.
- Experience defining SLIs/SLOs, production readiness criteria, incident response practices, monitoring, tracing, experiments, and reliability programs for AI or distributed systems.
- Strong understanding of AI safety, governance, security, and operational risks for autonomous or semi-autonomous systems, including data handling, access control, auditability, and human accountability.
- Excellent written and verbal communication, with demonstrated ability to lead technical direction, resolve ambiguity, and influence senior stakeholders.
Preferred Qualifications
- Experience optimizing large-scale GPU inference or training workloads for latency, throughput, utilization, availability, and cost.
- Experience building or operating model serving, inference gateways, agent runtimes, workflow engines, developer platforms, or internal AI productivity platforms.
- Experience integrating AI systems with enterprise APIs, databases, cloud services, vector databases, embeddings, retrieval systems, identity systems, and policy enforcement layers.
- Experience with LLM fine-tuning, long-context systems, reasoning models, model routing, caching, batching, quantization, or emerging generative AI research.
- Experience building evaluation frameworks for agentic systems, including offline evals, online experiments, golden tasks, adversarial testing, regression gates, and observability dashboards.
- Experience using AI-assisted software development tools such as Codex, Claude Code, Cursor, Copilot, or similar systems in large-scale engineering environments.
- Track record of defining architectural standards, platform capabilities, or engineering practices adopted across multiple teams or organizations.
- Experience in enterprise, cloud infrastructure, regulated, security-sensitive, or mission-critical environments.