Leads enterprise-wide performance monitoring and real-time operational governance, ensuring standardized processes for shift operations, event management, escalation, incident command, and communications. Oversees capacity and readiness for critical infrastructure (power, cooling, controls, life safety, and physical security), ensuring sites are resilient, compliant, and audit-ready.
Partners with executive leadership on multi-year operational, reliability, and financial targets; drives adoption of automation, telemetry, and predictive maintenance to reduce risk and improve mean time to restore (MTTR). Establishes crisis management standards, continuous improvement mechanisms, and a culture of operational excellence, knowledge sharing, and accountability.
Leads major expansion and transformation initiatives impacting operational readiness, serves as senior liaison across regions, and oversees the full lifecycle of critical infrastructure and hardware assets—including install, maintenance strategy, spares, vendor performance, and investment governance—to optimize reliability, security, and scalability.
Internal Responsibilities
Key Responsibilities
24/7 Mission Critical Operations Leadership
Owns 100% uptime operations for a portfolio of very large/complex data center sites, ensuring consistent execution of shift coverage, operational handoffs, and standardized runbooks.
Establishes and governs the Mission Critical Operations (MCO) operating model: command structure, on-call rotations, escalation paths, and service-impacting event response.
Ensures operational readiness for high-severity incidents through drills/tabletops, incident commander training, and continuous improvement of response playbooks.
Performance Monitoring, Controls, and Reliability
Defines the enterprise strategy for real-time monitoring and operational health across the portfolio (BMS/EPMS/SCADA/telemetry), aligning KPIs to uptime, reliability, safety, and customer outcomes.
Drives operating rhythms for reviewing: availability, MTTR/MTBF, alarm quality, repeat events, maintenance effectiveness, and risk posture.
Establishes standards for preventive and predictive maintenance, MOP/SOP/EOP quality, change control, and operational compliance.
Incident, Problem, and Crisis Management
Governs standards for event triage, incident command, escalation, stakeholder communications, and customer-impacting notifications.
Leads post-incident reviews for P1/P0 events, ensuring root cause analysis (RCA) quality, corrective/preventive actions (CAPA), and verified closure.
Operates as executive escalation point for highly complex incidents and cross-regional reliability risks.
Capacity, Resiliency, and Site Readiness
Oversees evaluation of power, cooling, physical space, network/support infrastructure, and security capacity, ensuring readiness for load growth and peak conditions.
Ensures resiliency standards are met (redundancy, maintenance windows, failover testing, generator/UPS readiness, fuel strategy as applicable).
Directs operational risk assessments and ensures sites remain audit-ready and compliant with applicable standards and internal controls.
Automation and Operational Tooling
Drives adoption of automation for alarm correlation, workflow orchestration, remote operations, and predictive analytics to reduce human error and improve response times.
Standardizes data quality and instrumentation required for high-confidence operational decision-making.
Expansion, Launch, and Transformation (Operational Readiness Focus)
Leads operational support for expansions/new builds/site launches, ensuring Day-0/Day-1 readiness, staffing, training, spares, procedures, and turnover acceptance criteria.
Partners with engineering and construction to embed operability, maintainability, and safety into design and commissioning.
Asset Lifecycle, Vendors, and Investment Governance
Oversees lifecycle strategy for critical infrastructure and supporting hardware assets: installation, maintenance, spares, logistics, inventory, and decommissioning.
Establishes enterprise standards for vendor performance, SLAs, service quality, and compliance; drives corrective actions where performance gaps exist.
Approves and manages multi-million dollar investments in upgrades, capacity expansion, reliability improvements, and risk remediation.
Core Leadership Responsibilities (unchanged but aligned to 24/7 ops)
Planning & Execution
Provides strategic oversight for mission-critical operational initiatives, ensuring priorities reflect reliability risk, customer impact, and compliance needs.
Collaboration & Partnership
Sets direction and builds strong partnerships with engineering, construction, security, network/IT, program management, and business stakeholders to ensure reliable 24/7 delivery.
Problem Solving
Continuous Learning / Improvement
Champions operational excellence through training programs, certifications, drills, and a sustained improvement roadmap aligned to availability and risk reduction.
Performance and Development
Builds and develops a high-performing 24/7 operations organization, including shift leaders, incident commanders, and regional operations management.
This role supports a 24/7/365 environment and will require participation and managing incident and team management across all shifts.
Safety emphasis: explicit accountability for life safety and safe work practices (LOTO, energized work policies as applicable).
External Responsibilities
Key Responsibilities
24/7 Mission Critical Operations Leadership
Owns 100% uptime operations for a portfolio of very large/complex data center sites, ensuring consistent execution of shift coverage, operational handoffs, and standardized runbooks.
Establishes and governs the Mission Critical Operations (MCO) operating model: command structure, on-call rotations, escalation paths, and service-impacting event response.
Ensures operational readiness for high-severity incidents through drills/tabletops, incident commander training, and continuous improvement of response playbooks.
Performance Monitoring, Controls, and Reliability
Defines the enterprise strategy for real-time monitoring and operational health across the portfolio (BMS/EPMS/SCADA/telemetry), aligning KPIs to uptime, reliability, safety, and customer outcomes.
Drives operating rhythms for reviewing: availability, MTTR/MTBF, alarm quality, repeat events, maintenance effectiveness, and risk posture.
Establishes standards for preventive and predictive maintenance, MOP/SOP/EOP quality, change control, and operational compliance.
Incident, Problem, and Crisis Management
Governs standards for event triage, incident command, escalation, stakeholder communications, and customer-impacting notifications.
Leads post-incident reviews for P1/P0 events, ensuring root cause analysis (RCA) quality, corrective/preventive actions (CAPA), and verified closure.
Operates as executive escalation point for highly complex incidents and cross-regional reliability risks.
Capacity, Resiliency, and Site Readiness
Oversees evaluation of power, cooling, physical space, network/support infrastructure, and security capacity, ensuring readiness for load growth and peak conditions.
Ensures resiliency standards are met (redundancy, maintenance windows, failover testing, generator/UPS readiness, fuel strategy as applicable).
Directs operational risk assessments and ensures sites remain audit-ready and compliant with applicable standards and internal controls.
Automation and Operational Tooling
Drives adoption of automation for alarm correlation, workflow orchestration, remote operations, and predictive analytics to reduce human error and improve response times.
Standardizes data quality and instrumentation required for high-confidence operational decision-making.
Expansion, Launch, and Transformation (Operational Readiness Focus)
Leads operational support for expansions/new builds/site launches, ensuring Day-0/Day-1 readiness, staffing, training, spares, procedures, and turnover acceptance criteria.
Partners with engineering and construction to embed operability, maintainability, and safety into design and commissioning.
Asset Lifecycle, Vendors, and Investment Governance
Oversees lifecycle strategy for critical infrastructure and supporting hardware assets: installation, maintenance, spares, logistics, inventory, and decommissioning.
Establishes enterprise standards for vendor performance, SLAs, service quality, and compliance; drives corrective actions where performance gaps exist.
Approves and manages multi-million dollar investments in upgrades, capacity expansion, reliability improvements, and risk remediation.
Core Leadership Responsibilities (unchanged but aligned to 24/7 ops)
Planning & Execution
Provides strategic oversight for mission-critical operational initiatives, ensuring priorities reflect reliability risk, customer impact, and compliance needs.
Collaboration & Partnership
Sets direction and builds strong partnerships with engineering, construction, security, network/IT, program management, and business stakeholders to ensure reliable 24/7 delivery.
Problem Solving
Continuous Learning / Improvement
Champions operational excellence through training programs, certifications, drills, and a sustained improvement roadmap aligned to availability and risk reduction.
Performance and Development
Builds and develops a high-performing 24/7 operations organization, including shift leaders, incident commanders, and regional operations management.
This role supports a 24/7/365 environment and will require participation and managing incident and team management across all shifts.
Safety emphasis: explicit accountability for life safety and safe work practices (LOTO, energized work policies as applicable).