Site Reliability Engineer - Platforms

Toyota • Plano, Texas, United States of America • 1h ago

Overview

Who we are

Collaborative. Respectful. A place to dream and do. These are just a few words that describe what life is like at Toyota. As one of the world’s most admired brands, Toyota is growing and leading the future of mobility through innovative, high-quality solutions designed to enhance lives and delight those we serve. We’re looking for talented team members who want to Dream. Do. Grow. with us.

An important part of the Toyota family is Toyota Financial Services (TFS), the finance and insurance brand for Toyota and Lexus in North America. While TFS is a separate business entity, it is an essential part of this world-changing company- delivering on Toyota's vision to move people beyond what's possible. At TFS, you will help create best-in-class customer experience in an innovative, collaborative environment.

Toyota does not offer support or sponsorship of job applicants for employment-based visas or any other work authorization for this role now or in the future. You must have the right to work in the United States and not require Toyota support or sponsorship for immigration-related employment (e.g., H-1B, O-1, E-3, H-1B1, TN, F-1 OPT, F-1 STEM OPT, F-1 CPT, TN, ‘job flexibility benefits’ (also known as I-140 or Adjustment of Status portability), etc. now or in the future. You should not apply for this role if you will require Toyota to assist with immigration support or sponsorship now or in the future.

Who we’re looking for

The Toyota Financial Services Technology Operations Center is looking for a passionate and highly motivated Site Reliability Engineer (SRE) - Platforms.

The SRE – Platforms reports to the Manager of the SRE Department.

In this role, you will apply software engineering principles to ensure the availability, performance and stability of TFS’s enterprise platforms and infrastructure services.

You will play a key role in maintaining and modernizing our Infrastructure Platforms including AWS Cloud Platform, Core Operating Platforms like Linux, Windows.

What you’ll be doing

Manage and maintain operating systems across Red Hat Enterprise Linux (RHEL), Amazon Linux, and Windows Server environments
Perform OS-level configuration, hardening, and lifecycle management following industry best practices and organizational security standards
Manage user access, permissions, file systems, storage, networking, and core OS services across platforms
Coordinate with relevant teams for maintenance and change management processes as needed.
Build/Update, own and maintain the end-to-end patch management lifecycle across all supported operating systems
Maintain tooling and workflows for automated patch scheduling, compliance reporting, and remediation tracking
Ensure patch compliance targets are consistently met and documented
Work with tools such as Red Hat Satellite, AWS Systems Manager (SSM), WSUS, Ansible, or similar patch management platforms
Design and maintain observability setups including metrics, logging, and alerting for all managed systems
Ensure all systems are instrumented with appropriate monitoring agents and are integrated into centralized observability platforms.
Define and maintain meaningful alerting thresholds, dashboards, and runbooks to provide operational visibility
Proactively identify gaps in monitoring coverage and address them before they impact reliability
Participate in incident triage and use observability data to drive faster resolution
Manage and maintain backup and restore solutions such as Cohesity, AWS backups for operating systems and critical data
Regularly test and validate restore procedures to ensure reliability and meet defined Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO)
Document backup policies, schedules, and recovery procedures
Identify and remediate failures in backup jobs and ensure alerts are in place for backup health monitoring
Write and maintain scripts and automation workflows to reduce manual toil and streamline operational tasks (e.g., provisioning, configuration management, log rotation, disk cleanup, service restarts)
Develop and implement self-healing mechanisms for common, well-understood system issues such as service crashes, disk space alerts, memory pressure, and connectivity failures
Use tools such as Bash, Python, PowerShell, Ansible, or Terraform to automate repeatable operational workflows
Contribute to internal automation libraries and maintain version-controlled infrastructure code
Troubleshoot complex production issues and implement permanent fixes to improve reliability.
Build and Maintain components required to Automate operational workflows and reduce toil using Python or equivalent scripting language.
Participate in capacity planning, disaster recovery, and business continuity exercises.
Define and manage SLIs/SLOs, health checks, and automated remediation processes
Collaborate across teams to ensure service reliability, deployment hygiene, and operational readiness
Work on Incident Postmortems and coordinate to implement required fixes to avoid repetitive incidents.
Participate in on-call rotations, Major Incident Restoration.

What you bring

Bachelor’s degree in information technology or related field.
Solid understanding of SRE concepts: SLIs, SLOs, error budgets, incident response.
Hands-on experience managing RHEL, Amazon Linux, and/or Windows Server in production environments
Solid understanding of Linux/Windows system administration fundamentals (file systems, networking, processes, services, permissions)
Experience with patch management tools and processes (e.g., Red Hat Satellite, AWS SSM Patch Manager, WSUS, Ansible)
Familiarity with monitoring and observability tools such as Dynatrace, CloudWatch etc.
Experience with backup solutions like Cohesity, AWS Backups and restore testing practices.
Scripting proficiency in one or more of: Bash, Python, PowerShell
Understanding of automation frameworks such as Ansible or similar configuration management tools
Good troubleshooting and root cause analysis skills
Ability to write clear technical documentation and runbooks
Strong understanding of SRE principles (SLIs/SLOs, error budgets, observability, toil reduction).

What we'll bring

During your interview process, our team can fill you in on all the details of our industry-leading benefits and career development opportunities. A few highlights include:

• A work environment built on teamwork, flexibility and respect
• Professional growth and development programs to help advance your career, as well as tuition reimbursement
• Team Member Vehicle Purchase Discount
• Toyota Team Member Lease Vehicle Program (if applicable)
• Comprehensive health care and wellness plans for your entire family
• Toyota 401(k) Savings Plan featuring a company match, as well as an annual retirement contribution from Toyota regardless of whether you contribute
• Paid holidays and paid time off
• Referral services related to prenatal services, adoption, childcare, schools and more
• Tax Advantaged Accounts (Health Savings Account, Health Care FSA, Dependent Care FSA)
• Relocation assistance (if applicable)

Belonging at Toyota

Our success begins and ends with our people. We embrace all perspectives and value unique human experiences. Respect for all is our North Star. Toyota is proud to have 10+ different Business Partnering Groups across 100 different North American chapter locations that support team members’ efforts to dream, do and grow without questioning that they belong.

Applicants for our positions are considered without regard to race, ethnicity, national origin, sex, sexual orientation, gender identity or expression, age, disability, religion, military or veteran status, or any other characteristics protected by law.

Have a question, need assistance with your application or do you require any special accommodations? Please send an email to talent.acquisition@toyota.com.