Director of Site Reliability Engineering The Judge Group Plano, TX Full Time

Kate · 23 Май 2021

Location: Plano, TX
Description: Our client is currently seeking a Director of Site Reliability Engineering

As Director of Site Reliability Engineering at PepsiCo, you will apply your engineering leadership skills and knowledge of infrastructure and software development to drive ultra-scalable and highly reliable software systems for PepsiCo’s products and services. Site Reliability Engineering (SRE) holds the responsibility for the big picture; determining how our systems relate to each other and using a breadth of tools and approaches to solve a broad spectrum of problems. Practices, such as limiting time spent on operations, blameless postmortems, and proactive identification of potential outages, factor into the iterative improvement, key to both product quality and interesting, dynamic day-to-day work. SRE’s culture of diversity, intellectual curiosity, problem solving, and openness unlocks its success. We encourage collaboration, thinking big, and taking risks in a blame-free environment. The selected candidate will be responsible for leading a highly-skilled team of Site reliability engineers that work in a consulting role with business units across sectors. These engagements' goals are to enable SRE practices and principles to improve PepsiCo’s systems' reliability, focusing on implementing Service Level Objectives (SLO), promoting a Blameless Culture, and driving engineering over toil. Additionally, the candidate will work alongside Availability teams to troubleshoot and resolve outages across large-scale distributed systems. As SRE lead, you will drive Reliability Engineering strategy and communications across PepsiCo IT, contributes to and executes long and short-term IT strategic direction. Provides high level consultation and influence on technical solutions, as well as advises senior management and provide input to aide in decision making.
•Build a SRE Center of Excellence (CoE) to coordinate and drive initiatives to improve reliability of services across multiple Engineering teams.
•Drive initiatives across multiple services such as automation, Chaos Engineering, defining critical metrics (SLIs/SLOs/SLAs), reducing MTTR and MTTD, EDA (Event Driven Automation), AI/ML techniques, improving end to end observability, defining service launch readiness criteria etc. •Develop strategic directions, workforce plans and organizational structure for the site reliability engineering team.
•Partner with peers and SRE leadership to help set team level goals, objectives, and overall strategies and provide unwavering support for SRE’s on projects through collaboration and leading reviews of processes and architecture design
•Educate and drive global adoption of automation and orchestration principles, and create an eagerness to automate, wherever and whenever the possibility arises •Manage, lead, retain, and grow a team of Site Reliability Engineers (SRE’s). Mentor and coach junior SREs, and be a driver for change and SRE adoption across the broader organization •Work cross-functionally in close partnership with product development team to guide product engineering to build fast, reliable, and durable production systems.
•Lead an organization that is responsible for reliable operation, automation, and evolution in collaborating with DevOps teams, I&O and Enterprise Architecture teams.
•Drive and promote protocols on production readiness and operational excellence
•Partner with product engineering teams to debug production outages and carry out action items to improve the reliability of those systems
•5+ Years of Site Reliability Engineering management experience leading technical teams in a globally dispersed enterprise organization
•Leadership experience in a software engineering organization with dozens of stakeholders and conflicting priorities
•10+ years as a Software engineering leader in Platform or Application area with passion for applying software development principles to scalability, resiliency, performance and security. •Ability to build technical and execution credibility with engineers
•Hold yourself and others around you to high standards when working with production •Ability to identify root causes of instability in a large-scale distributed system, across stacks
•Ability to identify time consuming and error prone manual tasks and then manage the building of a tooling to automate them
•Experience with Eng. Operation Processes implementing ITSM tools and workflows for incident response, change and problem management.
•Solid understanding of large-scale complex systems from a reliability perspective
•Experience with cloud-based solutions such as Microsoft Azure, Amazon AWS or Google Cloud
•Experience in range of SRE/DevOps technologies from Kubernetes/orchestration, software delivery pipelines, configuration and service discovery, and cloud provider platform services
•Operational knowledge with various data stores such as MongoDB, Postgres, Redis, Cassandra, Elasticsearch •BS or MS in a technical engineering discipline.
Contact: [ Link removed ] - Click here to apply to Director of Site Reliability Engineering

This job and many more are available through The Judge Group. Find us on the web at [ Link removed ] - Click here to apply to Director of Site Reliability Engineering

Recommended Skills

Engineering

Kubernetes

Mongo Db

Operations

Communication

Postgre Sql

Director of Site Reliability Engineering The Judge Group Plano, TX Full Time

Kate

Administrator

Recommended Skills​

Recommended Skills