Someone with Site Reliability Engineering experience (SRE) would be preferred. Someone with experience in both Windows/Unix (Linux/AIX) is required. This person will lead initiatives for both disciplines and be required to do heavy business analysis for Operational needs, along with scripting in Ansible (required skillset) or Powershell. Candidate should have a minimum of 5 years experience in a systems engineering role for either OS discipline and has experience working for a delivery organization or part of an IT operations. Dev/App/Tech Ops experience a bonus.
Job Description:
The SRE is an integral part of our engineering organization and is key to creating and driving a culture of automated solutions that leads to the business’s ability for sustained operations and product delivery. The SRE will be responsible for handling duties around operational stability and performance reliability for one or more critical business functions.
Essential Duties And Responsibilities:
Operational Stability and Reliability Engineering
• Define and standardize toolsets and technology used for daily operations support, service delivery, and enablement of application development
• Standardizes monitoring disciplines for an end to end application or service monitoring, proactive alerting of business-critical applications
• Identifies platform or application bottleneck/defects and works with key stakeholders to drive remediation efforts
• Design and architect operational solutions for the management of applications and infrastructure, with specifics goals around increasing automation, repeatability, and consistency of operational tasks
• Partner with internal engineering teams in a project delivery waterfall or agile methodology to support various business needs
• Manage work efforts split between the operational, app, dev, and delivery engineering work with a strong focus on production availability
• Prioritize work cycles to ensure that the operational needs of assigned applications/platforms are addressed as needed. Assist management with monthly operational performance reviews with key stakeholders
Problem & Performance management
• Participate in on-call duties to triage, solve, and drive automate responses to problems in business-critical services
• Create and maintain monitoring technologies and processes that improve the visibility to our applications' performance and meets or exceeds defined business metrics
• Partner with other internal engineering teams for developing plans around risk and vulnerability remediation
• Automate processes and systems configuration/deployment
• Monitor and report on SLA/SLO for business-critical applications. Work with business partners and product owners to establish key performance indicators.
• Work with Application Development to ensure that assigned applications/platforms have appropriate level of monitoring and metrics in place
Release Planning and Coordination
• Work with various engineers, architects, and leadership to develop the long-term Site Reliability Engineering road map which encompasses infrastructure, tools, and application lifecycle management
• Work with Release Management and business development teams to deploy software releases & updates
• Work with business partners and internal engineering teams to properly plan, coordinate, and announce all change releases. This includes execution, validation, and rollback strategies to be clearly defined, understood, and signed-off on prior to implementation.
Operational Readiness
• Ensure that business applications & platforms are operationally ready for production. This includes the ability to read monitoring dashboards and ensuring all SOPs/knowledge articles are accounted for in the event of issues to prevent the start of the day.
• Assist with the business unit application or infrastructure go-live events
• Review SOP/knowledge articles on a monthly basis for any new feature launch or other significant change that may impact support documentation.
• Assist with training of Command Center and Application 1st level Support on new SOPs, knowledge articles, and any other support-related needs.
• Perform monthly capacity analysis of applications & platforms, including tracking of end of life assets for tech refresh opportunities.
QUALIFICATIONS:
Minimum Requirements:
• Bachelor's degree in business, computer information systems, computer science, or related field
• 2+ years of experience in supporting and maintaining 24x7 available distributed environments
• 2+ years of experience in maintaining Unix/Windows environments under PCI compliance or similar security requirements
• In lieu of a degree, 4+ years of experience in Information Technology, or related field
Required Knowledge & Skills:
• Experience enhancing and maintaining complex software & web-application environments
• Experienced in the latest DevOps skills and methodologies - Create and manage a continuous build, integration, test, and deployment systems
• Proficient in monitoring, alerting, analyzing, and troubleshooting large-scale distributed systems
• Experience with designing and supporting solutions focused on high availability, resiliency, and scaling
• Familiar with OS tuning, optimization, and system requirements for vertical scaling
• Basic understanding of networking concepts and protocols
• Experience with one or more of the following Automation/Scripting tools: Chef, Puppet, Ansible, SALT, Python, Powershell.
• Continued curiosity regarding new technologies and evolving best practices
• Familiar with industry Cloud technologies – PCF, Amazon Web Services, Microsoft Azure
• OS knowledge with an emphasis on Windows Server, Redhat, Oracle Linux, AIX
• Experience maintaining Github repositories
• Fundamental REST Services
• Ability to multi-task and context switch in a high performing environment
Dev Ops
Scripting
Site Reliability
Job Description:
The SRE is an integral part of our engineering organization and is key to creating and driving a culture of automated solutions that leads to the business’s ability for sustained operations and product delivery. The SRE will be responsible for handling duties around operational stability and performance reliability for one or more critical business functions.
Essential Duties And Responsibilities:
Operational Stability and Reliability Engineering
• Define and standardize toolsets and technology used for daily operations support, service delivery, and enablement of application development
• Standardizes monitoring disciplines for an end to end application or service monitoring, proactive alerting of business-critical applications
• Identifies platform or application bottleneck/defects and works with key stakeholders to drive remediation efforts
• Design and architect operational solutions for the management of applications and infrastructure, with specifics goals around increasing automation, repeatability, and consistency of operational tasks
• Partner with internal engineering teams in a project delivery waterfall or agile methodology to support various business needs
• Manage work efforts split between the operational, app, dev, and delivery engineering work with a strong focus on production availability
• Prioritize work cycles to ensure that the operational needs of assigned applications/platforms are addressed as needed. Assist management with monthly operational performance reviews with key stakeholders
Problem & Performance management
• Participate in on-call duties to triage, solve, and drive automate responses to problems in business-critical services
• Create and maintain monitoring technologies and processes that improve the visibility to our applications' performance and meets or exceeds defined business metrics
• Partner with other internal engineering teams for developing plans around risk and vulnerability remediation
• Automate processes and systems configuration/deployment
• Monitor and report on SLA/SLO for business-critical applications. Work with business partners and product owners to establish key performance indicators.
• Work with Application Development to ensure that assigned applications/platforms have appropriate level of monitoring and metrics in place
Release Planning and Coordination
• Work with various engineers, architects, and leadership to develop the long-term Site Reliability Engineering road map which encompasses infrastructure, tools, and application lifecycle management
• Work with Release Management and business development teams to deploy software releases & updates
• Work with business partners and internal engineering teams to properly plan, coordinate, and announce all change releases. This includes execution, validation, and rollback strategies to be clearly defined, understood, and signed-off on prior to implementation.
Operational Readiness
• Ensure that business applications & platforms are operationally ready for production. This includes the ability to read monitoring dashboards and ensuring all SOPs/knowledge articles are accounted for in the event of issues to prevent the start of the day.
• Assist with the business unit application or infrastructure go-live events
• Review SOP/knowledge articles on a monthly basis for any new feature launch or other significant change that may impact support documentation.
• Assist with training of Command Center and Application 1st level Support on new SOPs, knowledge articles, and any other support-related needs.
• Perform monthly capacity analysis of applications & platforms, including tracking of end of life assets for tech refresh opportunities.
QUALIFICATIONS:
Minimum Requirements:
• Bachelor's degree in business, computer information systems, computer science, or related field
• 2+ years of experience in supporting and maintaining 24x7 available distributed environments
• 2+ years of experience in maintaining Unix/Windows environments under PCI compliance or similar security requirements
• In lieu of a degree, 4+ years of experience in Information Technology, or related field
Required Knowledge & Skills:
• Experience enhancing and maintaining complex software & web-application environments
• Experienced in the latest DevOps skills and methodologies - Create and manage a continuous build, integration, test, and deployment systems
• Proficient in monitoring, alerting, analyzing, and troubleshooting large-scale distributed systems
• Experience with designing and supporting solutions focused on high availability, resiliency, and scaling
• Familiar with OS tuning, optimization, and system requirements for vertical scaling
• Basic understanding of networking concepts and protocols
• Experience with one or more of the following Automation/Scripting tools: Chef, Puppet, Ansible, SALT, Python, Powershell.
• Continued curiosity regarding new technologies and evolving best practices
• Familiar with industry Cloud technologies – PCF, Amazon Web Services, Microsoft Azure
• OS knowledge with an emphasis on Windows Server, Redhat, Oracle Linux, AIX
• Experience maintaining Github repositories
• Fundamental REST Services
• Ability to multi-task and context switch in a high performing environment
Recommended Skills
AutomationDev Ops
Scripting
Site Reliability