Sr. Site Reliability Engineer American Recruiting & Consulting Group Boca Raton, FL

Kate

Administrator
Команда форума
Sr. Site Reliability Engineer

ARC Group, a Forbes-ranked top 20 recruiting and executive search firm, has a remote, contract opportunity for a Sr. Site Reliability Engineer with our client, a large Retailer, in Boca Raton, FL.

The ideal candidates should have advanced coding skills in Java, Go, Python, Shell and YAML, preferably with a minimum of 3-5 years of experience in all of these or similar languages and be available to work EST business hours.

Summary:
The role of Sr. Site Reliability Engineer is to support and enforce reliability elements into technological solutions that deliver an exceptional customer experience. As part of client's Site Reliability Engineering team, youll leverage your development background to promote a framework which will deliver optimal levels of performance and reliability throughout client's systems and services. This is an opportunity to shape and strengthen our SRE practice, serving as a key contributor to a versatile, high velocity team. You will collaborate with our product teams and software developers to improve the resiliency of our applications through development based on reliability requirements and ensure deployment consistency throughout our Technology organizations while utilizing your operational excellence to provide stability across our customer-facing sites and services.

Primary Responsibilities:
  • Independently designs, implements, productionizes and maintains site reliability guidelines, processes and systems
  • Service Level Definition, Configuration and Measurement:
  • Define SLIs, SLOs & SLAs specific to each application or system
  • Configuration of monitoring & alerting tools suitable for each product and/or platform team
  • Measure reliability & resilience (through pre-defined SLIs & SLOs) utilizing -monitoring/alerting tools to drive continuous improvement based on data analysis
  • Incident Management
  • Facilitation of incident response through the engagement of various teams and stakeholders, while providing robust communication and visibility to the organization during service interruptions
  • Provide Root Cause Analysis for failures
  • Experience with a modern incident management platform (OpsGenie) to effectively drive incident response and problem resolution
  • Monitoring & Alerting
  • Debug defects as well as develop dashboards using modern monitoring tools (e.g. New Relic, Splunk, AIOPs) to enable a reduction in mttd (detection time) & mttr (resolution time)
  • Build monitors and alerts designed to manage SLAs, optimize performance and minimize outages
  • Construct E2E customer journey dashboards and alerts for customized transactions and applications
  • Automates reliability requirements into system and application implementations and updates; including the implementation of self-healing solutions (ansible, terraform, etc).
  • Work with product management team to contribute to 1) the identification of reliability features & requirements and 2) level of effort estimates
Required Experience/Skills:
  • The ideal candidates should have advanced coding skills in Java, Go, Python, Shell and YAML, preferably with a minimum of 3-5 years of experience in all of these or similar languages.
  • Candidates should have 3+ years experience in SRE and either or both of the following roles: DevOps, Software Engineering, leveraging automation extensively to achieve key deliverables.
To see all of our jobs please visit http://www.arcgonline.com
ARC Group is proud to be an equal opportunity workplace dedicated to pursuing and hiring a diverse workforce.
 
Сверху