Our client is looking for a Senior Site Reliability Developer - Technical Lead to join our rapidly growing technology team. The Senior SRE-TL will join the SRE squad and will be responsible for keeping all user-facing services and other production systems running smoothly. The Senior SRE - Technical Lead will be accountable for the reliability, scalability and resilience of complex infrastructure components.
MPI does not discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity or expression, national origin, age, disability, veteran status, marital status, or based on an individual's status in any group or class protected by applicable federal, state or local law. MPI encourages applications from minorities, women, the disabled, protected veterans and all other qualified applicants.
Description
Team leadership, knowledge sharing & coaching - 25%
Working in an agile environment, our squads are made up of experienced innovators in Product Management, QA, Design, DevOps, Software Development, Machine Learning, Data Engineering, and Security. Headquartered in Montreal, our technology organization has been growing at a rate of 2X year-over-year and is doubling once again in 2021 as we expand across Canada, US, and Europe.
Michael Page
MPI does not discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity or expression, national origin, age, disability, veteran status, marital status, or based on an individual's status in any group or class protected by applicable federal, state or local law. MPI encourages applications from minorities, women, the disabled, protected veterans and all other qualified applicants.
Description
Team leadership, knowledge sharing & coaching - 25%
- Enforce an effective and efficient scrum process where all team members work in the same direction
- Guide SRE engineers, when needed, to break down user stories into manageable tasks
- Propose and drive a development process that emphasizes quality through code reviews, automated testing, continuous integration pipelines and documentation
- Develop a deep understanding of the team's roadmap and influence it with fact-based technical arguments
- Ensure proper documentation of team activities
- Ensure the demo of features developed are well prepared and presented to stakeholders
- Review Pull Request, documentation with the objective to guide and upskill junior developers on various technical/SRE topics
- Provide fact-based technical feedback on each squad member to managers as part of the evaluation cycle
- Actively contribute to SSENSE University, the internal peer learning platform, to promote continuous learning
- Participate in the onboarding of new developers
- Mentor Junior in all areas and other SREs in their area of deep knowledge.
- Set an example for a team of SREs with positive and inclusive leadership and discussion on work
- Trusted to de-escalate conflicts inside the team
- Handle emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed
- Accountable for ensuring & improving documentation on site reliability measures, either in application documentation, or in runbooks, explaining the issues encountered and the solutions implemented
- Actively seek and identify opportunities and implement them to improve the availability and performance of the system by applying the learnings from monitoring and observation
- Identify parts of the system that do not scale, provide immediate palliative measures and drive long term resolution of these incidents.
- Improve the SSENSE codebase by resolving issues
- Optimize cloud cost and reduce system resource usage by setting clear requirements through efficiency and capacity planning
- Plan, design and execute solutions within the infrastructure team to reach specific goals agreed upon
- Share the learnings publicly, either by creating issues that provide context for anyone to understand it or by writing blog posts
- Proposes ideas and solutions within the infrastructure team to reduce the workload by automation
- Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives
- Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again
- Anticipate the technical challenges the squad will face when delivering solutions and propose and implement technical solutions to those issues
- Write testable, efficient, and reusable code suitable for continuous integration and automated deployments, that respects best practices and SSENSE development standards
- Raise the bar for professional SRE engineers, lead by example, and help others learn the craft through rigorous code reviews and coaching
- Be accountable for performance, reliability, scalability and resilience of complex and critical infrastructure components (web servers, data stores, hosted services, load balancers, etc.) through the proper use of replication, sharding, load balancing, monitoring, SLAs, alerting, and auto-scaling
- Be an active participant in the incident escalation chain and prompt resolution
- Upgrade and patch systems as required while ensuring availability of service
- Contribute to cross-squad initiatives, acting as a change agent amongst peers to foster adoption of new processes or technical solutions
- Bachelor's degree in Computer Science, Engineering, or a related technical field, Master's degree, an asset
- Minimum 8 years of experience working as SRE
- A minimum of 8 years experience administrating Linux based environments (Red Hat, CentOS, Debian or Ubuntu)
- A minimum of 8 years experience with service-oriented architectures, micro-services.
- Must have at least 2 years of working in Agile development life cycle
- A minimum of 8 years experience practicing continuous integration and continuous delivery
- Minimum 5 years of experience with infrastructure automation frameworks in at least two of these technologies:, Saltstack, Terraform, or Cloud Foundation engine
- Expertise in infrastructure to support a microservice architecture
- A minimum of 4 years experience in Infrastructure-as-code specifically with Terraform
- Strong knowledge of caching technologies (Fastly, Redis) with the ability to identify opportunities for improvement
- Expertise with RDBMS (MySql, Post-gres) and NoSQL (DynamoDB, DocumentDB, Mongo DB) databases at scale
- Proficiency in Cloud resources (AWS) with the ability to operate them for the components owned, Certification preferred
- Ability to use containers and orchestration frameworks (Kubernetes, Docker, Container registries etc.)
- Proficiency in Git
- Must have at least 4 years of experience with Kubernetes. Nice to have Amazon EKS, ECS experience
Working in an agile environment, our squads are made up of experienced innovators in Product Management, QA, Design, DevOps, Software Development, Machine Learning, Data Engineering, and Security. Headquartered in Montreal, our technology organization has been growing at a rate of 2X year-over-year and is doubling once again in 2021 as we expand across Canada, US, and Europe.
Michael Page