Team Lead - Site Reliability Engineering

Team Lead - Site Reliability Engineering
Arbuthnot Latham has been associated with banking since 1833. We combine private and commercial banking, wealth planning and investment management. We believe in traditional relationship and service-led banking powered by modern technology.Job purposeThe Team Lead - Site Reliability Engineering is responsible for ensuring the effective and efficient running of the current NOC team with a view to transition to an SRE function over time.The team is responsible for enabling innovation and velocity of change while ensuring system reliability focusing on the critical features and functionality within products and platforms. It collaborates with the business or product owners to prioritise operational requirements by defining service-level indicators (SLIs) and service-level objectives (SLOs) to monitor and optimise customer journey and experience. Its goal is to design and operate scalable resilient systems utilising software engineering principles. It brings skills and expertise to automating manual tasks (TOIL) in such areas as incident management, problem management, change management, and release management tasks, and provides operational insights through monitoring and observability; and other aspects involved in preparing and optimising automated delivery solutions.To place the interests of customers at the centre of all activities, act in a way that is consistent with achieving good outcomes for consumers; and to comply with the FCA and PRA''s Conduct Rules.Key Responsibilities:Lead, manage and motivate the team.Ensure the team are following best practice across all disciplines.Have oversight of team tasks including investigation, troubleshooting, diagnosis, resolution and recovery to minimise impact to services.Audit the Engineers’ calls and tickets for quality assurance and provide feedback and coaching as required.Drive a culture of Customer Excellence and Continual Service Improvement within the team.I dentify, develop, communicate, and implement process changes within the team.Act as a point of escalation for the team.SRE responsibilities:
Help define the SRE practice for the organisation,
collaborate with other stakeholders to select the relevant SRE principles, define the objectives and measurements of the outcomes.Collaborate with stakeholders such as product and platform owners,
to define service level objectives (SLOs), and service-level indicators (SLIs) for system operations focused on the critical features of the customers journey and experience.Track and manage reliability performance against agreed SLOs , in partnership with other IT teams or other stakeholders, and ensure systems continue to meet SLOs over time.Ensure key stakeholders,
product owners, and platform owners are informed of reliability concerns and their potential impact to the customer experience.Provide expert knowledge on
reliability approaches,
to ensure our organisation achieves its goals and roadmap for reliability.Champion reliability
being treated as a feature in products and platforms and promote the concept across all phases of the software development life cycle.Create dashboards and reports to communicate key metrics,
to product owners and key stakeholders.Design, code, test and deliver solutions
to automate manual operation (i.e., “TOIL”).Participate in operations support and on-call rotation shifts,
for SRE supported systems and products.Participate in or lead problem management activities , including post-mortem incident analysis, and provision of technical insight, documented findings, outcomes and recommendations as part of a root cause analysis to troubleshoot priority incidents.Implement automation to reduce probability and/or impact of problems recurring
[possible options could include automated incident response, enhanced monitoring, observability initiatives, automation to change and release management].Identify, evaluate, and recommend monitoring and observability tools
and diagnostic techniques to improve system observability and insights, including identification of requirements, nonfunctional requirements, design, implementation and operationalisation.Participate in system design, platform management, capacity planning
at launch reviews and sprint planning sessions, or product and platform architecture discussions. Ensure all operational requirements including availability, performance and disaster recovery are met.Collaborate and share lessons learned
regarding reliability, performance and incidents with all stakeholders.Participate and exert influence in organisational learning initiatives
such as communities of practice to share knowledge and foster a continuous learning and improvement mindset.Support architects working on new solutions,
including analysing requirements, supporting technical architecture activities, prototyping, designing and developing reusable infrastructure artifacts, testing, implementing, and preparing for ongoing support.Train and mentor
the team to ensure SRE best practices evolve and scale successfully in the organisation.Shift working pattern -
there is a requirement to work shifts and on call hours.Risk:
Responsible for managing risks inherent to the role by diligently observing internal policies and procedures.Act as a point of escalation for customers and internal stakeholders as required.Key Interfaces:IT Infrastructure teamHeads of departmentsVertical teamsChange Management teamAll business areas across the Group3rd party suppliersIT Service DeskPerson Specification
Knowledge/
Experience/Skills:Line management/team leader experienceUnderstanding of software engineering principles (source control, versioning, code reviews, etc.)Working in an environment that complies with ISO27001, NIST, CIS Benchmarks, PCIDSS amongst othersLeading root cause analysis and blameless postmortems in complex environmentsExperience of communicating complex issues to senior stakeholders and technical teams.Implementation of highly available and reliable systems, using multi-AZ and multiregional approachesExpertise with monitoring and observability tools (e.g. SolarWinds, Datadog, Azure/AWS native tools)Expertise with SLI/SLO management tools such as (ServiceNow)Expertise with Incident ticketing and change management systems such as (ServiceNow, Ivanti)Expertise with automated incident response tools such as (Pager Duty, ServiceNow)Expertise with software development frameworks/languages (e.g., Java, PHP, Python, PowerShell)Extensive knowledge of cloud ecosystems (e.g. AWS, Red Hat OpenShift, Oracle Cloud Infrastructure, Microsoft Azure)Knowledge of DevOps tools, such as CI/CD tools (e.g., Azure DevOps, GitHub, GitLab, Jira, Harness, Jenkins)Knowledge of Infrastructure-as-code approaches, role-specific automation tools and associated programming languages (e.g., AWS CloudFormation, Azure ARM, Hashi Corp Terraform, Progress Chef, Perforce Puppet)Knowledge of Orchestration tools (e.g., Cloudify, env0, Morpheus Data, Pliant, RackN, Scalr, Spacelift, Terraform for Cloud) desirableCloud provider services (e.g., AWS, Azure, Oracle, regional providers)Operating systems (e.g., Windows and Linux, including scripting experience)Knowledge of scalable architectures, including APIs, microservices and PaaS desirableKnowledge of architecting for resilience (e.g., HA, multi-AZ, multiregional, backup and recovery tools) desirableQualifications:Bachelors or masters degree in computer science, information systems or a related field, or equivalent work experienceSRE foundation course completed, and qualification gainedAutomation provider certificationsTeam WorkingInfluencing OthersPerformance FocusChange FocusWorking ProactivelyProblem Solving and JudgementAbout Us
Life, Work and BenefitsArbuthnot Latham is committed to equal-opportunities for all staff and candidates. We embrace inclusion and diversity and understand why they are critical for the success of our business and people.Agile working - (3 Days in London Office per week)Competitive salary, pension and holiday allowanceBUPA Health cover4x Life AssuranceDiscretionary bonusMarket leading maternity/paternity and menopause policiesData Privacy and Reasonable adjustmentsWe take keeping your data security seriously. For more detail on how we may keep your data please refer to our Privacy NoticeReasonable adjustments
: Please let us know of any adjustments or arrangements that you may need to help you apply to this role or that will help you during the recruitment process. If you wish to discuss any particular requirements or concerns you have because of a disability or medical condition please contact us atrecruitment@arbuthnot.co.uk. Information you provide about any disability or medical condition will remain confidential unless it is necessary to disclose it to other members of staff or outside agencies to ensure the health and safety of yourself and others, or to implement the adjustments you require. In these circumstances we will first discuss with you how and to whom the information may be disclosed.
#J-18808-Ljbffr
Other jobs of interest...


Perform a fresh search...
-
Create your ideal job search criteria by
completing our quick and simple form and
receive daily job alerts tailored to you!