System Development Manager, AWS Resilience, AWS Incident Response Jobs in Dublin, Leinster, Ireland

System Development Manager, AWS Resilience, AWS Incident Response - ENGINEERINGUK

Dublin, Leinster, Ireland
via BeBee.com

Salary: -

Type

Career Level

Positions

Experience

Degree

Job Description

Job Summary: Manage automated tooling roadmaps and delivery for the detection and resolution of issues within AWS and Amazon infrastructure.

AWS Resilience owns service to prevent and respond to availability and security issues for all AWS Services. We're the people who keep the cloud running, working on challenging problems with constant new services and possible failure modes to prevent. Our team is diverse, with software, security experts, operations managers, and other vital roles. We collaborate with people across AWS to deliver the highest standards for safety, security, and availability.

AWS Incident Response is at the heart of the high availability of Amazon Web Services. We make customer impacting events shorter and less frequent by driving large scale event and incident response. Our automated tooling quickly identifies the cause of an issue and helps mitigate its impact. We also provide manual incident management for AWS and other Amazon groups, directing the resolution of an issue with service teams, and diving deep into those events to drive improvements to the tooling.

As a System Development Manager, you will manage automated tooling roadmaps and delivery for the detection and resolution of issues within AWS and Amazon infrastructure. You will also spend a portion of your time ensuring your team efficiently directs the resolution of high visibility incidents in conference calls and global teams. Using data learned from those incidents, you will drive further improvements into our automation, tooling, and processes so that the next event is shorter or avoided entirely.

Key Responsibilities:

Define and Deliver Business Priorities: You will be a key contributor and owner of the direction of the global AWS Incident Response team. You will define, plan, track, and deliver on strategic goals for the team, while ensuring that the team remains unblocked and focused.
Cross-Site, Cross-Team Coordination: You will be responsible for coordinating with your counterparts to ensure that a clear communication channel exists between AWS Operations teams. You will also work closely with systems and product teams to create and maintain proper processes for monitoring and alarming on services.
Incident/Change Management: You will be the point of contact for inquiries regarding engagement processes and issues within the global Amazon platform during your team's coverage. Responsibilities include delegation of emergent engagement issues to team members, driving initiatives regarding improvements to existing tools & processes, and providing feedback on new practices & procedures in order to scale with the rapid expansion of the AWS Services and customer base.
Performance Management/Team Health: You will own all facets of performance and career management for the team.

BASIC QUALIFICATIONS:

5+ years of direct experience with cloud hosting technologies (AWS, Azure, etc.) / 5+ years experience managing an engineering team operating at scale.
Deep understanding of infrastructure delivered through the software development lifecycle in an API-enabled environment - including agile development, software patterns, and modern cloud services.
Experience in implementing, supporting, and evaluating tools and services with a security, scalability, and performance mindset.
Ability to handle multiple competing priorities in a fast-paced environment.
Excellent written and verbal communication skills and ability to get ideas across to the team, peers, and customers.

PREFERRED QUALIFICATIONS:

Strong understanding of fundamental operational best practices such as monitoring, alerting, deployment, and change policies.
Experience running agile frameworks or other workflow methodologies in a DevOps setting.
Experience dealing with customers during issue resolution and operating under pressure.
Routine communication of status to senior management.
SLA definition and refinement.
Goal-setting for reduction and elimination of customer-facing defects.
Leading post-mortem analysis, including ensuring a high quality bar for analysis and follow through of consequent action items.

Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify, and build.

System Development Manager Aws Resilience Aws Incident Response Job In Dublin

System Development Manager, AWS Resilience, AWS Incident Response - ENGINEERINGUK

Job Description