SRE (Application Support + Dev-Ops + Automation) - Fulcrum Digital
  • N/A, Other, Ireland
  • via ClickaJobs (1)
-
Job Description

SRE (Application Support + Dev-Ops + Automation)Dublin 2, Ireland | Posted on 09/26/2024Who are we FulcrumDigital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, healthcare, and manufacturing.The RolePlan, manage, and oversee all aspects of a Production Environment.Define strategies for Application Performance Monitoring, Optimization in Prod environment.Respond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.Support deployment of code into multiple lower environments. Supporting current processes with an emphasis on automating everything as soon as possible.Design, develop and standardize Monitoring and Alerting mechanism for the supported applications.Take a holistic approach to problem solving, by connecting the dots during a production event through the various technology stack that makes up the platform, to optimize meantime to recover.Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.Analyze ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.Maintain services once they are live by measuring and monitoring availability, latency and overall system health.Scale systems sustainably through mechanisms like automation and evolving systems by pushing for changes that improve reliability and velocity.Work with a global team spread across tech hubs in multiple geographies and time zones.Ability to share knowledge and explain processes and procedures to others.Share knowledge and mentor junior resources.Able to perform on-call duties on a rotational basis.Occasional off hours work required.Candidate should have inclination for Training and should be good trainer and ready to mentor others.RequirementsSkills –Must HaveExperience in REST and WEB API Support.Experience in Cloud based apps Support.Skill Category• Linux & Shell Scripting• Monitoring Tool - Splunk/Dynatrace or Other• Jenkins• Linux, shell scripting & Git/bit bucketSite Reliability Engineering:o Serve as the primary contact responsible for ensuring application scalability, performance, and resilience.o Practice sustainable incident response and blameless post-mortems while taking a holistic approach to problem solving and optimizing time to recover.o Automate data-driven alerts to proactively escalate issues. Work with development teams to establish SLOs and improve reliability.o Tackle complex development, automation, and business process problems. Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation, and refinement.o Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.o Increase automation and tooling to reduce toil and manual intervention.o Analyze ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.The ideal candidate will have experience in many of these areas:• BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience.• Coding or scripting exposure.• Appetite for change and pushing the boundaries of what can be done with automation. Be curious about new technology, infrastructure, and practices to scale our architecture and prepare for future growth.• Experience with algorithms, data structures, scripting, pipeline management, and software design.• Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.• Interest in designing, analyzing, and troubleshooting large-scale distributed systems.• Willingness and ability to learn and take on challenging opportunities and to work as a member of matrix-based diverse and geographically distributed project team.• Ability to balance doing things right with fixing things quickly. Flexible and pragmatic, while working towards improving the long-term health of the system.• Comfortable collaborating with cross-functional teams to ensure that expected system behavior is understood and monitoring exists to detect anomalies.Preferred Qualifications:• Coding experience in one or more of the following: C++, Java, Python, Go.• Experience with algorithms, data structures, scripting, pipeline management, and software design.• Experience in working across development, operations, and product teams to prioritize needs and to build relationships is a must.• Experience in a SRE role or related field.• Background on cloud native tooling and orchestration technologies (Kubernetes preferred).• Experience in Monitoring tools such as Splunk, Dynatrace.• Experience with Java, J2EE, WebServices (SOAP/REST), Spring/Spring Boot is a plus.• Experience in production support environments and ITIL processes.• Experience with industry standard CI/CD tools like Git/BitBucket, Jenkins, Maven, Artifactory, Groovy and Chef. Experience designing and implementing an effective and efficient CI/CD flow that gets code from dev to prod with high quality and minimal manual effort is required.• Developing and maintaining cloud solutions on Azure, GCP, or AWS in accordance with best practices.• Understanding of:o Client-server relationships.o Network concepts (Layer 1 to Layer 3).o Stack trace analysis (TCP dumps, heap dumps, CPU/memory analysis, thread dumps).o Load balancers and application firewalls.o Logging and monitoring methods, standards, and tools.o High availability and business continuity planning. #J-18808-Ljbffr

;