Site Reliability Engineer (Sre)

Eaton Corporation Peachtree City , GA 30270

Posted 2 months ago

Eaton's Digital Services Operations team is currently seeking a Site Reliability Engineer (SRE) to join our Eaton SRE and SaaS Operations team. This position will be based at a primary Eaton engineering location in Moon Township, PA, Raleigh, NC, Cleveland, OH, or Peachtree City, GA. The position is eligible for a hybrid work setup.

The expected annual salary range for this role is $80,250 - $117,700 a year.

Please note the salary information shown above is a general guideline only. Salaries are based upon candidate skills, experience, and qualifications, as well as market and business considerations.

What you'll do:

The SRE and SaaS Ops Mission and team:

The SRE and SaaS Ops mission is to reduce business impact due to any outage or change events in Eaton production Brightlayer cloud hosted software and prevent their recurrence. As a part of the Digital Services Operations organization, team members will work across the businesses developing, delivering, and supporting Brightlayer IoT offerings. Collaboration primary involves the following organizations: software development, product and offer management, technical support, and operations to execute and deliver best practices, consultation, and support for this production suite of offerings. The team will address requirements, measurement, and compliance for Service Level Agreements, reliability, uptime, other core SRE tenets, and business continuity to meet customer commitments and contribute to excellent customer experiences.

In this function you will:

  • Work together with the centralized SRE & SaaSOps mission focused on providing production support across the Brightlayer product software offerings. You may align more closely with certain offerings. You will work together with the SRE/SaaSOps architect and regional SRE leader to provide guidance, best practices, and automation to ensure reliability, resiliency, and availability of the software.

  • You will deliver SRE & SaaS Ops expertise, implement capability, and drive synergy, standardization, and execution in collaboration with product team SMEs.

  • Strive to automate eligible manual activities and design/develop supporting software tools that would improve the system and operations. The target for your effort split is 50/50 across value add development efforts and operational support.

  • Participate in new product launch activities including architecture, implementation assessments, and launch readiness deliverable reviews.

  • As required respond to monitoring alerts and mitigate any production support escalation issues by restoring normal service operations and conducting post incident reviews (often referred as post-mortems), always seeking methods for continuous improvement.

  • Work closely with service management, software engineers, and quality to manage, measure, and report system availability, performance, and reliability.

  • Participate as required in on-call and support rotation processes driving SLA adherence and quality system availability and reliability.

  • Document your system knowledge as you acquire it over time, create runbooks, and ensure critical system information is readily available to those who need it.

  • Help build a Site Reliability Engineering culture across the organization by sharing your best practices, approaches, documentation, training and code with other engineering teams.

Qualifications:

Required (Basic) Qualifications:

  • Bachelor's degree in Computer Science or Engineering from an accredited institution.

  • Minimum of 3 years of experience in the software industry developing enterprise scalable cloud-based applications and/or distributed systems

  • Legally authorized to work in the United States without company sponsorship now or in the future

  • No relocation benefit is being offered for this position. Only candidates within a 50-mile radius of the work locations listed will be considered. Active-duty military Service member candidates are exempt from the geographical area limitation.

Preferred Qualifications:

  • Experience with incident management, including the ability to triage and resolve issues that may affect system reliability and performance

  • Experience with Agile methodologies and concepts

  • Experience with software engineering principles and best practices, including design patterns, testing, and debugging

  • Experience with DevOps teams and experience with production environments beneficial

Position Criteria:

  • Some experience developing, deploying, configuring, and monitoring infrastructure, applications, and/or services in Microsoft Azure or AWS public cloud environments. Work with hybrid cloud and/or private cloud is beneficial.

  • Familiarity with continuous integration and continuous delivery (CI/CD) best practices and tools, such as Jenkins, GitHub Actions, Azure DevOps, Opsera

  • Familiarity with scripting languages such as Python, Go, and Bash

  • Knowledge of operating systems, networking, relational and non-relational databases, and computer systems architecture

  • Knowledge of monitoring, logging, and observability tools such as Prometheus, Grafana, Dynatrace, ELK, Azure monitor/insights

  • Ability to troubleshoot and debug complex problems

  • Good judgment, time management, collaboration, and decision-making skills

  • Ability to stay calm under pressure and passionate to drive continuous improvement

  • Solid understanding of automation concepts, automatic provisioning of software and infrastructure, logging, and data visualization.

  • Working knowledge of other languages such as C#, Java, C++, or Go

  • Understanding of latency, performance, high availability, efficiency, change management, monitoring, and incident management.

  • Familiarity with service-level management and related tools.

Skills:

  • Split your time between developing solutions that increase the reliability of an internal IoT Platform running in Microsoft Azure and Ops/On-call duties. We are targeting a 50/50 split.

  • Apply automation and software to any tasks or parts of the system that would benefit from it or are performed manually.

  • Respond to and mitigate any support escalation issues by restoring normal service operations, cleaning up, and conducting post mortems.

  • Work closely with software engineers and testers to ensure the system is responding properly to no-functional requirements such as performance, security, and availability.

  • Help build a Site Reliability Engineering culture across the organization by sharing your best practices, approaches, documentation, and code with other engineering teams.

  • Optimize existing on-call and support rotation processes that ultimately improve system reliability.

  • Document your system knowledge as you acquire it over time, create runbooks, and ensure critical system information is readily available to those who need it.

We are committed to ensuring equal employment opportunities for job applicants and employees. Our recruitment processes use balanced selection criteria and avoid unlawful discrimination against applicants on the basis of their age, colour, disability, marital status, national origin, gender, gender identity, genetic information, race or racial origin, religion, sexual orientation or any other status protected or required by law.


icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon
lc_ad

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove
Site Reliability Engineer (Sre) 3

Akina, Inc.

Posted Yesterday

VIEW JOBS 4/27/2024 12:00:00 AM 2024-07-26T00:00 TS/SCI w/Polygraph required Approved for 60% telework 06-11-SRE Description: DevOps refers to a software developm Akina, Inc. Annapolis Junction Maryland

Site Reliability Engineer (Sre)

Eaton Corporation