Reliability System Engineer

Sallie Mae Inc Indianapolis , IN 46218

Posted 7 days ago

Reliability System Engineer

Indianapolis, IN

Who we are:

Sallie Mae is proud to help Americans aspiring to create the life they wantwhether that means helping them make college happen, or something more. Our colleagues across departments and across the country are united in our passion and our customer-first approach. Whether you want to join a growing company, be part of an agile workforce, or gain new skillsyou're in the right place.

Sallie Mae is seeking top tier individuals who thrive in collaborative agile teams leveraging the latest technologies (Amazon AWS, Clojure/Clojurescript, Reagent/React, AWS Lambda, S3 Buckets, Beanstalk, PaaS, Azure Functions, C#, Python, Java, Go, Swift, Kotlin, .NET Core, NoSQL DB's). Sallie Mae is committed to delivering best-in-class solutions to our customers and is looking for developers who share the desire to deliver only the best code and solutions and who want to continue to sharpen their skills by joining a company that greatly values constant learning, pair-programming, training and career development.

What You'll Contribute:

Site Reliability is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. RE ensures that servicesboth our internally critical and our externally-visible systemshave reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance. RE is also a mindset and a set of engineering approaches to running better production systemswe build our own creative engineering solutions to operations problems.

Much of our software development will focus on optimizing existing systems, building infrastructure and eliminating work through automation. As SREs are responsible for the big picture of how our systems relate to each other, we will use a breadth of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on operational work, blameless retrospectives and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work. Our ideal candidate is a collected and organized problem solver with an Interest in designing, analyzing and troubleshooting large-scale distributed systems.

What You'll Do:

  • Respond to production bottlenecks, investigate their causes, and engineer lasting solutions

  • Engage in and improve the whole life-cycle of servicesfrom inception and design, through to deployment, operation and refinement

  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health

  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity

  • Make strategic maintenance and reliability program improvements

  • Practice sustainable incident response and blameless postmortems

  • Analyze downtime and lost production data to identify and implement solutions for chronic issues

  • Identify and implement predictive and preventive maintenance tools as well as performance metrics to increase asset reliability

  • Develop site monitoring and alerting

  • Report performance of operations

  • Communicate with Users, Computing, and Development teams in the event of an incident

  • Troubleshoot complex issues quickly and effectively

  • Facilitate root cause investigations

  • Recognize automation opportunities

  • Develop tools to enable prevention, detection, and mitigation

  • Ensure software has good logging and diagnostics

  • Create and maintain operational runbooks

  • Help triage escalated support tickets

  • Work on feature requests, defects and other development tasks

  • Contribute to overall product roadmap

  • Continually improve your craft as a software performance engineer by learning and leveraging the latest design patterns, principles and technology with performance testing

  • Research and implement solutions in related technologies to stay ahead of the technology curve

What You Bring:

  • 5 - 7 years of programming experience in operating cloud environments or highly related experience

  • Prior experience with AWS or Azure cloud infrastructure automation and DevOps workflows

  • Ability to debug and optimize code, and to automate routine tasks with focus on self-healing resilient systems

  • Demonstrates problem solving skills through engineering solutions and open source tools

  • Self-starter, organized and able to work independently

  • Experience with one or more scripting languages for automating common tasks

  • Experience with Agile delivery methodology

  • Strong interpersonal and relationship skills

  • Good team player, but able to work effectively independently

  • Service an interrupt-driven system

  • Experience in one or more of the following: C, C++, Java, Python, Go, Perl or Ruby

Extra Credit:

  • Experience with Python and web programming frameworks (eg. Django, Flask, ORM)

  • Experience with DevOps or related fields (eg. Docker, Kubernetes, Fabric, New Relic)

  • Experience with Javascript and well established MV* frameworks (eg. React, Backbone, etc)

  • Ability to use a wide variety of open source technologies and cloud services

  • Programming experience multiple scripting and programming languages and supporting technologies such as Reagent/React, Clojure/Clojurescript, AWS Lambda, Java, Python, R, Ruby, Go, bash, Swift, Beanstalk, VSTS

  • Experience designing/developing robust API's for mission critical, high volume systems. (RESTful, GraphQL)

  • Experience working in financial services industry, compliance regulated environment

  • Proven ability to learn new technologies quickly

What You'll Get at Sallie Mae:

  • Comprehensive Compensation and Healthcare Benefits (Medical, Dental, Vision plans)

  • Financial Well-being: 401(k) company match, employee stock purchase plan, and basic life insurance and short-term disability are provided to employees at no cost

  • Work/Life Balance: Paid time off, time off to volunteer, and tuition reimbursement. In addition, after 6 months of employment, primary caregivers receive 12 weeks of 100% fully paid time off and secondary caregivers are eligible for 4 weeks of 100% fully paid time off, for birth or adoption

  • Wellness: Fitness centers/gym subsidies, free Fitbits with step challenges, and wellness education

icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon
lc_ad

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove
Reliability System Engineer II

Sallie Mae Inc

Posted 7 days ago

VIEW JOBS 3/15/2019 12:00:00 AM 2019-06-13T00:00 Reliability System Engineer II Indianapolis, IN Who we are: Sallie Mae is proud to help Americans aspiring to create the life they want—whether that means helping them make college happen, or something more. Our colleagues across departments and across the country are united in our passion and our customer-first approach. Whether you want to join a growing company, be part of an agile workforce, or gain new skills—you're in the right place. Sallie Mae is seeking top tier individuals who thrive in collaborative agile teams leveraging the latest technologies (Amazon AWS, Clojure/Clojurescript, Reagent/React, AWS Lambda, S3 Buckets, Beanstalk, PaaS, Azure Functions, C#, Python, Java, Go, Swift, Kotlin, .NET Core, NoSQL DB's). Sallie Mae is committed to delivering best-in-class solutions to our customers and is looking for developers who share the desire to deliver only the best code and solutions and who want to continue to sharpen their skills by joining a company that greatly values constant learning, pair-programming, training and career development. What You'll Contribute: Site Reliability is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. RE ensures that services—both our internally critical and our externally-visible systems—have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance. RE is also a mindset and a set of engineering approaches to running better production systems—we build our own creative engineering solutions to operations problems. Much of our software development will focus on optimizing existing systems, building infrastructure and eliminating work through automation. As SREs are responsible for the big picture of how our systems relate to each other, we will use a breadth of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on operational work, blameless retrospectives and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work. Our ideal candidate is a collected and organized problem solver with an Interest in designing, analyzing and troubleshooting large-scale distributed systems What You'll Do: * Predict and prevent production bottlenecks * Lead investigation into production bottleneck causes, and engineer solutions * Lead and champion improvements to the life-cycle of services—from inception and design, through to deployment, operation and refinement * Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews * Maintain services once they are live by measuring and monitoring availability, latency and overall system health * Architect scalable systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity * Identify needs and lead strategic maintenance and reliability program improvements * Practice sustainable incident response and blameless postmortems * Analyze downtime and lost production data to identify and implement solutions for chronic issues * Identify and implement predictive and preventive maintenance tools as well as performance metrics to increase asset reliability * Design, and develop site monitoring and alerting * Report performance of operations * Communicate with Users, Computing, and Development teams in the event of an incident * Troubleshoot complex issues quickly and effectively * Facilitate root cause investigations * Recognize automation opportunities * Develop tools to enable prevention, detection, and mitigation * Ensure software has good logging and diagnostics * Create and maintain operational runbooks * Help triage escalated support tickets * Work on feature requests, defects and other development tasks * Contribute to overall product roadmap * Continually improve your craft as a software performance engineer by learning and leveraging the latest design patterns, principles and technology with performance testing * Research and implement solutions in related technologies to stay ahead of the technology curve What You Bring: * 8 - 10 years of programming experience in operating cloud environments or highly related experience * Experience with AWS or Azure cloud infrastructure automation and DevOps workflows * Ability to debug and optimize code, and to automate routine tasks with focus on self-healing resilient systems * Experience with Dynatrace and other diagnostic tools and technologies * Demonstrates problem solving skills through engineering solutions and open source tools * Self-starter, organized and able to work independently * Experience with one or more scripting languages for automating common tasks * Experience with Agile delivery methodology * Strong interpersonal and relationship skills * Good team player, but able to work effectively independently * Service an interrupt-driven system * Experience with Python and web programming frameworks (eg. Django, Flask, ORM) * Experience with DevOps or related fields (eg. Docker, Kubernetes, Fabric, New Relic) * Experience with Javascript and well established MV* frameworks (eg. React, Backbone, etc) * Experience in one or more of the following: C, C++, Java, Python, Go, Perl or Ruby Extra Credit: * Ability to use a wide variety of open source technologies and cloud services * Programming experience multiple scripting and programming languages and supporting technologies such as Reagent/React, Clojure/Clojurescript, AWS Lambda, Java, Python, R, Ruby, Go, bash, Swift, Beanstalk, VSTS * Experience designing/developing robust API's for mission critical, high volume systems. (RESTful, GraphQL) * Experience working in financial services industry, compliance regulated environment * Proven ability to learn new technologies quickly What You'll Get at Sallie Mae: * Comprehensive Compensation and Healthcare Benefits (Medical, Dental, Vision plans) * Financial Well-being: 401(k) company match, employee stock purchase plan, and basic life insurance and short-term disability are provided to employees at no cost * Work/Life Balance: Paid time off, time off to volunteer, and tuition reimbursement. In addition, after 6 months of employment, primary caregivers receive 12 weeks of 100% fully paid time off and secondary caregivers are eligible for 4 weeks of 100% fully paid time off, for birth or adoption * Wellness: Fitness centers/gym subsidies, free Fitbits with step challenges, and wellness education Sallie Mae Inc Indianapolis IN

Reliability System Engineer

Sallie Mae Inc