Site Reliability Engineer Plano

Splunk Plano , TX 75023

Posted 1 month ago

Join us on the Splunk Site Reliability Team, working on our vision to make machine data accessible, usable, and valuable to everyone! You will configure and maintain our customer-facing SaaS product, Splunk Cloud. Come join a team that is striving for operational awesomeness and trying to automate the world. We have a large AWS presence, and you should have experience with AWS architecting, deployments, and networking. This is an incredible opportunity to use your existing cloud experience and drive the growth of the Splunk Cloud.


Splunk's Cloud group is looking for skilled Site Reliability Engineers to support and build our large scale Cloud offering. You will be working with a fun, diverse, geographically distributed team to deliver an excellent product and an extraordinary experience to our customers.


  • You are passionate about building and running distributed systems at scale in production. You understand the challenges and trade-offs to be made when building and deploying systems to production.

  • You constantly consider "How can I automate this process?"

  • Knowledge of best practices related to security, performance, and disaster recovery.

  • Skilled in identifying performance bottlenecks, spotting anomalous system behavior, and determining the root cause of incidents.

  • Experience monitoring cloud environments using tools like Splunk, VictorOps, Nagios, Zabbix, and PagerDuty.

  • You care about good documentation and appreciate how it allows a distributed team to function.

  • Ability to tackle complex problems, resolve operational issues, and interact with vendors to find solutions.

  • Comfortable working with critical, customer-facing issues and able to prioritize quickly when escalations happen.


  • Extensive experience as a Linux system administrator supporting enterprise computing platforms and systems.

  • Experience running complex systems in AWS, including Amazon EC2.

  • Experience with Python or Shell for scripting, and Git or similar version control system.

  • Experience supporting customer facing SaaS infrastructure or similar cloud related services.

  • Splunk experience a plus.

  • Bachelors Degree or comparable work experience.

We value diversity at our company. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying.

For job positions in San Francisco, CA, and other locations where required, we will consider for employment qualified applicants with arrest and conviction records.

icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove
Sr Site Reliability Engineer

Bank Of America Corporation

Posted Yesterday

VIEW JOBS 1/17/2020 12:00:00 AM 2020-04-16T00:00 Job Description: Resource will be responsible for the following: Drive formulation and implementation of "never down" strategy aimed at identifying opportunities to improve overall stability and resiliency of critical business functions and applications. Partner with infrastructure and application development teams to implement enhancements. Ensure the design and development of new capabilities incorporate best practices aimed at ensuring such capabilities are highly resilient and stable. Partner with production support, infrastructure and application teams to review disaster and contingency capabilities for all critical business functions, applications and individual components. Identify opportunities to optimize such capabilities, specifically recovery time and recovery point objectives, and partner with appropriate teams to implement such enhancements. Ensure the design and development of new capabilities incorporate best practices aimed at ensuring optimal recovery time and point objectives can be achieved. Partner with production support, infrastructure and development organizations to ensure robust disaster recovery and contingency plans and capabilities are implemented and operationalized. Apply extensive technical experience and skill set to drive the triaging of complex, high impact Production incidents to quickly restore service Partner with application and product managers to identify root cause and actions to correct complex, high impact Production problems. Also, work with those teams to identify other opportunities to improve overall Production stability, including actions to mitigate the reoccurrence of any problem as well as opportunities to improve overall monitoring. Socialize best practice design patterns for highly available and resilient applications with production support, infrastructure and development partners. Function as a subject matter expert for the team on stability and resiliency Required Job Skills Resource will have the following skills: 7+ years of experience in information technology Knowledgeable in best practice design patterns aimed at highly available and resilient applications Experience formulating and driving enterprise strategy across a large-scale organization Previous experience as an architect working with business partners and application development teams to understand business requirements and identify technology solutions best positioned to meet such needs in a highly resilient and stable manner Experience as a system administrator, database administrator, middleware administrator and/or network administrator. Ideally, experience in more than one role preferred. Experience as an application developer and/or production support Experience using advanced monitoring tools such as Splunk, AppDynamics, SiteScope, Glassbox, and NetScout Experience troubleshooting network related incidents Strong, courageous communicator capable of effectively communicating, verbally, via emails and instant messaging, to both technical and business teams Capable of periodically providing on call support outside of normal working hours Capable of working in high pressure situations Desired Skills Bachelor's degree in business, computer science, MIS or related field Experience working for a large cap technology company Experience supporting/development applications that utilize SAN and NAS storage. Any experience with Dell EMC Centera and Hitachi HCP storage a plus. Experience leaning out and automating processes aimed at improving overall efficiency and quality of the work product Familiarity with the ITIL framework Shift: 1st shift (United States of America) Hours Per Week: 40 Bank Of America Corporation Plano TX

Site Reliability Engineer Plano