Cloud Service Group Site Reliability Engineer (Remote)

Splunk San Jose , CA 95111

Posted 2 months ago

Job Description:

Join us as we pursue our disruptive vision to make machine data accessible, usable and valuable to everyone. We are a company filled with people who are passionate about our product and seek to deliver the best experience for our customers. At Splunk, we're committed to our work, customers, having fun and most significantly to each other's success. Learn more about Splunk careers and how you can become a part of our journey!


Splunk's Cloud Services group is looking for a Site Reliability Engineer to help lead, design and build the next generation of our large scale cloud offering. You will be working on core services and applications that form the primitives for our current and future cloud service offerings. Site Reliability Engineers in this role will be engaging with multiple service owners across the platform to teach and implement modern interpretations of SRE, observability, Chaos Engineering and DevOps. This role is highly visible and impactful to the organization and will help shape Splunk's Engineering culture for years to come. Your job, in a nutshell, is to make every team around you better... including your own!

You will:

  • Work across the organization to deliver quality products that delight Splunk's passionate users.

  • Build tools and design processes that help improve observability and system resiliency of the SignalFx Platform.

  • Triage Site Availability Incidents and proactively work towards reducing MTTR for customer impacting incidents.

  • Partner with Service owners to implement Service Level Metrics & Service Level Objectives that act as service level health indicators.

  • Establish design patterns for monitoring, benchmarking and deploying new features for the backend services.

  • Abide by all of the FedRAMP security controls on Splunk Cloud environments.

  • Work across the organization to deliver quality products that delight Splunk's passionate users.

  • Lead teams of tight-knit engineers who are building a state-of-the-art, cloud-based environment for massive-scale data processing.

  • Mentor and help new engineers to achieve more than they thought possible. You enjoy making other teams successful and are fulfilled through the success of others.


  • Experience monitoring and troubleshooting Splunk environments..

  • Experience in administering or architecting distributed Splunk environments.

  • Experience with the development and deployment of a hosted cloud environment, preferably AWS and GCP.

  • Experience with Python or Shell for scripting, and Git or similar version control systems.

  • Experience supporting customer facing SaaS infrastructure or similar cloud related services.

  • You enjoy building and running distributed systems at scale in production. You understand the challenges and trade-offs to be made when building and deploying systems to production.

  • Expertise in working with container deployment and orchestration technologies at scale with knowledge of fundamentals including service discovery, deployments, monitoring, scheduling, load balancing. Knowledge of Kubernetes, Go and Docker preferred.

  • Deep understanding of Systems programming (network stack, file system, OS services) and networking (L2 vs. L3, network architecture, VLANs, etc)

  • Knowledge of standard methodologies related to security, performance, and disaster recovery.

  • Highly skilled in identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues.

  • You've demonstrated the skills to effectively work across teams and functions to influence design, operations and deployment of highly available software.

  • You work hard to make the users of Splunk's products happier every day.

  • Ability to work nights, weekends, and swing shifts.

  • This is a FULLY REMOTE, US-based position. We encourage applicants anywhere in the US! You must be a US Citizen working on US soil to be considered.

Preferred skills:

  • Experience with large scale distributed cloud service development, infrastructure, traffic management and architecture.

  • Experience with distributed architectures/systems with optimized and scalable software that operates on a large number of nodes.

  • Ability to obtain a favorably adjudicated Single Scope Background Investigation (SSBI) and SECRET clearance

All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying. For job positions in San Francisco, CA, and other locations where required, we will consider for employment qualified applicants with arrest and conviction records.

icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove
Site Reliability Engineer Adobe Document Cloud

Adobe Systems Incorporated

Posted 2 days ago

VIEW JOBS 6/13/2021 12:00:00 AM 2021-09-11T00:00 Our Company Changing the world through digital experiences is what Adobe's all about. We give everyone-from emerging artists to global brands-everything they need to design and deliver exceptional digital experiences! We're passionate about empowering people to create beautiful and powerful images, videos, and apps, and transform how companies interact with customers across every screen. We're on a mission to hire the very best and are committed to creating exceptional employee experiences where everyone is respected and has access to equal opportunity. We realize that new ideas can come from everywhere in the organization, and we know the next big idea could be yours! Adobe's Reliability Engineering team is looking for an experienced Senior Site Reliability Engineer (SRE) to help build and operate Adobe Sign's highly secure digital signature and document services in both commercial and FedRAMP environments. Adobe Sign is the fastest, easiest way to get contracts signed and filed. Unlike virtual fax or e-signature software, we offer an end-to-end solution by automating the entire contracting process from the request for signature to the distribution and filing of the final agreement. We instantly show customers what's out for signature, what is signed, when and by whom. We are passionate about what we do and are looking for experienced, motivated people to join a hardworking team. We believe in diversity and encourage all to apply. We are looking for someone with a track record as a site reliability engineer in large-scale SaaS businesses, and a strong desire to implement initiatives and systems to enhance reliability, availability, security, and privacy. We need someone familiar with devops and other agile methods who thrives solving problems in real-time under pressure. The Role! * Build amazing things that matter. Solve problems for engineers and customers on this critical growth initiative. * Have meaningful ownership. Make important decisions about how we grow; have a say in what we build next. Work with the team and across teams to develop new solutions. * Grow. Sharpen your skills, lead small teams, and collaborate with your peers. * Collaborate. Work in an environment that values collaboration. What is Needed to Succeed! * A bachelors degree in computer science or equivalent four-year degree * 8+ years of experience in devops or SRE roles of increasing scale and complexity * Applicants must be able to meet Federal Contract Requirements * Strong programming skills, particularly with Python, Java, and Go * Experience implementing Chef, Docker, Kubernetes, etc. in a multi-cloud environment * Prior GovCloud experience running services at FedRAMP moderate or higher strongly desired * Enforce security controls including PCI-DSS, HIPAA, SOC2, and FedRAMP. Security testing experience desired, but not required * Deliver infrastructure as a code, automated wherever possible, for resources like DNS, log management and code deployments * Participate in on-call pager rotation * Participate in the incident management process and serve as a war room manager * Assist in the creation and refinement of operational documentation * Manage our uptime and performance using service level indicators and objectives * Familiarity with Prometheus, Cortex, Grafana, NewRelic, DataDog, and Splunk * Our current stack: Java, Apache, Tomcat, Memcached, Qpid, and MySQL on Linux Adobe Systems Incorporated San Jose CA

Cloud Service Group Site Reliability Engineer (Remote)