Site Reliability Engineer - Service Assurance Platforms

Cisco Systems, Inc. San Jose , CA 95111

Posted 1 month ago

Who We Are

Today's challenging business environment is more than that - it's a period of disruption between the pandemic, global business change and internal process complexity. For us to focus on simplicity and the best customer experience, we need great talent and the right skillsets to be successful. This is now a mantra for our Cisco leadership team and for us.

The Digital Enterprise Solutions team is changing the way we run Cisco's operations by maximizing the power of technology, the best of business processes and superior data insights. Together, we will Reimagine the Cisco experience. Show the world how to Reinvent applications and leverage the future of the Internet to Showcase the power of Cisco: our people, products, processes, systems, and data.

As part of the Digital Enterprise Solutions team, our team is responsible for the infrastructure and application monitoring across Cisco leveraging Agile and DevSecOps frameworks. Our monitoring solutions provide detailed visibility into the performance, availability, and user experience of business-critical applications and the supporting infrastructure and networks. Helping IT and DevOps teams deliver consistent availability and performance to rapidly detect, diagnose and restore issues before client impact are the primary goals of these solutions.

Please join us and make this journey together!

What You'll Do

Site Reliability Engineers are responsible and take ownership for reliability, scalability, automation, and other issues related to uptime and availability of our monitoring solutions. You will need to have strong skills in following areas:

  • Design, write and build use cases/features to improve the reliability, availability and scalability of our Monitoring Solutions.

  • Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure.

  • Design and develop improvements, focused on resilience, to our production systems to achieve and surpass SLOs

  • Help improve our operational practices to minimize service disruptions

  • Work with our Service Assurance team to modernize and improve our monitoring and alerting architecture.

  • Design repeatable & scalable solutions that detect failures or issues before our clients.

  • Conduct product proof of concepts, establish success criteria and provide recommendations

  • Work with engineers to identify root cause and fix issues

  • Influence, design and create new architectures, standards and methods for large-scale enterprise systems.

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health.

Who You'll Work With

We are a diverse DevOps team supporting a mix of Cisco on Cisco solutions such as AppDynamics & ThousandEyes with strategic 3rd Party solutions to deliver the best in class Service Assurance Architecture supporting a mix of Hybrid Cloud deployment models across Cisco IT. This is a team of highly motivated individuals leveraging SAFe Agile. We thrive in rapid pace environments and are passionate about Infrastructure & Application Monitoring in Hybrid Cloud environments. Giving back to our communities and contributing to the innovative culture here at Cisco is highly encouraged. We have a history of building innovative solutions at scale or being that bridge to what's possible. We are looking for a passionate SRE who is ready to embark on a new transformational journey us.

Who You Are

You are a success driven Site Reliability Engineer with proven technical and leadership skills who has a passion for designing and implementing innovative enterprise use cases to enhance and optimize our existing monitoring solutions. You will participate in the architecture and design of the monitoring solutions, aligning with leadership, product managers and product owners along with implementation teams to support the transformation of the Service Assurance architecture across Cisco.

This is an opportunity for you to work with the best minds and monitoring solutions in Cisco IT, in a dynamic field of Infrastructure and Application Performance Monitoring in a Hybrid Cloud environment.

Technical Expertise:

  • Experience with tool suites like Elastic, Grafana & Splunk

  • Experience with JavaScript either Node JS or React

  • Experience with Infrastructure or Application Performance Monitoring Solutions & Testing experience in a diverse and complex infrastructure

  • ThousandEyes, AppD or similar experience a plus

  • Experience with building and maintaining Redhat or Centos Linux

  • Experience with configuration automation using Ansible

  • Experience with public cloud like AWS, GCP, or Azure

  • Experience with on-premise cloud technologies using VMware or Openstack

  • Experience with container technologies like Openshift, Kubernetes, and Docker

  • Software development lifecycle including design, development, testing, packaging, deployment, upgrade and support.

  • Experience with software development tools like Git, Gerrit, Spinnaker, and Jenkins

  • Python, Shell, Go, or similar programming experience.

  • QA and testing experience of your code and the entire platform.

  • Understanding of security including OS hardening, firewalls, iptables, and working with Infosec

  • Understanding of network basics like routers and switches

Non-Technical Requirements:

  • Leadership in building and maintaining SRE technologies

  • Agile software development practices

  • Working with geographically distributed teams

  • Understand lifecycle IT processes, including: architecture, design, implementation, and operations

  • Opensource development experience

  • Self-motivated, able and willing to help where help is needed

  • Able to build relationships, be culturally sensitive, have goal alignment, have learning agility

  • Typically requires BS in Engineering or Computer Science and 8+ yrs of relevant experience.

Why Cisco

#WeAreCisco, where each person is unique, but we bring our talents to work as a team and make a difference powering an inclusive future for all.

We embrace digital, and help our customers implement change in their digital businesses. Some may think we're "old" (36 years strong) and only about hardware, but we're also a software company. And a security company. We even invented an intuitive network that adapts, predicts, learns and protects. No other company can do what we do - you can't put us in a box!

But "Digital Transformation" is an empty buzz phrase without a culture that allows for innovation, creativity, and yes, even failure (if you learn from it.)

Day to day, we focus on the give and take. We give our best, give our egos a break, and give of ourselves (because giving back is built into our DNA.) We take accountability, bold steps, and take difference to heart. Because without diversity of thought and a dedication to equality for all, there is no moving forward.

So, you have colorful hair? Don't care. Tattoos? Show off your ink. Like polka dots? That's cool. Pop culture geek? Many of us are. Passion for technology and world changing? Be you, with us!

icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove
Cloud Service Group Site Reliability Engineer (Remote)


Posted 1 week ago

VIEW JOBS 4/29/2021 12:00:00 AM 2021-07-28T00:00 Job Description Join us as we pursue our disruptive vision to make machine data accessible, usable and valuable to everyone. We are a company filled with people who are passionate about our product and seek to deliver the best experience for our customers. At Splunk, we're committed to our work, customers, having fun and most significantly to each other's success. Learn more about Splunk careers and how you can become a part of our journey! Role: Splunk's Cloud Services group is looking for a Site Reliability Engineer to help lead, design and build the next generation of our large scale cloud offering. You will be working on core services and applications that form the primitives for our current and future cloud service offerings. Site Reliability Engineers in this role will be engaging with multiple service owners across the platform to teach and implement modern interpretations of SRE, observability, Chaos Engineering and DevOps. This role is highly visible and impactful to the organization and will help shape Splunk's Engineering culture for years to come. Your job, in a nutshell, is to make every team around you better... including your own! You will: * Work across the organization to deliver quality products that delight Splunk's passionate users. * Build tools and design processes that help improve observability and system resiliency of the SignalFx Platform. * Triage Site Availability Incidents and proactively work towards reducing MTTR for customer impacting incidents. * Partner with Service owners to implement Service Level Metrics & Service Level Objectives that act as service level health indicators. * Establish design patterns for monitoring, benchmarking and deploying new features for the backend services. * Abide by all of the FedRAMP security controls on Splunk Cloud environments. * Work across the organization to deliver quality products that delight Splunk's passionate users. * Lead teams of tight-knit engineers who are building a state-of-the-art, cloud-based environment for massive-scale data processing. * Mentor and help new engineers to achieve more than they thought possible. You enjoy making other teams successful and are fulfilled through the success of others. Qualifications: * Experience monitoring and troubleshooting Splunk environments.. * Experience in administering or architecting distributed Splunk environments. * Experience with the development and deployment of a hosted cloud environment, preferably AWS and GCP. * Experience with Python or Shell for scripting, and Git or similar version control systems. * Experience supporting customer facing SaaS infrastructure or similar cloud related services. * You enjoy building and running distributed systems at scale in production. You understand the challenges and trade-offs to be made when building and deploying systems to production. * Expertise in working with container deployment and orchestration technologies at scale with knowledge of fundamentals including service discovery, deployments, monitoring, scheduling, load balancing. Knowledge of Kubernetes, Go and Docker preferred. * Deep understanding of Systems programming (network stack, file system, OS services) and networking (L2 vs. L3, network architecture, VLANs, etc) * Knowledge of standard methodologies related to security, performance, and disaster recovery. * Highly skilled in identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues. * You've demonstrated the skills to effectively work across teams and functions to influence design, operations and deployment of highly available software. * You work hard to make the users of Splunk's products happier every day. * Ability to work nights, weekends, and swing shifts. * This is a FULLY REMOTE, US-based position. We encourage applicants anywhere in the US! You must be a US Citizen working on US soil to be considered. Preferred skills: * Experience with large scale distributed cloud service development, infrastructure, traffic management and architecture. * Experience with distributed architectures/systems with optimized and scalable software that operates on a large number of nodes. * Ability to obtain a favorably adjudicated Single Scope Background Investigation (SSBI) and SECRET clearance All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying. For job positions in San Francisco, CA, and other locations where required, we will consider for employment qualified applicants with arrest and conviction records. Splunk San Jose CA

Site Reliability Engineer - Service Assurance Platforms

Cisco Systems, Inc.