Site Reliability Engineer/Crisis Management

Servicenow San Diego , CA 92140

Posted 4 months ago

Job Title: Technical Duty Officer

Job Location: San Diego, CA

This position reports to the Manager of Technical Duty Officers in AMS. ServiceNow is changing the way people work. With a service-orientation toward the activities, tasks and processes that make up day-to-day work life, we help the modern enterprise operate faster and be more scalable than ever before.We're disruptive. We work hard but try not to take ourselves too seriously. We are highly adaptable and constantly evolving. We are passionate about our product, and we live for our customers. We have high expectations and a career at ServiceNow means challenging yourself to always be better.

Who is the TDO?:

The Technical Duty Officer (TDO) team provides leadership to a talented Site Reliability Engineering (SRE) group to keep our worldwide cloud service available. We advance training and give support to the operations teams on all issues impacting our infrastructure. The TDO engages in robust communication across the organization to drive necessary changes and execute initiatives with rigorous determination.

What you get to do in this role?:

  • Leverage your extensive system, network, and database skills to provide technical leadership for a team of on-site engineers who are responsible for the availability and performance of ServiceNow's cloud platform.

  • Lead as the crisis manager during all major outages, and provide technical input to the teams engaged in remediation.

  • Drive organization-wide change by participating in post-incident reviews, approving new architectural designs, and establishing strong relationships by working with many cross-functional teams.

  • Make operations more effective by continually training and mentoring the team on all aspects of the operational environment.

  • Build requirements for new procedures and automations and verify that these new services meet our needs before getting released to the production environment.

  • Coordinate all recovery efforts to provide rapid relief and resolution to any issue that could be impacting the operational environment.

What should you know to be successful?:

  • An in-depth understanding of the technology associated with operating a service or platform in the cloud, including datacenters, systems, networks, load balancers, applications, and relational databases.

  • Familiarity with Networking technologies such as routing, switching, DNS, load balancing, and CDN.

  • Working knowledge of BASH, Python, Perl or other scripting languages.

  • Experience with MariaDB/MySQL configuration, SQL query analysis, and database performance techniques.

  • Solid *nix systems administration, network administration and application layer experience.

  • Excellent collaboration skills across diverse cross-functional teams.

  • Proven abilities to effectively promote your ideas and obtain buy-in from stakeholders.

  • Meticulous analytical skills to identify and understand the root cause of critical issues.

Preferred qualifications:

  • 3-5 years of technical leadership experience.

  • Bachelor's degree in Computer Science or Information Systems or equivalent technical discipline, or similar work experience in a enterprise 24/7 production environment supporting critical, real-time applications.

  • Strong understanding of Internet protocols, web technologies, and operating systems.

More about this role:

We provide competitive compensation, generous benefits, and a professional atmosphere. This is a very collaborative and inclusive work environment where individuals with a strong aptitude will have an opportunity to grow their professional careers through working with some of the most advanced technology in the industry.

icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove
Site Reliability Engineer Couchbase


Posted 1 week ago

VIEW JOBS 8/10/2019 12:00:00 AM 2019-11-08T00:00 TE2 - The Experience Engine™ Inc, a division of accesso, is the leader in experience-driven, personalized advertisement and content delivery for connected consumers, bridging the physical and digital brand experience across mobile, wearables and other digital technologies. TE2 is designed for industries where an in-person experience is a critical engagement opportunity, including hospitality, resorts, theme parks, food, travel, education and healthcare. For more information about TE2, please visit Position Overview The Couchbase Site Reliability Engineers objective is to essentially "make things scale" which includes: building software that automates experiences, developing utilities that provide insights/metrics, and providing instrumentation for the Engineering teams to more efficiently scale up the TE2 platform's performance. Red Hat, Docker, Kubernetes, AWS, Jenkins, and Ansible are the main internal tech stack you will be working with. The Couchbase Site Reliability Engineer brings deep expertise supporting Couchbase database clusters as part of complex TE2 deployments. You will serve as a subject matter expert on all aspects of our utilization of Couchbase, including deployment, configuration, scaling (MDS), and upgrades. You will debug problems in production and test environments, advise developers on best practices using Couchbase including key-value operations and N1QL queries, and maintain high-volume clusters in multiple datacenters. You will develop automation that improves deployment speed and service reliability of Couchbase clusters. Challenges that you may tackle include: * Instrumentation and metrics collection from AWS lambda FaaS or otherwise immutable containers * Minimize and harden microservices and public-facing API gateway attack surface * Continuous delivery using tools such as Jenkins pipelines, Docker, Kubernetes * Observability, capacity planning, system and service performance analysis and tuning * Orchestration of AWS VPC resources using tools such as terraform, boto, consul Some of the technologies you will be working with: * Configuration management: ansible, aws-cli, git * Operating Systems: mostly RedHat derived linux * Containerization and virtualization technologies: Docker Enterprise, Kubernetes * Metrics and monitoring: statsd, ELK, PagerDuty, Slack chatops * Messaging: Kafka, RabbitMQ * Microservices patterns: Eureka, Ribbon, Hystrix, nginx * Databases: Couchbase (NoSQL, N1QL), memcached, Elasticsearch, PostgreSQL, Oracle * L2-L7 frame/packet/session inspection: netflow, WAF, pcap Requirements: * 5+ years of highly-available or high-volume site reliability engineering or systems administration * 3+ years of infrastructure automation, configuration management or container orchestration * Strong with one or more languages (Go (golang), Python, Java, Ruby, perl or bash) and git * BA/BS in Computer Science or a related technical field (preferred, but not necessary) * Periodic participation in an after-hours on-call rotation supporting production environments 24x7 * Willingness to embrace an agile devops culture What We Offer: * Competitive compensation package including discretionary annual bonus opportunity; * 4-weeks of Paid Time Off for employees up to 3-years of tenure (higher accrual thereafter); * 8-hours of paid Volunteer Time Off to give back to organizations and groups you feel most passionately about; * Three different medical insurance plans to choose from, including an employer-contributed HSA; * Employer-paid short & long-term disability and life insurance; * Matching 401K; * Unlimited access to Udemy for Business for continued learning and career development. Other Considerations: * We are an E-Verify organization. Eligible candidates must be authorized to work in the US without requiring visa sponsorship. * accesso is a drug free company. If you are interested in joining a team who values Passion, Commitment, Teamwork, Innovation and Integrity and what we've described above is YOU, then apply today and let's talk! Accesso San Diego CA

Site Reliability Engineer/Crisis Management