Principal Software Engineer - Site Reliability Engineering

Walmart Sunnyvale , CA 94085

Posted 2 months ago

Position Summary...

What you'll do...

Job Summary:

As a member of the Site Reliability Engineering team, you will work with Developers and DevOps practitioners to produce mission-critical infrastructure, tools, and processes that will ensure highest levels of availability and reliability of all our websites, systems, and services. As a senior member of the team, you will be expected to work with management, peers, and customers to define and implement the technical vision of the team.

You are right for the job if you are comfortable with deep technical Linux, networking topics, and distributed architectures. You will work cross-functionally amongst a variety of teams and be a core contributor in every significant engineering service or solution that we deliver to our stakeholders. You will excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization, and organization. You will work directly with our Software Engineering teams to build our next generation "always up" cloud-based e-commerce/Retail and Enterprise platform.

Site Reliability principal engineers are hybrid systems, software engineers and Data engineers who are responsible and take ownership for reliability, scalability, automation, and other issues related to uptime and availability of Walmart's e-commerce/Retail and Enterprise platform. Our goal is to build, scale and guard the systems that delights the customers. To do so, you will need to strong skills in following areas:

  • Own end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence, eventually automate response to all non-exceptional service conditions.

  • Define, measure, and meet key Service Level Objectives including availability, performance, incidents, and chronic problems.

  • Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure.

  • Participate in capacity planning, demand forecasting, software performance analysis and system tuning.

  • Develop a deep understanding of the numerous services and applications that come together to deliver Walmart e-commerce/Retail and Enterprise products

  • Design new tools to monitor and build smart alerts system that help discover failures/issues in proactive ways and work with engineers to identify root cause and fix issues.

  • Influence, design and create new architectures, standards, and methods for large-scale enterprise systems.

  • Root-cause analysis complex problems involving multiple parties, networks, hardware, and software that relate to scaling and performance.

  • Designing and building Products and other IPs (Intellectual Properties) that can be used across solutions for multiple clients.

  • Determine overall data modeling standards, guidelines, best practices, and approved modeling techniques.

  • Develop innovative ideas and apply AI/ML to specific product challenges across Enterprise Operations.

  • Partner with software engineers, product owners and technology leadership to establish development and operational strategy and then deliver against those expectations.

  • Study and innovate in artificial intelligence/machine learning and its application in reporting and Operations.

  • Design conceptual, logical, and physical data models with competitive AI/ML services for next generation monitoring, alerting, reporting and create prototypes for demonstration.

  • Develop architectures that are inherently secure, robust, resilient, scalable, and enable API-centric and microservices ecosystem.

  • Deep understanding of computer architecture with knowledge of Operations and Site reliability Engineering.

  • Data Design and Data Modelling for in-house SRE Tools and Applications for both Operational and Analytics purposes.

  • Proven and demonstrated implementation skills (putting theory into practice).

  • Leverage the power of new patterns and architectures (Serverless, Microservices, Event Sourcing, etc.) and technologies (Kafka, Elixir, TensorFlow, Node.js, Docker, Go) to build cloud-native style solutions for the Predictive Enterprises Monitoring, Alerting and Reporting.

Minimum Qualifications:

  • Bachelor of Science and 6 years' experience in software engineering OR Master of Science and 3 years' experience in software engineering OR PhD.

Preferred Qualifications:

  • Experience with more than one programming language (Go, Python Java, Kotlin, Scala, Clojure, JavaScript, preferred).

  • Experience with Azure, GCP, or private cloud (OpenStack); namely choosing platform components and assembling application and runtime architecture.

  • Experience with the Serverless framework and serverless architecture.

  • Experience with microservices architectures and constituent technologies.

  • Experience with containerization and container platforms. (e.g., Docker, Kubernetes, Docker EE, OpenShift, Mesosphere).

  • Experience with Event Driven Architectures and constituent technologies (e.g., Kafka, Event Hubs, PubSub) and patterns (Event Sourcing, CQRS, etc.)

  • Functional programming experience (e.g., in Kotlin, Scala, Clojure, Erlang, Elixir) is good to have.

  • Experience with high performance networking (Quic, network layer optimization) or Realtime transaction protocols/methods (HTTP2, Server Sent Events, MQTT, WebSockets).

  • Experience with distributed NoSQL databases (Cassandra, Riak, ScyllaDB, Aerospike, etc.)

Additional responsibilities may include:

  • Drives standardization and service focused instrumentation. Provides subject matter expertise. Resolves break/fix scenarios, engaging broader teams as necessary; and partners/leads to achieve continuous improvement. Contributes to command-and-control related activities focused on restoration of complex outages, and rapid restoration. Participate on 24/7 on-call rotation. May work independently or as part of a team on more complex projects. Provides mentoring and guidance to more junior team members.

  • Creates systems engineering and architectural documentation to be used by others to build and maintain systems.

  • Scripting and Development responsibilities: Design and develop software in several modern languages. Design large/complex database-backed systems and has an expert understanding of DB schema and query performance. Given a broad set of goals, can create detailed requirements and technical design specifications. Designs modular systems to be co-developed by teams of less experienced SRE. Designs horizontally scalable solutions with innovative use of storage and networking including good understanding of APIs for integration with other systems. Utilizes professional best practices in day-to-day work like revision control, unit testing, or other. Utilizes professional best practices in day-to-day work like revision control, unit testing, or other. Applies statistical data analysis techniques.

  • Networking responsibilities: Recommends or helps architect an entire system. Acts as an expert in understanding and performing TCP dumps, snoop, and other network sniffers. Understands and applies knowledge of most protocols (TCP/IP, HTTP, UDP, etc.)

  • Application Technologies: Provides expert recommendations and advice to the team and/or department in the areas of web services, OS, and storage, including being an active liaison to Development, QA, and the Business.

  • Analyzes systems and makes recommendations to prevent potential problems. Takes lead on issue resolution activities using knowledge of complex and company-wide systems.

  • Lead end-to-end audit of monitors and alarms based on subsystem knowledge. Takes the lead on defining the requirements for new tools required for CRC and vertical SRE.

  • Utilizes time management and project management skills to lead the resolution of issues in a timely and organized manner, effectively communicating necessary information. May consult directly with developers or third-party vendors; provides subject matter expertise.

  • Consistent exercise of independent judgment and discretion in matters of significance.

  • Other duties and responsibilities as assigned.

About Global Tech

Imagine working in an environment where one line of code can make life easier for hundreds of millions of people and put a smile on their face. That's what we do at Walmart Global Tech. We're a team of 15,000+ software engineers, data scientists and service professionals within Walmart, the world's largest retailer, delivering innovations that improve how our customers shop and empower our 2.2 million associates. To others, innovation looks like an app, service or some code, but Walmart has always been about people. People are why we innovate, and people power our innovations. Being human-led is our true disruption.

We're virtual

Working virtually this year has helped us make quicker decisions, remove location barriers across our global team, be more flexible in our personal lives and spend less time commuting. Today, we are reimagining the tech workplace of the future by making a permanent transition to virtual work for most of our team. Of course, being together in person is an important part of our culture and shared success. We'll collaborate in person at a regular cadence and with purpose.

Minimum Qualifications...

Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.

Minimum Qualifications:Bachelor of Science and 6 years' experience in software engineering OR Master of Science and 3 years' experience in software engineering OR PhD.

Preferred Qualifications...

Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.

Master's degree in Computer Science or related field and 4 years' experience in software engineering or related field

Primary Location...

840 W CALIFORNIA AVE, SUNNYVALE, CA 94086-4828, United States of America

icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove
Director Of Engineering Site Reliability Engineering


Posted 2 weeks ago

VIEW JOBS 7/13/2021 12:00:00 AM 2021-10-11T00:00 Company Description LinkedIn was built to help professionals achieve more in their careers, and every day millions of people use our products to make connections, discover opportunities, and gain insights. Our global reach means we get to make a direct impact on the world's workforce in ways no other company can. We're much more than a digital resume - we transform lives through innovative products and technology. Searching for your dream job? At LinkedIn, we strive to help our employees find passion and purpose. Join us in changing the way the world works. Technology leaders collaborate, maintain balance, commit and achieve results - all while upholding immense pride in their quality of work. Our leaders value their craft and inspire their team to do the same. They balance product and technology strategy to put members first. They are responsible for attracting, retaining, engaging and developing their teams while also leading and inspiring them to achieve the goals of LinkedIn. Engineering leaders are champions of LinkedIn to their coworkers, their networks and the tech community. Job Description As the Site Reliability Engineering Director for our core Ad team, you have the opportunity to build and lead an amazing engineering team of experts who run our next generation of Ad Delivery and Tracking infrastructure which together drive $2B+ in revenue annually. You will apply your engineering leadership skills and knowledge of infrastructure and software development to drive ultra-scalable and highly reliable ads systems. As a Site Reliability Engineering leader you hold the responsibility for the big picture; determining how our systems relate to each other and use a breadth of tools and approaches to solve a broad spectrum of problems. This job is as mission-critical as it gets. You'll work along with product development and Product Managers, to understand and scale the advertising ecosystem with an acute focus on reliability. You'll oversee some of the most talented engineers on the planet, and help them to pursue their passion and transform their careers. Your growth will come from building and supporting strong leaders to whom you can effectively delegate, mentor for peak performance, define clear roles and accountability, communicate clear priorities and maximize cross-organizational alignment. Your long-term success will come not from a "command and control" style, but from "inspire and empower." Responsibilities Be a role model and professional coach for engineers and managers. Help them realize their potential by setting clear expectations, openly evaluating performance, upholding accountability, and providing challenges (within and outside their team) to stretch their skills. Participate with senior management in co-developing a long-term Product reliability and technology road map. Drive and own improvements in end-to-end infrastructure and systems Be the domain expert who follows industry trends and applies best practices Contribute to the hiring and development of talented leaders and engineers ranging from junior to senior levels of experience. Build effective partnerships with the development, infrastructure and product teams. Lead cross organizational projects to ensure teams deliver on time and at a high-quality bar, by proactively aligning resources and processes, mitigating risks, and making tradeoffs by taking intelligent risks Collaborate across the rest of the SRE teams and Engineering leadership to drive high availability, performance and resiliency of LinkedIn ads products. Qualifications Basic Qualifications: Bachelor's degree in computer science or related domain. 5+ years managing teams of 30+ engineers Experience attracting, retaining, and developing top engineering and operations talent throughout the industry. Preferred Qualifications Experience in running consumer facing mission critical systems and working under SLA's on uptime, performance. Familiarity with online ads, business-to-business marketing or related domain Experience Managing Through Leads Experience with the full lifecycle of building, deploying, and maintaining a large scale public cloud service 5+ years in a large-scale web application infrastructure leadership position. 4+ years experience in one of the following areas - performance engineering, capacity engineering, availability engineering, developer productivity engineering, web engineering Equal Opportunity Statement LinkedIn is committed to diversity in its workforce and is proud to be an equal opportunity employer. LinkedIn considers qualified applicants without regard to race, color, religion, creed, gender, national origin, age, disability, veteran status, marital status, pregnancy, sex, gender expression or identity, sexual orientation, citizenship, or any other legally protected class. LinkedIn is an Affirmative Action and Equal Opportunity Employer as described in our equal opportunity statement here: Please reference and for more information. LinkedIn is an equal employment opportunity employer offering opportunities to all job seekers, including individuals with disabilities. If you believe you need a reasonable accommodation in order to search for a job opening or to apply for a position, please contact us by sending an email to This email box is designed to assist disabled job seekers who seek a reasonable accommodation to the application process. Emails sent for non-disability related issues, such as following up on an application, will not receive a response. In your email, please include the following: (1) confirm you have a disability; (2) identify the disability-related limitation that needs to be accommodated, and (3) if known, describe the specific accommodation requested for the disability-related limitation. A response to your request may take up to three business days. LinkedIn will not discharge or in any other manner discriminate against employees or applicants because they have inquired about, discussed, or disclosed their own pay or the pay of another employee or applicant. However, employees who have access to the compensation information of other employees or applicants as a part of their essential job functions cannot disclose the pay of other employees or applicants to individuals who do not otherwise have access to compensation information, unless the disclosure is (a) in response to a formal complaint or charge, (b) in furtherance of an investigation, proceeding, hearing, or action, including an investigation conducted by LinkedIn, or (c) consistent with LinkedIn's legal duty to furnish information. Pay Transparency Policy Statement As a federal contractor, LinkedIn follows the Pay Transparency and non-discrimination provisions described at this link: Global Data Privacy Notice for Job Candidates This document provides transparency around the way in which LinkedIn handles personal data of employees and job applicants: Show more Show less * Seniority level Not Applicable * Employment type Full-time * Job function Engineering * Industries Internet Linkedin Sunnyvale CA

Principal Software Engineer - Site Reliability Engineering