What you'll do...
As a member of the Site Reliability Engineering team, you will work with Developers and DevOps practitioners to produce mission-critical infrastructure, tools, and processes that will ensure highest levels of availability and reliability of all our websites, systems, and services. As a senior member of the team, you will be expected to work with management, peers, and customers to define and implement the technical vision of the team.
You are right for the job if you are comfortable with deep technical Linux, networking topics, and distributed architectures. You will work cross-functionally amongst a variety of teams and be a core contributor in every significant engineering service or solution that we deliver to our stakeholders. You will excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization, and organization. You will work directly with our Software Engineering teams to build our next generation "always up" cloud-based e-commerce/Retail and Enterprise platform.
Site Reliability principal engineers are hybrid systems, software engineers and Data engineers who are responsible and take ownership for reliability, scalability, automation, and other issues related to uptime and availability of Walmart's e-commerce/Retail and Enterprise platform. Our goal is to build, scale and guard the systems that delights the customers. To do so, you will need to strong skills in following areas:
Own end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence, eventually automate response to all non-exceptional service conditions.
Define, measure, and meet key Service Level Objectives including availability, performance, incidents, and chronic problems.
Augment existing instrumentation to build a cohesive picture of the characteristics of our systems with special attention to points of failure.
Participate in capacity planning, demand forecasting, software performance analysis and system tuning.
Develop a deep understanding of the numerous services and applications that come together to deliver Walmart e-commerce/Retail and Enterprise products
Design new tools to monitor and build smart alerts system that help discover failures/issues in proactive ways and work with engineers to identify root cause and fix issues.
Influence, design and create new architectures, standards, and methods for large-scale enterprise systems.
Root-cause analysis complex problems involving multiple parties, networks, hardware, and software that relate to scaling and performance.
Designing and building Products and other IPs (Intellectual Properties) that can be used across solutions for multiple clients.
Determine overall data modeling standards, guidelines, best practices, and approved modeling techniques.
Develop innovative ideas and apply AI/ML to specific product challenges across Enterprise Operations.
Partner with software engineers, product owners and technology leadership to establish development and operational strategy and then deliver against those expectations.
Study and innovate in artificial intelligence/machine learning and its application in reporting and Operations.
Design conceptual, logical, and physical data models with competitive AI/ML services for next generation monitoring, alerting, reporting and create prototypes for demonstration.
Develop architectures that are inherently secure, robust, resilient, scalable, and enable API-centric and microservices ecosystem.
Deep understanding of computer architecture with knowledge of Operations and Site reliability Engineering.
Data Design and Data Modelling for in-house SRE Tools and Applications for both Operational and Analytics purposes.
Proven and demonstrated implementation skills (putting theory into practice).
Leverage the power of new patterns and architectures (Serverless, Microservices, Event Sourcing, etc.) and technologies (Kafka, Elixir, TensorFlow, Node.js, Docker, Go) to build cloud-native style solutions for the Predictive Enterprises Monitoring, Alerting and Reporting.
Experience with Azure, GCP, or private cloud (OpenStack); namely choosing platform components and assembling application and runtime architecture.
Experience with the Serverless framework and serverless architecture.
Experience with microservices architectures and constituent technologies.
Experience with containerization and container platforms. (e.g., Docker, Kubernetes, Docker EE, OpenShift, Mesosphere).
Experience with Event Driven Architectures and constituent technologies (e.g., Kafka, Event Hubs, PubSub) and patterns (Event Sourcing, CQRS, etc.)
Functional programming experience (e.g., in Kotlin, Scala, Clojure, Erlang, Elixir) is good to have.
Experience with high performance networking (Quic, network layer optimization) or Realtime transaction protocols/methods (HTTP2, Server Sent Events, MQTT, WebSockets).
Experience with distributed NoSQL databases (Cassandra, Riak, ScyllaDB, Aerospike, etc.)
Additional responsibilities may include:
Drives standardization and service focused instrumentation. Provides subject matter expertise. Resolves break/fix scenarios, engaging broader teams as necessary; and partners/leads to achieve continuous improvement. Contributes to command-and-control related activities focused on restoration of complex outages, and rapid restoration. Participate on 24/7 on-call rotation. May work independently or as part of a team on more complex projects. Provides mentoring and guidance to more junior team members.
Creates systems engineering and architectural documentation to be used by others to build and maintain systems.
Scripting and Development responsibilities: Design and develop software in several modern languages. Design large/complex database-backed systems and has an expert understanding of DB schema and query performance. Given a broad set of goals, can create detailed requirements and technical design specifications. Designs modular systems to be co-developed by teams of less experienced SRE. Designs horizontally scalable solutions with innovative use of storage and networking including good understanding of APIs for integration with other systems. Utilizes professional best practices in day-to-day work like revision control, unit testing, or other. Utilizes professional best practices in day-to-day work like revision control, unit testing, or other. Applies statistical data analysis techniques.
Networking responsibilities: Recommends or helps architect an entire system. Acts as an expert in understanding and performing TCP dumps, snoop, and other network sniffers. Understands and applies knowledge of most protocols (TCP/IP, HTTP, UDP, etc.)
Application Technologies: Provides expert recommendations and advice to the team and/or department in the areas of web services, OS, and storage, including being an active liaison to Development, QA, and the Business.
Analyzes systems and makes recommendations to prevent potential problems. Takes lead on issue resolution activities using knowledge of complex and company-wide systems.
Lead end-to-end audit of monitors and alarms based on subsystem knowledge. Takes the lead on defining the requirements for new tools required for CRC and vertical SRE.
Utilizes time management and project management skills to lead the resolution of issues in a timely and organized manner, effectively communicating necessary information. May consult directly with developers or third-party vendors; provides subject matter expertise.
Consistent exercise of independent judgment and discretion in matters of significance.
Other duties and responsibilities as assigned.
About Global Tech
Imagine working in an environment where one line of code can make life easier for hundreds of millions of people and put a smile on their face. That's what we do at Walmart Global Tech. We're a team of 15,000+ software engineers, data scientists and service professionals within Walmart, the world's largest retailer, delivering innovations that improve how our customers shop and empower our 2.2 million associates. To others, innovation looks like an app, service or some code, but Walmart has always been about people. People are why we innovate, and people power our innovations. Being human-led is our true disruption.
Working virtually this year has helped us make quicker decisions, remove location barriers across our global team, be more flexible in our personal lives and spend less time commuting. Today, we are reimagining the tech workplace of the future by making a permanent transition to virtual work for most of our team. Of course, being together in person is an important part of our culture and shared success. We'll collaborate in person at a regular cadence and with purpose.
Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.
Minimum Qualifications:Bachelor of Science and 6 years' experience in software engineering OR Master of Science and 3 years' experience in software engineering OR PhD.
Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.
Master's degree in Computer Science or related field and 4 years' experience in software engineering or related field
840 W CALIFORNIA AVE, SUNNYVALE, CA 94086-4828, United States of America