Senior Infrastructure Engineer, Site Reliability

Credit Karma San Francisco , CA 94118

Posted 1 week ago

Credit Karma is taking the convoluted world of personal finances and making it easy to understand. We are uniquely positioned to help our 80 million (and counting!) users take control of their financial lives. The Site Reliability Engineering team is responsible for taking the complicated world of site operations and making it easy for teams to run reliable and performant services. You'll be expected to improve automation, tools, processes, and communications related to running systems. You will work closely with our network, cloud and platform engineering teams to deliver best in class service to the company and our members.

What the Job Entails:

  • Developing automation and tools, to reduce toil and improve repeatability of processes.

  • Manage critical services that keep our systems running including logging (Splunk) and infrastructure management systems (Salt, Terraform)

  • Define reliability metrics(KPIs, SLOs), and work to ensure services meet them. Develop runbooks and processes to reduce MTTR in incidents.

  • Work on processes to migrate traditional infrastructure services to a microservice model with Docker and Kubernetes.

  • Collaborate with core infrastructure and service engineers to improve service reliability, scalability, and tooling.

  • Troubleshoot issues across the entire stack, software, hardware, cloud, and networking.

  • Participate in 24x7 on-call rotation.

  • Mentor junior members of the team and help them reach the next level.

What's Great About It?

  • The changes you make will directly improve our customers experience by improving reliability.

  • You'll improve the life of everyone around you by helping to reduce the operational toil.

  • You'll get broad exposure to our stack of technologies such as Splunk, Google Cloud, Docker, and Kubernetes.

  • You'll learn a lot; we value continued learning and development both within the team and at CK as a whole.

  • And, of course, all those awesome company perks that you probably already read about.

Our Ideal Candidate:

  • 3+ years experience in systems engineering, Dev Ops, or software engineering with infrastructure focus role

  • You have a solid understanding of at least two of the following: Splunk Administration, Docker/Kubernetes, or Cloud infrastructure (AWS or Google Cloud).

  • Deep knowledge of Splunk cluster administration on both the front end and back end..

  • You've built tooling to improve reliability of systems, automated remediation of issues, or improve scalability.

  • You have experience working in production environments at scale, and want to improve our availability and performance.

  • Systems often need to be reconfigured, so you should have experience with a configuration management system like Puppet, Chef or Salt. (We use Salt.)

  • You should be able to clearly communicate technical details when speaking or writing.

  • This position is part of a well established team, and you should be excited about working closely with them, and product development teams.

  • Working in the cloud is a little different, so it would be great if you have some experience with AWS or GCP. Bonus if you have experience working with Terraform.

  • Our environment often has new challenges and technologies, so we want a candidate who is excited to learn.

icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove
Senior Site Reliability Engineer Cloud Cache And Storage Infrastructure


Posted 1 week ago

VIEW JOBS 3/14/2019 12:00:00 AM 2019-06-12T00:00 Senior Site Reliability Engineer - Cloud Cache and Storage Infrastructure San Francisco, CA Who we are SREs work on improving the availability, scalability, performance and reliability of Twitter's production services. Twitter is looking for a Senior Site Reliability Engineer to join our Cloud Cache and Storage Infrastructure SRE team. Our team is dedicated to expanding our infrastructure, automation, and tooling for our Cloud Cache and Storage systems. Our team's mission is to provide safe, reliable and secure cache and core storage systems and to automate and operate these systems at scale. What you'll do * You will work in engineering team to design, build, and maintain cache layers, key-value, relational and binary file storage systems * You will build automation and tooling in Python and other languages to manage our cache and storage services and their infrastructure * You will perform deep dives into systemic and latent reliability issues, service performance, and capacity modeling * You will troubleshoot issues across the entire stack: hardware, software, application and network, * You will consult with customer teams on their service use patterns and identify anti-patterns and optimization strategies * You will mentor SWEs on standard methodologies across multiple disciplines including proper service selection, monitoring, and troubleshooting complex code issues * You will drive standardization efforts across the services, infrastructure, systems and practices * You will develop new software-based solutions to infrastructure engineering problems Who you are * You have a solid understanding of systems and application design, including the operational trade-offs of various designs * You have the knowledge of various aspects of service design: including messaging protocols & behavior, caching strategies and software design practices * You have practical, solid knowledge of shell scripting and at least one higher-level language (Python or Ruby preferred) * You have an expert understanding of Linux systems, services, optimization, storage subsystems, and file systems * You have demonstrable knowledge of TCP/IP, HTTP, and experience supporting multi-tier application architectures * You have a minimum 5 years experience handling services in a large scale environment * You are able to prioritize tasks and work independently * You have excellent written communication, interpersonal communication, and documentation skills * B.S. in computer science or similar field or equivalent experience. Desired * Practical experience in Java or Scala * Advanced knowledge of Python or Ruby to be able to build, write, and support complex services * Ability to lead technical teams through design and implementation across an organization Come join us Do you love working on large scale and complex problems and proposing solutions to fix them both in the short-term and long-term? Are you able to hold the standard high for reliability, consistency, and operability while balancing the need to ship and iterate? If you like working in an independent environment where you get to define requirements, work with great people, and drive projects from conception to completion and long-term ownership, come join our team. We are committed to an inclusive and diverse Twitter. Twitter is an equal opportunity employer. We do not discriminate based on race, color, ethnicity, ancestry, national origin, religion, sex, gender, gender identity, gender expression, sexual orientation, age, disability, veteran status, genetic information, marital status or any legally protected status. San Francisco applicants: Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records. Twitter San Francisco CA

Senior Infrastructure Engineer, Site Reliability

Credit Karma