Senior Site Reliability Engineer

Syrinx Boston , MA 02108

Posted 3 months ago

We are hiring a Sr. Site Reliability Engineer who will work with the software engineers to build reliable, high capacity and high-performance infrastructure in support of our mission to reimagine learning for millions of students worldwide. If you know AWS services inside out, have solid networking experience, and you like engineering solutions to solve site reliability and operations problems, you will thrive in this position. The position will be located at our Boston, MA facility.

Essential Accountabilities:

  • Hands-on design, analysis and troubleshooting of highly-distributed large-scale production systems;
  • Ownership of reliability, uptime, capacity, and performance analysis thereof
  • Ensuring the repeatability, traceability, and transparency of our infrastructure automation
  • Identifying highest-impact opportunities to optimize existing systems
  • System design consulting for teams seeking to leverage or improve their production infrastructure
  • Anticipate, build and plan capacity for upcoming product/feature launches

Required Skills:

  • Mastery of AWS services (IAM, EC2, S3, EBS/EFS, ELB/ALB, AutoScaling, RDS and replication techniques, VPC, Subnets, Elastic IP, Route53, CloudWatch, CloudFront, Lambda, CloudFormation, ECS, SNS, ElastiCache);
  • Expertise in container/container-fleet-orchestration technologies (like Docker, Kubernetes, AWS ECS);
  • Expertise in designing and manage escalation response plans from monitoring, react, respond, remediate and retrospect in culturally aligned (proactive, customer focused, collaborative, data-driven and AUTOMATED) ways;
  • Mastery of infrastructure build and configuration automation technologies (like Terraform, Ansible, Puppet, CodeDeploy, Chef);
  • Strong skills in reading, understanding and writing code in at least two of: Javascript, Python, PHP, Go, or Ruby;
  • Strong network engineering skills;
  • Cloud and container native Linux administration/build/management skills (AWS AMIs, Packer, etc.);
  • Significant experience troubleshooting concurrent and distributed system interactions;
  • Expertise with continuous-deployment software development lifecycles in the Cloud (CI/CD);
  • Cloud database operations and deployment experience (RDS MySQL/Postgres/Aurora), caching operations & deployments (Memcache, Redis);
  • Expertise with Lean/Agile deployment processes (ZDT: Blue/Green, Canary, DNS strategies);
  • Familiarity with site and infrastructure monitoring systems (CloudWatch, Datadog, New Relic, Sumologic, Thousand Eyes);
  • Strong problem solving, root cause analysis and systems engineering skills;
  • Good presentation and communication skills;
  • Expertise with SDLC branching, SCM, and code deployment systems (Git/Gitflow, Jenkins, CircleCI, etc.);
  • BS Degree in Computer Science (or related technical field and/or equivalent industry experience).
icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove
Senior Site Reliability Engineer


Posted 2 days ago

VIEW JOBS 10/16/2019 12:00:00 AM 2020-01-14T00:00 Engineering at Klaviyo Klaviyo is a fast-growing and profitable startup located in the heart of downtown Boston. Our mission is to use data science to help ecommerce brands grow faster. We love taking on tough engineering problems such as building real-time analytics systems to process billions of events every day. We believe in ownership and autonomy and look for engineers who are passionate about building, operating & scaling features end to end and breaking through technical challenges to have outsized impact on our customers and on Klaviyo. We pride ourselves in shipping code dozens of times daily to enhance the product that our 10,000+ paying customers rely on to meaningfully engage with more than 1 billion global consumers. Klaviyo's most important asset is our people and we are committed to always raising the bar for what it means to be a Klaviyo and to investing in and leveling up our people. Read more about our tech and teams at About the Role Site Reliability Engineering (SRE) is what you get when you treat system operations as a software engineering problem. The mission of the Site Reliability Engineering team is to ensure uninterrupted service for Klaviyo customers and act as a force multiplier for Klaviyo product teams to deliver better software faster. The SRE team builds foundational backend services as well as tooling and automation to allow product teams to release and scale their software reliably and predictably. SREs are team players who embed themselves within product teams as needed to advance the architecture and performance of software systems and train their peers in topics such as debugging distributed systems, building self-healing applications and eking out every drop of performance possible. As a Senior Site Reliability Engineer you will own multiple foundational Klaviyo services and make a big impact on the productivity of our product engineering teams. What you'll be doing * Ship foundational services to enable Klaviyo engineering to move faster with confidence * Design and develop systems and processes that enable highly available & scalable systems * Achieve break-throughs in systems throughput by identifying and eliminating bottlenecks * Leverage technology such as Python, AWS, Django, Kubernetes, Bash, Terraform, MySQL, RabbitMQ, Redis, Cassandra, Postgresql to advance Klaviyo's platform * Champion best practices by actively collaborating with other teams in a culture that values whiteboarding and technical design review * Contribute to the company as a subject matter expert in multiple areas, constantly pushing yourself to be a better engineer and to level up all of your peers within your team and within Klaviyo. Responsibilities * Design, build and deliver software to dramatically improve the availability, scalability, latency, and efficiency of Klaviyo's services * Mentor and pair with other Klaviyo engineers to build better software by focusing on performance, self-healing system, configuration as code; defensive programming, application security, etc. * Participate in periodic on call duties with a focus on solving issues when they are discovered, preventing recurrences and minimizing alert fatigue * Prototype and advocate for architectural improvements to achieve breakthrough results in Klaviyo systems' operational scalability and reliability * Work hand-in-hand with product-facing engineers to ship impactful code * Perform quantitative analysis to understand and scale Klaviyo systems and manage the cross-functional effort to resolve scalability issues * Produce and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies * Confidently make informed, data-driven decisions in a fast paced environment with competing priorities * Evangelize Site Reliability best practices across the engineering organization and community Requirements * BA or BS Degree in Computer Science, related field, or equivalent experience * 5+ years of responsibility operating & scaling complex distributed systems * Ability to handle yourself and complex systems in outage situations and to drive failures to root cause analysis and prevention of future issues * Fundamental understanding of Linux (we run Ubuntu) and all layers of the networking stack. You should be confident administering and debugging production Linux systems * Experience working on an engineering team building software * Experience developing applications in Python, Ruby, Go, etc. Klaviyo Boston MA

Senior Site Reliability Engineer