We are hiring a Sr. Site Reliability Engineer who will work with the software engineers to build reliable, high capacity and high-performance infrastructure in support of our mission to reimagine learning for millions of students worldwide. If you know AWS services inside out, have solid networking experience, and you like engineering solutions to solve site reliability and operations problems, you will thrive in this position. The position will be located at our Boston, MA facility.
Hands-on design, analysis and troubleshooting of highly-distributed large-scale production systems;
Ownership of reliability, uptime, capacity, and performance analysis thereof
Ensuring the repeatability, traceability, and transparency of our infrastructure automation
Identifying highest-impact opportunities to optimize existing systems
System design consulting for teams seeking to leverage or improve their production infrastructure
Anticipate, build and plan capacity for upcoming product/feature launches
Expertise in container/container-fleet-orchestration technologies (like Docker, Kubernetes, AWS ECS);
Expertise in designing and manage escalation response plans from monitoring, react, respond, remediate and retrospect in culturally aligned (proactive, customer focused, collaborative, data-driven and AUTOMATED) ways;
Mastery of infrastructure build and configuration automation technologies (like Terraform, Ansible, Puppet, CodeDeploy, Chef);
Strong network engineering skills;
Cloud and container native Linux administration/build/management skills (AWS AMIs, Packer, etc.);
Significant experience troubleshooting concurrent and distributed system interactions;
Expertise with continuous-deployment software development lifecycles in the Cloud (CI/CD);
VIEW JOBS10/16/2019 12:00:00 AM2020-01-14T00:00Engineering at Klaviyo
Klaviyo is a fast-growing and profitable startup located in the heart of downtown Boston. Our mission is to use data science to help ecommerce brands grow faster. We love taking on tough engineering problems such as building real-time analytics systems to process billions of events every day.
We believe in ownership and autonomy and look for engineers who are passionate about building, operating & scaling features end to end and breaking through technical challenges to have outsized impact on our customers and on Klaviyo. We pride ourselves in shipping code dozens of times daily to enhance the product that our 10,000+ paying customers rely on to meaningfully engage with more than 1 billion global consumers.
Klaviyo's most important asset is our people and we are committed to always raising the bar for what it means to be a Klaviyo and to investing in and leveling up our people.
Read more about our tech and teams at https://klaviyo.tech
About the Role
Site Reliability Engineering (SRE) is what you get when you treat system operations as a software engineering problem. The mission of the Site Reliability Engineering team is to ensure uninterrupted service for Klaviyo customers and act as a force multiplier for Klaviyo product teams to deliver better software faster.
The SRE team builds foundational backend services as well as tooling and automation to allow product teams to release and scale their software reliably and predictably. SREs are team players who embed themselves within product teams as needed to advance the architecture and performance of software systems and train their peers in topics such as debugging distributed systems, building self-healing applications and eking out every drop of performance possible.
As a Senior Site Reliability Engineer you will own multiple foundational Klaviyo services and make a big impact on the productivity of our product engineering teams.
What you'll be doing
* Ship foundational services to enable Klaviyo engineering to move faster with confidence
* Design and develop systems and processes that enable highly available & scalable systems
* Achieve break-throughs in systems throughput by identifying and eliminating bottlenecks
* Leverage technology such as Python, AWS, Django, Kubernetes, Bash, Terraform, MySQL, RabbitMQ, Redis, Cassandra, Postgresql to advance Klaviyo's platform
* Champion best practices by actively collaborating with other teams in a culture that values whiteboarding and technical design review
* Contribute to the company as a subject matter expert in multiple areas, constantly pushing yourself to be a better engineer and to level up all of your peers within your team and within Klaviyo.
* Design, build and deliver software to dramatically improve the availability, scalability, latency, and efficiency of Klaviyo's services
* Mentor and pair with other Klaviyo engineers to build better software by focusing on performance, self-healing system, configuration as code; defensive programming, application security, etc.
* Participate in periodic on call duties with a focus on solving issues when they are discovered, preventing recurrences and minimizing alert fatigue
* Prototype and advocate for architectural improvements to achieve breakthrough results in Klaviyo systems' operational scalability and reliability
* Work hand-in-hand with product-facing engineers to ship impactful code
* Perform quantitative analysis to understand and scale Klaviyo systems and manage the cross-functional effort to resolve scalability issues
* Produce and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies
* Confidently make informed, data-driven decisions in a fast paced environment with competing priorities
* Evangelize Site Reliability best practices across the engineering organization and community
* BA or BS Degree in Computer Science, related field, or equivalent experience
* 5+ years of responsibility operating & scaling complex distributed systems
* Ability to handle yourself and complex systems in outage situations and to drive failures to root cause analysis and prevention of future issues
* Fundamental understanding of Linux (we run Ubuntu) and all layers of the networking stack. You should be confident administering and debugging production Linux systems
* Experience working on an engineering team building software
* Experience developing applications in Python, Ruby, Go, etc.
VIEW JOBS10/9/2019 12:00:00 AM2020-01-07T00:00Datto, the world's leading provider of IT solutions delivered through managed service providers, is looking for a Senior Site Reliability Engineer to join a growing team. Datto is a creative company at its core and is an exciting and dynamic workplace. We're 100% focused on our managed service provider partners and believe that with the right technology, managed service providers can change how businesses around the world operate. Datto provides data protection, business continuity, networking, business management, and file backup and sync products that empower and protect the clients of our 14,000+ partners. We're headquartered in Norwalk, Connecticut and have 22 offices worldwide.
We're looking for a motivated, self-starting, Sr. Site Reliability Engineer to help pioneer this role at Datto. The Sr. Site Reliability Engineer attaches to our Core Products Team, which maintains and develops new features for all of Datto's backup appliances (~75K devices and growing quickly). The backup device is a physical or virtual appliance that takes block-level backups of Windows, Mac, and Linux machines, turns them into raw disk images and stores them on a local ZFS-based disk array. In the case of a disaster, our customers restore these backups/disk-images instantly as KVM-based virtual machines, iSCSI targets, Samba shares, and many other formats. We also offer a virtual VMware/Hyper-V-based appliance and integrate with their hypervisors. We write code in modern Symfony-based PHP (with some Python and C++ sprinkled in), and we strongly rely on our Ubuntu-based Linux stack. We do amazing and exciting things every day, such as detecting when a VM has booted successfully, injecting drivers into the Windows registry before boot, and generating vmdk files on the fly. On top of that, we work with many low-level technologies, such as hypervisors and the ZFS filesystem. This is not your average PHP webdev gig! You will report to the Sr. Director of Software Engineering.
Does This Describe You:
You're a technical expert!
A Look Inside the Job:
* Collaborate with Product and Software Development teams to determine the Core products reliability strategy including Service Level Objectives (SLOs) and Indicators (SLIs)
* Guide product reliability improvement through monitoring, alerting, and application of software development best practices
* Collect SLI metrics and establish monitoring based on SLO thresholds and other product requirements
* Establish and configure transaction volume, traffic, performance, and error rate monitoring including alert thresholds, capacity planning, and performance impact analysis
* You will participate in SRE software engineering, writing code for the continuing reduction of human intervention in operational tasks and automation of processes
* Troubleshoot complex issues quickly and effectively
* Develop a balanced on-call program with appropriate staffing
* Communicate with Users, Support, and Development teams in the event of an incident
* Diagnose and develop root cause solutions for failures and performance issues in our production environment
* Bachelor's degree in Computer Science or equivalent experience
* Strong root cause analysis and troubleshooting competency
* Experience working with automation and data-driven analysis
* Experience with OOP languages such as Java, PHP, C#, or C++
* Solid understanding of Objection Oriented Programming fundamentals
* Experience with distributed systems, hypervisors or file systems
At Datto, we believe our employees are our greatest asset and offer all full-time employees a wide-ranging benefits package, including:
* Comprehensive health-care benefits
* Free lunch every Friday
* Flexible paid time off policy
* Free food, drinks, and fresh organic fruit
* Fitness reimbursement
* Charity match program
* Transit subsidy in select cities
* Education reimbursement
* And more!
By submitting an application, you acknowledge we will process your data in order to consider you for the position you apply for and for other open positions within our company for which you may be suited. We collect and store your data in accordance with our Recruiting Privacy Practices.
Datto is an equal opportunity employer.
VIEW JOBS9/30/2019 12:00:00 AM2019-12-29T00:00ezCater is the world's largest online marketplace for catering – a $60+ billion market in the U.S. We make it superbly easy for businesspeople to find and order great food for meetings and events, and we help our catering partners grow their business. We're backed by $320 million in venture funding and in early 2019 were valued at $1.25 billion. Our mission is to power the world's catering, and we'll make it happen – even more surely if you come help us.
We're looking for a top-notch, hands-on Senior level SRE to join our small and talented infrastructure engineering team and help us elevate our game when it comes to designing, building and operating high-performance and highly-available systems.
At ezCater, every engineer is responsible for the software they build, and SREs play a critical part in providing the tools, practices, and expertise to support them.
Our production systems are hosted in AWS data centers running multiple Rails and Node.js services in Kubernetes. We employ continuous delivery to allow our developers to deploy as often as they need. Our systems are stable and fire drills are rare.
Technologies we're currently using include:
* Amazon Web Services (EC2, S3, RDS, ElastiCache) and Ubuntu Linux
* Kubernetes, Postgres, Redis, Memcached, ElasticSearch
* Terraform, Chef, Fluentd, Test Kitchen, DataDog, Sumo Logic
In this mission-critical role, you will:
* Design, build and maintain the core infrastructure for ezCater
* Develop operational and security standards and champion operational excellence and secure coding practices
* Partner with engineering teams closely to educate and consult
* Participate in solution design for new features, products, systems, and tooling
* Debug complex problems across the whole stack
* Continually monitor application/system performance and costs, generate actionable insights and either implement or advocate for them
* Participate in on-call rotations, along with every member of the engineering team
* Ruthlessly eliminate repetitive manual tasks and recurring errors
* Ensure we are always employing best-of-breed tooling for all our infrastructure and automation needs
* Collaboratively plot course for the maturing and growth of ezCater's infrastructure
* Participate (and sometimes run point) in handling production incidents
* Work closely with engineering teams to conduct root cause analysis for production incidents, and evolve infrastructure and tooling.
This role might be that rare opportunity if you:
* Thrive in a highly collaborative, no red-tape, rapid-growth environment
* Love building tooling and infrastructure to help developers be more productive
* Love eliminating repetitive manual tasks through automation
* Have a healthy appreciation of what it means to work in production
* Have solid Unix command line and systems chops
* Have experience with substantial, distributed SaaS or eCommerce systems
* Have vision and well-informed opinions about how to build infrastructure for a high-growth, technology-driven company that's headed towards the $2B mark.
What you'll get from us:
Importantly, you'll get a tremendous amount of authority and autonomy. You'll own your outcomes and see measurable results for your efforts. With ezCater's radical transparency and trust, you'll have open access to the data that drives our decisions. ezUniversity sessions will provide plenty of opportunities to expand your mind.
At the same time, you'll get sane working hours and a huge amount of flexibility around work/life balance. Have people in your life – of any age – who always, often, or sometimes need your help? We make room for that. Have a bad thing or a good thing happen to you? We make room for that, too.
Oh, and here's what else you'll get: Market salary, stock options you'll help make worth a lot, the usual holidays, all-you-can-eat vacation, 401K, health/dental/FSA, long-term disability insurance, subsidized T-passes, a great office in the heart of Boston, a tremendous amount of responsibility and autonomy, wicked awesome co-workers, cupcakes (and many more goodies), and knowing that you helped get this rocket ship to the moon.
ezCater is an equal opportunity employer. We embrace humans of every background, appearance, race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, veteran status, and disability status. At the same time, we do not employ jerks, even brilliant ones.