Senior Site Reliability Engineer


Syrinx Boston , MA 02110

Posted Today

Senior Site Reliability Engineer
Boston, MA
Apply directly to

U.S. Citizens and those authorized to work in the U.S. are encouraged to apply. We are unable to sponsor at this time. No Corp to Corp.
Our e-commerce partner is hiring for the next generation of their products that are delivering engaging, adaptive, and personalized experiences to optimally support every learner.
We are hiring a Sr. Site Reliability Engineer who will work with the software engineers to build reliable, high capacity and high-performance infrastructure in support of our mission to reimagine learning for millions of students worldwide.
If you know AWS services inside out, have solid networking experience, and you like
engineering solutions to solve site reliability and operations problems, you will thrive in
this position.
Essential Accountabilities:
Hands-on design, analysis and troubleshooting of highly-distributed large-scale
production systems;
Ownership of reliability, uptime, capacity, and performance analysis thereof
Ensuring the repeatability, traceability, and transparency of our infrastructure automation
Identifying highest-impact opportunities to optimize existing systems
System design consulting for teams seeking to leverage or improve their production
Anticipate, build and plan capacity for upcoming product/feature launches
Required Skills:
Mastery of AWS services (IAM, EC2, S3, EBS/EFS, ELB/ALB, AutoScaling, RDS and
replication techniques, VPC, Subnets, Elastic IP, Route53, CloudWatch, CloudFront,
Lambda, CloudFormation, ECS, SNS, ElastiCache);
Expertise in container/container-fleet-orchestration technologies (like Docker,
Kubernetes, AWS ECS);
Expertise in designing and manage escalation response plans from monitoring, react,
respond, remediate and retrospect in culturally aligned (proactive, customer focused,
collaborative, data-driven and AUTOMATED) ways;
Mastery of infrastructure build and configuration automation technologies (like
Terraform, Ansible, Puppet, CodeDeploy, Chef);
Strong skills in reading, understanding and writing code in at least two of: Javascript,
Python, PHP, Go, or Ruby;
Strong network engineering skills;
Cloud and container native Linux administration/build/management skills (AWS AMIs,
Packer, etc.);
Significant experience troubleshooting concurrent and distributed system interactions;
Expertise with continuous-deployment software development lifecycles in the Cloud (e.g.
Cloud database operations and deployment experience (RDS MySQL/Postgres/Aurora),
caching operations & deployments (Memcache, Redis);
Expertise with Lean/Agile deployment processes (ZDT: Blue/Green, Canary, DNS
Familiarity with site and infrastructure monitoring systems (CloudWatch, Datadog, New
Relic, Sumologic, Thousand Eyes);
Strong problem solving, root cause analysis and systems engineering skills;
Good presentation and communication skills;
Expertise with SDLC branching, SCM, and code deployment systems (Git/Gitflow,
Jenkins, CircleCI, etc.);
icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove
Senior Site Reliability Engineer


Posted 3 weeks ago

VIEW JOBS 11/21/2019 12:00:00 AM 2020-02-19T00:00 Datto, the world's leading provider of IT solutions delivered through managed service providers, is looking for a Senior Site Reliability Engineer to join a growing team. Datto is a creative company at its core and is an exciting and dynamic workplace. We're 100% focused on our managed service provider partners and believe that with the right technology, managed service providers can change how businesses around the world operate. Datto provides data protection, business continuity, networking, business management, and file backup and sync products that empower and protect the clients of our 14,000+ partners. We're headquartered in Norwalk, Connecticut and have 22 offices worldwide. We're looking for a motivated, self-starting, Sr. Site Reliability Engineer to help pioneer this role at Datto. The Sr. Site Reliability Engineer attaches to our Core Products Team, which maintains and develops new features for all of Datto's backup appliances (~75K devices and growing quickly). The backup device is a physical or virtual appliance that takes block-level backups of Windows, Mac, and Linux machines, turns them into raw disk images and stores them on a local ZFS-based disk array. In the case of a disaster, our customers restore these backups/disk-images instantly as KVM-based virtual machines, iSCSI targets, Samba shares, and many other formats. We also offer a virtual VMware/Hyper-V-based appliance and integrate with their hypervisors. We write code in modern Symfony-based PHP (with some Python and C++ sprinkled in), and we strongly rely on our Ubuntu-based Linux stack. We do amazing and exciting things every day, such as detecting when a VM has booted successfully, injecting drivers into the Windows registry before boot, and generating vmdk files on the fly. On top of that, we work with many low-level technologies, such as hypervisors and the ZFS filesystem. This is not your average PHP webdev gig! You will report to the Sr. Director of Software Engineering. Does This Describe You: You're a technical expert! A Look Inside the Job: * Collaborate with Product and Software Development teams to determine the Core products reliability strategy including Service Level Objectives (SLOs) and Indicators (SLIs) * Guide product reliability improvement through monitoring, alerting, and application of software development best practices * Collect SLI metrics and establish monitoring based on SLO thresholds and other product requirements * Establish and configure transaction volume, traffic, performance, and error rate monitoring including alert thresholds, capacity planning, and performance impact analysis * You will participate in SRE software engineering, writing code for the continuing reduction of human intervention in operational tasks and automation of processes * Troubleshoot complex issues quickly and effectively * Develop a balanced on-call program with appropriate staffing * Communicate with Users, Support, and Development teams in the event of an incident * Diagnose and develop root cause solutions for failures and performance issues in our production environment About You: * Bachelor's degree in Computer Science or equivalent experience * Strong root cause analysis and troubleshooting competency * Experience working with automation and data-driven analysis * Experience with OOP languages such as Java, PHP, C#, or C++ * Solid understanding of Objection Oriented Programming fundamentals Bonus Points: * Experience with distributed systems, hypervisors or file systems At Datto, we believe our employees are our greatest asset and offer all full-time employees a wide-ranging benefits package, including: * Comprehensive health-care benefits * Free lunch every Friday * Flexible paid time off policy * Free food, drinks, and fresh organic fruit * Fitness reimbursement * Charity match program * Transit subsidy in select cities * Education reimbursement * And more! By submitting an application, you acknowledge we will process your data in order to consider you for the position you apply for and for other open positions within our company for which you may be suited. We collect and store your data in accordance with our Recruiting Privacy Practices. Datto is an equal opportunity employer. Datto Boston MA

Senior Site Reliability Engineer