Site Reliability Engineer, Dtx

Harvard University Boston , MA 02298

Posted 2 months ago

Additional Information

This role is offered as a hybrid (some combination of onsite and remote) where you are required to be onsite at our Boston, MA based campus a determined number of days per month. Specific days and schedule will be determined between you and your manager.

We may conduct candidate interviews virtually (phone and/or via Zoom) and/or in-person for this role.

A cover letter is required to be considered for this opportunity.

Harvard Business School will not offer visa sponsorship for this opportunity.

Culture of Inclusion: The work and well-being of HBS is profoundly strengthened by the diversity of our network and our differences in background, culture, national origin, religion, sexual orientation, and life experiences. Explore more about HBS work culture here https://www.hbs.edu/employment.

Basic Qualifications

  • Minimum of five years' post-secondary education or relevant work experience

Position Description

Be a pioneer in business, education, and global impact by joining the Harvard Business School HBS) Digital Transformation team - a "startup with assets," where you will have the chance to deploy digital- and emerging-technology education solutions. Where else can you make a difference at the intersection of cutting-edge technology, world-class education, noble purpose, and timeless legacy?

We are building educational and research solutions powered by Generative AI (GenAI) that scale across hundreds of courses and to hundreds of thousands of users. Our products assist educators and students alike with intelligent, adaptive capabilities that make education more accessible, engaging, and effective.

As a Site Reliability Engineer at Harvard Business School (HBS), you will play a crucial role in ensuring the high availability, performance, security, and scalability of our cloud-based solutions and services. You will work closely with our development and operations teams to build and maintain robust, efficient, and reliable systems on the AWS platform. You will work at the intersection of software engineering and systems engineering to build and run large-scale, fault-tolerant systems that balance speed of deployment with stability and operating at peak efficiency while also managing costs.

  • Design, implement, and maintain scalable, reliable, and efficient systems on the AWS platform.

  • Automate the deployment, scaling, and management of applications using AWS services such as EC2, S3, RDS, Lambda, CloudFormation, etc.

  • Monitor system performance, troubleshoot issues, and implement solutions to ensure optimal operation and uptime.

  • Implement solutions that enable running multiple GenAI workflows using shared infrastructure, while ensuring high throughput, low latency, and speed of deployment.

  • Provide a platform for machine learning (and other exciting workloads) allowing developers to move quickly and experiment.

  • Collaborate with development teams to optimize applications for the cloud and implement best practices for cloud-native development.

  • Implement and manage continuous integration and deployment (CI/CD) pipelines.

  • Develop and maintain disaster recovery plans and conduct regular system backups.

  • Ensure security compliance and best practices throughout the AWS infrastructure.

  • Document system configurations, processes, and procedures.

  • Develop runbooks and recipes for on-call support as part of a rotation schedule to resolve critical issues outside of regular business hours.

  • Adhere to standard methodologies in architectural design, testing (unit, integration, visual, and regression), and scrum methodology.

  • Evaluate developer platform designs, technical decisions, and code to ensure all are high quality, efficient, and well documented.

  • Develop and lead all aspects of Container Orchestration Platform, a diverse ecosystem of multiple applications.

  • Complete other responsibilities as assigned.

icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon
lc_ad

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Site Reliability Engineer, Dtx

Harvard University