Site Reliability Engineer - High Performance Computing / Ai-Ml

Twitter San Francisco , CA 94118

Posted 2 weeks ago

Apply

This Job is not relevant Tell us why

Are you prepared to join the X team and help build the ultimate real-time information-sharing app, revolutionizing how people connect? At X, we're on a mission to become a trusted global digital public square, committed to minimal censorship within legal boundaries. Our goal is to empower every user to freely create and share ideas, fostering open public discourse without barriers. Join us in shaping this thrilling journey where your contribution will be invaluable to our success!

Role: Site Reliability Engineer - HPC / AI-ML (All Levels)

Location: San Francisco, New York, Los Angeles, Seattle or Austin

Base Salary Range: $120,000 to $297,000

Who We Are:

At X, we're pioneering the frontier of technology with our innovative Everything App. Our mission is to revolutionize how people connect, share ideas, and engage in meaningful conversations. We champion freedom of speech and strive to create a platform that embraces diverse perspectives. Our commitment is to foster open dialogue and empower individuals to express themselves freely.

What You'll Do:

As a Site Reliability Engineer (SRE) supporting HPC (High Performance Computing) + AI/ML initiatives at X, you will play a crucial role in maintaining and enhancing the reliability, availability, and performance of our large-scale systems. Your responsibilities will include:

Managing and troubleshooting large scale clusters to ensure the stability and efficiency of our platform (primarily Linux + Kubernetes)
Collaborating with cross-functional teams, including hardware engineers and software developers, to support and improve our infrastructure
Automating the provisioning and deployment of systems to enhance long-term health and scalability
Ensuring the robustness of our HPC environments and storage clusters
Writing and maintaining scripts and tools for automation and monitoring
Addressing system failures and performance issues, identifying root causes, and implementing preventive measures
Working closely with end-users to understand changing needs as our environment evolves.

Who You Are:

We're looking for exceptional engineers who are passionate about our mission and have a strong desire to make a meaningful impact. The ideal candidate will have:

2+ years of professional software development experience
Extensive experience with Kubernetes and container orchestration
Proficiency in one or more object-oriented programming languages (e.g. Python, Java, C++, Scala)
Proficiency in scripting languages (Python, Bash, etc.)
Strong experience in configuration management (e.g., puppet, ansible, chef, etc.)
Familiarity with Ethernet networking at scale and distributed systems
Strong troubleshooting skills and experience with HPC environments
Experience managing large-scale systems, ideally supporting thousands of machines
Working understanding of the storage systems required to support such environments
Experience with various GPU / accelerator architectures and ability to optimize performance on such platforms.
Ability to think outside the box and come up with innovative solutions to complicated problems.
Extremely committed, willing to work in a fast paced environment
Excellent communication and interpersonal skills

At X, our small but fast-paced team values innovation, creativity, and a strong commitment to our mission. As a Site Reliability Engineer, you'll have the opportunity to make a significant impact on the future of X and our aspiration to build the Everything App.

Show Full Description

See how you match
to the job

Upload my resume

Download the
LiveCareer app and find
your dream job anywhere

Similar Jobs

View All

Want to see jobs matched to your resume?
Upload One Now!

Site Reliability Engineer High Performance Computing / AiMl

Twitter

Posted 2 weeks ago

VIEW JOBS

Site Reliability Engineer High Performance Computing / AiMl

Twitter

Posted 2 weeks ago

VIEW JOBS

Site Reliability Engineer High Performance Computing / AiMl

Twitter

Posted 2 weeks ago

VIEW JOBS

Apply