HPC Engineer, Autopilot Infrastructure

Tesla Palo Alto , CA 94306

Posted 2 months ago

Tesla's Supercomputing/HPC team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware, silicon design, and Dojo. With the rapidly-growing need for more data and optimized compute resources, cluster builds are getting larger and increasingly complex. Continued development/automation of deployment, monitoring, self-healing and alerting processes is imperative to the success of our engineering groups.

As an HPC Engineer on our Supercomputing/HPC team, you will be responsible for maintaining and improving our infrastructure to ensure engineering teams across Autopilot/AI and Dojo have the necessary tools and resources to be productive. This includes managing our HPC clusters, monitoring compute/GPU/network metrics, writing scripts for configuration management, and collaborating with our Data Center team to coordinate the smooth operation of hundreds of servers/bring up new capacity on our GPU clusters.

  • Support the AI/ML cluster infrastructure on both GPU and Dojo platforms, focusing on systems automation, configuration management and deployment at scale

  • Improve our cluster health monitoring and auto-recovery pipeline

  • Work with users on debugging application performance issues

  • Work with hardware and storage vendors to tune and optimize our servers, storage and network

  • Write Ansible playbooks for configuration management

  • Performance tuning and OS provisioning on Linux systems

  • Manage HPC clusters, workloads and applications

  • Automation and systems engineering in Python, Golang or Bash/Shell

  • Participate in 24x7 on-call rotation

  • Proficiency in high-level programming language and/or scripting with (Python, Golang, Bash)

  • Strong understanding of Linux fundamentals and performance optimizations (Ubuntu/RHEL OS)

  • Advanced experience with configuration management systems such as Ansible

  • Demonstrable knowledge of TCP/IP, IPoIB, Linux operating system internals, filesystems, disk/storage technologies and storage protocols

  • Experience in collaborating with network and data center teams for large scale cluster builds

  • Experience with configuration management software (Ansible, etc.) systems monitoring and alerting (Prometheus, Grafana, Telegraf, Splunk, etc.) and/or administering HPC workload managers (SLURM, LSF, etc.)

  • Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high performance storage systems

  • Experience with Slurm and storage management of distributed parallel file systems a plus

  • Bachelor's degree in computer science, electrical engineering or related field

  • 3+ years of additional equivalent experience or evidence of exceptional ability related to the position

icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon
lc_ad

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

HPC Engineer, Autopilot Infrastructure

Tesla