Staff Site Reliability Engineer, PLM Operations

Tesla Palo Alto , CA 94306

Posted 2 days ago

This position can be based in Palo Alto, CA, San Diego, CA or Austin, TX.

Every day, thousands of Tesla Engineers around the world use a variety of software tools and data stores to design mechanical, electrical, electronic, and software systems. The PLM/CAD Operations team, POPS for short, maintains and improves these systems as technologies evolve so that Tesla Engineers have access to reliable and performant engineering design tools.

Due to the breadth of technology used by Tesla, the members of the POPS team are expected to be technical generalists - with a deeper well in a few areas, e.g. database, networking or cluster management. As SREs, we replace toil with automation. We develop tooling in Go, but we encounter plenty of Java, Python, JS frameworks, Tcl, and even some VB. We manage clusters above the node allocation layer, managing for example, our own kubelet upgrades and Windows nodes.

  • Define SLOs around latency, traffic, errors and saturation. Reliability and performance are the team's deliverables
  • Maintain Tesla-custom Helm Charts to deploy highly customized and evolving 3DExperience (Dassault Systèmes) services running on on-prem Kubernetes
  • Modernize our deployment infrastructure using custom GitHub Actions, ArgoCD, Atlantis, and terraform
  • Achieve high performance service using tools like Prometheus, Grafana, Catchpoint, Splunk and OpsGenie
  • Be in an on-call rotation, manage incidents as Incident Commander, write actionable incident reports
  • Manage tasks via Jira for observability and human capacity planning. Maintain excellent Jira hygiene
  • Write and review design docs - testing frameworks, deployment models, environment definitions, etc.
  • Deep networking experience, e.g. experience troubleshooting outages from L7 to L3, experience contributing to infra or networking GitHub repos or publications
  • Deep Oracle Database experience, e.g. indexing deltas, schema migrations
  • Docker/Kubernetes, e.g. performed kubelet upgrades in-situ, used skopeo or CRI-O intentionally, configured containerd
  • Diagnosing problems in legacy enterprise Java stacks
  • Installing, managing or using 3DExperience, or similar experience with other PLM software
  • Outstanding experience with Scientific computing or LIMS
  • Deep understanding of hypervisor technology (VMware)
icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon
lc_ad

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove

Staff Site Reliability Engineer, PLM Operations

Tesla