Staff Site Reliability Engineer, PLM Operations

Tesla Palo Alto , CA 94306

Posted 2 days ago

Apply

This Job is not relevant Tell us why

This position can be based in Palo Alto, CA, San Diego, CA or Austin, TX.

Every day, thousands of Tesla Engineers around the world use a variety of software tools and data stores to design mechanical, electrical, electronic, and software systems. The PLM/CAD Operations team, POPS for short, maintains and improves these systems as technologies evolve so that Tesla Engineers have access to reliable and performant engineering design tools.

Due to the breadth of technology used by Tesla, the members of the POPS team are expected to be technical generalists - with a deeper well in a few areas, e.g. database, networking or cluster management. As SREs, we replace toil with automation. We develop tooling in Go, but we encounter plenty of Java, Python, JS frameworks, Tcl, and even some VB. We manage clusters above the node allocation layer, managing for example, our own kubelet upgrades and Windows nodes.

Define SLOs around latency, traffic, errors and saturation. Reliability and performance are the team's deliverables
Maintain Tesla-custom Helm Charts to deploy highly customized and evolving 3DExperience (Dassault Systèmes) services running on on-prem Kubernetes
Modernize our deployment infrastructure using custom GitHub Actions, ArgoCD, Atlantis, and terraform
Achieve high performance service using tools like Prometheus, Grafana, Catchpoint, Splunk and OpsGenie
Be in an on-call rotation, manage incidents as Incident Commander, write actionable incident reports
Manage tasks via Jira for observability and human capacity planning. Maintain excellent Jira hygiene
Write and review design docs - testing frameworks, deployment models, environment definitions, etc.
Deep networking experience, e.g. experience troubleshooting outages from L7 to L3, experience contributing to infra or networking GitHub repos or publications
Deep Oracle Database experience, e.g. indexing deltas, schema migrations
Docker/Kubernetes, e.g. performed kubelet upgrades in-situ, used skopeo or CRI-O intentionally, configured containerd
Diagnosing problems in legacy enterprise Java stacks
Installing, managing or using 3DExperience, or similar experience with other PLM software
Outstanding experience with Scientific computing or LIMS
Deep understanding of hypervisor technology (VMware)

Show Full Description

See how you match
to the job

Upload my resume

Download the
LiveCareer app and find
your dream job anywhere

Similar Jobs

View All

Want to see jobs matched to your resume?
Upload One Now!

Apply