T
Staff Site Reliability Engineer, PLM Operations
Tesla
Palo Alto , CA 94306
Posted 2 days ago
This position can be based in Palo Alto, CA, San Diego, CA or Austin, TX.
Every day, thousands of Tesla Engineers around the world use a variety of software tools and data stores to design mechanical, electrical, electronic, and software systems. The PLM/CAD Operations team, POPS for short, maintains and improves these systems as technologies evolve so that Tesla Engineers have access to reliable and performant engineering design tools.
Due to the breadth of technology used by Tesla, the members of the POPS team are expected to be technical generalists - with a deeper well in a few areas, e.g. database, networking or cluster management. As SREs, we replace toil with automation. We develop tooling in Go, but we encounter plenty of Java, Python, JS frameworks, Tcl, and even some VB. We manage clusters above the node allocation layer, managing for example, our own kubelet upgrades and Windows nodes.
- Define SLOs around latency, traffic, errors and saturation. Reliability and performance are the team's deliverables
- Maintain Tesla-custom Helm Charts to deploy highly customized and evolving 3DExperience (Dassault Systèmes) services running on on-prem Kubernetes
- Modernize our deployment infrastructure using custom GitHub Actions, ArgoCD, Atlantis, and terraform
- Achieve high performance service using tools like Prometheus, Grafana, Catchpoint, Splunk and OpsGenie
- Be in an on-call rotation, manage incidents as Incident Commander, write actionable incident reports
- Manage tasks via Jira for observability and human capacity planning. Maintain excellent Jira hygiene
- Write and review design docs - testing frameworks, deployment models, environment definitions, etc.
- Deep networking experience, e.g. experience troubleshooting outages from L7 to L3, experience contributing to infra or networking GitHub repos or publications
- Deep Oracle Database experience, e.g. indexing deltas, schema migrations
- Docker/Kubernetes, e.g. performed kubelet upgrades in-situ, used skopeo or CRI-O intentionally, configured containerd
- Diagnosing problems in legacy enterprise Java stacks
- Installing, managing or using 3DExperience, or similar experience with other PLM software
- Outstanding experience with Scientific computing or LIMS
- Deep understanding of hypervisor technology (VMware)