We are now seeking a GPU Computing Systems Specialist.
Would you be thrilled to work with the most cutting-edge hardware and applications for deep learning in the world? Do you have the skills to run a diverse computing cluster filled with the latest NVIDIA GPUs? NVIDIA's Deep Learning Architecture and Libraries group is looking for a world-class systems specialist to run and grow our internal development cluster, the core infrastructure that software developers and GPU architects rely on for every stage of our product development. Our mission, which spans both hardware and software, is to consistently deliver the world's fastest deep learning technology stack for applications ranging from autonomous vehicles to training enormous models on supercomputers.
Your work will enable engineers to work efficiently with a wide variety of systems as they vigilantly seek out opportunities for performance optimization and continuously deliver high quality software. As a member of our team, you will need to be versatile enough to wear many hats: systems specialist, system administrator, and software engineer. Your work will enable the ground breaking experimentation that allows us to design the world's most powerful systems for the most demanding computing applications. You will have a meaningful impact at a fast-moving company that is spearheading the next wave in computing technology. Join our technically diverse team of GPU architects, software engineers and infrastructure experts to unlock unprecedented deep learning performance in every domain!
What you'll be doing:
Administer a diverse GPU computing cluster containing production and pre-production GPUs
Use modern cluster management tools to configure and monitor the nodes and network
Develop scripts, tools, and distributed systems to automate cluster management tasks and simplify usage
Assist users with experiment and application setup using a variety of development, performance analysis, and hardware configuration tools
Work closely with multiple teams to identify new infrastructure and software requirements
Influence methodologies for cluster usage and testing of tools and workflows
What we need to see:
BA, BS, or MS in relevant field (e.g. CS, EE, CE)
At least 2 years of experience deploying and administering Linux clusters, with at least 5 years of relevant industry experience.
Deep understanding of operating systems, containers, computer networks, and high performance applications
Experience with modern DevOps tools (Docker, Gitlab, SaltStack or similar)
Background with HPC job schedulers (SLURM or similar)
Ways To Stand Out From The Crowd:
Familiarity with GPU computing, HPC, and parallel programming (CUDA, MPI, OpenMP)
Experience working with deep learning frameworks like Caffe, TensorFlow, and Torch
Strong programming skills in Python (or similar) and C++ (or similar)
NVIDIA is widely considered to be one of technology's desirable employers. We have some of the most forward-thinking and hardworking people on the planet working for us. Does the idea of contributing to and pushing the boundaries of state-of-the-art AI and Compute systems excite you? If so, we want to hear from you!
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression , sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.