The objective for this work is to manage high performance computing (HPC) cluster (HPC administrator) and support users (HPC analyst) with respect to the installation, execution and debugging of research applications and code on high performance computing (HPC) clusters. This requires troubleshooting and ensuring client satisfaction to help clients (scientists) devote their time to NRC research priorities, not resolving IT related issues.
Scope of Work
Category
Tasks for Contractor
HPC administrator tasks
Maintain a HPC cluster (hardware, image management, local networking, scheduler, backups).
Troubleshoot the environment when an incident occurs to ensure a quick return to normal operations.
HPC Analyst Tasks
Meet with scientists and evaluate their requirements for HPC support.
Develop a task plan to meet scientists' needs and consult the technical authority for approval.
Application builds and installs, runtime troubleshooting (GNU, Intel, Fortran, Nvidia).
Support for open-source and commercial off-the-shelf (COTS) software, including:
Python and Anaconda installs.
Bash scripts, build/make tools, EasyBuild, and Spack.
Assist with in-house developed applications (compilation and runtime).
Other General Tasks
Management of:
Operating system (patching schedule, reliability for Linux distributions).
Accounts (creation, deletion).
Configuration via Git, MS DevOps, Ansible Playbooks.
RPM/DEB Packages.
Environment modules.
ThinLinc troubleshooting.
Troubleshoot & Hardware
Troubleshooting jobs on schedulers (PBS Pro/Torque, SLURM, SGE).
Ensure reliable CUDA installs, troubleshoot GPU failures and other CUDA software/driver issues.
Hardware support (memory upgrades, storage arrays, power and network cabling, ILO).
Documentation
Document each process for every task to ensure enterprise knowledge continuity.
Mandatory Requirements
The proposed resource has five (5) years’ experience within the last ten (10) years in administrating HPC (High Performance Computing) systems and performing HPC analyst tasks, as per Annex – A Statement of Work.
The proposed resource has worked for more than twelves (12) months. Each reference provided must have been in a role of supervision of the proposed resource
VIEW JOBS2/19/2025 12:00:00 AM2025-05-20T00:00 Ellington Solutions is seeking candidates that will: Assist in the implementation, troubleshooting, and maintenance of Information Technology (IT) systems. Ellington SolutionsFort MeadeMD
VIEW JOBS2/19/2025 12:00:00 AM2025-05-20T00:00 Provides support for implementation, troubleshooting and maintenance of Information Technology (IT) systems. Manages IT system infrastructure and any processesNisus Technologies CorporationAnnapolis JunctionMD
VIEW JOBS2/19/2025 12:00:00 AM2025-05-20T00:00 InProduction is the leading provider of temporary seating, staging, structures, and scenic production for the U.S. live events industry. The Company is a valuaInproductionWheatonIL