HPC Systems Administrator
Vector Institute
Date: 11 hours ago
City: Toronto, ON
Contract type: Full time

Position Summary
The Vector Institute is seeking an HPC Systems Administrator to join our growing team in Toronto as we continue the work of making Canada a centre of expertise for AI in the world.
The incumbent in this role will participate in the building and maintenance of High-Performance Computing environments for world-class research in Machine Learning.
As a member of the Scientific Computing team, the role will share responsibility for managing servers, networks, storage, and security for the High-Performance Computing infrastructure, as well as provide support for the office local area network, servers and scientific computing workstations. The role will also perform installation and maintenance of server and AI & machine learning layered software to support our 1000+ researchers and affiliates.
We are seeking a highly motivated System Administrator with a hands-on, problem-solving approach to managing and troubleshooting high-tech environments. The role will be a combination of remote, on-site at the office, and at our co-location facility as required.
Key Responsibilities
If you require an accommodation at any point throughout the recruitment and selection process, please contact [email protected] and we will happily work with you to meet your needs.
The Vector Institute is seeking an HPC Systems Administrator to join our growing team in Toronto as we continue the work of making Canada a centre of expertise for AI in the world.
The incumbent in this role will participate in the building and maintenance of High-Performance Computing environments for world-class research in Machine Learning.
As a member of the Scientific Computing team, the role will share responsibility for managing servers, networks, storage, and security for the High-Performance Computing infrastructure, as well as provide support for the office local area network, servers and scientific computing workstations. The role will also perform installation and maintenance of server and AI & machine learning layered software to support our 1000+ researchers and affiliates.
We are seeking a highly motivated System Administrator with a hands-on, problem-solving approach to managing and troubleshooting high-tech environments. The role will be a combination of remote, on-site at the office, and at our co-location facility as required.
Key Responsibilities
- Support the Vector HPC systems formed by more than 250+ node/10,000+ core/1,200+ GPU/and growing HPC compute clusters;
- Support our GPU-enabled workstation office environment;
- Provide guidance and support to our research community;
- Develop and maintain solutions for automatic installation and configuration of infrastructure;
- Perform hardware and software system upgrades and maintenance;
- Install new scientific software, libraries, on servers, workstations, or laptops, in a variety of operating systems (Linux, Mac OS, Windows);
- Support researchers in all their computing needs;
- Maintain network infrastructure and assist users;
- Maintain system security: firewall, IPS, system logs; and,
- General enterprise IT operations.
- Ensures the smooth functioning of the research systems, by undertaking troubleshooting, maintenance and installation tasks;
- Researchers and the enterprise operations feel supported in all other computing needs;
- Builds and maintains tools that facilitate the automated or direct administration of network and computing infrastructure, both locally and on the cloud.
- Degree or diploma in computer science or engineering, or equivalent, or more than three (3) years of proven, hands-on experience: Linux/UNIX systems administration preferably in a research environment (e.g., Ubuntu, RedHat, CentOS)
- Hands-on experience in managing an HPC grid, Slurm, or equivalent scheduler;
- Proven programming/scripting skills as it pertains to systems administration;
- Managing and troubleshooting environments using mostly open-source software;
- Demonstrated ability to learn quickly;
- Demonstrated ability to prioritize tasks and resolve problems in a timely manner;
- Ability to work autonomously, multi-task and work in a fast-paced and stressful environment;
- Being proactive, addressing potential problems before they occur;
- Possessing a strong attention to detail;
- Having a problem-solving outlook;
- Excellent verbal and written communication skills.
- Hands-on experience in managing HPC workload management systems such as, Slurm, SGE, Moab/Torque, or equivalent scheduler;
- Experience supporting large scale-out storage infrastructure technologies (SAN/NAS) and a good understanding of file systems such as ZFS and GPFS;
- Good understanding of high speed internetworking technologies such as 100GE, Infiniband, link aggregation;
- Good understanding of and experience with data management at scale, including performance, backups, archive, and monitoring;
- Experience maintaining application tools and databases e.g., MySQL, PostgreSQL;
- Experience with open source infrastructure systems such as openLDAP, NFS, openZFS, 2FA systems.
If you require an accommodation at any point throughout the recruitment and selection process, please contact [email protected] and we will happily work with you to meet your needs.
How to apply
To apply for this job you need to authorize on our website. If you don't have an account yet, please register.
Post a resume