HPC Systems Administrator

Vector Institute

Date: 2 weeks ago

City: Toronto, ON

Contract type: Full time

Position Summary

The Vector Institute is seeking an HPC Systems Administrator to join our growing team in Toronto as we continue the work of making Canada a centre of expertise for AI in the world.

The incumbent in this role will participate in the building and maintenance of High-Performance Computing environments for world-class research in Machine Learning.

As a member of the Scientific Computing team, the role will share responsibility for managing servers, networks, storage, and security for the High-Performance Computing infrastructure, as well as provide support for the office local area network, servers and scientific computing workstations. The role will also perform installation and maintenance of server and AI & machine learning layered software to support our 1000+ researchers and affiliates.

We are seeking a highly motivated System Administrator with a hands-on, problem-solving approach to managing and troubleshooting high-tech environments. The role will be a combination of remote, on-site at the office, and at our co-location facility as required.

Key Responsibilities

Support the Vector HPC systems formed by more than 250+ node/10,000+ core/1,200+ GPU/and growing HPC compute clusters;
Support our GPU-enabled workstation office environment;
Provide guidance and support to our research community;
Develop and maintain solutions for automatic installation and configuration of infrastructure;
Perform hardware and software system upgrades and maintenance;
Install new scientific software, libraries, on servers, workstations, or laptops, in a variety of operating systems (Linux, Mac OS, Windows);
Support researchers in all their computing needs;
Maintain network infrastructure and assist users;
Maintain system security: firewall, IPS, system logs; and,
General enterprise IT operations.

KEY SUCCESS MEASURES

Ensures the smooth functioning of the research systems, by undertaking troubleshooting, maintenance and installation tasks;
Researchers and the enterprise operations feel supported in all other computing needs;
Builds and maintains tools that facilitate the automated or direct administration of network and computing infrastructure, both locally and on the cloud.

PROFILE OF THE IDEAL CANDIDATE

Degree or diploma in computer science or engineering, or equivalent, or more than three (3) years of proven, hands-on experience: Linux/UNIX systems administration preferably in a research environment (e.g., Ubuntu, RedHat, CentOS)
Hands-on experience in managing an HPC grid, Slurm, or equivalent scheduler;
Proven programming/scripting skills as it pertains to systems administration;
Managing and troubleshooting environments using mostly open-source software;
Demonstrated ability to learn quickly;
Demonstrated ability to prioritize tasks and resolve problems in a timely manner;
Ability to work autonomously, multi-task and work in a fast-paced and stressful environment;
Being proactive, addressing potential problems before they occur;
Possessing a strong attention to detail;
Having a problem-solving outlook;
Excellent verbal and written communication skills.

Qualifications And Experiences Below Are Considered An Asset

Hands-on experience in managing HPC workload management systems such as, Slurm, SGE, Moab/Torque, or equivalent scheduler;
Experience supporting large scale-out storage infrastructure technologies (SAN/NAS) and a good understanding of file systems such as ZFS and GPFS;
Good understanding of high speed internetworking technologies such as 100GE, Infiniband, link aggregation;
Good understanding of and experience with data management at scale, including performance, backups, archive, and monitoring;
Experience maintaining application tools and databases e.g., MySQL, PostgreSQL;
Experience with open source infrastructure systems such as openLDAP, NFS, openZFS, 2FA systems.

At the Vector Institute, we are committed to driving excellence and leadership in Canada’s knowledge, creation, and use of AI to foster economic growth and improve the lives of Canadians. We strive for greater inclusion in the programs and culture that we build by welcoming and encouraging applications from all qualified candidates. This includes, but is not limited to, applicants who are Indigenous, 2SLGBTQIA+, racialized persons/visible minorities, women, and people with disabilities.

If you require an accommodation at any point throughout the recruitment and selection process, please contact [email protected] and we will happily work with you to meet your needs.

How to apply

To apply for this job you need to authorize on our website. If you don't have an account yet, please register.

Post a resume