Infrastructure Manager - Linux Subject Matter Expert (34285)
Myticas Consulting
Date: 19 hours ago
City: Ottawa, ON
Contract type: Full time

The Manager will serve as a Linux Subject Matter Expert (SME), responsible for monitoring, maintaining, troubleshooting, and supporting high-performance computing (HPC) nodes that are critical to our client’s day-to-day operations. The role focuses on ensuring a secure, optimized, and highly available HPC environment, while delivering deep technical expertise and guidance to users and internal teams.
Candidates must be able to work onsite at least 4 days per week.
Key Responsibilities
Key Responsibilities
Candidates must be able to work onsite at least 4 days per week.
Key Responsibilities
- Act as the primary technical expert for Linux-based HPC clusters – ensuring performance, capacity, and availability targets are met.
- Identify, diagnose, and resolve complex second-level issues for hardware, software, network, VPN, and Linux environments; escalate as needed with full documentation.
- Manage daily operations of Linux-based HPC environments, including patching, upgrades, security hardening, and configuration of Ubuntu and RedHat systems.
- Support job submission and workload management using Slurm or OpenHPC, and assist end users in optimizing compute workloads.
- Migrate existing nodes to Linux, ensuring minimal downtime and performance impact.
- Implement and manage cluster patching/automation tools such as Foreman (or similar) to streamline operations.
- Install and configure servers, storage, hypervisors (KVM), and other HPC infrastructure components.
- Automate administrative tasks to improve operational efficiency.
- Execute firewall access requests, monitor security alerts, and assist in incident response.
- Provide second-level support and mentorship to junior technical staff, ensuring knowledge transfer and consistent process execution.
- Develop, maintain, and publish technical documentation, KB articles, and end-user guides for new systems or upgrades.
- Participate in on-call rotations, emergency incident response, and occasional after-hours maintenance windows.
- Diploma or Degree in Computer Science, Information Technology, or related field.
- Minimum 2+ years in IT (with a related University Degree) or 7+ years in IT (with a three-year College Diploma).
- Enterprise-level Linux expertise (Ubuntu and/or RedHat) is essential.
- Certifications (e.g., MCSE, CISSP) are strong assets.
- Proven track record as a Linux SME in installation, tuning, and operational support.
- In-depth experience with HPC clusters and job scheduling tools such as Slurm, LSF, or GridEngine.
- Strong knowledge of KVM or similar hypervisors.
- Working understanding of network systems, protocols, and standards including Active Directory integration.
- Identity management experience (Microsoft Identity Manager, Azure AD Connect).
- Solid scripting skills (Bash required; additional scripting languages are an asset).
- Experience applying advanced troubleshooting to resolve performance, configuration, or security issues.
- Excellent problem-solving, organizational, and documentation skills.
- Ability to communicate clearly with both technical and non-technical stakeholders.
- Bilingualism (English/French) is an asset.
- Microsoft Windows knowledge is an asset.
- Operate with minimal supervision while making decisions based on analysis, troubleshooting, and established procedures.
- Coordinate with helpdesk, networking, platform, and security teams to ensure alignment of upgrades, patches, and operations.
- Comfortable office environment with periodic physical tasks (e.g., installing hardware).
- Requires appropriate security clearance.
- Must be willing to provide occasional off-hours support and participate in on-call rotation.
- Published Job Description
Key Responsibilities
- Act as the primary technical expert for Linux-based HPC clusters – ensuring performance, capacity, and availability targets are met.
- Identify, diagnose, and resolve complex second-level issues for hardware, software, network, VPN, and Linux environments; escalate as needed with full documentation.
- Manage daily operations of Linux-based HPC environments, including patching, upgrades, security hardening, and configuration of Ubuntu and RedHat systems.
- Support job submission and workload management using Slurm or OpenHPC, and assist end users in optimizing compute workloads.
- Migrate existing nodes to Linux, ensuring minimal downtime and performance impact.
- Implement and manage cluster patching/automation tools such as Foreman (or similar) to streamline operations.
- Install and configure servers, storage, hypervisors (KVM), and other HPC infrastructure components.
- Automate administrative tasks to improve operational efficiency.
- Execute firewall access requests, monitor security alerts, and assist in incident response.
- Provide second-level support and mentorship to junior technical staff, ensuring knowledge transfer and consistent process execution.
- Develop, maintain, and publish technical documentation, KB articles, and end-user guides for new systems or upgrades.
- Participate in on-call rotations, emergency incident response, and occasional after-hours maintenance windows.
- Diploma or Degree in Computer Science, Information Technology, or related field.
- Minimum 2+ years in IT (with a related University Degree) or 7+ years in IT (with a three-year College Diploma).
- Enterprise-level Linux expertise (Ubuntu and/or RedHat) is essential.
- Certifications (e.g., MCSE, CISSP) are strong assets.
- Proven track record as a Linux SME in installation, tuning, and operational support.
- In-depth experience with HPC clusters and job scheduling tools such as Slurm, LSF, or GridEngine.
- Strong knowledge of KVM or similar hypervisors.
- Working understanding of network systems, protocols, and standards including Active Directory integration.
- Identity management experience (Microsoft Identity Manager, Azure AD Connect).
- Solid scripting skills (Bash required; additional scripting languages are an asset).
- Experience applying advanced troubleshooting to resolve performance, configuration, or security issues.
- Excellent problem-solving, organizational, and documentation skills.
- Ability to communicate clearly with both technical and non-technical stakeholders.
- Bilingualism (English/French) is an asset.
- Microsoft Windows knowledge is an asset.
- Operate with minimal supervision while making decisions based on analysis, troubleshooting, and established procedures.
- Coordinate with helpdesk, networking, platform, and security teams to ensure alignment of upgrades, patches, and operations.
- Comfortable office environment with periodic physical tasks (e.g., installing hardware).
- Requires appropriate security clearance.
- Must be willing to provide occasional off-hours support and participate in on-call rotation.
How to apply
To apply for this job you need to authorize on our website. If you don't have an account yet, please register.
Post a resume