Infrastructure Manager - Linux Subject Matter Expert (34285)

Myticas Consulting


Date: 19 hours ago
City: Ottawa, ON
Contract type: Full time
The Manager will serve as a Linux Subject Matter Expert (SME), responsible for monitoring, maintaining, troubleshooting, and supporting high-performance computing (HPC) nodes that are critical to our client’s day-to-day operations. The role focuses on ensuring a secure, optimized, and highly available HPC environment, while delivering deep technical expertise and guidance to users and internal teams.

Candidates must be able to work onsite at least 4 days per week.

Key Responsibilities

  • Act as the primary technical expert for Linux-based HPC clusters – ensuring performance, capacity, and availability targets are met.
  • Identify, diagnose, and resolve complex second-level issues for hardware, software, network, VPN, and Linux environments; escalate as needed with full documentation.
  • Manage daily operations of Linux-based HPC environments, including patching, upgrades, security hardening, and configuration of Ubuntu and RedHat systems.
  • Support job submission and workload management using Slurm or OpenHPC, and assist end users in optimizing compute workloads.
  • Migrate existing nodes to Linux, ensuring minimal downtime and performance impact.
  • Implement and manage cluster patching/automation tools such as Foreman (or similar) to streamline operations.
  • Install and configure servers, storage, hypervisors (KVM), and other HPC infrastructure components.
  • Automate administrative tasks to improve operational efficiency.
  • Execute firewall access requests, monitor security alerts, and assist in incident response.
  • Provide second-level support and mentorship to junior technical staff, ensuring knowledge transfer and consistent process execution.
  • Develop, maintain, and publish technical documentation, KB articles, and end-user guides for new systems or upgrades.
  • Participate in on-call rotations, emergency incident response, and occasional after-hours maintenance windows.

Education & Experience

  • Diploma or Degree in Computer Science, Information Technology, or related field.
  • Minimum 2+ years in IT (with a related University Degree) or 7+ years in IT (with a three-year College Diploma).
  • Enterprise-level Linux expertise (Ubuntu and/or RedHat) is essential.
  • Certifications (e.g., MCSE, CISSP) are strong assets.

Specialized Skills

  • Proven track record as a Linux SME in installation, tuning, and operational support.
  • In-depth experience with HPC clusters and job scheduling tools such as Slurm, LSF, or GridEngine.
  • Strong knowledge of KVM or similar hypervisors.
  • Working understanding of network systems, protocols, and standards including Active Directory integration.
  • Identity management experience (Microsoft Identity Manager, Azure AD Connect).
  • Solid scripting skills (Bash required; additional scripting languages are an asset).
  • Experience applying advanced troubleshooting to resolve performance, configuration, or security issues.
  • Excellent problem-solving, organizational, and documentation skills.
  • Ability to communicate clearly with both technical and non-technical stakeholders.
  • Bilingualism (English/French) is an asset.
  • Microsoft Windows knowledge is an asset.

Decision Making & Supervision

  • Operate with minimal supervision while making decisions based on analysis, troubleshooting, and established procedures.
  • Coordinate with helpdesk, networking, platform, and security teams to ensure alignment of upgrades, patches, and operations.

Working Conditions

  • Comfortable office environment with periodic physical tasks (e.g., installing hardware).
  • Requires appropriate security clearance.
  • Must be willing to provide occasional off-hours support and participate in on-call rotation.

Published Job Description

  • Published Job Description

The HPC Administrator will serve as a Linux Subject Matter Expert (SME), responsible for monitoring, maintaining, troubleshooting, and supporting high-performance computing (HPC) nodes that are critical to our client’s day-to-day operations. The role focuses on ensuring a secure, optimized, and highly available HPC environment, while delivering deep technical expertise and guidance to users and internal teams. Candidates must be able to work onsite at least 4 days per week.

Key Responsibilities

    • Act as the primary technical expert for Linux-based HPC clusters – ensuring performance, capacity, and availability targets are met.
    • Identify, diagnose, and resolve complex second-level issues for hardware, software, network, VPN, and Linux environments; escalate as needed with full documentation.
    • Manage daily operations of Linux-based HPC environments, including patching, upgrades, security hardening, and configuration of Ubuntu and RedHat systems.
    • Support job submission and workload management using Slurm or OpenHPC, and assist end users in optimizing compute workloads.
    • Migrate existing nodes to Linux, ensuring minimal downtime and performance impact.
    • Implement and manage cluster patching/automation tools such as Foreman (or similar) to streamline operations.
    • Install and configure servers, storage, hypervisors (KVM), and other HPC infrastructure components.
    • Automate administrative tasks to improve operational efficiency.
    • Execute firewall access requests, monitor security alerts, and assist in incident response.
    • Provide second-level support and mentorship to junior technical staff, ensuring knowledge transfer and consistent process execution.
    • Develop, maintain, and publish technical documentation, KB articles, and end-user guides for new systems or upgrades.
    • Participate in on-call rotations, emergency incident response, and occasional after-hours maintenance windows.
Education & Experience

    • Diploma or Degree in Computer Science, Information Technology, or related field.
    • Minimum 2+ years in IT (with a related University Degree) or 7+ years in IT (with a three-year College Diploma).
    • Enterprise-level Linux expertise (Ubuntu and/or RedHat) is essential.
    • Certifications (e.g., MCSE, CISSP) are strong assets.
Specialized Skills

    • Proven track record as a Linux SME in installation, tuning, and operational support.
    • In-depth experience with HPC clusters and job scheduling tools such as Slurm, LSF, or GridEngine.
    • Strong knowledge of KVM or similar hypervisors.
    • Working understanding of network systems, protocols, and standards including Active Directory integration.
    • Identity management experience (Microsoft Identity Manager, Azure AD Connect).
    • Solid scripting skills (Bash required; additional scripting languages are an asset).
    • Experience applying advanced troubleshooting to resolve performance, configuration, or security issues.
    • Excellent problem-solving, organizational, and documentation skills.
    • Ability to communicate clearly with both technical and non-technical stakeholders.
    • Bilingualism (English/French) is an asset.
    • Microsoft Windows knowledge is an asset.
Decision Making & Supervision

    • Operate with minimal supervision while making decisions based on analysis, troubleshooting, and established procedures.
    • Coordinate with helpdesk, networking, platform, and security teams to ensure alignment of upgrades, patches, and operations.
Working Conditions

    • Comfortable office environment with periodic physical tasks (e.g., installing hardware).
    • Requires appropriate security clearance.
    • Must be willing to provide occasional off-hours support and participate in on-call rotation.
, : Matheo Theodossiou

How to apply

To apply for this job you need to authorize on our website. If you don't have an account yet, please register.

Post a resume