Senior AI Infrastructure & Platform Engineer - Riyadh,KSA
Quick CV Check
- Get your ATS score for DeepSource Technologies in 30 seconds
- Get AI-rewritten bullet points
- Download Gulf-ready CV
60 seconds. $3.99 one-time.
Role Overview
We are seeking a highly skilled Senior AI Infrastructure & Platform Engineer to join our client’s team in Riyadh. In this role, you’ll be responsible for building, managing, and optimizing scalable AI infrastructure and compute environments that support high-performance workloads, including GPU-accelerated AI/ML pipelines, cluster scheduling, and orchestration.
Key Responsibilities
• Deploy, maintain, and optimize GPU-based compute clusters and infrastructure.
• Manage and operate GPU orchestration tools and platforms such as:
• Nvidia Base Command Manager (critical)
• Nvidia AI Enterprise Suite
• Nvidia GPU and Network Operators
• Nvidia NIMs and Blueprints
•
• Configure, deploy, and maintain compute workloads using scheduling and orchestration tools including:
• Slurm (critical)
• Vanilla Kubernetes
•
• Install, configure, and maintain the underlying OS (e.g. Canonical Ubuntu) and supporting system software.
• Monitor and troubleshoot infrastructure performance, availability, and reliability; ensure high uptime for AI/ML workloads.
• Work with data scientists, ML engineers, and dev teams to define infrastructure requirements, resource allocation, and deployment workflows.
• Develop automation scripts, CI/CD pipelines, and best practices for infrastructure provisioning and management.
• Document architecture, configurations, and operational procedures; enforce security, compliance, and backup policies. Requirements
Required Skills & Experience
• Proven experience managing GPU-based AI/ML infrastructure and compute clusters.
• Hands-on experience with:
• Nvidia Base Command Manager
• Nvidia AI Enterprise Suite
• Nvidia GPU/Network Operators, NIMs, Blueprints
•
• Strong experience with Slurm and/or Kubernetes orchestration.
• Solid Linux system administration skills — preferably on Ubuntu or similar distributions.
• Strong scripting/automation ability (e.g. Bash, Python, or relevant tooling) for provisioning, deployment, and maintenance.
• Excellent troubleshooting and performance-tuning skills.
• Experience collaborating with ML/data science teams and integrating infrastructure with their workflows.
• Strong understanding of networking, security, resource allocation, and cluster management best practices.
Preferred Qualifications
- Previous experience working in a high-performance computing (HPC) or AI-focused infrastructure team.
- Knowledge of containerization, container orchestration, and GPUs in cloud or on-prem environments.
- Experience with CI/CD, infrastructure-as-code (e.g. Terraform, Ansible), monitoring tools, and logging setups.
- Familiarity with workload scheduling, job queuing, resource quotas, and GPU-shared environments.
Requirements
- •Proven experience managing GPU-based AI/ML infrastructure and compute clusters
- •Hands-on experience with Nvidia Base Command Manager, Nvidia AI Enterprise Suite, Nvidia GPU/Network Operators, NIMs, Blueprints
- •Strong experience with Slurm and/or Kubernetes orchestration
- •Solid Linux system administration skills (Ubuntu or similar)
- •Strong scripting/automation ability (Bash, Python)
- •Excellent troubleshooting and performance-tuning skills
- •Experience collaborating with ML/data science teams
- •Strong understanding of networking, security, resource allocation, and cluster management
Nice to Have
- •Previous experience in HPC or AI-focused infrastructure
- •Knowledge of containerization, container orchestration, and GPUs
- •Experience with CI/CD, infrastructure-as-code
- •Familiarity with workload scheduling, job queuing, resource quotas, and GPU-shared environments
Responsibilities
- •Deploy, maintain, and optimize GPU-based compute clusters and infrastructure
- •Manage and operate GPU orchestration tools
- •Configure, deploy, and maintain compute workloads using scheduling and orchestration tools
- •Install, configure, and maintain the underlying OS and supporting system software
- •Monitor and troubleshoot infrastructure performance, availability, and reliability
- •Define infrastructure requirements, resource allocation, and deployment workflows with data scientists and engineers
- •Develop automation scripts, CI/CD pipelines, and best practices
- •Document architecture, configurations, and operational procedures; enforce security, compliance, and backup policies
Related Jobs
Browse Similar
- Get your ATS score for DeepSource Technologies in 30 seconds
- Get AI-rewritten bullet points
- Download Gulf-ready CV
60 seconds. $3.99 one-time.
DeepSource provides an AI-powered platform for automated code review, helping development teams improve code quality and reduce bugs. It serves software engineering teams of all sizes.
Visit WebsiteView all jobs