Skip to content

Install GPU Operator Offline

DCE 5.0 provides a GPU Operator offline package with CentOS 7.9 and kernel version 3.10.0-1160 preinstalled. This article explains how to deploy the GPU Operator offline. This section covers parameter configurations for various usage modes of NVIDIA GPUs.

  • GPU Full Mode
  • GPU vGPU Mode
  • GPU MIG Mode

Please refer to NVIDIA GPU Card Usage Modes for more details. This article demonstrates the installation using AMD architecture on CentOS 7.9 (3.10.0-1160). If you want to deploy on Red Hat 8.4, refer to Uploading Red Hat GPU Operator Offline Images to Bootstrap Nodes and Building Red Hat 8.4 Offline Yum Repository.

Prerequisites

  1. The user has already installed addon offline package v0.12.0 or above on the platform.
  2. The kernel versions of the nodes in the cluster, where GPU Operator will be deployed, must be identical. The distribution and GPU card models should be within the scope of the GPU Support Matrix.

Steps

To install the GPU Operator plugin for your cluster, follow these steps:

  1. Log in to the platform and go to Container Management -> Clusters , check cluster eetails.

  2. On the Helm Charts page, select All Repositories and search for gpu-operator .

  3. Select gpu-operator and click Install .

  4. Configure the installation parameters for gpu-operator based on the instructions below to complete the installation.

Configure parameters

Basic information

  • Name : Enter the plugin name
  • Namespace : Select the namespace for installing the plugin
  • Version : Plugin version, for example, 23.6.10
  • Wait : When enabled, all associated resources must be in a ready state for the application installation to be marked as successful
  • Deletion failed : If the installation fails, delete the already installed associated resources. By enabling this, Wait is automatically enabled
  • Detail Logs : When enabled, detailed logs of the installation process will be recorded

Advanced settings

Operator parameters

  1. InitContainer.image : Configure the CUDA image, recommended default image: nvidia/cuda
  2. InitContainer.repository : Repository where the CUDA image is located, defaults to nvcr.m.daocloud.io repository
  3. InitContainer.version : Version of the CUDA image, please use the default parameter

Driver parameters

  1. Driver.enable : Configure whether to deploy the NVIDIA driver on the node, default is enabled. If you have already deployed the NVIDIA driver on the node before using the GPU Operator, please disable this.
  2. Driver.image : Configure the GPU driver image, recommended default image: nvidia/driver .
  3. Driver.repository : Repository where the GPU driver image is located, default is nvidia's nvcr.io repository.
  4. Driver.version : Version of the GPU driver image, use default parameters for offline deployment. Configuration is only required for online installation. Different versions of the Driver image exist for different types of operating systems. For more details, refer to Nvidia GPU Driver Versions. Examples of Driver Version for different operating systems are as follows:

    Note

    The system provides the image 525.147.05-centos7 by default. For other images, refer to Upload Image to Bootstrap Node Repository. There is no need to include the operating system name such as Ubuntu, CentOS, Red Hat at the end of the version number. If the official image contains an operating system suffix, manually remove it.

    • For Red Hat systems, for example, 525.105.17
    • For Ubuntu systems, for example, 535-5.15.0-1043-nvidia
    • For CentOS systems, for example, 525.147.05
  5. Driver.RepoConfig.ConfigMapName : Used to record the name of the offline yum source profile for the GPU Operator. When using pre-installed offline packages, the global cluster can directly run the following command. Worker clusters should refer to the yum source configuration of any node in the Global cluster.

    • Configuration for the global cluster

      kubectl create configmap local-repo-config  -n gpu-operator --from-file=CentOS-Base.repo=/etc/yum.repos.d/extension.repo
      
    • Configuration for the worker cluster

    Using the yum source configuration of any node in the Global cluster
    1. Use SSH or another method to access any node in the Global cluster and retrieve the platform's offline source profile extension.repo :

      cat /etc/yum.repos.d/extension.repo # View the contents of extension.repo.
      

      The expected output should look like this:

      [extension-0]
      async = 1
      baseurl = http://x.x.x.x:9000/kubean/centos/$releasever/os/$basearch
      gpgcheck = 0
      name = kubean extension 0
      
      [extension-1]
      async = 1
      baseurl = http://x.x.x.x:9000/kubean/centos-iso/$releasever/os/$basearch
      gpgcheck = 0
      name = kubean extension 1
      
    2. Copy the contents of the extension.repo file mentioned above. In the gpu-operator namespace of the cluster where GPU Operator will be deployed, create a new config map named local-repo-config . Refer to Creating ConfigMaps for creating the config map.

      Note

      The configuration key value must be CentOS-Base.repo, and the value should be the content of the offline source configuration file extension.repo.

    For other operating systems or kernels, refer to the following links to create the yum source file:

Toolkit Parameters

  1. Toolkit.enable : Default is enabled. This component enables containerd/docker to support running containers that require GPU.

  2. Toolkit.image : Configure the Toolkit image, recommended default image: nvidia/k8s/container-toolkit .

  3. Toolkit.repository : Repository where the Toolkit image is located, defaults to nvcr.m.daocloud.io repository.

  4. Toolkit.version : Version of the Toolkit image, keep the version consistent with the official website. By default, it uses the CentOS image. If using Ubuntu, you need to manually modify the yaml of the addon, changing CentOS to Ubuntu. Refer to NVIDIA Container Toolkit for specific models.

MIG Parameters

For detailed configuration, refer to Enabling MIG Functionality

  1. MigManager.enabled : Whether to enable MIG capability feature.
  2. MigManager.Config.name : Name of the MIG partitioning profile, used to define the (GI, CI) partitioning strategy for MIG. Default is default-mig-parted-config . For custom parameters, refer to Enabling MIG Functionality.
  3. Mig.strategy : Public strategy for MIG devices on GPU cards on the node. NVIDIA provides two policies for exposing MIG devices: single , mixed policies, details can be found in NVIDIA GPU Card Mode Explanation.

Node-Feature-Discovery Parameters

Node-Feature-Discovery.enableNodeFeatureAPI : Enable or disable the Node Feature API.

  • When set to true , the Node Feature API is enabled.
  • When set to false or not set, the Node Feature API is disabled.

Next steps

After completing the parameter configurations and creations mentioned above:

  1. If you are using Full GPU mode , follow the instructions in Using GPU Resources in Application Creation.

  2. If you are using vGPU mode , after completing the parameter configurations and creations mentioned above, proceed to vGPU Addon Installation.

  3. If you are using MIG mode and need to allocate specific GPU nodes according to a certain partitioning specification, assign the following label to the corresponding node:

    • For single mode, assign the label as follows:

      kubectl label nodes {node} nvidia.com/mig.config="all-1g.10gb" --overwrite
      
    • For mixed mode, assign the label as follows:

      kubectl label nodes {node} nvidia.com/mig.config="custom-config" --overwrite
      

    After partitioning, applications can use MIG GPU Resources.

Comments