Skip to content

Deploy AI Lab Components

Starting from version 0.17.0 of the DCE 5.0 installer, the AI Lab module can be installed concurrently during the installation of the commercial version, without the need for separate installation ; please contact the delivery support team to obtain the commercial installation package.

Global Service Cluster

AI Lab needs to be installed only in the Global Service Cluster.

Go to the Global Service Cluster, then navigate to Helm Apps -> Helm Charts to find baize and run the installation steps.

Important Notes

  • The namespace is baize-system
  • After replacing the environment address, open <YOUR_DCE_HOST>/kpanda/clusters/kpanda-global-cluster/helm/charts/addon/baize
  • kpanda-global-cluster is the name of the Global Service Cluster.

baize-agent Worker Cluster

Warning

If the cluster cannot be selected in AI Lab or indicates that baize-agent is missing, it means that the components for this worker cluster have not been successfully deployed.

In each worker cluster with computing resources, the corresponding basic components need to be deployed. The main components include:

  • gpu-operator initializes the GPU resources in the cluster. This part may vary depending on the installation method of the GPU resource type. For details, refer to GPU Management.
  • insight-agent is an observability component used to collect infrastructure information from the cluster, including logs, metrics, and events.
  • baize-agent contains the core components of AI Lab, including scheduling, monitoring, and computing components like Pytorch and TensorFlow.
  • [Optional] nfs storage service for dataset preloading.

Danger

The above components must be installed; otherwise, functionality may not work properly.

Install baize-agent on UI

baize-agent needs to be deployed in the worker cluster.

Follow the prompts below to enter the worker cluster, then navigate to Helm Apps -> Helm Charts to find baize-agent and execute the installation steps.

Important Notes

  • The namespace is baize-system
  • After replacing the environment address, open <YOUR_DCE_HOST>/kpanda/clusters/<cluster_name>/helm/charts/addon/baize
  • cluster_name is the name of the corresponding worker cluster.

Install baize-agent

Here is a YAML Example:

cluster-controller:
  image:
    registry: ''
    repository: baize/baize-cluster-controller
    tag: v0.4.1
global:
  cluster:
    schedulers: []
  config:
    cluster_name: ''
    dataset_job_spec: {}
    inference_config:
      triton_image: m.daocloud.io/nvcr.io/nvidia/tritonserver:24.01-py3
      triton_images_map:
        VLLM: m.daocloud.io/nvcr.io/nvidia/tritonserver:24.01-vllm-python-py3
  debug: false
  high_available: false
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  imageRegistry: release.daocloud.io
  prod: baize-agent
  resources: {}
kubeRbacProxy:
  image:
    registry: ''
    repository: baize/kube-rbac-proxy
    tag: v0.8.0
kueue:
  enablePlainPod: false
  fullnameOverride: kueue
  image:
    registry: ''
    repository: baize/kueue
    tag: v0.6.2
loader:
  image:
    registry: ''
    repository: baize/baize-data-loader
    tag: v0.4.1
notebook:
  image:
    registry: ''
    repository: baize/baize-notebook
    tag: v0.4.1
notebook-controller:
  image:
    registry: ''
    repository: baize/notebook-controller
    tag: v1.8.0
priority:
  high:
    value: 100000
  low:
    value: 1000
  medium:
    value: 10000
training-operator:
  image:
    registry: ''
    repository: baize/training-operator
    tag: v1-5525468

Install baize-agent with helm

Ensure that the AI Lab components are already installed in the Global Service Cluster. You can check in the GUI if the AI Lab module is available.

Info

You need to have an AI Lab entry in the main navigation bar to ensure that the management components are deployed successfully.

# Baize is the development code name for the AI Lab component
helm repo add baize https://release.daocloud.io/chartrepo/baize
helm repo update baize
helm search repo baize # Get the latest version number
export VERSION=<version> # Make sure to use the current latest version
helm upgrade --install baize-agent baize/baize-agent \
    --create-namespace \
    -n baize-system \
    --set global.imageRegistry=release.daocloud.io \
    --version=${VERSION}

Once the above tasks are completed, the worker cluster will be successfully initialized, and you can perform task training and model development in the AI Lab module.

Components for Preloading

The dataset preloading capability provided in the AI Lab module relies on the storage service, and it is recommended to use the NFS service:

  • Deploy NFS Server

    • If an NFS already exists, you can skip this step.
    • If it does not exist, you can refer to the NFS Service Deployment in best practices.
  • Deploy nfs-driver-csi

  • Deploy StorageClass

Conclusion

After completing the above steps, you can fully experience the functionalities of AI Lab within the worker cluster. Enjoy your usage!

Comments