GPU Alerting Rules¶

This document describes how to set up GPU-related alerting rules on the DCE 5.0 platform.

Prerequisites¶

GPU devices are properly installed on the cluster nodes.
The gpu-operator component is correctly installed in the cluster.
If vGPU is used, the Nvidia-vgpu component must also be installed in the cluster, and servicemonitor enabled.
The insight-agent component is properly installed in the cluster.

Common GPU Metrics for Alerts¶

This section introduces commonly used GPU metrics for alerting, divided into two categories:

Metrics at the GPU card level, mainly reflecting the operational status of a single GPU device.
Metrics at the application level, mainly reflecting the Pod’s usage of the GPU.

GPU Card Metrics¶

Metric Name	Unit	Description
DCGM_FI_DEV_GPU_UTIL	%	GPU utilization
DCGM_FI_DEV_MEM_COPY_UTIL	%	Memory copy utilization
DCGM_FI_DEV_ENC_UTIL	%	Encoder utilization
DCGM_FI_DEV_DEC_UTIL	%	Decoder utilization
DCGM_FI_DEV_FB_FREE	MB	Amount of free GPU memory
DCGM_FI_DEV_FB_USED	MB	Amount of used GPU memory
DCGM_FI_DEV_GPU_TEMP	°C	Current GPU temperature
DCGM_FI_DEV_POWER_USAGE	W	Device power usage
DCGM_FI_DEV_XID_ERRORS	-	The last XID error code occurred within a time window. XID provides information on GPU hardware, NVIDIA software, or application error types, locations, and codes. More details on XID info

Application-Level Metrics¶

Metric Name	Unit	Description
kpanda_gpu_pod_utilization	%	GPU utilization by the Pod
kpanda_gpu_mem_pod_usage	MB	GPU memory usage by the Pod
kpanda_gpu_mem_pod_utilization	%	GPU memory utilization by the Pod

Setting Alert Rules¶

Here is how to set GPU alert rules. Using GPU utilization as an example, users should choose metrics and write PromQL queries based on their actual business scenarios.

Goal: Trigger an alert if the GPU utilization stays above 80% continuously for 5 seconds.

On the observability page, click Alerting -> Alert Policies -> Create Alert Policy.
Fill in the basic information.
Add the alert rule.
Choose the notification method.
After setup, if a GPU maintains utilization above 80% for 5 seconds, you will receive an alert message like this: