Application Overview | Pod GPU Utilization | The ratio of GPU cards currently used by the Pod | DCGM_FI_DEV_GPU_UTIL{cluster="\(cluster",exported_namespace="\)namespace",exported_pod="$pod"} | Line chart |
Application Overview | Pod GPU Utilization (Only MIG Enabled) | The ratio of GPU cards currently used by the Pod when MIG feature is enabled | DCGM_FI_PROF_GR_ENGINE_ACTIVE{cluster="\(cluster",exported_namespace="\)namespace",exported_pod="$pod"} * 100 | Line chart |
Application Overview | Pod GPU Utilization (vGPU) | The ratio of GPU cards currently used by the Pod when vGPU feature is enabled | vGPUCorePercentage{cluster="\(cluster",exported_namespace="\)namespace",podname="$pod"} | Line chart |
Application Overview | Pod GPU Memory Utilization | The ratio of GPU memory currently used by the Pod | DCGM_FI_DEV_FB_USED{cluster="\(cluster",exported_namespace="\)namespace",exported_pod="$pod"} | Line chart |
Application Overview | Pod GPU Memory Utilization (vGPU) | The ratio of GPU memory currently used by the Pod in vGPU mode | vGPUMemoryPercentage{cluster="\(cluster",exported_namespace="\)namespace",podname="$pod"} | Line chart |
Application Overview | Pod Memory Usage | The memory usage of GPU cards currently used by the Pod | DCGM_FI_DEV_FB_USED{cluster="\(cluster",exported_namespace="\)namespace",exported_pod="$pod"} | Line chart |
Application Overview | Pod Memory Usage (vGPU) | The memory usage of GPU cards currently used by the Pod in vGPU mode | sum(GPUDeviceMemoryLimit{cluster="\(cluster"}) * vGPUMemoryPercentage{cluster="\)cluster",exported_namespace="\(namespace",podname="\)pod"} | Line chart |
Application Overview | Pod GPU Memory Copy Utilization | The ratio of GPU memory copy currently used by the Pod | DCGM_FI_DEV_MEM_COPY_UTIL{cluster="\(cluster",exported_namespace="\)namespace",exported_pod="$pod"} | Line chart |
Application Overview | Pod Decode Utilization | The ratio of GPU decode engine currently used by the Pod | DCGM_FI_DEV_DEC_UTIL{cluster="\(cluster",exported_namespace="\)namespace",exported_pod="$pod"} | - |
Application Overview | Pod Encode Utilization | The ratio of GPU encode engine currently used by the Pod | DCGM_FI_DEV_ENC_UTIL{cluster="\(cluster",exported_namespace="\)namespace",exported_pod="$pod"} | - |
GPU Card - Compute & Memory | GPU Utilization Details | Usage details (max, avg, current) of GPU cards associated with the Pod in the last 24 hours | DCGM_FI_DEV_GPU_UTIL{cluster="\(cluster", UUID="\)}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Compute & Memory | GPU Utilization Details (Only MIG Enabled) | Usage details (max, avg, current) of GPU cards associated with the Pod when MIG feature is enabled in the last 24 hours | DCGM_FI_PROF_GR_ENGINE_ACTIVE{cluster="\(cluster", UUID="\) * 100}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Compute & Memory | GPU Memory Used Details | Memory usage details (min, max, avg, current) of GPU cards associated with the Pod in the last 24 hours | DCGM_FI_DEV_FB_USED{cluster="\(cluster", UUID="\)}",GPU_I_ID=~"${gpu_i_id}" | Time-based line chart |
GPU Card - Compute & Memory | GPU Memory Copy Utilization | Memory copy utilization of GPU cards associated with the Pod | DCGM_FI_DEV_MEM_COPY_UTIL{cluster="\(cluster", UUID="\)}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Engine Overview | GPU Graphics Engine Active | The ratio of time the Graphics or Compute engine is active within a monitoring period | DCGM_FI_PROF_GR_ENGINE_ACTIVE{cluster="\(cluster", UUID="\) * 100}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Engine Overview | GPU DRAM Active | Memory bandwidth utilization | DCGM_FI_PROF_DRAM_ACTIVE{cluster="\(cluster", UUID="\) * 100}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Engine Overview | GPU Tensor Core Engine Active | The ratio of time the Tensor Core pipeline is active within a monitoring period | DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{cluster="\(cluster", UUID="\) * 100}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Engine Overview | GPU FP16 Engine Active | The ratio of time the FP16 pipeline is active within a monitoring period | DCGM_FI_PROF_PIPE_FP16_ACTIVE{cluster="\(cluster", UUID="\) * 100}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Engine Overview | GPU FP32 Engine Active | The ratio of time the FP32 pipeline is active within a monitoring period | DCGM_FI_PROF_PIPE_FP32_ACTIVE{cluster="\(cluster", UUID="\) * 100}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Engine Overview | GPU FP64 Engine Active | The ratio of time the FP64 pipeline is active within a monitoring period | DCGM_FI_PROF_PIPE_FP64_ACTIVE{cluster="\(cluster", UUID="\) * 100}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Engine Overview | GPU Decode Utilization | The ratio of GPU decode engine utilization | DCGM_FI_DEV_DEC_UTIL{cluster="\(cluster", UUID="\)}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Engine Overview | GPU Encode Utilization | The ratio of GPU encode engine utilization | DCGM_FI_DEV_ENC_UTIL{cluster="\(cluster", UUID="\)}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Temperature & Power | GPU Temperature | Temperature of all GPU cards in the cluster | DCGM_FI_DEV_GPU_TEMP{cluster="\(cluster", UUID="\)}",GPU_I_ID=~"${gpu_i_id}" | Bar chart |
GPU Card - Temperature & Power | GPU Power Usage | Power usage of all GPU cards in the cluster | DCGM_FI_DEV_POWER_USAGE{cluster="\(cluster", UUID="\)}",GPU_I_ID=~"${gpu_i_id}" | Bar chart |
GPU Card - Temperature & Power | GPU Total Energy Consumption | Total energy consumption of GPU cards | sum(DCGM_FI_DEV_POWER_USAGE{cluster="\(cluster", UUID="\))}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Temperature & Power | GPU Memory Clock | Memory clock frequency | DCGM_FI_DEV_MEM_CLOCK{cluster="\(cluster",UUID="\) * 1000 * 1000}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Temperature & Power | GPU APP SM Clock | Application SM clock frequency | DCGM_FI_DEV_APP_SM_CLOCK{cluster="\(cluster",UUID="\) * 1000 * 1000}",GPU_I_ID=~"${gpu_i_id}" | Line chart |
GPU Card - Temperature & Power Consumption | GPU Card Application Memory Frequency | GPU APP Memory Clock | Application Memory Frequency | DCGM_FI_DEV_APP_MEM_CLOCK{cluster="\(cluster",UUID="\) * 1000 * 1000}",GPU_I_ID=~"${gpu_i_id}" |
GPU Card - Temperature & Power Consumption | GPU Card Video Engine Frequency | GPU Video Clock | Video Engine Frequency | DCGM_FI_DEV_VIDEO_CLOCK{cluster="\(cluster",UUID="\) * 1000 * 1000}",GPU_I_ID=~"${gpu_i_id}" |
GPU Card - Temperature & Power Consumption | GPU Card Throttling Reasons | GPU-Clock Throttle Reasons | Throttling Reasons | __DCGM_FI_DEV_CLOCK_THROTTLE_REASONS{cluster="\(cluster",UUID="\) __}",GPU_I_ID=~"${gpu_i_id}" |
GPU Card - Other Details | PCIe Transfer Rate | PCIE TX BYTES | Data transfer rate of the node GPU card via PCIe bus. | rate(DCGM_FI_PROF_PCIE_RX_BYTES{cluster="\(cluster",UUID="\)[1m])}",GPU_I_ID=~"${gpu_i_id}" |
GPU Card - Other Details | PCIe Receive Rate | PCIE RX BYTES | Data receive rate of the node GPU card via PCIe bus. | rate(DCGM_FI_PROF_PCIE_TX_BYTES{cluster="\(cluster",UUID="\)[1m])}",GPU_I_ID=~"${gpu_i_id}" |