NVIDIA DCGM dashboard metrics

Written by

Updated at April 8, 2026

General info
Dashboard variables
Panels and metrics
Monitoring best practices
Extra resources

Grafana's NVIDIA DCGM dashboard enables comprehensive monitoring of NVIDIA GPUs in a Stackland cluster. The dashboard presents the metrics of temperature, power consumption, GPU utilization, memory usage, and other critical GPU performance parameters.

To open a dashboard:

Open Grafana.
Select NVIDIA DCGM official in the dashboard list.

The GPU monitoring dashboard will open.

General info

Dashboard title: NVIDIA DCGM official.
UID: gpu-official.
Refresh interval: 30 seconds.
Data source: Prometheus.

Dashboard variables

The dashboard uses the following variables for data filtering:

DS_PROMETHEUS: Prometheus data source.
instance: Selects a specific DCGM Exporter instance (multi-select supported).
gpu: Selects a specific GPU ID or all GPUs (multi-select supported, all GPUs selected by default).

Panels and metrics

1. GPU Temperature

Description: Temperature of each GPU in degrees Celsius. The chart displays temperature for each GPU, calculating the average, last, and maximum values.

Metric: DCGM_FI_DEV_GPU_TEMP.

Unit of measurement: Degrees Celsius.

Thresholds:

Green: Normal temperature.
Red: Temperature above 80°C.

2. GPU Avg. Temp

Description: Average temperature of all GPUs. A gauge panel displays the current average temperature value with color indication.

Metric: avg(DCGM_FI_DEV_GPU_TEMP).

Unit of measurement: Degrees Celsius.

Thresholds:

Green: Temperature below 83°C.
Yellow: Temperature from 83°C to 87°C.
Red: Temperature above 87°C.

Recommendations: If the temperature exceeds 83°C, check the cooling system and GPU load. Temperatures above 87°C may lead to throttling and reduced performance.

3. GPU Power Usage

Description: Power consumption of each GPU in watts. The chart shows current power draw per GPU, calculating the average, last, and maximum values.

Metric: DCGM_FI_DEV_POWER_USAGE.

Unit of measurement: Watts.

4. GPU Power Total

Description: Total power consumption of all GPUs. A gauge panel displays the total power consumption of all GPUs.

Metric: sum(DCGM_FI_DEV_POWER_USAGE).

Unit of measurement: Watts.

Thresholds:

Green: Below 1,800 W.
Yellow: From 1800 W to 2200 W.
Red: Above 2200 W.

Recommendations: Monitor the total power consumption not to exceed the power supply unit's limits or the capacity allocated for the rack.

5. GPU SM Clocks

Description: Clock frequency of the GPU streaming multiprocessors (SMs), in Hz. The metric is multiplied by 1,000,000 to convert MHz to Hz.

Metric: DCGM_FI_DEV_SM_CLOCK * 1000000.

Unit of measurement: Hertz.

Description: An SM (streaming multiprocessor) is the NVIDIA GPU architecture's primary compute unit responsible for parallel computing. SM frequency determines the speed of computing operations.

6. GPU Utilization

Description: GPU utilization percentage. Shows the currently used portion of the the GPU's computing resources.

Metric: DCGM_FI_DEV_GPU_UTIL.

Unit of measurement: Percentage (0-100).

Recommendations: Low GPU utilization despite active tasks may indicate data transfer bottlenecks or suboptimal code. High utilization (close to 100%) is normal for computing tasks.

7. GPU Framebuffer Mem Used

Description: The amount of framebuffer memory used per GPU. Framebuffer is the GPU memory area used to store data, textures, and intermediate computation results.

Metric: DCGM_FI_DEV_FB_USED.

Unit of measurement: Megabytes.

Recommendations: Monitor the GPU memory usage. Exceeding the available memory will cause task execution errors or reduced performance due to swapping.

8. Tensor Core Utilization

Description: GPU tensor core utilization Tensor cores are specialized compute units for matrix operations, commonly used in machine learning and AI tasks.

Metric: DCGM_FI_PROF_PIPE_TENSOR_ACTIVE.

Unit of measurement: Fraction from 0 to 1 (0% to 100%).

Description: High tensor core utilization is typical for deep learning tasks that employ matrix operations (e.g., neural network training). Low utilization during ML tasks may indicate suboptimal GPU usage.

Monitoring best practices

Critical metrics

The following metrics require immediate attention when they deviate from normal values:

GPU Avg. Temp must be less than 83°C; exceeding 87°C is critical.
GPU Power Total: Monitor total energy consumption, make sure it does not exceed 2200 W.
GPU Framebuffer Mem Used: Keep track of how much memory is used to avoid out-of-memory errors.

Performance metrics

To evaluate GPU utilization efficiency, pay attention to:

GPU Utilization: Should be high during computing tasks.
Tensor Core Utilization: Should be substantial for ML tasks.
GPU SM Clocks: SM clock frequency must be aligned with the GPU specifications.

Temperature metrics and power consumption metrics

For thermal and power monitoring:

GPU Temperature: The temperature of each GPU must be within the normal range.
GPU Power Usage: Power consumption must correspond to workload.
GPU Avg. Temp: Average temperature must not exceed the safe thresholds.

NVIDIA DCGM dashboard metrics

General info

Dashboard variables

Panels and metrics

1. GPU Temperature

2. GPU Avg. Temp

3. GPU Power Usage

4. GPU Power Total

5. GPU SM Clocks

6. GPU Utilization

7. GPU Framebuffer Mem Used

8. Tensor Core Utilization

Monitoring best practices

Critical metrics

Performance metrics

Temperature metrics and power consumption metrics

Memory metrics

Extra resources

Was the article helpful?

NVIDIA DCGM dashboard metrics

General infoGeneral info

Dashboard variablesDashboard variables

Panels and metricsPanels and metrics

1. GPU Temperature1. GPU Temperature

2. GPU Avg. Temp2. GPU Avg. Temp

3. GPU Power Usage3. GPU Power Usage

4. GPU Power Total4. GPU Power Total

5. GPU SM Clocks5. GPU SM Clocks

6. GPU Utilization6. GPU Utilization

7. GPU Framebuffer Mem Used7. GPU Framebuffer Mem Used

8. Tensor Core Utilization8. Tensor Core Utilization

Monitoring best practicesMonitoring best practices

Critical metricsCritical metrics

Performance metricsPerformance metrics

Temperature metrics and power consumption metricsTemperature metrics and power consumption metrics

Memory metricsMemory metrics

Extra resourcesExtra resources

Was the article helpful?

General info

Dashboard variables

Panels and metrics

1. GPU Temperature

2. GPU Avg. Temp

3. GPU Power Usage

4. GPU Power Total

5. GPU SM Clocks

6. GPU Utilization

7. GPU Framebuffer Mem Used

8. Tensor Core Utilization

Monitoring best practices

Critical metrics

Performance metrics

Temperature metrics and power consumption metrics

Memory metrics

Extra resources