Yandex Cloud
Search
Contact UsTry it for free
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
  • Marketplace
    • Featured
    • Infrastructure & Network
    • Data Platform
    • AI for business
    • Security
    • DevOps tools
    • Serverless
    • Monitoring & Resources
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
    • Price calculator
    • Pricing plans
  • Customer Stories
  • Documentation
  • Blog
© 2026 Direct Cursus Technology L.L.C.
Yandex Cloud Stackland
  • What's new
  • Installation
    • All tutorials
    • Installing Stackland on Yandex BareMetal
    • Setting up external access to a pod in a cluster
    • All guides
    • Projects
    • Resource model
      • ClickHouse monitoring
      • NVIDIA® DCGM dashboard metrics
      • NVIDIA® DCGM dashboard metrics with MIG
      • NVIDIA® DCGM dashboard metrics without MIG
      • Hardware monitoring
  • Access management
  • Pricing policy
  • Diagnostics and troubleshooting

In this article:

  • General info
  • Dashboard variables
  • Panels and metrics
  • 1. GPU Temperature
  • 2. GPU Avg. Temp
  • 3. GPU Power Usage
  • 4. GPU Power Total
  • 5. GPU SM Clocks
  • 6. GPU Utilization
  • 7. GPU Framebuffer Mem Used
  • 8. Tensor Core Utilization
  • Monitoring best practices
  • Critical metrics
  • Performance metrics
  • Temperature metrics and power consumption metrics
  • Memory metrics
  • Extra resources
  1. Concepts
  2. Cluster monitoring
  3. NVIDIA® DCGM dashboard metrics

NVIDIA DCGM dashboard metrics

Written by
Yandex Cloud
Updated at April 8, 2026
  • General info
  • Dashboard variables
  • Panels and metrics
    • 1. GPU Temperature
    • 2. GPU Avg. Temp
    • 3. GPU Power Usage
    • 4. GPU Power Total
    • 5. GPU SM Clocks
    • 6. GPU Utilization
    • 7. GPU Framebuffer Mem Used
    • 8. Tensor Core Utilization
  • Monitoring best practices
    • Critical metrics
    • Performance metrics
    • Temperature metrics and power consumption metrics
    • Memory metrics
  • Extra resources

Grafana's NVIDIA DCGM dashboard enables comprehensive monitoring of NVIDIA GPUs in a Stackland cluster. The dashboard presents the metrics of temperature, power consumption, GPU utilization, memory usage, and other critical GPU performance parameters.

To open a dashboard:

  1. Open Grafana.
  2. Select NVIDIA DCGM official in the dashboard list.

The GPU monitoring dashboard will open.

General infoGeneral info

Dashboard title: NVIDIA DCGM official.
UID: gpu-official.
Refresh interval: 30 seconds.
Data source: Prometheus.

Dashboard variablesDashboard variables

The dashboard uses the following variables for data filtering:

  • DS_PROMETHEUS: Prometheus data source.
  • instance: Selects a specific DCGM Exporter instance (multi-select supported).
  • gpu: Selects a specific GPU ID or all GPUs (multi-select supported, all GPUs selected by default).

Panels and metricsPanels and metrics

1. GPU Temperature1. GPU Temperature

Description: Temperature of each GPU in degrees Celsius. The chart displays temperature for each GPU, calculating the average, last, and maximum values.

Metric: DCGM_FI_DEV_GPU_TEMP.

Unit of measurement: Degrees Celsius.

Thresholds:

  • Green: Normal temperature.
  • Red: Temperature above 80°C.

2. GPU Avg. Temp2. GPU Avg. Temp

Description: Average temperature of all GPUs. A gauge panel displays the current average temperature value with color indication.

Metric: avg(DCGM_FI_DEV_GPU_TEMP).

Unit of measurement: Degrees Celsius.

Thresholds:

  • Green: Temperature below 83°C.
  • Yellow: Temperature from 83°C to 87°C.
  • Red: Temperature above 87°C.

Recommendations: If the temperature exceeds 83°C, check the cooling system and GPU load. Temperatures above 87°C may lead to throttling and reduced performance.

3. GPU Power Usage3. GPU Power Usage

Description: Power consumption of each GPU in watts. The chart shows current power draw per GPU, calculating the average, last, and maximum values.

Metric: DCGM_FI_DEV_POWER_USAGE.

Unit of measurement: Watts.

4. GPU Power Total4. GPU Power Total

Description: Total power consumption of all GPUs. A gauge panel displays the total power consumption of all GPUs.

Metric: sum(DCGM_FI_DEV_POWER_USAGE).

Unit of measurement: Watts.

Thresholds:

  • Green: Below 1,800 W.
  • Yellow: From 1800 W to 2200 W.
  • Red: Above 2200 W.

Recommendations: Monitor the total power consumption not to exceed the power supply unit's limits or the capacity allocated for the rack.

5. GPU SM Clocks5. GPU SM Clocks

Description: Clock frequency of the GPU streaming multiprocessors (SMs), in Hz. The metric is multiplied by 1,000,000 to convert MHz to Hz.

Metric: DCGM_FI_DEV_SM_CLOCK * 1000000.

Unit of measurement: Hertz.

Description: An SM (streaming multiprocessor) is the NVIDIA GPU architecture's primary compute unit responsible for parallel computing. SM frequency determines the speed of computing operations.

6. GPU Utilization6. GPU Utilization

Description: GPU utilization percentage. Shows the currently used portion of the the GPU's computing resources.

Metric: DCGM_FI_DEV_GPU_UTIL.

Unit of measurement: Percentage (0-100).

Recommendations: Low GPU utilization despite active tasks may indicate data transfer bottlenecks or suboptimal code. High utilization (close to 100%) is normal for computing tasks.

7. GPU Framebuffer Mem Used7. GPU Framebuffer Mem Used

Description: The amount of framebuffer memory used per GPU. Framebuffer is the GPU memory area used to store data, textures, and intermediate computation results.

Metric: DCGM_FI_DEV_FB_USED.

Unit of measurement: Megabytes.

Recommendations: Monitor the GPU memory usage. Exceeding the available memory will cause task execution errors or reduced performance due to swapping.

8. Tensor Core Utilization8. Tensor Core Utilization

Description: GPU tensor core utilization Tensor cores are specialized compute units for matrix operations, commonly used in machine learning and AI tasks.

Metric: DCGM_FI_PROF_PIPE_TENSOR_ACTIVE.

Unit of measurement: Fraction from 0 to 1 (0% to 100%).

Description: High tensor core utilization is typical for deep learning tasks that employ matrix operations (e.g., neural network training). Low utilization during ML tasks may indicate suboptimal GPU usage.

Monitoring best practicesMonitoring best practices

Critical metricsCritical metrics

The following metrics require immediate attention when they deviate from normal values:

  1. GPU Avg. Temp must be less than 83°C; exceeding 87°C is critical.
  2. GPU Power Total: Monitor total energy consumption, make sure it does not exceed 2200 W.
  3. GPU Framebuffer Mem Used: Keep track of how much memory is used to avoid out-of-memory errors.

Performance metricsPerformance metrics

To evaluate GPU utilization efficiency, pay attention to:

  1. GPU Utilization: Should be high during computing tasks.
  2. Tensor Core Utilization: Should be substantial for ML tasks.
  3. GPU SM Clocks: SM clock frequency must be aligned with the GPU specifications.

Temperature metrics and power consumption metricsTemperature metrics and power consumption metrics

For thermal and power monitoring:

  1. GPU Temperature: The temperature of each GPU must be within the normal range.
  2. GPU Power Usage: Power consumption must correspond to workload.
  3. GPU Avg. Temp: Average temperature must not exceed the safe thresholds.

Memory metricsMemory metrics

For video memory usage monitoring:

  1. GPU Framebuffer Mem Used: Monitor memory used by each GPU.
  2. Compare the used memory against the total available GPU memory.

Extra resourcesExtra resources

  • NVIDIA DCGM Documentation
  • DCGM Exporter on GitHub
  • NVIDIA GPU Architecture
  • Tensor Cores Documentation

Was the article helpful?

Previous
ClickHouse monitoring
Next
NVIDIA® DCGM dashboard metrics with MIG
© 2026 Direct Cursus Technology L.L.C.