Skip to content

GPU Metrics

When your reservation is active on an AMD Accelerator Cloud (AAC) cluster, you will receive an email with Grafana credentials to access your personal GPU metrics dashboard.

Field Value
URL https://aac.amd.com:6443/dashboards
Username Provided in your credential email
Password Provided in your credential email

Your credentials are unique to you. Do not share them with others.


Logging in

  1. Open a browser and go to https://aac.amd.com:6443/dashboards
  2. Enter the username and password from your credential email
  3. You will land on your personal dashboard folder — only your dashboards are visible to you

If you have trouble logging in: - Check your spam/junk folder for the credential email - Contact the AAC team at dl.dcgpu.aacteam@amd.com


Your dashboards

You will see 5 dashboards scoped exclusively to your reserved nodes and jobs.


1. Reservation overview

A high-level summary of everything in your reservation at a glance.

Panel Description
Reservation Summary Your reservation name, node count, and active status
Reserved Nodes List of nodes assigned to your reservation
Total GPUs Total number of GPUs across your reserved nodes
Current Avg GPU Utilization Average GPU compute usage (%) across all your nodes right now
Total Power Usage Combined power draw (W) across all reserved nodes
Active Jobs Number of jobs currently running in your reservation
Energy Consumed Cumulative energy used (kWh) during your reservation
GPU Performance GPU compute core usage trend over time
GPU Utilization per Node Per-node GPU utilization breakdown
GPU Memory Used % per Node VRAM usage percentage per node
GPU Clock Speed per Node Current GPU clock frequency per node
HBM Memory Temperature per Node HBM/VRAM temperature per node
Avg GPU Temperature per Node Average GPU die temperature per node
GPU Power per Node Power draw per individual node

When to use: Start here for a quick health check of your entire reservation.


2. GPU

Deep-dive metrics for an individual GPU within a compute node.

Panel Description
GPU Power Usage Real-time power draw per GPU
GPU Usage Compute utilization per GPU core
GPU Package Power Total package power for the node
GPU Temperature Die temperature per GPU
Memory Temperature HBM/VRAM temperature per GPU
Temperature Sensors All available temperature sensor readings
VRAM Usage Video memory usage per GPU
Used VRAM Absolute VRAM consumed (GB)
GPU Compute Core Usage Core-level utilization breakdown
PCIe Bandwidth PCIe read/write throughput
PCIe Max Speed / Counts Link speed and lane count
Total / Total Uncorrectable ECC ECC error counts (correctable and uncorrectable)
Energy Consumed Per-node energy consumption over time
GPU Errors Hardware error events

When to use: Investigating a specific node — thermal issues, ECC errors, power spikes, or VRAM pressure.


3. Compute node

Node-level aggregated view across all GPUs on a single compute node.

Panel Description
GPUs / Allocated GPUs Total vs allocated GPU count on this node
Jobs Jobs currently running on this node
Average GPU Usage Mean compute utilization across all GPUs on the node
Average GPU Power Mean power draw across all GPUs
Average GPU / Memory Temperature Mean temperatures across all GPUs
Used VRAM / Memory Usage Aggregate memory consumption
Top 5 Jobs by GPU Usage Which jobs are consuming the most GPU compute
Top 5 Used VRAM Which jobs are consuming the most VRAM
Total Power Usage Combined power for the whole node
Total ECC Counts Aggregate ECC error counts
Total / Current PCIe Bandwidth PCIe throughput (historical and current)
Average Sensor Temperatures All sensor averages
All GPUs Per-GPU utilization view for all GPUs on the node
Energy Consumed Node-level energy usage over time

When to use: Understanding how a single node is being used across multiple GPUs and jobs simultaneously.


4. Job

Resource consumption metrics for a specific Slurm job.

Panel Description
Name Job name and ID
Allocated GPUs GPUs assigned to this job
Compute Nodes Which nodes this job is running on
Average GPU Usage Mean GPU compute utilization for this job
Top 5 GPU Usage Top GPU consumers within the job
Top 5 Used VRAM Top VRAM consumers within the job
Average GPU / Memory Temperature Thermal view for this job's GPUs
Used VRAM / Memory Usage Memory consumption for this job
Total Power Usage Power draw attributed to this job
Total ECC Counts ECC errors occurring during this job
Total / Current PCIe Bandwidth PCIe I/O for this job
Average Sensor Temperatures Sensor averages for this job's GPUs
All GPUs Per-GPU breakdown for all GPUs in this job
GPU Performance Compute performance trend for this job

When to use: Profiling a specific training run or workload — correlate GPU utilization, memory pressure, and thermal headroom for your job.


5. BMC log events

Hardware-level event logs from the Baseboard Management Controller (BMC) on your reserved nodes. These are low-level firmware and hardware events, independent of your workload.

Panel Description
BMC Events (last 5m) Count of all BMC events in the last 5 minutes
Warning/Critical (last 5m) Count of warning and critical severity events
Per-node logs (last 5m) Event count broken down by node
Event rate by severity Timeline of events grouped by severity level
Event rate by node Timeline of events grouped by node
Top sensors (time range) Most frequently reporting hardware sensors
Top entry codes (time range) Most common BMC event codes
Warning/Critical events (last 1h) Warning/critical events over the past hour
Recent BMC events (logs) Raw log stream of recent BMC events

When to use: Investigating hardware-level issues — unexpected reboots, thermal warnings, power faults, or sensor alerts on your nodes.


Tips

  • Time range: Use the time picker (top right) to zoom into a specific training window
  • Node selection: Dashboards scope automatically to your reserved nodes — no manual filtering needed
  • Refresh: Set auto-refresh to 30s or 1m during active training runs
  • Dashboards reset on reservation change: Dashboards reflect only nodes active in your current reservation

Support

For access issues or questions about your metrics:

Email: dl.dcgpu.aacteam@amd.com