GPU Metrics

When your reservation is active on an AMD Accelerator Cloud (AAC) cluster, you will receive an email with Grafana credentials to access your personal GPU metrics dashboard.

Field	Value
URL	https://aac.amd.com:6443/dashboards
Username	Provided in your credential email
Password	Provided in your credential email

Your credentials are unique to you. Do not share them with others.

Logging in

Open a browser and go to https://aac.amd.com:6443/dashboards
Enter the username and password from your credential email
You will land on your personal dashboard folder — only your dashboards are visible to you

If you have trouble logging in: - Check your spam/junk folder for the credential email - Contact the AAC team at dl.dcgpu.aacteam@amd.com

Your dashboards

You will see 5 dashboards scoped exclusively to your reserved nodes and jobs.

1. Reservation overview

A high-level summary of everything in your reservation at a glance.

Panel	Description
Reservation Summary	Your reservation name, node count, and active status
Reserved Nodes	List of nodes assigned to your reservation
Total GPUs	Total number of GPUs across your reserved nodes
Current Avg GPU Utilization	Average GPU compute usage (%) across all your nodes right now
Total Power Usage	Combined power draw (W) across all reserved nodes
Active Jobs	Number of jobs currently running in your reservation
Energy Consumed	Cumulative energy used (kWh) during your reservation
GPU Performance	GPU compute core usage trend over time
GPU Utilization per Node	Per-node GPU utilization breakdown
GPU Memory Used % per Node	VRAM usage percentage per node
GPU Clock Speed per Node	Current GPU clock frequency per node
HBM Memory Temperature per Node	HBM/VRAM temperature per node
Avg GPU Temperature per Node	Average GPU die temperature per node
GPU Power per Node	Power draw per individual node

When to use: Start here for a quick health check of your entire reservation.

2. GPU

Deep-dive metrics for an individual GPU within a compute node.

Panel	Description
GPU Power Usage	Real-time power draw per GPU
GPU Usage	Compute utilization per GPU core
GPU Package Power	Total package power for the node
GPU Temperature	Die temperature per GPU
Memory Temperature	HBM/VRAM temperature per GPU
Temperature Sensors	All available temperature sensor readings
VRAM Usage	Video memory usage per GPU
Used VRAM	Absolute VRAM consumed (GB)
GPU Compute Core Usage	Core-level utilization breakdown
PCIe Bandwidth	PCIe read/write throughput
PCIe Max Speed / Counts	Link speed and lane count
Total / Total Uncorrectable ECC	ECC error counts (correctable and uncorrectable)
Energy Consumed	Per-node energy consumption over time
GPU Errors	Hardware error events

When to use: Investigating a specific node — thermal issues, ECC errors, power spikes, or VRAM pressure.

3. Compute node

Node-level aggregated view across all GPUs on a single compute node.

Panel	Description
GPUs / Allocated GPUs	Total vs allocated GPU count on this node
Jobs	Jobs currently running on this node
Average GPU Usage	Mean compute utilization across all GPUs on the node
Average GPU Power	Mean power draw across all GPUs
Average GPU / Memory Temperature	Mean temperatures across all GPUs
Used VRAM / Memory Usage	Aggregate memory consumption
Top 5 Jobs by GPU Usage	Which jobs are consuming the most GPU compute
Top 5 Used VRAM	Which jobs are consuming the most VRAM
Total Power Usage	Combined power for the whole node
Total ECC Counts	Aggregate ECC error counts
Total / Current PCIe Bandwidth	PCIe throughput (historical and current)
Average Sensor Temperatures	All sensor averages
All GPUs	Per-GPU utilization view for all GPUs on the node
Energy Consumed	Node-level energy usage over time

When to use: Understanding how a single node is being used across multiple GPUs and jobs simultaneously.

4. Job

Resource consumption metrics for a specific Slurm job.

Panel	Description
Name	Job name and ID
Allocated GPUs	GPUs assigned to this job
Compute Nodes	Which nodes this job is running on
Average GPU Usage	Mean GPU compute utilization for this job
Top 5 GPU Usage	Top GPU consumers within the job
Top 5 Used VRAM	Top VRAM consumers within the job
Average GPU / Memory Temperature	Thermal view for this job's GPUs
Used VRAM / Memory Usage	Memory consumption for this job
Total Power Usage	Power draw attributed to this job
Total ECC Counts	ECC errors occurring during this job
Total / Current PCIe Bandwidth	PCIe I/O for this job
Average Sensor Temperatures	Sensor averages for this job's GPUs
All GPUs	Per-GPU breakdown for all GPUs in this job
GPU Performance	Compute performance trend for this job

When to use: Profiling a specific training run or workload — correlate GPU utilization, memory pressure, and thermal headroom for your job.

5. BMC log events

Hardware-level event logs from the Baseboard Management Controller (BMC) on your reserved nodes. These are low-level firmware and hardware events, independent of your workload.

Panel	Description
BMC Events (last 5m)	Count of all BMC events in the last 5 minutes
Warning/Critical (last 5m)	Count of warning and critical severity events
Per-node logs (last 5m)	Event count broken down by node
Event rate by severity	Timeline of events grouped by severity level
Event rate by node	Timeline of events grouped by node
Top sensors (time range)	Most frequently reporting hardware sensors
Top entry codes (time range)	Most common BMC event codes
Warning/Critical events (last 1h)	Warning/critical events over the past hour
Recent BMC events (logs)	Raw log stream of recent BMC events

When to use: Investigating hardware-level issues — unexpected reboots, thermal warnings, power faults, or sensor alerts on your nodes.

Tips

Time range: Use the time picker (top right) to zoom into a specific training window
Node selection: Dashboards scope automatically to your reserved nodes — no manual filtering needed
Refresh: Set auto-refresh to 30s or 1m during active training runs
Dashboards reset on reservation change: Dashboards reflect only nodes active in your current reservation

Support

For access issues or questions about your metrics:

Email: dl.dcgpu.aacteam@amd.com