GPU Metrics
When your reservation is active on an AMD Accelerator Cloud (AAC) cluster, you will receive an email with Grafana credentials to access your personal GPU metrics dashboard.
| Field | Value |
|---|---|
| URL | https://aac.amd.com:6443/dashboards |
| Username | Provided in your credential email |
| Password | Provided in your credential email |
Your credentials are unique to you. Do not share them with others.
Logging in
- Open a browser and go to https://aac.amd.com:6443/dashboards
- Enter the username and password from your credential email
- You will land on your personal dashboard folder — only your dashboards are visible to you
If you have trouble logging in: - Check your spam/junk folder for the credential email - Contact the AAC team at dl.dcgpu.aacteam@amd.com
Your dashboards
You will see 5 dashboards scoped exclusively to your reserved nodes and jobs.
1. Reservation overview
A high-level summary of everything in your reservation at a glance.
| Panel | Description |
|---|---|
| Reservation Summary | Your reservation name, node count, and active status |
| Reserved Nodes | List of nodes assigned to your reservation |
| Total GPUs | Total number of GPUs across your reserved nodes |
| Current Avg GPU Utilization | Average GPU compute usage (%) across all your nodes right now |
| Total Power Usage | Combined power draw (W) across all reserved nodes |
| Active Jobs | Number of jobs currently running in your reservation |
| Energy Consumed | Cumulative energy used (kWh) during your reservation |
| GPU Performance | GPU compute core usage trend over time |
| GPU Utilization per Node | Per-node GPU utilization breakdown |
| GPU Memory Used % per Node | VRAM usage percentage per node |
| GPU Clock Speed per Node | Current GPU clock frequency per node |
| HBM Memory Temperature per Node | HBM/VRAM temperature per node |
| Avg GPU Temperature per Node | Average GPU die temperature per node |
| GPU Power per Node | Power draw per individual node |
When to use: Start here for a quick health check of your entire reservation.
2. GPU
Deep-dive metrics for an individual GPU within a compute node.
| Panel | Description |
|---|---|
| GPU Power Usage | Real-time power draw per GPU |
| GPU Usage | Compute utilization per GPU core |
| GPU Package Power | Total package power for the node |
| GPU Temperature | Die temperature per GPU |
| Memory Temperature | HBM/VRAM temperature per GPU |
| Temperature Sensors | All available temperature sensor readings |
| VRAM Usage | Video memory usage per GPU |
| Used VRAM | Absolute VRAM consumed (GB) |
| GPU Compute Core Usage | Core-level utilization breakdown |
| PCIe Bandwidth | PCIe read/write throughput |
| PCIe Max Speed / Counts | Link speed and lane count |
| Total / Total Uncorrectable ECC | ECC error counts (correctable and uncorrectable) |
| Energy Consumed | Per-node energy consumption over time |
| GPU Errors | Hardware error events |
When to use: Investigating a specific node — thermal issues, ECC errors, power spikes, or VRAM pressure.
3. Compute node
Node-level aggregated view across all GPUs on a single compute node.
| Panel | Description |
|---|---|
| GPUs / Allocated GPUs | Total vs allocated GPU count on this node |
| Jobs | Jobs currently running on this node |
| Average GPU Usage | Mean compute utilization across all GPUs on the node |
| Average GPU Power | Mean power draw across all GPUs |
| Average GPU / Memory Temperature | Mean temperatures across all GPUs |
| Used VRAM / Memory Usage | Aggregate memory consumption |
| Top 5 Jobs by GPU Usage | Which jobs are consuming the most GPU compute |
| Top 5 Used VRAM | Which jobs are consuming the most VRAM |
| Total Power Usage | Combined power for the whole node |
| Total ECC Counts | Aggregate ECC error counts |
| Total / Current PCIe Bandwidth | PCIe throughput (historical and current) |
| Average Sensor Temperatures | All sensor averages |
| All GPUs | Per-GPU utilization view for all GPUs on the node |
| Energy Consumed | Node-level energy usage over time |
When to use: Understanding how a single node is being used across multiple GPUs and jobs simultaneously.
4. Job
Resource consumption metrics for a specific Slurm job.
| Panel | Description |
|---|---|
| Name | Job name and ID |
| Allocated GPUs | GPUs assigned to this job |
| Compute Nodes | Which nodes this job is running on |
| Average GPU Usage | Mean GPU compute utilization for this job |
| Top 5 GPU Usage | Top GPU consumers within the job |
| Top 5 Used VRAM | Top VRAM consumers within the job |
| Average GPU / Memory Temperature | Thermal view for this job's GPUs |
| Used VRAM / Memory Usage | Memory consumption for this job |
| Total Power Usage | Power draw attributed to this job |
| Total ECC Counts | ECC errors occurring during this job |
| Total / Current PCIe Bandwidth | PCIe I/O for this job |
| Average Sensor Temperatures | Sensor averages for this job's GPUs |
| All GPUs | Per-GPU breakdown for all GPUs in this job |
| GPU Performance | Compute performance trend for this job |
When to use: Profiling a specific training run or workload — correlate GPU utilization, memory pressure, and thermal headroom for your job.
5. BMC log events
Hardware-level event logs from the Baseboard Management Controller (BMC) on your reserved nodes. These are low-level firmware and hardware events, independent of your workload.
| Panel | Description |
|---|---|
| BMC Events (last 5m) | Count of all BMC events in the last 5 minutes |
| Warning/Critical (last 5m) | Count of warning and critical severity events |
| Per-node logs (last 5m) | Event count broken down by node |
| Event rate by severity | Timeline of events grouped by severity level |
| Event rate by node | Timeline of events grouped by node |
| Top sensors (time range) | Most frequently reporting hardware sensors |
| Top entry codes (time range) | Most common BMC event codes |
| Warning/Critical events (last 1h) | Warning/critical events over the past hour |
| Recent BMC events (logs) | Raw log stream of recent BMC events |
When to use: Investigating hardware-level issues — unexpected reboots, thermal warnings, power faults, or sensor alerts on your nodes.
Tips
- Time range: Use the time picker (top right) to zoom into a specific training window
- Node selection: Dashboards scope automatically to your reserved nodes — no manual filtering needed
- Refresh: Set auto-refresh to 30s or 1m during active training runs
- Dashboards reset on reservation change: Dashboards reflect only nodes active in your current reservation
Support
For access issues or questions about your metrics:
Email: dl.dcgpu.aacteam@amd.com