Clusters at a Glance
This page provides a quick comparison of AAC Bare Metal clusters to help you choose the right environment for your workload.
Quick reference
| Feature | MI325X Cluster | MI355X Cluster |
|---|---|---|
| GPU Model | AMD Instinct MI325X | AMD Instinct MI355X |
| GPUs per Node | 8 | 8 |
| GPU Memory per GPU | 256 GB HBM3 | 288 GB HBM3 |
| Total GPU Memory per Node | 2 TB (8 × 256 GB) | 2.25 TB (8 × 288 GB) |
| CPU Cores per Node | 256 (dual-socket AMD EPYC) | 256 (dual-socket AMD EPYC) |
| Operating System | Ubuntu 22.04 | Ubuntu 22.04 |
| Slurm Partition | 256C8G1H_MI325X_Ubuntu22 |
256C8G1H_MI355X_Ubuntu22 |
| ROCm Version (default) | 7.2.0 | 7.2.0 |
| Anaconda Module | ✅ anaconda3/25.5.1 |
❌ Not available |
| Pyxis (Container Integration) | ✅ Enabled | ✅ Enabled |
| GPU Partitioning Modes | SPX, DPX, QPX, CPX | SPX, DPX, QPX, CPX |
| NUMA Modes | NPS1, NPS2, NPS4 | Fixed (no user control) |
Which cluster should I use?
Use MI325X when:
- You need the Anaconda module for conda environment management
- Your workload benefits from NUMA mode tuning (NPS1/NPS2/NPS4)
- You're working with models or datasets that fit within 256 GB per GPU
- You need access to legacy software in
/shared/apps2
Use MI355X when:
- You need larger GPU memory (288 GB vs 256 GB per GPU)
- You're training very large models that require more GPU memory
- You don't need Anaconda (can use containers or venv instead)
- You want the latest generation MI355X accelerators
Either cluster works when:
- Standard containerized workflows (both have Pyxis)
- ROCm 7.X.X workloads
- Multi-node jobs with high-speed interconnect
- Standard GPU partitioning modes (SPX/DPX/QPX/CPX)
Shared storage
Both clusters share the same NFS-based storage layout:
| Path | Purpose | Access |
|---|---|---|
$HOME (/shared/amdgpu/home/<user>) |
User home directory | Read/write, private |
/shared/data |
Shared datasets, models, containers | Read/write, shared |
/shared/apps |
Software, ROCm modules | Read-only, shared |
/shared/apps2 |
Legacy software (MI325X only) | Read-only, shared |
See Storage and Shared Filesystems for details.
Software availability
Available on both clusters
| Software | Module/Command |
|---|---|
| ROCm 7.X.X | module load rocm/<rocm version> |
| ROCm 7.X.X RC | module load rocm/<rocm rc version> |
| Podman | podman (no module needed) |
| Enroot | enroot (no module needed) |
| Pyxis | srun --container-image=... |
| MPI | $MPI_HOME (loaded with ROCm) |
Available on MI325X only
| Software | Module/Command |
|---|---|
| Anaconda3 25.5.1 | module load anaconda3/25.5.1 |
Note: If you need Anaconda on MI355X, contact cluster operations or use containerized Python environments.
High-availability controllers
Both clusters use dual Slurm controllers for high availability:
- Always use the canonical cluster DNS alias in the aacXX.amd.com format for SSH access
- The alias automatically routes your connection to the active controller
- This keeps access uninterrupted during controller failover
GPU partitioning and NUMA modes
Both clusters support GPU partitioning, but NUMA mode control differs:
| Mode | MI325X (MI325X) | MI355X (MI355X) |
|---|---|---|
| SPX (full GPU) | ✅ --constraint=spx |
✅ --constraint=spx |
| DPX (dual partition) | ✅ --constraint=dpx |
✅ --constraint=dpx |
| QPX (quad partition) | ✅ --constraint=qpx |
✅ --constraint=qpx |
| CPX (compute partition) | ✅ --constraint=cpx |
✅ --constraint=cpx |
| NPS1/NPS2/NPS4 | ✅ --constraint=nps1 etc. |
❌ Fixed NUMA config |
See Node Reference Guide for detailed specifications and examples.
Migration guide: Moving between clusters
If you need to move work from one cluster to another:
Files and data
- Both clusters mount the same
/shared/dataand$HOMEvia NFS - Files in
$HOMEand/shared/dataare accessible from both clusters - No file transfer needed!
Container images
- Store
.sqshimages in/shared/data- accessible from both clusters - Docker Hub images work on both via Pyxis
Job scripts
- Update partition name:
256C8G1H_MI325X_Ubuntu22↔256C8G1H_MI355X_Ubuntu22 - Remove NUMA constraints if moving from MI325X → MI355X
- If using Anaconda on MI325X, switch to containers on MI355X: ```bash # MI325X with Anaconda module load anaconda3/25.5.1 conda activate myenv python train.py
# MI355X with container srun --container-image=docker://rocm/pytorch-training:v25.5 \ --container-mounts=$HOME:/workdir \ --container-workdir=/workdir \ python train.py ```
Getting help
For questions about cluster access, quotas, or software installation: - Contact your AMD sponsor - Submit a support ticket through the AAC portal - See Prerequisites for access requirements
Related documentation
- Prerequisites - Access requirements and common software
- AAC Slurm Cluster User Guide - Detailed Slurm usage
- Node Reference Guide - Complete node specifications
- Storage and Shared Filesystems - Storage layout and best practices
- Using Anaconda - Anaconda on MI325X
- What's New - Latest cluster updates