Skip to content

Clusters at a Glance

This page provides a quick comparison of AAC Bare Metal clusters to help you choose the right environment for your workload.

Quick reference

Feature MI325X Cluster MI355X Cluster
GPU Model AMD Instinct MI325X AMD Instinct MI355X
GPUs per Node 8 8
GPU Memory per GPU 256 GB HBM3 288 GB HBM3
Total GPU Memory per Node 2 TB (8 × 256 GB) 2.25 TB (8 × 288 GB)
CPU Cores per Node 256 (dual-socket AMD EPYC) 256 (dual-socket AMD EPYC)
Operating System Ubuntu 22.04 Ubuntu 22.04
Slurm Partition 256C8G1H_MI325X_Ubuntu22 256C8G1H_MI355X_Ubuntu22
ROCm Version (default) 7.2.0 7.2.0
Anaconda Module anaconda3/25.5.1 ❌ Not available
Pyxis (Container Integration) ✅ Enabled ✅ Enabled
GPU Partitioning Modes SPX, DPX, QPX, CPX SPX, DPX, QPX, CPX
NUMA Modes NPS1, NPS2, NPS4 Fixed (no user control)

Which cluster should I use?

Use MI325X when:

  • You need the Anaconda module for conda environment management
  • Your workload benefits from NUMA mode tuning (NPS1/NPS2/NPS4)
  • You're working with models or datasets that fit within 256 GB per GPU
  • You need access to legacy software in /shared/apps2

Use MI355X when:

  • You need larger GPU memory (288 GB vs 256 GB per GPU)
  • You're training very large models that require more GPU memory
  • You don't need Anaconda (can use containers or venv instead)
  • You want the latest generation MI355X accelerators

Either cluster works when:

  • Standard containerized workflows (both have Pyxis)
  • ROCm 7.X.X workloads
  • Multi-node jobs with high-speed interconnect
  • Standard GPU partitioning modes (SPX/DPX/QPX/CPX)

Shared storage

Both clusters share the same NFS-based storage layout:

Path Purpose Access
$HOME (/shared/amdgpu/home/<user>) User home directory Read/write, private
/shared/data Shared datasets, models, containers Read/write, shared
/shared/apps Software, ROCm modules Read-only, shared
/shared/apps2 Legacy software (MI325X only) Read-only, shared

See Storage and Shared Filesystems for details.

Software availability

Available on both clusters

Software Module/Command
ROCm 7.X.X module load rocm/<rocm version>
ROCm 7.X.X RC module load rocm/<rocm rc version>
Podman podman (no module needed)
Enroot enroot (no module needed)
Pyxis srun --container-image=...
MPI $MPI_HOME (loaded with ROCm)

Available on MI325X only

Software Module/Command
Anaconda3 25.5.1 module load anaconda3/25.5.1

Note: If you need Anaconda on MI355X, contact cluster operations or use containerized Python environments.

High-availability controllers

Both clusters use dual Slurm controllers for high availability: - Always use the canonical cluster DNS alias in the aacXX.amd.com format for SSH access - The alias automatically routes your connection to the active controller - This keeps access uninterrupted during controller failover

GPU partitioning and NUMA modes

Both clusters support GPU partitioning, but NUMA mode control differs:

Mode MI325X (MI325X) MI355X (MI355X)
SPX (full GPU) --constraint=spx --constraint=spx
DPX (dual partition) --constraint=dpx --constraint=dpx
QPX (quad partition) --constraint=qpx --constraint=qpx
CPX (compute partition) --constraint=cpx --constraint=cpx
NPS1/NPS2/NPS4 --constraint=nps1 etc. ❌ Fixed NUMA config

See Node Reference Guide for detailed specifications and examples.

Migration guide: Moving between clusters

If you need to move work from one cluster to another:

Files and data

  • Both clusters mount the same /shared/data and $HOME via NFS
  • Files in $HOME and /shared/data are accessible from both clusters
  • No file transfer needed!

Container images

  • Store .sqsh images in /shared/data - accessible from both clusters
  • Docker Hub images work on both via Pyxis

Job scripts

  • Update partition name: 256C8G1H_MI325X_Ubuntu22256C8G1H_MI355X_Ubuntu22
  • Remove NUMA constraints if moving from MI325X → MI355X
  • If using Anaconda on MI325X, switch to containers on MI355X: ```bash # MI325X with Anaconda module load anaconda3/25.5.1 conda activate myenv python train.py

# MI355X with container srun --container-image=docker://rocm/pytorch-training:v25.5 \ --container-mounts=$HOME:/workdir \ --container-workdir=/workdir \ python train.py ```

Getting help

For questions about cluster access, quotas, or software installation: - Contact your AMD sponsor - Submit a support ticket through the AAC portal - See Prerequisites for access requirements