Skip to content

Release Notes - Bare Metal v2026.04

Release date: April 23, 2026

🎉 Highlights

This release documents the new MI355X cluster (MI355X.amd.com) alongside the existing MI325X cluster (MI325X.amd.com), promotes ROCm 7.2 as the recommended module on both clusters, and formalizes Pyxis-based containerized job submission with srun --container-image=.... The bare-metal documentation now also includes a dedicated GPU partitioning modes page.

🚀 New features

Pyxis for containerized srun

The user guide now shows how to launch jobs directly from a container image using srun --container-image=..., backed by the Pyxis SPANK plugin installed on both clusters. This removes the need for users to manually enroot create / enroot start for most workflows.

GPU partitioning modes documentation

A new GPU partitioning modes page explains SPX/DPX/QPX/CPX and NPS1/NPS2/NPS4 settings and how to request them via Slurm --constraint.

🔧 Improvements

  • ROCm examples: Updated default examples to module load rocm/7.2.0. Removed references to rocm-6.4.2 as the default. Both clusters expose ROCm 7.0 through 7.13; MI325X additionally retains ROCm 6.3.x and 6.4.x for legacy workloads.
  • Login hosts: Replaced vague aac13.amd.com and aacXX.amd.com placeholders with MI325X.amd.com (MI325X) and MI355X.amd.com (MI355X) when identifying which cluster a doc section applies to. For SSH and other client connections, continue to use the aacXX.amd.com alias as documented below.
  • Partitions: Replaced the non-existent sample partition 1CN96C8G1H_4IB_MI250_Ubuntu22 with the actual partitions in production: 256C8G1H_MI325X_Ubuntu22 MI325X and 256C8G1H_MI355X_Ubuntu22 MI355X.
  • Reservations: Clarified that --reservation= is not currently required on either cluster. The default flow is now salloc -p <partition> --account=<ACCOUNT_NAME> --gres=gpu:8 --mem=0 --exclusive.
  • High-availability controllers: Documented that each cluster runs two Slurm controllers (primary + backup). Use the aacXX.amd.com alias for SSH and other client connections so failover remains transparent. Treat MI325X.amd.com and MI355X.amd.com as cluster-identification names in the docs, not as the preferred login hostnames.- Style: Fixed HTML entities (&lt;, &gt;, &lowbar;) in headings, replaced en-dash characters in ssh and salloc examples, and standardized on module load rocm/<version> (slash) rather than rocm-<version> (hyphen).

🐛 Bug fixes

  • Fixed broken partition-listing example that mangled partition case via tr '[:lower:]' '[:upper:]'. Replaced with sacctmgr show assoc where user=$USER format=Account,Partition,QOS%40 and sinfo -o "%P %a %l %F %G %f".
  • Fixed sinfo –o "%P%" in the FAQ (incorrect trailing % and en-dash). Now sinfo -o "%P".
  • Fixed ssh –J –A examples that used en-dashes instead of -, which broke when copy-pasted.
  • Removed references to non-existent shared mounts /shareddata and /shareddata.ai. Real shared mounts are /shared/data, /shared/apps, and $HOME (/shared/amdgpu/home/<user>).