Release Notes - Bare Metal v2026.04
Release date: April 23, 2026
🎉 Highlights
This release documents the new MI355X cluster (MI355X.amd.com) alongside the existing MI325X cluster (MI325X.amd.com), promotes ROCm 7.2 as the recommended module on both clusters, and formalizes Pyxis-based containerized job submission with srun --container-image=.... The bare-metal documentation now also includes a dedicated GPU partitioning modes page.
🚀 New features
Pyxis for containerized srun
The user guide now shows how to launch jobs directly from a container image using srun --container-image=..., backed by the Pyxis SPANK plugin installed on both clusters. This removes the need for users to manually enroot create / enroot start for most workflows.
GPU partitioning modes documentation
A new GPU partitioning modes page explains SPX/DPX/QPX/CPX and NPS1/NPS2/NPS4 settings and how to request them via Slurm --constraint.
🔧 Improvements
- ROCm examples: Updated default examples to
module load rocm/7.2.0. Removed references torocm-6.4.2as the default. Both clusters expose ROCm 7.0 through 7.13; MI325X additionally retains ROCm 6.3.x and 6.4.x for legacy workloads. - Login hosts: Replaced vague
aac13.amd.comandaacXX.amd.complaceholders withMI325X.amd.com(MI325X) andMI355X.amd.com(MI355X) when identifying which cluster a doc section applies to. For SSH and other client connections, continue to use theaacXX.amd.comalias as documented below. - Partitions: Replaced the non-existent sample partition
1CN96C8G1H_4IB_MI250_Ubuntu22with the actual partitions in production:256C8G1H_MI325X_Ubuntu22MI325X and256C8G1H_MI355X_Ubuntu22MI355X. - Reservations: Clarified that
--reservation=is not currently required on either cluster. The default flow is nowsalloc -p <partition> --account=<ACCOUNT_NAME> --gres=gpu:8 --mem=0 --exclusive. - High-availability controllers: Documented that each cluster runs two Slurm controllers (primary + backup). Use the
aacXX.amd.comalias for SSH and other client connections so failover remains transparent. TreatMI325X.amd.comandMI355X.amd.comas cluster-identification names in the docs, not as the preferred login hostnames.- Style: Fixed HTML entities (<,>,_) in headings, replaced en-dash characters insshandsallocexamples, and standardized onmodule load rocm/<version>(slash) rather thanrocm-<version>(hyphen).
🐛 Bug fixes
- Fixed broken partition-listing example that mangled partition case via
tr '[:lower:]' '[:upper:]'. Replaced withsacctmgr show assoc where user=$USER format=Account,Partition,QOS%40andsinfo -o "%P %a %l %F %G %f". - Fixed
sinfo –o "%P%"in the FAQ (incorrect trailing%and en-dash). Nowsinfo -o "%P". - Fixed
ssh –J –Aexamples that used en-dashes instead of-, which broke when copy-pasted. - Removed references to non-existent shared mounts
/shareddataand/shareddata.ai. Real shared mounts are/shared/data,/shared/apps, and$HOME(/shared/amdgpu/home/<user>).