Quick Start AAC Slurm Cluster User Guide
This guide explains how to use the AMD Accelerator Cloud (AAC) Slurm cluster.
How to log in to the AAC Slurm cluster
Use ssh to login to the AAC Slurm Cluster using your registered userid with SSH key
ssh registered-userid@aacXX.amd.com (example - for MI355 Cluster use aac13.amd.com)
To register for an account to access AAC Slurm Cluster, please generate SSH key pair and send public SSH key to get your account created. Contact your AMD Sponsor for more details.
AAC Slurm Cluster partition naming convention
The Slurm partitions are named for each Compute Node, number of CPUs, number of GPUs, number of GPU Hives, cluster interconnect, the GPU product, the Operating System distribution of the nodes as follows:
<n>C<m>G<p>H_<GPU_Product>_<OS>
where <n> is total number of CPUs per node, <m> is total number of GPUs per node, <p> is total number of GPU hives per node, <GPU_Product> is the AMD Instinct Accelerator Product name, and <OS> is the Operating System distribution running on the nodes in the cluster.
The following are the currently configured partitions in AAC Slurm cluster:
XXXC8G1H_MI3XXX_Ubuntu22
You will be given access to specific partition(s)/queue(s).
How to display all Slurm partition/queue information
Your account will be set up with access to specific partition(s)/queue(s).
Use sinfo to list all the partitions and available compute nodes
sinfo --format="%.32P %.5a %.10l %.16F %N"
The partition with an asterisk * is the default partition. However, a partition/queue name must be specified when submitting jobs using Slurm commands.
How to submit a batch job
Use the sbatch command to queue an SBATCH script to be executed on a specific partition, using the -p option to specify the partition to use:
sbatch -p <specific_parition_name> <path_to_the_sbatch_file>
How to create an SBATCH file
There are sample SBATCH scripts located in https://github.com/amddcgpuce/sbatchfiles/tree/main to submit AMD Infinity Hub applications (https://www.amd.com/en/technologies/infinity-hub) and other sample applications for 1, 2, 4 and 8 GPU workloads on single compute node which can be used as a starting point to write your own SBATCH script for your application.
Here's a template for an SBATCH file for a 1 GPU workload on AAC Slurm cluster: https://github.com/amddcgpuce/sbatchfiles/blob/main/example1gpu.sbatch
Note: Include these lines in the SBATCH file for the AAC Slurm cluster to load the correct OS-specific module environment for the compute node. At the time of this writing, the ROCm 6.4.2 environment is the latest available.
source /etc/profile.d/modules.sh
source /shared/apps/aac.modules.bash
module purge
module load rocm-<ROCm Version> (example rocm-6.4.2)
The line module load rocm-<ROCm Version> loads the ROCm
How to check a queued job status
Use squeue to view submitted job status:
squeue --format="%.18i %.32P %.8j %.8u %.2t %.10M %.6D %R %n"
How to find details of a submitted or queued job
Use scontrol to show details of the job ID.
For example, to view details of job ID 21851, use
scontrol show job 21851
How to allocate and SSH to a compute node
Use salloc to allocate a whole node from your assigned partition/queue and SSH to the allocated node.
salloc --exclusive --mem=0 --gres=gpu:8 -p <QUEUE_NAME>
Where
How to create additional SSH sessions to the allocated compute node (previously using salloc)
You can start additional SSH connections to the salloc-allocated node:
ssh –J –A <USERID>@aacXX.amd.com <USERID>@<COMPUTENODE_HOSTNAME>
How to set up ROCm environment on the compute node
After ssh to the compute node, load the ROCm environment using:
module load rocm-<ROCm Version> (example- "rocm-6.4.2")
The ROCm environment module should be loaded from each SSH session.
How to display the accessible Slurm queues
To show the <QUEUE_NAME> accessible to you, run:
sshare -nm --format="Partition%32" | tr '[:lower:]' '[:upper:]'
How to enable GCC/GFORTRAN toolset 11 on RHEL8
Use salloc -p <desired_partition>_RHEL8 to allocate and ssh to a compute node running RHEL8. To enable gcc-toolset-11, start a new bash shell:
$ scl enable gcc-toolset-11 bash
How to use Podman to run Docker containers
To start amddcgpuce/rocm:6.1.2-ub22 docker in interactive mode using podman with $HOME mounted as /workdir inside docker container
podman run -it --privileged --network=host --ipc=host -v $HOME:/workdir -v /shareddata:/shareddata -v /shared:/shared --workdir /workdir docker://amddcgpuce/rocm:6.1.2-ub22 bash
This will pull the docker image, start an interactive session with current work directory set to /workdir and start a bash shell.
Create work files under /workdir which is $HOME and is accessible from ALL nodes with SSH session.
To exit the docker interactive session, type: exit <return> to return to SSH login prompt on allocated node.
Suggested best practices
- Release cache space used by Docker images:
podman system prune -a - Keep work files OUTSIDE container under
$HOMEand use Docker as customized tools environment with packages installed specific for your application development. - Files under
$HOMEare visible on ALL Compute Nodes to resume work with Docker images on ANY allocated node.
Note: Podman caches Docker images on local storage on the compute node. Cached images are removed periodically to free disk space.
How to reattach to a Podman container after exiting it
From SSH session on Compute Node,
Use podman ps –a to get CONTAINER ID of the Exited container.
Use podman start <CONTAINER ID> to start the container
Use podman attach <CONTAINER ID> to attach and get back to bash shell prompt.
How to exit Podman container and get back to compute node shell prompt
Type exit to leave the interactive session and return to the compute node shell prompt.
How to pull a private Docker image
To pull private Docker images, First login to the account:
podman login docker.io
Logout when done.
podman logout docker.io
How to allocate multi-node compute cluster
Allocate 2-node MI355 cluster (example: 2-node MI355 Ubuntu 22)
salloc -N 2 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 –p <QUEUE_NAME>
Ex: salloc -N 2 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 -p 256C8G1H_MI355X_Ubuntu22
Allocate 4-node MI355 cluster
salloc -N 4 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 –p <QUEUE_NAME>
Ex: salloc -N 4 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 -p 256C8G1H_MI355X_Ubuntu22
Allocate 8-node MI355 cluster
salloc -N 8 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 –p <QUEUE_NAME>
Ex: salloc -N 8 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 -p 256C8G1H_MI355X_Ubuntu22
FAQ
Q What does the Invalid account or account/partition combination specified error mean?
It means you do not have access to nodes behind the <QUEUE_NAME> specified. The errors salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified and sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified indicate that a <QUEUE_NAME> was specified to which you do not have access.
Sometimes, copy-paste of the command may introduce invalid character which could cause the error. Please type in the commands manually to verify whether that works.
Q: What does the error: invalid partition specified error message mean?
It means that the <QUEUE_NAME> specified does not exist. Run sinfo –o “%P% to list valid queues.
Q: How do I fix “rocminfo: command not found” or “Command 'rocminfo' not found … Please ask your administrator.”?
Load the ROCm Environment “module load rocm-