Quick Start AAC Slurm Cluster User Guide
How to Log in to the AAC Slurm Cluster
Use ssh
to login to the AAC Slurm Cluster using your registered userid
with SSH key
ssh registered-userid@aacXX.amd.com (example - for MI355 Cluster use aac13.amd.com)
To register for an account to access AAC Slurm Cluster, please generate SSH key pair and send public SSH key to get your account created. Contact your AMD Sponsor for more details.
AAC Slurm Cluster Partition Naming Convention
The Slurm partitions are named for each Compute Node, number of CPUs, number of GPUs, number of GPU Hives, cluster interconnect, the GPU product, the Operating System distribution of the nodes as follows:
<n>C<m>G<p>H_<GPU_Product>_<OS>
where <n> is total number of CPUs per node, <m> is total number of GPUs per node, <p> is total number of GPU hives per node, <GPU_Product> is the AMD Instinct Accelerator Product name, and <OS> is the Operating System distribution running on the nodes in the cluster.
The following are the currently configured partitions in AAC Slurm cluster:
XXXC8G1H_MI3XXX_Ubuntu22
Each user will be given access permissions to specific partition(s)/queue(s).
How to Display All Slurm Partition/Queue Information
Your account will be set up with access to specific partition(s)/queue(s).
Use sinfo
to list all the partitions and available compute nodes
sinfo --format="%.32P %.5a %.10l %.16F %N"
The partition with an asterisk *
is the default partition. However, a partition/queue name must be specified when submitting jobs using Slurm commands.
How to Submit a Batch Job
Use the sbatch
command to queue an SBATCH
script to be executed on a specific partition, using the -p
option to specify the partition to use:
sbatch -p <specific_parition_name> <path_to_the_sbatch_file>
How to Create an SBATCH File
There are sample SBATCH
scripts located in https://github.com/amddcgpuce/sbatchfiles/tree/main to submit AMD Infinity Hub applications (https://www.amd.com/en/technologies/infinity-hub) and other sample applications for 1, 2, 4 and 8 GPU workloads on single compute node which can be used as a starting point to write your own SBATCH
script for your application.
Here's a template for an SBATCH
file for a 1 GPU workload on AAC Slurm cluster: https://github.com/amddcgpuce/sbatchfiles/blob/main/example1gpu.sbatch
NOTE: Please include these lines in the SBATCH file for AAC Slurm cluster which load the correct OS-specific module environment corresponding to the OS running on the Compute Node. At the time of this writing, the ROCm 6.4.2 environment is the latest available.
source /etc/profile.d/modules.sh
source /shared/apps/aac.modules.bash
module purge
module load rocm-<ROCm Version> (example rocm-6.4.2)
The line module load rocm-<ROCm Version>
loads the ROCm
How to Check a Queued Job Status
Use squeue
command shown below to view submitted job status
squeue --format="%.18i %.32P %.8j %.8u %.2t %.10M %.6D %R %n"
How to Find Details of a Submitted or Queued Job
Use scontrol
to show details of the job ID.
For example, to view details of job ID 21851
, use
scontrol show job 21851
How to Allocate and SSH to a Compute Node
Use salloc to allocate a whole node from the assigned partition/queue and SSH to the allocated node.
salloc --exclusive --mem=0 --gres=gpu:8 -p <QUEUE_NAME>
Where
How to create additional SSH sessions to the Allocated Compute Node (previously using salloc
command)
Additional SSH connections to the salloc allocated node can be started by the user
ssh –J –A <USERID>@aacXX.amd.com <USERID>@<COMPUTENODE_HOSTNAME>
How to Set up ROCm Environment on the Compute Node
After ssh
to the compute node, load the ROCm environment using:
module load rocm-<ROCm Version> (example- "rocm-6.4.2")
The ROCm environment module should be loaded from each SSH session.
How to Display the Accessible Slurm Queues
To show the
sshare -nm --format="Partition%32" | tr '[:lower:]' '[:upper:]'
How to Enable GCC/GFORTRAN toolset 11 on RHEL8
Use salloc -p <desired_partition>_RHEL8
to allocate and ssh
to a Compute Node running RHEL8 OS distribution. To enable gcc-toolset-11
, start a new bash shell:
$ scl enable gcc-toolset-11 bash
How to Use Podman to run Docker Containers
To start amddcgpuce/rocm:6.1.2-ub22
docker in interactive mode using podman
with $HOME
mounted as /workdir
inside docker container
podman run -it --privileged --network=host --ipc=host -v $HOME:/workdir -v /shareddata:/shareddata -v /shared:/shared --workdir /workdir docker://amddcgpuce/rocm:6.1.2-ub22 bash
This will pull the docker image, start an interactive session with current work directory set to /workdir and start a bash shell.
Create work files under /workdir
which is $HOME
and is accessible from ALL nodes with SSH session.
To exit the docker interactive session, type: exit <return>
to return to SSH login prompt on allocated node.
Suggested Best Practices:
- Release cache space used by docker images, run:
podman system prune -a
- Keep work files OUTSIDE container under
$HOME
and use Docker as customized tools environment with packages installed specific for your application development. - Files under
$HOME
are visible on ALL Compute Nodes to resume work with Docker images on ANY allocated node.
NOTE: Podman is configured to cache Docker images on local storage on the Compute Node. Cached images will be removed periodically to free up disk space
How to reattach to a podman container after exiting it
From SSH session on Compute Node,
Use podman ps –a
to get CONTAINER ID of the Exited container.
Use podman start <CONTAINER ID>
to start the container
Use podman attach <CONTAINER ID>
to attach and get back to bash
shell prompt.
How to exit podman container and get back to Compute Node shell prompt
Type exit
to exit the interactive session with docker to the Compute Node shell prompt
How to pull a private docker image
To pull private docker images. First login to the account:
podman login docker.io
Logout when done.
podman logout docker.io
How to Allocate Multi-Node Compute Cluster
Allocate 2-node MI355 cluster. Ex: 2-node MI355 Ubuntu 22
salloc -N 2 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 –p <QUEUE_NAME>
Ex: salloc -N 2 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 -p 256C8G1H_MI355X_Ubuntu22
Allocate 4-node MI355 cluster
salloc -N 4 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 –p <QUEUE_NAME>
Ex: salloc -N 4 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 -p 256C8G1H_MI355X_Ubuntu22
Allocate 8-node MI355 cluster
salloc -N 8 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 –p <QUEUE_NAME>
Ex: salloc -N 8 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 -p 256C8G1H_MI355X_Ubuntu22
FAQ
Q What does the Invalid account or account/partition combination specified
error mean?
It means that user does not have access to nodes behind the <QUEUE_NAME>
specified.
The errors salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified
and sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
indicates a <QUEUE_NAME>
was specified to which the user does not have access permissions
Sometimes, copy-paste of the command may introduce invalid character which could cause the error. Please type in the commands manually to verify whether that works.
Q: What does the error: invalid partition specified
error message mean?
It means that the <QUEUE_NAME>
specified does not exist. Run sinfo –o “%P%
to list valid queues.
Q: How do I fix “rocminfo: command not found” or “Command 'rocminfo' not found … Please ask your administrator.”?
Load the ROCm Environment “module load rocm-