Quick Start AAC Slurm Cluster User Guide

How to Log in to the AAC Slurm Cluster

Use ssh to login to the AAC Slurm Cluster using your registered userid with SSH key

ssh registered-userid@aacXX.amd.com (example - for MI355 Cluster use aac13.amd.com)

To register for an account to access AAC Slurm Cluster, please generate SSH key pair and send public SSH key to get your account created. Contact your AMD Sponsor for more details.

AAC Slurm Cluster Partition Naming Convention

The Slurm partitions are named for each Compute Node, number of CPUs, number of GPUs, number of GPU Hives, cluster interconnect, the GPU product, the Operating System distribution of the nodes as follows:

<n>C<m>G<p>H_<GPU_Product>_<OS>

where <n> is total number of CPUs per node, <m> is total number of GPUs per node, <p> is total number of GPU hives per node, <GPU_Product> is the AMD Instinct Accelerator Product name, and <OS> is the Operating System distribution running on the nodes in the cluster.

The following are the currently configured partitions in AAC Slurm cluster:

XXXC8G1H_MI3XXX_Ubuntu22

Each user will be given access permissions to specific partition(s)/queue(s).

How to Display All Slurm Partition/Queue Information

Your account will be set up with access to specific partition(s)/queue(s). Use sinfo to list all the partitions and available compute nodes

sinfo --format="%.32P %.5a %.10l %.16F %N"

The partition with an asterisk * is the default partition. However, a partition/queue name must be specified when submitting jobs using Slurm commands.

How to Submit a Batch Job

Use the sbatch command to queue an SBATCH script to be executed on a specific partition, using the -p option to specify the partition to use:

sbatch -p <specific_parition_name> <path_to_the_sbatch_file>

How to Create an SBATCH File

There are sample SBATCH scripts located in https://github.com/amddcgpuce/sbatchfiles/tree/main to submit AMD Infinity Hub applications (https://www.amd.com/en/technologies/infinity-hub) and other sample applications for 1, 2, 4 and 8 GPU workloads on single compute node which can be used as a starting point to write your own SBATCH script for your application. Here's a template for an SBATCH file for a 1 GPU workload on AAC Slurm cluster: https://github.com/amddcgpuce/sbatchfiles/blob/main/example1gpu.sbatch

NOTE: Please include these lines in the SBATCH file for AAC Slurm cluster which load the correct OS-specific module environment corresponding to the OS running on the Compute Node. At the time of this writing, the ROCm 6.4.2 environment is the latest available.

source /etc/profile.d/modules.sh
source /shared/apps/aac.modules.bash
module purge
module load rocm-<ROCm Version> (example rocm-6.4.2)

The line module load rocm-<ROCm Version> loads the ROCm environment and related libraries built for ROCm .

How to Check a Queued Job Status

Use squeue command shown below to view submitted job status

squeue --format="%.18i %.32P %.8j %.8u %.2t %.10M %.6D %R %n"

How to Find Details of a Submitted or Queued Job

Use scontrol to show details of the job ID. For example, to view details of job ID 21851, use

scontrol show job 21851

How to Allocate and SSH to a Compute Node

Use salloc to allocate a whole node from the assigned partition/queue and SSH to the allocated node.

salloc --exclusive --mem=0 --gres=gpu:8 -p <QUEUE_NAME>

Where is the name of Slurm queue/partition name accessible to you

How to create additional SSH sessions to the Allocated Compute Node (previously using salloc command)

Additional SSH connections to the salloc allocated node can be started by the user

ssh –J –A <USERID>@aacXX.amd.com  <USERID>@<COMPUTENODE_HOSTNAME>

How to Set up ROCm Environment on the Compute Node

After ssh to the compute node, load the ROCm environment using:

module load rocm-<ROCm Version> (example- "rocm-6.4.2")

The ROCm environment module should be loaded from each SSH session.

How to Display the Accessible Slurm Queues

To show the accessible, run sshare command

sshare -nm --format="Partition%32" | tr '[:lower:]' '[:upper:]'

How to Enable GCC/GFORTRAN toolset 11 on RHEL8

Use salloc -p <desired_partition>_RHEL8 to allocate and ssh to a Compute Node running RHEL8 OS distribution. To enable gcc-toolset-11, start a new bash shell:

$ scl enable gcc-toolset-11 bash

How to Use Podman to run Docker Containers

To start amddcgpuce/rocm:6.1.2-ub22 docker in interactive mode using podman with $HOME mounted as /workdir inside docker container

podman run -it --privileged --network=host --ipc=host -v $HOME:/workdir -v /shareddata:/shareddata -v /shared:/shared --workdir /workdir docker://amddcgpuce/rocm:6.1.2-ub22 bash

This will pull the docker image, start an interactive session with current work directory set to /workdir and start a bash shell.

Create work files under /workdir which is $HOME and is accessible from ALL nodes with SSH session.

To exit the docker interactive session, type: exit <return> to return to SSH login prompt on allocated node.

Suggested Best Practices:

Release cache space used by docker images, run: podman system prune -a
Keep work files OUTSIDE container under $HOME and use Docker as customized tools environment with packages installed specific for your application development.
Files under $HOME are visible on ALL Compute Nodes to resume work with Docker images on ANY allocated node.

NOTE: Podman is configured to cache Docker images on local storage on the Compute Node. Cached images will be removed periodically to free up disk space

How to reattach to a podman container after exiting it

From SSH session on Compute Node,

Use podman ps –a to get CONTAINER ID of the Exited container.

Use podman start <CONTAINER ID> to start the container

Use podman attach <CONTAINER ID> to attach and get back to bash shell prompt.

How to exit podman container and get back to Compute Node shell prompt

Type exit to exit the interactive session with docker to the Compute Node shell prompt

How to pull a private docker image

To pull private docker images. First login to the account:

podman login docker.io

Logout when done.

podman logout docker.io

How to Allocate Multi-Node Compute Cluster

Allocate 2-node MI355 cluster. Ex: 2-node MI355 Ubuntu 22

salloc -N 2 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 –p <QUEUE_NAME>
Ex: salloc -N 2 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 -p 256C8G1H_MI355X_Ubuntu22

Allocate 4-node MI355 cluster

salloc -N 4 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 –p <QUEUE_NAME>
Ex: salloc -N 4 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 -p 256C8G1H_MI355X_Ubuntu22

Allocate 8-node MI355 cluster

salloc -N 8 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 –p <QUEUE_NAME>
Ex: salloc -N 8 --cpus-per-task=12 --mem=0 --gres=gpu:8 --ntasks-per-node=8 -p 256C8G1H_MI355X_Ubuntu22

FAQ

Q What does the Invalid account or account/partition combination specified error mean? It means that user does not have access to nodes behind the <QUEUE_NAME> specified. The errors salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified and sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified indicates a <QUEUE_NAME> was specified to which the user does not have access permissions

Sometimes, copy-paste of the command may introduce invalid character which could cause the error. Please type in the commands manually to verify whether that works.

Q: What does the error: invalid partition specified error message mean? It means that the <QUEUE_NAME> specified does not exist. Run sinfo –o “%P% to list valid queues.

Q: How do I fix “rocminfo: command not found” or “Command 'rocminfo' not found … Please ask your administrator.”?

Load the ROCm Environment “module load rocm-” and retry.