How to Build and Run RCCL Tests on AMD Accelerator Cloud (AAC) Slurm Cluster

Examples below use the Slurm partitions for MI325X and MI355X clusters.

Clone and build RCCL tests

Allocate and SSH to a node from your partition:

salloc --exclusive --mem=0 --gres=gpu:8 -p <Partition_Name> --account=<ACCOUNT_NAME>

Example:

salloc --exclusive --mem=0 --gres=gpu:8 -p 256C8G1H_MI325X_Ubuntu22 --account=myteam

In the SSH session, load ROCm environment:

module load rocm/<version>

Example:

module load rocm/7.2.0

Clone rccl-tests git repository

cd $HOME
git clone https://github.com/ROCmSoftwarePlatform/rccl-tests.git

Change to rccl-tests directory

cd $HOME/rccl-tests/

Compile program with MPI

./install.sh --mpi --mpi_home=$MPI_HOME --rccl_home=$ROCM_PATH/rccl

Single node multi-GPU RCCL test example

Allocate and SSH to a node:

salloc --exclusive --mem=0 --gres=gpu:8 -p <Partition_Name> --account=<ACCOUNT_NAME>

Example:

salloc --exclusive --mem=0 --gres=gpu:8 -p 256C8G1H_MI355X_Ubuntu22 --account=myteam

In the SSH session, load ROCm environment:

module load rocm/7.2.0

Change to $HOME/rccl-tests/build directory

cd $HOME/rccl-tests/build

Run all_reduce_perf on 8 GPUs/GCDs

mpirun -np 8 --mca coll_hcoll_enable 0 --mca btl ^self,vader,openib -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_GDR_LEVEL=3 -x UCX_IB_PCI_RELAXED_ORDERING=on -x UCX_TLS=self,sm,rc_x -x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1  $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1 -c 0

Multinode multi-GPU/GCD RCCL tests example

Allocate a 2-node cluster:

salloc --exclusive --mem=0 --gres=gpu:8 -p <Partition_Name> -N 2 --account=<ACCOUNT_NAME>

Example:

salloc --exclusive --mem=0 --gres=gpu:8 -p 256C8G1H_MI325X_Ubuntu22 -N 2 --account=myteam

In the SSH session, load ROCm environment:

module load rocm/7.2.0

Change to $HOME/rccl-tests/build directory

cd $HOME/rccl-tests/build

Get list of nodes currently allocated

scontrol show hostname $SLURM_NODELIST

Example output

node1
node2

Example: All Reduce

Run all_reduce_perf on 8 GPUs/GCDs on each node of the 2-node cluster. Use -H option to specify the $SLURM_NODELIST

mpirun -np 16 -H <node1>:8,<node2>:8 --mca coll_hcoll_enable 0 --mca btl ^self,vader,openib -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_GDR_LEVEL=3 -x UCX_IB_PCI_RELAXED_ORDERING=on -x UCX_TLS=self,sm,rc_x -x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1  $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1 -c 0

Example: All to All

Run alltoall_perf on 8 GPUs/GCDs on each node of the 2-node cluster. Use -H option to specify the $SLURM_NODELIST

mpirun -np 16 -H <node1>:8,<node2>:8 --mca coll_hcoll_enable 0 --mca btl ^self,vader,openib -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_GDR_LEVEL=3 -x UCX_IB_PCI_RELAXED_ORDERING=on -x UCX_TLS=self,sm,rc_x -x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1  $HOME/rccl-tests/build/alltoall_perf -b 8 -e 16G -f 2 -g 1 -c 0

Run alltoall_perf on 32 GPUs/GCDs on each node of the 8-node cluster.

mpirun -np 64 -H <node1>:8,<node2>:8,<node3>:8,<node4>:8,<node5>:8,<node6>:8,<node7>:8,<node8>:8 --mca coll_hcoll_enable 0 --mca btl ^self,vader,openib -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_GDR_LEVEL=3 -x UCX_IB_PCI_RELAXED_ORDERING=on -x UCX_TLS=self,sm,rc_x -x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1  $HOME/rccl-tests/build/alltoall_perf -b 8 -e 16G -f 2 -g 1 -c 0