How to Build and Run RCCL Tests on AMD Accelerator Cloud (AAC) Slurm Cluster
Examples below use the Slurm partitions for MI325X and MI355X clusters.
Clone and build RCCL tests
Allocate and SSH to a node from your partition:
salloc --exclusive --mem=0 --gres=gpu:8 -p <Partition_Name> --account=<ACCOUNT_NAME>
Example:
salloc --exclusive --mem=0 --gres=gpu:8 -p 256C8G1H_MI325X_Ubuntu22 --account=myteam
In the SSH session, load ROCm environment:
module load rocm/<version>
Example:
module load rocm/7.2.0
Clone rccl-tests git repository
cd $HOME
git clone https://github.com/ROCmSoftwarePlatform/rccl-tests.git
Change to rccl-tests directory
cd $HOME/rccl-tests/
Compile program with MPI
./install.sh --mpi --mpi_home=$MPI_HOME --rccl_home=$ROCM_PATH/rccl
Single node multi-GPU RCCL test example
Allocate and SSH to a node:
salloc --exclusive --mem=0 --gres=gpu:8 -p <Partition_Name> --account=<ACCOUNT_NAME>
Example:
salloc --exclusive --mem=0 --gres=gpu:8 -p 256C8G1H_MI355X_Ubuntu22 --account=myteam
In the SSH session, load ROCm environment:
module load rocm/7.2.0
Change to $HOME/rccl-tests/build directory
cd $HOME/rccl-tests/build
Run all_reduce_perf on 8 GPUs/GCDs
mpirun -np 8 --mca coll_hcoll_enable 0 --mca btl ^self,vader,openib -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_GDR_LEVEL=3 -x UCX_IB_PCI_RELAXED_ORDERING=on -x UCX_TLS=self,sm,rc_x -x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1 $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1 -c 0
Multinode multi-GPU/GCD RCCL tests example
Allocate a 2-node cluster:
salloc --exclusive --mem=0 --gres=gpu:8 -p <Partition_Name> -N 2 --account=<ACCOUNT_NAME>
Example:
salloc --exclusive --mem=0 --gres=gpu:8 -p 256C8G1H_MI325X_Ubuntu22 -N 2 --account=myteam
In the SSH session, load ROCm environment:
module load rocm/7.2.0
Change to $HOME/rccl-tests/build directory
cd $HOME/rccl-tests/build
Get list of nodes currently allocated
scontrol show hostname $SLURM_NODELIST
Example output
node1
node2
Example: All Reduce
Run all_reduce_perf on 8 GPUs/GCDs on each node of the 2-node cluster. Use -H option to specify the $SLURM_NODELIST
mpirun -np 16 -H <node1>:8,<node2>:8 --mca coll_hcoll_enable 0 --mca btl ^self,vader,openib -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_GDR_LEVEL=3 -x UCX_IB_PCI_RELAXED_ORDERING=on -x UCX_TLS=self,sm,rc_x -x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1 $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1 -c 0
Example: All to All
Run alltoall_perf on 8 GPUs/GCDs on each node of the 2-node cluster. Use -H option to specify the $SLURM_NODELIST
mpirun -np 16 -H <node1>:8,<node2>:8 --mca coll_hcoll_enable 0 --mca btl ^self,vader,openib -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_GDR_LEVEL=3 -x UCX_IB_PCI_RELAXED_ORDERING=on -x UCX_TLS=self,sm,rc_x -x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1 $HOME/rccl-tests/build/alltoall_perf -b 8 -e 16G -f 2 -g 1 -c 0
Run alltoall_perf on 32 GPUs/GCDs on each node of the 8-node cluster.
mpirun -np 64 -H <node1>:8,<node2>:8,<node3>:8,<node4>:8,<node5>:8,<node6>:8,<node7>:8,<node8>:8 --mca coll_hcoll_enable 0 --mca btl ^self,vader,openib -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_GDR_LEVEL=3 -x UCX_IB_PCI_RELAXED_ORDERING=on -x UCX_TLS=self,sm,rc_x -x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1 $HOME/rccl-tests/build/alltoall_perf -b 8 -e 16G -f 2 -g 1 -c 0