Skip to content

Run RCCL Tests

How To Build and Run RCCL Tests on Plano Slurm Cluster

Examples below use 1CN96C8G1H_4IB_MI250_Ubuntu22 slurm partition which has MI250 compute nodes.

Clone and Build RCCL Tests

Allocate and SSH to a node from the partition 1CN96C8G1H_4IB_MI250_Ubuntu22

salloc --exclusive --mem=0 --gres=gpu:8 -p 1CN96C8G1H_4IB_MI250_Ubuntu22

In the SSH session, load ROCm 6.1.2 Environment

module load rocm-6.1.2 

Clone rccl-tests git repository

cd $HOME
git clone https://github.com/ROCmSoftwarePlatform/rccl-tests.git

Change to rccl-tests directory

cd $HOME/rccl-tests/

Compile program with MPI

./install.sh --mpi --mpi_home=$MPI_HOME --rccl_home=$ROCM_PATH/rccl

Single Node Multi GPU RCCL Test Example

Allocate and SSH to a node from the partition 1CN96C8G1H_4IB_MI250_Ubuntu22

salloc --exclusive --mem=0 --gres=gpu:8 -p 1CN96C8G1H_4IB_MI250_Ubuntu22

In the SSH session, load ROCm 6.1.2 Environment

module load rocm-6.1.2 

Change to $HOME/rccl-tests/build directory

cd $HOME/rccl-tests/build

Run all_reduce_perf on 8 GPUs/GCDs

mpirun -np 8 --mca coll_hcoll_enable 0 --mca btl ^self,vader,openib -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_GDR_LEVEL=3 -x UCX_IB_PCI_RELAXED_ORDERING=on -x UCX_TLS=self,sm,rc_x -x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1  $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1 -c 0

Multinode Multi GPU/GCD RCCL Tests Example

Allocate and SSH to a node from the 2-node cluster from partition 1CN96C8G1H_4IB_MI250_Ubuntu22

salloc --exclusive --mem=0 --gres=gpu:8 -p 1CN96C8G1H_4IB_MI250_Ubuntu22 -N 2

In the SSH session, load ROCm 6.1.2 Environment

module load rocm-6.1.2

Change to $HOME/rccl-tests/build directory

cd $HOME/rccl-tests/build

Get list of nodes currently allocated

scontrol show hostname $SLURM_NODELIST

Example Output:

ubb-r09-11
ubb-r09-12

Example: All Reduce

Run all_reduce_perf on 8 GPUs/GCDs on each node of the 2-node cluster. Use -H option to specify the $SLURM_NODELIST

mpirun -np 16 -H ubb-r09-11:8,ubb-r09-12:8 --mca coll_hcoll_enable 0 --mca btl ^self,vader,openib -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_GDR_LEVEL=3 -x UCX_IB_PCI_RELAXED_ORDERING=on -x UCX_TLS=self,sm,rc_x -x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1  $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 16G -f 2 -g 1 -c 0

Example: All to All

Run alltoall_perf on 8 GPUs/GCDs on each node of the 2-node cluster. Use -H option to specify the $SLURM_NODELIST

mpirun -np 16 -H ubb-r09-12:8,ubb-r09-13:8 --mca coll_hcoll_enable 0 --mca btl ^self,vader,openib -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_GDR_LEVEL=3 -x UCX_IB_PCI_RELAXED_ORDERING=on -x UCX_TLS=self,sm,rc_x -x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1  $HOME/rccl-tests/build/alltoall_perf -b 8 -e 16G -f 2 -g 1 -c 0

Run alltoall_perf on 32 GPUs/GCDs on each node of the 8-node cluster.

mpirun -np 64 -H ubb-r09-09:8,ubb-r09-10:8,ubb-r09-11:8,ubb-r09-12:8,ubb-r09-13:8,ubb-r09-14:8,ubb-r09-15:8,ubb-r09-17:8 --mca coll_hcoll_enable 0 --mca btl ^self,vader,openib -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_NET_GDR_LEVEL=3 -x UCX_IB_PCI_RELAXED_ORDERING=on -x UCX_TLS=self,sm,rc_x -x NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1  $HOME/rccl-tests/build/alltoall_perf -b 8 -e 16G -f 2 -g 1 -c 0