How to run PyTorch Multinode

We will be using elastic_ddp.py file to run Pytorch DDP test.
https://raw.githubusercontent.com/ozziemoreno/files/main/elastic_ddp.py

Request 2 nodes, 8 GPUs per node though Slurm.

salloc --exclusive --mem=0 --gres=gpu:8 -p 1CN128C8G2H_2IB_MI210_SLES15 -N 2

Example: Change -N to 3 to request 3 nodes_

Sample output

salloc: Granted job allocation 61583
salloc: Waiting for resource configuration
salloc: Nodes smc-r07-[07-08] are ready for job
<user>@smc-r07-07:~$

Slurm logs into the first node or master node, in this case smc-r07-07. Exiting this session terminates sessions on all nodes. Login to the 2nd node. (either from first node, or from Slurm control node in a new terminal)

ssh smc-r07-08

On the master node

Print the list of nodes which are able to communicate

scontrol show hostname $SLURM_NODELIST

Sample output

smc-r07-07
smc-r07-08

Launch the podman / PT docker container (rocm/pytorch:latest)

podman run -it --privileged --network=host --ipc=host -v $HOME:/workdir -v /shareddata:/shareddata -v /shared:/shared --workdir /workdir docker://rocm/pytorch:latest bash

Inside the container copy the file from step 1 elastic_ddp.py to /var/lib/jenkins.

torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=smc-r07-07:29400 elastic_ddp.py

Change --rdzv_endpoint to allocated master node in your instance, keep the port same 29400.
Change --nnodes to 3 for 3 node testing. --nproc_per_node to change GPU per node.
The training will start but will wait for the torchrun command on other nodes.

Output on master node

[2023-12-19 21:46:44,746] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2023-12-19 21:46:44,746] torch.distributed.run: [WARNING]
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************

On the second node

Run the podman, copy elastic_ddp.py torchrun commands from 3 a. (Same steps as above) Note that the --rdzv_endpoint will still point to the master node. No changes required in the torchrun command. The training will proceed on both of the nodes.

Output on second node

root@smc-r07-08:/var/lib/jenkins# torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=smc-r07-07:29400 elastic_ddp.py
[2023-12-19 21:48:15,473] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2023-12-19 21:48:15,473] torch.distributed.run: [WARNING]
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Start running basic DDP example on rank 9.
Start running basic DDP example on rank 8.
Start running basic DDP example on rank 10.
Start running basic DDP example on rank 12.
Start running basic DDP example on rank 15.
Start running basic DDP example on rank 13.
Start running basic DDP example on rank 11.
Start running basic DDP example on rank 14.
root@smc-r07-08:/var/lib/jenkins#

Output on first node, training will resume

Start running basic DDP example on rank 0.
Start running basic DDP example on rank 7.Start running basic DDP example on rank 5.
Start running basic DDP example on rank 4.
Start running basic DDP example on rank 2.
Start running basic DDP example on rank 6.
Start running basic DDP example on rank 3.
Start running basic DDP example on rank 1.
root@smc-r07-07:/var/lib/jenkins#

Above logs indicate that distributed training is success in a multi node environment.

Notes: You can also run single node training to test individual node. Just change --nnodes=1 from above commands. ``` torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=smc-r07-07:29400 elastic_ddp.py