Run Multinode PyTorch Application
This guide shows how to run a multi-node PyTorch application. Sign in to AAC if you have not already.
Select application
- Click Applications in the navigation bar.
- In the filter, type Multinode and wait for the PyTorch application family to appear.
- Open the family and select Pytorch Multinode application.
- Click New Workload.
Note
In this example we use AMD Internal as the team.
Select team
If you have more than one team, select one in the pop-up and click Launch.
Select input files
Upload any input files the application needs via Upload files or Browse files, then click Next. You can reuse files from a previous workload. The elastic_ddp.py script can be used for simple testing.
Configure run script
Modify the workload run script and set the input_script variable to your Python script.
Every node execution will get configure the proper values, that can be use to customize the distributed python model.
- PLEXUS_NODE_INDEX: Node rank value.
- PLEXUS_NUM_NODES: Number of nodes.
- PLEXUS_NUM_GPUS: Number of GPUs value.
- PLEXUS_WORLD_SIZE: World size value. Number of GPUs * Number of nodes.
- PLEXUS_BATCH_RANK: Batch rank value. Node index * number of GPUS.
- NCCL backend:
- MASTER_ADDR: IP or Hostname in which Master node is deployed.
- MASTER_PORT: Port in which Master is deployed.
Select resources
Step 8: Select Resources. Specify how many nodes/replicas and GPUs per node (e.g., 1 GPU) and maximum allowed runtime are required for this specific workload. After selecting the required resources, click NEXT to proceed.
Workload will be cancelled when the maximum allowed runtime is exceeded.
Select compute
Select the cluster and queue (use filters if needed), then click Next.
Review workload submission
Review the workload details. Use Change in any section to edit, then click Run Workload.
At the bottom of review workload submission, estimated costs for the use of the application that has selected, and the use of the compute system will appear here.
If everything looks good, click Run Workload to launch the workload, then it will redirect to the workloads page, where notification about the workload is being created and sent to the selected compute.
Monitor workload
After you submit, you can monitor the workload on the workload information page. Workload states:
- Created – The workload has been created in the system
- Sent – The workload has been sent to the queue that you selected in the workload submission process
- Pending – The workload is in a waiting state in the queue
- Running – The workload has started running in the selected queue
- Completed – The workload has successfully finished processing
- Failed – A problem has occurred which has prevented the workload from completing successfully
- Canceled – The workload has been canceled by the user and stopped running
Click the workload to open the workload information page. The left panel shows submission details (name, application, input files, run script, resources). The right panel shows activity, queue status, cost, and workload info.
View Log - Information about the workload throughout the entire process
View Stdout - Standard output that presents the output of a workload and sometimes includes the results of the workload.
View Stderr - Standard Error that helps you understand why you may have encountered certain issues during the process. It can show logs from the docker image pull.
Download log files - Download all information about the Log, STDOUT and Stderr log files.
When the script finishes, the workload completes automatically. Files generated in /home/aac are uploaded to Plexus.














