PyTorch Multinode

This document shows how to run multi-node pytorch application.

Select Application

Step 1: To start creating a workload, sign into the AAC.

Step 2: A dashboard will appear, click on Applications Tab of the navigation bar.

Step 3: Type application name as Multinode in the filter section to run the workload.

Note: Wait until the PyTorch application family gets displayed on the applications page.

Step 4: Open the family and select Pytorch Multinode application.

Step 5: Click on the New Workload.

Select Team

In case the user is assigned to more than one team, a pop window will required to select one of the customer teams to which the user is part of. If there is just one team assigned to the user, it will not be required.

Note: In this case, we have selected Team as AMD Internal.

Click Launch button.

Select Input Files

Step 6: Select Input Files. Upload any input file(s) which application will need to run. Click on upload files, and then drag the files into AAC, or click Browse files to open the file dialog window. After selecting input files, click NEXT to proceed.

If a file was already uploaded in a previous workload, you do not need to it again.

Note: This elastic_ddp.py script could be use for simple testing.

Configure Runscript

Step 7: Configure workload runscript to select a custom python script by modifying input_script variable.

Every node execution will get configure the proper values, that can be use to customize the distributed python model.

PLEXUS_NODE_INDEX: Node rank value.
PLEXUS_NUM_NODES: Number of nodes.
PLEXUS_NUM_GPUS: Number of GPUs value.
PLEXUS_WORLD_SIZE: World size value. Number of GPUs * Number of nodes.
PLEXUS_BATCH_RANK: Batch rank value. Node index * number of GPUS.
NCCL backend:
- MASTER_ADDR: IP or Hostname in which Master node is deployed.
- MASTER_PORT: Port in which Master is deployed.

Select Resources

Step 8: Select Resources. Specify how many nodes/replicas and GPUs per node (e.g., 1 GPU) and maximum allowed runtime are required for this specific workload. After selecting the required resources, click NEXT to proceed.

Workload will be cancelled when the maximum allowed runtime is exceeded.

Select Compute

Step 9: Select Compute. Cluster or queues can be filtered for running the workload. After selecting compute parameters, click NEXT.

Review Workload Submission

Step 10: Review workload submission. Review the information that has entered for this workload. If any change is needed, it can be changed by clicking in the appropriate sections to make revisions.

At the bottom of review workload submission, estimated costs for the use of the application that has selected, and the use of the compute system will appear here.

If everything looks good, click Run Workload to launch the workload, then it will redirect to the workloads page, where notification about the workload is being created and sent to the selected compute.

Monitor Workload

Step 11: After submitting a workload, user can monitor how the workload is performing by checking the workload status on the workload information page. Each workload goes through several different states after it is submitted:

Created – The workload has been created in the system
Sent – The workload has been sent to the queue that you selected in the workload submission process
Pending – The workload is in a waiting state in the queue
Running – The workload has started running in the selected queue
Completed – The workload has successfully finished processing
Failed – A problem has occurred which has prevented the workload from completing successfully
Canceled – The workload has been canceled by the user and stopped running

Step 12: By clicking on the selected compute node, the workload information page will appear on screen.

Basic information in the left panel (below), the user can see all the detailed information entered during the workload submission process - workload name, application details, input files, and run scripts and resources configuration.

In the right panel, the user can see information that is related to the activity of the workload. User can learn more about how many of the workloads that are queued and running now, the total cost and estimated cost for this current workload in the cost section, and the up to the second detail of the activity after the workload is created in the workload info section.

View Log - Information about the workload throughout the entire process

View Stdout - Standard output that presents the output of a workload and sometimes includes the results of the workload.

View Stderr - Standard Error that helps you understand why you may have encountered certain issues during the process. It can show logs from the docker image pull.

Download log files - Download all information about the Log, STDOUT and Stderr log files.

Step 13: The workload automatically completed after finishing the script execution

Once the workload is completed, all files generated in /home/aac will be automatically uploaded to Plexus.