Skip to content

PyTorch Multinode

This document shows how to run multi-node pytorch application.

Select Application

Step 1: To start creating a workload, sign into the AAC.

AAC Sign in

Step 2: A dashboard will appear, click on Applications Tab of the navigation bar.

Step 3: Type application name as Multinode in the filter section to run the workload.

Note: Wait until the PyTorch application family gets displayed on the applications page.

Multinode families

Step 4: Open the family and select Pytorch Multinode application.

Select application

Step 5: Click on the New Workload.

New workload

Select Team

In case the user is assigned to more than one team, a pop window will required to select one of the customer teams to which the user is part of. If there is just one team assigned to the user, it will not be required.

Note: In this case, we have selected Team as AMD Internal.

Select team

Click Start new workload button.

New workload

Select Input Files

Step 6: Select Input Files. Upload any input file(s) which application will need to run. Click on upload files, and then drag the files into AAC, or click Browse files to open the file dialog window. After selecting input files, click NEXT to proceed.

If a file was already uploaded in a previous workload, you do not need to it again.

Note: This elastic_ddp.py script could be use for simple testing.

Input files

Configure Runscript

Step 7: Configure workload runscript to select a custom python script by modifying input_script variable.

Every node execution will get configure the proper values, that can be use to customize the distributed python model.

  • PLEXUS_NODE_INDEX: Node rank value.
  • PLEXUS_NUM_NODES: Number of nodes.
  • PLEXUS_NUM_GPUS: Number of GPUs value.
  • PLEXUS_WORLD_SIZE: World size value. Number of GPUs * Number of nodes.
  • PLEXUS_BATCH_RANK: Batch rank value. Node index * number of GPUS.
  • NCCL backend:
    • MASTER_ADDR: IP or Hostname in which Master node is deployed.
    • MASTER_PORT: Port in which Master is deployed.

Run script

Select Resources

Step 8: Select Resources. Specify how many nodes/replicas and GPUs per node (e.g., 1 GPU) and maximum allowed runtime are required for this specific workload. After selecting the required resources, click NEXT to proceed.

Workload will be cancelled when the maximum allowed runtime is exceeded.

Select replicas

Select Compute

Step 9: Select Compute. Cluster or queues can be filtered for running the workload. After selecting compute parameters, click NEXT.

Select queue

Review Workload Submission

Step 10: Review workload submission. Review the information that has entered for this workload. If any change is needed, it can be changed by clicking in the appropriate sections to make revisions.

Workload review

At the bottom of review workload submission, estimated costs for the use of the application that has selected, and the use of the compute system will appear here.

If everything looks good, click Run Workload to launch the workload, then it will redirect to the workloads page, where notification about the workload is being created and sent to the selected compute.

Monitor Workload

Step 11: After submitting a workload, user can monitor how the workload is performing by checking the workload status on the workload information page. Each workload goes through several different states after it is submitted:

  • Created – The workload has been created in the system
  • Sent – The workload has been sent to the queue that you selected in the workload submission process
  • Pending – The workload is in a waiting state in the queue
  • Running – The workload has started running in the selected queue
  • Completed – The workload has successfully finished processing
  • Failed – A problem has occurred which has prevented the workload from completing successfully
  • Canceled – The workload has been canceled by the user and stopped running

Workload status

Step 12: By clicking on the selected compute node, the workload information page will appear on screen.

Basic information in the left panel (below), the user can see all the detailed information entered during the workload submission process - workload name, application details, input files, and run scripts and resources configuration.

In the right panel, the user can see information that is related to the activity of the workload. User can learn more about how many of the workloads that are queued and running now, the total cost and estimated cost for this current workload in the cost section, and the up to the second detail of the activity after the workload is created in the workload info section.

Workload parameters

View Log - Information about the workload throughout the entire process

Workload syslog

View Stdout - Standard output that presents the output of a workload and sometimes includes the results of the workload.

Workload stdout

View Stderr - Standard Error that helps you understand why you may have encountered certain issues during the process. It can show logs from the docker image pull.

Workload stderr

Download log files - Download all information about the Log, STDOUT and Stderr log files.

Step 13: The workload automatically completed after finishing the script execution

Once the workload is completed, all files generated in /home/aac will be automatically uploaded to Plexus.

Workload output files