TensorFlow Multinode

How to run Tensorflow Multinode

This document shows how to run multi-node tensorflow application.

Select Application

Step 1: To start creating a workload, sign into the AAC account(aac.amd.com).

Step 2: A dashboard will appear, click on Applications Tab of the navigation bar.

Step 3: Type Application Name as Multinode in the filter section to run the workload.

Note: Wait until the Tensorflow application family gets displayed on the Applications page.

Step 4: Open the family and select Tensorflow Multinode application

Step 5: Click on the 'New Workload'.

Select Team

Step 6: In case the user is assigned to more than one team, a pop window will required to select one of the customer teams to which the user is part of. If there is just one team assigned to the user, it will not be required.

Note: In this case, we have selected Team as AMD Internal.

Step 7: Click 'Start new workload' button.

Select Input Files

Step 8: Select Input Files. Upload any input file(s) which application will need to run. Click + Upload files, and then drag the files into AAC, or click Browse files to open the Open file dialog window. After selecting input files, click NEXT to proceed.

If a file was already uploaded in a previous workload, you do not need to it again.

Note: If no input file(s) to load to the app then click on NEXT to proceed.

Configure Runscript

Step 9: Configure workload runscript to select a custom python script by modifying input_script variable. If no modifications are done, workload will execute the default /plexus_demo_scripts//plexus_demo_scripts/simple_multi_worker_mirrored_strategy.py python script.

Note: Input files are stored in /home/aac, so the modification will look like input_script=/home/aac/custom_script.py

Select Resources

Step 10: Select Resources. Specify how many nodes, GPUs per node (e.g., 1 GPU) and maximum allowed runtime are required for this specific workload. After selecting the required resources, click NEXT to proceed.

Select Compute

Step 11: Select Compute. Required Name of the Queue and cluster name can be searched for running the workload. After selecting Compute parameters, click NEXT.

Review Workload Submission

Step 12: Review Workload Submission. Review the information that has entered for this workload. If any change is needed, it can be changed by clicking in the appropriate sections to make revisions.

At the bottom of Review Workload Submission, estimated costs for the use of the application that has selected, and the use of the compute system will appear here. If everything looks good, click Run Workload to launch the workload. Click the Run Workload button and it will redirect to Workloads page, where notification about the workload is being created and sent to the selected compute.

Monitor Workload

Step 13: After submitting a workload, user can monitor how the workload is performing by checking the Workload Status on the Workloads page and the Workload Information page: Each workload goes through several different states after it is submitted

Created – The workload has been created in the system

Sent – The workload has been sent to the queue that you selected in the workload submission process

Pending – The workload is in a waiting state in the queue

Running – The workload has started running in the selected queue

Completed – The workload has successfully finished processing

Failed – A problem has occurred which has prevented the workload from completing successfully

Canceled – The workload has been canceled by the user and stopped running

Step 14: By clicking on the selected compute node, the Workload Information Page will appear on screen. Basic Information In the left panel (below), user can see all the detailed information entered during the workload submission process - Workload Name, (Selected) Application Detail, Application Configuration such as Input Files and Run Scripts and Resources Configuration. In the right panel, the user can see information that is related to the activity of the Workload. User can learn more about how many of the workloads that are queued and running now, the total cost and estimated cost for this current workload in the Cost section, and the up to the second detail of the activity after the workload is created in the Workload Info section.

View Log Click on the links shown on the Workload Information screen to view the job log, stdout log and stderr log:

View Log - Information about the workload throughout the entire process

View Stdout - Standard output that presents the output of a workload and sometimes includes the results of the workload.

View Stderr - Standard Error that helps you understand why you may have encountered certain issues during the process.

Download log files - Download all information about the Log, STDOUT and Stderr log files.

Step 15: The Workload automatically Completed after finishing the script execution

Once the workload is completed, anyfile generated in /home/aac will be automatically uploaded to Plexus.