Skip to content

Run Multinode TensorFlow Application

This guide shows how to run a multi-node TensorFlow application. Sign in to AAC if you have not already.

Select application

  1. Click Applications in the navigation bar.
  2. In the filter, type Multinode and wait for the TensorFlow application family to appear.
  3. Open the family and select Tensorflow Multinode application.
  4. Click New Workload.

Multinode families

Multinode TensorFlow application

Multinode TensorFlow new workload

Select team

If you have more than one team, select one in the pop-up and click Launch.

Note

In this example we use AMD Internal.

Selected team

Select input files

Upload any input files via Upload files or Browse files, then click Next. You can reuse files from a previous workload. If you have no files to upload, click Next.

Workload input files

Configure run script

Set the input_script variable to your Python script. If you leave it unchanged, the workload runs the default script. Input files are in /home/aac, e.g. input_script=/home/aac/custom_script.py.

Multinode TensorFlow run script

Select resources

Set the number of nodes, GPUs per node, and maximum allowed runtime. Click Next.

Multinode TensorFlow resources

Select compute

Step 11: Select Compute. Required Name of the Queue and cluster name can be searched for running the workload. After selecting Compute parameters, click NEXT.

Multinode TensorFlow select compute

Review workload submission

Review the workload details. Use Change in any section to edit. Estimated costs appear at the bottom. Click Run Workload to launch; you are redirected to the Workloads page.

Multinode TensorFlow review 1

Multinode TensorFlow review 2

Payment run workload

Monitor workload

After you submit, monitor the workload on the Workloads page and the Workload Information page. Workload states:

Created – The workload has been created in the system

Sent – The workload has been sent to the queue that you selected in the workload submission process

Pending – The workload is in a waiting state in the queue

Running – The workload has started running in the selected queue

Completed – The workload has successfully finished processing

Failed – A problem has occurred which has prevented the workload from completing successfully

Canceled – The workload has been canceled by the user and stopped running

Multinode TensorFlow workload status

Click the workload to open the Workload Information page. The left panel shows submission details (name, application, input files, run script, resources). The right panel shows activity, queue status, cost, and workload info.

Multinode TensorFlow workload parameters 1

Multinode TensorFlow workload parameters 2

View Log Click on the links shown on the Workload Information screen to view the job log, stdout log and stderr log:

View Log - Information about the workload throughout the entire process

Multinode TensorFlow syslog tab

View Stdout - Standard output that presents the output of a workload and sometimes includes the results of the workload.

Multinode TensorFlow STDOUT tab

View Stderr - Standard Error that helps you understand why you may have encountered certain issues during the process.

Multinode TensorFlow STDERR tab

Download log files - Download all information about the Log, STDOUT and Stderr log files.

When the script finishes, the workload completes automatically. Files generated in /home/aac are uploaded to Plexus.

Multinode TensorFlow output files in home directory