How To Run TensorFlow Docker Application

Login to https://aac.amd.com/.

Select Application

Click on Applications. Select TensorFlow.

In the Select An Application pop-up, select the desired TensorFlow version with container type as docker.

Note: In this case, we have selected TensorFlow 2-10 ROCm 5-4-1 version and container as docker.

New Workload

Click on New Workload button available on the top right corner.

Select Team

In case the user is assigned to more than one team, a pop window will required to select one of the customer teams to which the user is part of. If there is just one team assigned to the user, it will not be required.

Note: In this case, we have selected Team as AMD Internal.

Click Start new workload button.

Click Next button to continue.

Select Resources

In Select Resources page, specify the number of GPU's (e.g., 1 GPU) and max allowed runtime required for the workload. Click Next button.

Note: The maximum number of GPUs should be 8.

Select the cluster and desired queue to run the job. In this case 1CN128C8G2H_2IB_MI210_SLES15 (Pre-emptible) - AAC Plano is selected. Click on Next

Review Workload Submission

Review all the configurations selected and click on Run Workload

Once the workload is submitted, the workload status changes to Running when queue is available. Click on the running workload

User can see the system logs in SYSLOG, output in STDOUT and errors in STDERR tabs.

A token will be generated in STDOUT tab in yellow color as shown below. Copy the token.

Interactive Endpoints

Once the interactive endpoints are enabled, click Connect to launch ML Studio(Jupyter lab).

Jupyter lab opens. Paste the token in Password or token field. Click login.

User can see Jupyter lab, which can be used for python based development work.

Click on Terminal to open it.

In terminal, enter the following benchmark: Benchmark - python3 /root/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet50 --num_gpus=8 --batch_size=256 --num_batches=100 -- print_training_accuracy=True --variable_update=parameter_server --local_parameter_device=gpu

Collect Performance Metrics

Once the work is done with Jupyter lab, Close it.

Finish Workload

Click Finish Workload button.

Download Logs

Logs can be downloaded from STDOUT tab by clicking Download Logs once workload is finished.