Run workload

How to Run Workload

This document shows how to run workload.

To start creating a workload, sign into the AAC web platform.

A workload can be launched either from the workload or the application view.

Start workload from application view

The application for which the workload has to be run can be selected in this step.

a. Navigate to Applications page.

b. Select application for which workload has to be run.

Note: In this case, we have selected Application as HPCG.

c. Select desired version if prompted. It can be either docker or singularity.

Note: In this case, we have selected rocHPCG 3.1.0_97 version and container as docker.

d. Click on New workload button

Select team to launch workload from application view

In case the user is assigned to more than one team, a pop window will required to select one of the customer teams to which the user is part of. If there is just one team assigned to the user, it will not be required.

Note: In this case, we have selected Team as AMD Internal.

Click Start new workload button. Now we will land on select input files page.

Start workload from workload view

Navigate to Workloads Page. Click New Workload button

Select team to launch workload from workload page

In case the user is assigned to more than one team, a pop window will required to select one of the customer teams to which the user is part of. If there is just one team assigned to the user, it will not be required.

Note: In this case, we have selected Team as AMD Internal.

click the Next button located at the top right.

Select desired application for which workload has to be run.

Note: In this case, we have selected Application as HPCG.

Select desired version if prompted. It can be either docker or singularity.

Note: In this case, we have selected rocHPCG 3.1.0_97 version and container as docker.

Click Next button. Now we will land on select input files page.

Select Input Files

The files required for the workload can be uploaded in this step.

a. Click Upload files.

b. Click Browse Files

c. Maximum of 5 files can be uploaded at one time.

d. Select the required files. Click Next.

Application Configuration

User can either continue with default script or enter custom commands into the following script fields.

Note: Application configuration step will not be available for interactive applications like pyTorch, tensorflow, Jammy etc.

a. Pre-run Script in this step, commands which are required to be executed before creating containers can be added into Pre-run Script field.

b. Run Script in this step, commands which are required to be executed in the container created can be added into Run Script field. Benchmarking script for 8 GPUs will be available by default for the selected application in this field. AMD Infinity hub can be referred for more benchmarking commands

c. Post-run Script in this step, commands which are required to be executed after creating containers can be added into Post-run Script field.Click Next

Select Resources

Required inputs for workload like number of GPUs, maximum allowed run time, etc can be entered in this step.

a. By default Queue oversubscribe will be disabled.

Enabling Queue oversubscribe means allocated resources to the workload will be shared among other workloads.

b. Telemetry enabled: By default telemetry is enabled.

If telemetry is enabled, real time metrics will appear in the workload's detail page and once workload is completed, performance tab will be shown. If disabled, real time metrics or performance tab will not be shown on workload's detail page.

c. Maximum allowed runtime The time for which workload is allowed to run should be specified in the Maximum allowed runtime field. By default 1 hour will be selected.

If maximum allowed run time is 1 hour, it implies, workload will run for 1 hour and then it will be automatically stopped after 1 hour as it will not be allowed to exceed Maximum allowed runtime.

Based on the time required for workload, user should change the Maximum allowed runtime.

Example: Maximum allowed time is changed to 2 hours and 30 min. That means, workload is allowed to run for a maximum duration of 2 hr and 30 min.

Once the workload is launched, user cannot change the total workload time. It has to be configured before running.

The number of GPUS required for the workload has to be selected in this step. It should be selected based on the run script provided in the Application Configuration page for HPC applications whereas for AI/ML applications, GPUs should be selected as required by the user.

Click Next to proceed.

Select Compute

Based on the number of GPUs selected in previous step, list of available queues will be displayed. Based on accelerator type and OS, required queue can be selected in this step.

One of the nodes in the selected queue will be assigned for workload based on availability. If node is occupied, workload goes into pending state. Once the node is available, workload will start running.

Note: In this case, 1CN128C8G2H_2IB_MI210_SLES15 is selected.

Click on Next

Review Workload Submission

Review all the configurations which are selected in the previous steps. Click on Change button of any of the sections in order to change the configurations if required.

Click RUN WORKLOAD

The workload will be submitted to the selected queue and will start when requested resources are ready. Once the workload is finished successfully, the status will change to Completed.