How to run your own docker batch workload

This document shows how to run your own docker workload in batch mode.

To start creating a workload, sign into the AAC web platform.

Permissions

In order to be able to run own docker, you need to request Developer or Admin permissions. Contact support to request for the right permissions.

New Workload

Click on the Workloads option in the top bar menu.
Once the workload view is open, click the New Workload button located at the top right.

Select Team

In case the user is assigned to more than one team, a pop window will required to select one of the customer teams to which the user is part of. If there is just one team assigned to the user, it will not be required.

Note: In this case, we have selected Team as AMD Internal.

click the Next button located at the top right.

Configure your own container

The following view show two tabs:

Docker container: Allows you to configure your own container on the fly.
Application list: Allows you to select a preconfigured application.

The Docker container view configures the container url, and credentials for private images.

General information:
- SSH enabled: It determines if workload is either ssh interactive or batch.
- Container url: Docker url.
Repository Authorization:
- User: User to log in docker repository server.
- Password: Passoword to log in the docker repository server
- Server: Docker repository server. By default docker hub is https://index.docker.io/v1/

In this document we show how to run the batch mode, so the SSH service is deactivated and there will be not any ssh endpoint to connect. Workload will run until the runscript ends.

Fill the container url, the credentials fields are just required when image is private.

Select Input Files

Upload any input file(s) which application will need to run. Click Upload files, and then drag the files into AAC, or click Browse files to open the open file dialog window. Here uploading files is not required, hence click NEXT to proceed.

Configure Runscript

Configure workload custom scripts to be executed in your container.

Pre-run script: Command executed before your container, it runs in a basic ubuntu docker image.
Runscript: Command to be executed in your container.
Post-run script: Command executed after your container, it runs in a basic ubuntu docker image.

Click on Next button available on the top right corner.

Request Resources

In the following image, we are able con request the resources we need for our workload.

Number of GPUs: Number of GPU devices required in the compute resources. It is 1 GPU by default
Maximum allowed runtime: Request the time the workload will be running until it gets cancelled by the system. It is 1 hour by default. It filters the compute resources with equal or higher maximum runtime.
Oversubscribe: Select compute resources that allow to share resources in the same GPU. Disabled by default.
Schedule: Configure workload execution for future. It is configured by using UTC time.
Telemetry: This enables workload performance track. Showing GPU, CPU, memory, network, IO and other metrics. Telemetry is disabled by default beacuse it could minimal affect to compute performance.

Once workload is launched, user cannot change the total workload time. It has to be configured in the current step. Click on Next button available on the top right corner.

Select Compute Resources

Select the cluster and queue that are assigned to the team and are available to run the job.

In this case, an resource queue with MI250 is available, so we select it by click over it.

Click on Next button available on the top right corner.

Review Workload Submission

Review Workload Submission. Review the information that has entered for this workload. If any change is needed, it can be changed by clicking Change button in the appropriate sections to make revisions.

Click on 'Run Workload.

Monitor Workload

Each workload goes through several different states after it is submitted -

Created – The workload has been created in the system.
Sent – The workload has been sent to the queue that you selected in the workload submission process.
Pending – The workload is in a waiting state in the queue.
Running – The workload has started running in the selected queue.
Completed – The batch workload has successfully completed processing the runscript.
Failed – A problem has occurred which has prevented the workload from completing successfully.
Cancelled – The workload has been canceled by the user or the system (maximum runtime exceeded).

Logs

The user can check the system, stdout and stderr logs by clicking on respective tabs.

View SysLog - Information about the workload throughout the entire process.
View Stdout - Standard output that presents the output of a workload and sometimes includes the results of the workload.
View Stderr - Standard Error that helps you understand why you may have encountered certain issues during the process.