Frequently Asked Questions
How To Launch/run a workload?
Please refer HowTo_Run_Workload
How to change/increase Workload run time?
Once the workload is launched, we cannot change/increase the total workload time. It has to be configured before running.
While launching the workload, in Select Resources step, Maximum allowed runtime option is available which can be used to change the total time allowed for the workload to run.
By default 1 hour will be selected.
If maximum allowed run time is 1 hour, it implies, workload will run for 1 hour and then it will be automatically stopped after 1 hour as it will not be allowed to exceed Maximum allowed runtime.
Based on the time required for workload, user should change the Maximum allowed runtime.
Example: Maximum allowed time is changed to 2 hours and 30 min. That means, workload is allowed to run for a maximum duration of 2 hr and 30 min.
Please refer HowTo_Run_Workload for detailed steps in launching a workload
How to Schedule a job?
While launching the workload, in Select Resources step, Scheduled option is available which can be used to schedule a job for desired date and time which can be made recurring daily or weekly.
Scheduling a job is optional. This step can be skipped if you are trying to run the job immediately.
To schedule a job, Click the below toggle button to show Yes
The time at which the workload has to be triggered, how frequently the workload has to be triggered i.e, how many days per week or on which week days can be defined under Scheduled field.
Example1:
Here, workload will be run for every two days at time 15:30. Last day is 9/20/2023 after which workload will not trigger.
Example2:
Here, workload will trigger every tuesday and wednesday at time 19:15. Workload will not trigger after 9/21/2023.
Please refer HowTo_Run_Workload for detailed steps in launching a workload
Why can't I see any queue while launching workload?
The following details should be checked to understand why queues are not visible while launching workload
-
Make sure you are assigned to a team. In user profile page, check if you are assigned to your organization team in Teams section. If you are not assigned to a team, contact your sponsor to get a team assigned. Please refer HowTo_Check_Team_Assigned for detailed steps to check the team assigned to user.
-
Once the above step is verified, navigate to queues page and check if queues are visibile. If queues are not visible, contact your sponsor to get queues assigned to your team. Please refer HowTo_Check_Queues_Assigned for detailed steps to check queues assigned.
-
The queues availability while launching workload depends on the num of GPUs selected in Select Resources step. The number of GPUs should not be more than 8. Even after selecting GPUS <=8, queues are not visible, please contact your sponsor for assistance.
Why is my workload/job in pending state?
A workload will be in pending state until the resources are available.
To elaborate, while launching a workload, user selects a queue. Each queue is a combination of certain number of nodes. Node is where the workload will be executed. If the number of nodes required by you are already occupied, then the workload will be in pending state. Once the required number of nodes are freed, the workload will start running.
Note: Maximum time a workload can run is for a week. So, if all the nodes are occupied with workloads which are running for 7 days, it might take 7 days for your workload to start running in some rare cases.
Why can't I see the performance tab after workload is completed.
Performance tab will be visible only if Telemetry enabled option is enabled in Select Resources step while launching workload.
Please refer Select Resources step in HowTo_Run_Workload guide.
Why can't I see App Config/ Application Configuration step while launching workload
Interactive applications like pyTorch, Tensorflow, Jammy, etc will not have Application Configuration step. Hence, you will not see the step.
How to SSH into container for Jammy/Ubuntu applications?
Launch a Jammy application. Once interactive endpoints are available in workload overview page, Navigate to STDOUT and copy username and password.
Click Connect. Use the ssh url provided. Replace \
Please refer How_To_Launch_Jammy(SSH)_Application for detailed steps.
How to connect JupyterLab for AI/ML applications like pytorch, tensorflow etc?
Launch a pytorch application. Once interactive endpoints are available, Click Connect
Copy the token from STDOUT tab and use it to connect.
Please refer HowTo_Launch_Pytorch_Application for detailed steps.
What does the Invalid account or account/partition combination specified error mean?
It means that user does not have access to nodes behind the
- salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified.
- sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified.indicates a
was specified to which the user does not have access permissions.
Sometimes, copy-paste of the command may introduce invalid character which could cause the error. Please type in the commands manually to verify whether that works.
What does the error: invalid partition specified error message mean?
It means that the
How do I fix “rocminfo: command not found” or “Command 'rocminfo' not found … Please ask your administrator.”?
Load the ROCm Environment “module load rocm-6.1.2” and retry.
How to check my role?
Navigate to User Profile page. In Organization panel, role will be available