Skip to content

How to Use Podman on AMD Accelerator Cloud (AAC) to Use rocgdb to Debug Application

Introduction

This guide describes how to allocate and SSH to an AMD GPU compute node on the Slurm cluster and use Podman to run a ROCm Docker image so you can use rocgdb to debug sources outside the container (e.g. under $HOME, mounted as /workdir inside the container).

Allocate and SSH to an AMD GPU compute node

From the Slurm login node, use a command like the following to allocate and SSH to an AAC compute node:

salloc -N 1 --exclusive --mem=0 --gres=gpu:8 -p <Partition_Name> --account=<ACCOUNT_NAME>

Example:

salloc -N 1 --exclusive --mem=0 --gres=gpu:8 -p 256C8G1H_MI325X_Ubuntu22 --account=myteam

Pulling and verifying container images with Podman

  1. Pulling an image:

Command Structure:

podman pull (registry/]image_name:tag

registry/: (Optional) Specifies the container registry. Examples: docker.io, quay.io, or a private registry. If omitted, Podman defaults to docker.io.

image_name: The name of the container image.

tag: (Optional) Specifies the image tag. If omitted, Podman defaults to latest.

Example (Pulling rocm/pytorch-training:v25.5 ):

podman pull docker.io/rocm/pytorch-training:v25.5

This command explicitly pulls the v25.5 tag of the ROCm PyTorch training image from the docker.io registry. The docker.io registry is Podman's default if you omit the registry name, but the v25.5 tag is specified explicitly for reproducibility; if you omit the tag, Podman defaults to latest.

  1. Verifying the image:

List images:

podman images

This command displays a list of all locally stored container images.
@smc-r09-03:~$ podman pull docker.io/nginx:latest
Trying to pull docker.io/library/nginx:latest...
Getting image source signatures
Copying blob 943ea0f0c2e4 done   |
Copying blob 7cf63256a31a done   |
Copying blob bf9acace214a done   |
Copying blob 513c3649bb14 done   |
Copying blob d014f92d532d done   |
Copying blob 9dd21ad5a4a6 done   |
Copying blob 103f50cb3e9f done   |
Copying config b52e0b094b done   |
Writing manifest to image destination
b52e0b094bc0e26c9eddc9e4ab7a64ce0033c3360d8b7ad4ff4132c4e03e8f7b
@smc-r09-03:~$ podman images
REPOSITORY               TAG         IMAGE ID      CREATED      SIZE
docker.io/library/nginx  latest      b52e0b094bc0  4 weeks ago  196 MB
@smc-r09-03:~$

Run ROCm Docker image using Podman

The following command launches rocm/pytorch-training:v25.5 docker image in interactive mode using podman, mounting $HOME directory as /workdir and invoking bash at startup.

podman run -it --privileged --network=host --ipc=host -v $HOME:/workdir -v /shared/data:/shared/data -v /shared/apps:/shared/apps --workdir /workdir docker://rocm/pytorch-training:v25.5 bash

This will pull the image and start an interactive session.

Compile and debug sample application hellowgpu.cpp using Docker environment

The following example shows navigating to the src/ directory under $HOME (mounted as /workdir in the Podman environment), compiling the hellowgpu.cpp sample, launching rocgdb to debug, and exiting the debugger.

root@gpu-node:/workdir# cd src/
root@gpu-node:/workdir/src# ls
a.out hello.c hellowgpu.cpp python
root@gpu-node:/workdir/src# which hipcc
/opt/rocm/bin/hipcc
root@gpu-node:/workdir/src# hipcc -g hellowgpu.cpp
root@gpu-node:/workdir/src# rocgdb ./a.out
GNU gdb (rocm-rel-5.7-98) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./a.out...
(gdb) break main
Breakpoint 1 at 0x20e7e7: file hellowgpu.cpp, line 20.
(gdb) run
Starting program: /workdir/src/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, main (argc=1, argv=0x7fffffffe598) at hellowgpu.cpp:20
20 if (argc > 1)
(gdb) bt
#0 main (argc=1, argv=0x7fffffffe598) at hellowgpu.cpp:20
(gdb) quit
A debugging session is active.
Inferior 1 [process 261] will be killed.
Quit anyway? (y or n) y
root@gpu-node:/workdir/src

Exit the Podman session to native Slurm module environment

The following shows exiting the ROCm Podman environment back to the native SSH shell prompt.

root@gpu-node:/workdir/src# exit
exit
@gpu-node:~$

Release the allocated AMD GPU compute node to terminate the session

The following shows releasing the allocated node back to the Slurm queue and returning to the Slurm login node shell.

@gpu-node:~$ exit
exit
Job Ended Successfully
salloc: Relinquishing job allocation 59093
@pl1vm1pctlgpu-node02:~$