Skip to content

Ensuring Persistent SSH Sessions on AMD Accelerator Cloud (AAC)

When you work on AMD Accelerator Cloud (AAC) systems over SSH, unstable connections or temporary disconnections can halt progress; during long tasks the session can also time out due to default SSH limits. Tmux and Screen create persistent terminal sessions that stay active even when you log out. This guide helps you use these tools to keep your workflow uninterrupted.

End-to-end flow using Tmux and Screen: accessing the cluster

Scenario

You need to access a compute cluster from an AMD remote server. The process involves:

  • SSH Login: Establishing a secure shell connection to the cluster.
  • Allocate a Node: Requesting a compute node for your workloads.
  • Load ROCm Environment: Setting up the necessary environment for running ROCm applications.

Step 1: SSH Login

Use the following SSH command to connect to the cluster:

ssh -i <priv key> Username@<cluster domain name e.g.,aac13.amd.com>

Replace <priv key> with your private key and Username with your actual username.

Step 2: Initiating a Session

Using Tmux

Start a new Tmux session:

tmux new -s cluster_access

Using Screen

Start a new Screen session:

screen -S cluster_access

Step 3: Allocate a Node

Use the salloc command to allocate the necessary resources.

For Exclusive Access to One Node with All 8 GPUs:

salloc --mem=0 --gres=gpu:8 --reservation=<reservation e.g,PU3E3E_reservation >

Step 4: Load ROCm Environment

Once logged into the node, load the ROCm environment:

module avail
module load <ROCm Module>
example - module load rocm/6.4.2

Step 5: Leaving the Session

To leave the session without terminating it:

For Tmux

Ctrl + b, then d

For Screen

Ctrl + a, then d

Step 6: Come back and check previous sessions

To see your previous sessions:

For Tmux

tmux ls

For Screen

screen -ls

Step 7: Attach to Previous Session

To reattach to your session:

For Tmux

tmux attach -t cluster_access

For Screen

screen -r cluster_access

Step 8: Monitor Progress

Once reattached, you can continue to monitor your workload.

Step 9: Terminate the Session

After your workload finishes, exit the session:

exit