Ensuring Persistent SSH Sessions on AMD Accelerator Cloud (AAC)
When you work on AMD Accelerator Cloud (AAC) systems over SSH, unstable connections or temporary disconnections can halt progress; during long tasks the session can also time out due to default SSH limits. Tmux and Screen create persistent terminal sessions that stay active even when you log out. This guide helps you use these tools to keep your workflow uninterrupted.
End-to-end flow using Tmux and Screen: accessing the cluster
Scenario
You need to access a compute cluster from an AMD remote server. The process involves:
- SSH Login: Establishing a secure shell connection to the cluster.
- Allocate a Node: Requesting a compute node for your workloads.
- Load ROCm Environment: Setting up the necessary environment for running ROCm applications.
Step 1: SSH Login
Use the following SSH command to connect to the cluster:
ssh -i <priv key> Username@<cluster domain name e.g.,aac13.amd.com>
Replace <priv key> with your private key and Username with your actual username.
Step 2: Initiating a Session
Using Tmux
Start a new Tmux session:
tmux new -s cluster_access
Using Screen
Start a new Screen session:
screen -S cluster_access
Step 3: Allocate a Node
Use the salloc command to allocate the necessary resources.
For Exclusive Access to One Node with All 8 GPUs:
salloc --mem=0 --gres=gpu:8 --reservation=<reservation e.g,PU3E3E_reservation >
Step 4: Load ROCm Environment
Once logged into the node, load the ROCm environment:
module avail
module load <ROCm Module>
example - module load rocm/6.4.2
Step 5: Leaving the Session
To leave the session without terminating it:
For Tmux
Ctrl + b, then d
For Screen
Ctrl + a, then d
Step 6: Come back and check previous sessions
To see your previous sessions:
For Tmux
tmux ls
For Screen
screen -ls
Step 7: Attach to Previous Session
To reattach to your session:
For Tmux
tmux attach -t cluster_access
For Screen
screen -r cluster_access
Step 8: Monitor Progress
Once reattached, you can continue to monitor your workload.
Step 9: Terminate the Session
After your workload finishes, exit the session:
exit