Ensuring Persistent SSH Sessions on AAC
For users working on AAC systems over SSH, interruptions due to unstable connections or temporary disconnections can halt progress or especially during long tasks the session can timeout due to SSH session default time limits. Tmux and Screen offer a solution by creating persistent terminal sessions that remain active even when you log out. This guide will help you get started with these tools to maintain uninterrupted workflows.
End-to-End Flow Using Tmux and Screen: Accessing the Cluster
Scenario
You need to access a compute cluster from an AMD remote server. The process involves:
- SSH Login: Establishing a secure shell connection to the cluster.
- Allocate a Node: Requesting a compute node for your workloads.
- Load ROCm Environment: Setting up the necessary environment for running ROCm applications.
Step 1: SSH Login
Use the following SSH command to connect to the cluster:
ssh -i <priv key> Username@<cluster domain name e.g.,aac1.amd.com>
Replace <priv key>
with your private key and Username with your actual username.
Step 2: Initiating a Session
Using Tmux
Start a new Tmux session:
tmux new -s cluster_access
Using Screen
Start a new Screen session:
screen -S cluster_access
Step 3: Allocate a Node
Use the salloc
command to allocate the necessary resources.
For Exclusive Access to One Node with All 8 GPUs:
salloc --mem=0 --gres=gpu:8 --reservation=<reservation e.g,s30-05_Reservation >
Step 4: Load ROCm Environment
Once logged into the node, load the ROCm environment:
module avail
module load <modulefile>
Step 5: Leaving the Session
To leave the session without terminating it:
For Tmux
Ctrl + b, then d
For Screen
Ctrl + a, then d
Step 6: Comeback and Check Previous Sessions
To see your previous sessions:
For Tmux
tmux ls
For Screen
screen -ls
Step 7: Attach to Previous Session
To reattach to your session:
For Tmux
tmux attach -t cluster_access
For Screen
screen -r cluster_access
Step 8: Monitor Progress
Once reattached, you can continue to monitor your workload.
Step 9: Terminate the Session
After your workload finishes, exit the session:
exit