Plexus Satellite Application
This guide explains how to set up and manage an isolated networked cluster on AMD Plexus using the Plexus Satellite container. You will learn how to configure your cluster, verify compatibility, onboard it to Plexus, and launch the broker satellite daemon.
Overview
The Plexus Satellite container provides tools for connecting isolated clusters to the AMD Plexus software stack. It supports three resource managers: Kubernetes, Slurm, and Edge SSH. The container performs the following functions:
- Verifies that your target cluster is compliant with Plexus prerequisites.
- Onboards your cluster to a specified Plexus instance.
- Launches the broker satellite daemon to enable communication between isolated clusters and Plexus.
AMD Plexus allows you to run your own private in-house AI cloud deployment and take advantage of your existing infrastructure and MLOps investment. With Plexus, your data scientists have a single pane of glass for working across on-premises and public clouds.
Plexus is a foundational component of AMD Accelerator Cloud (AAC), a data science platform built for AI and HPC. Combining leading-edge hardware with the Plexus software stack, AAC offers access to a co-located AI Platform-as-a-Service (AI PaaS).
Prerequisites
To run the Plexus Satellite container, you need the following:
- An account on a Plexus platform (for example, AAC), or on a Plexus instance installed on your premises. If you don't have an account, contact
aac.support@amd.com. - Provider or Admin permissions on your Plexus instance. Check with your Plexus administrator if you are unsure.
- A Docker Hub account.
- Access to the Plexus Satellite container image:
amdaccelcloud/plexussatellite:3.0.0. - An up-to-date image installed. Run
docker pullto ensure you have the latest version.
Configure your cluster
Kubernetes clusters
Before using Plexus Satellite with a Kubernetes cluster, ensure the following:
- Configure the cluster to meet Plexus requirements.
- Format the kubeconfig file correctly in YAML.
- Create an account in the Plexus Control Panel with Provider permissions. Only users with the Provider role can connect to the Plexus Control Panel when using Plexus Satellite.
Note: If you have a regular user account on Plexus, ask your Plexus administrator for Provider permissions.
Pod network isolation
Plexus network isolation works by default for the following IP ranges:
10.0.0.0/8172.16.0.0/12192.168.0.0/16100.69.0.0/16
If your cluster uses a different pod CIDR, network isolation won't work. Contact Plexus support to enable network isolation for your cluster.
Slurm clusters
Configure your Slurm cluster to meet Plexus requirements before using Plexus Satellite.
Edge SSH clusters
An Edge SSH cluster can have one or more nodes. Each node must meet the following requirements:
- Accessible through SSH.
- Has at least one container manager installed (Singularity, Docker, or both).
- All nodes share the same configuration (user, password, port, and home directory), differing only in hostname, latitude, or longitude.
Run the container
Start the Plexus Satellite container in interactive mode:
docker run --rm -it amdaccelcloud/plexussatellite:3.0.0
Note: Always run the container in interactive mode (-it). Otherwise, the container enters a loop.
Kubernetes parameters
For Kubernetes clusters, mount the kubeconfig file in the container. The default path is /plexus/kubeconfig. This file is required for both verification and onboarding.
docker run --rm -it \
-v <path_to_kubeconfig>:/plexus/kubeconfig:ro \
amdaccelcloud/plexussatellite:3.0.0
Slurm and Edge SSH authentication
For Slurm and Edge SSH clusters, you can authenticate with either a password or a private SSH key. To mount a private key:
docker run --rm -it \
-v <path_to_private_key>:/plexus/private_key:ro \
amdaccelcloud/plexussatellite:3.0.0
Edge SSH node file
For Edge SSH clusters with multiple nodes, mount a nodes file in the container. The default path is /plexus/nodes. Each line in the file defines a node using the format <hostname>;<latitude>;<longitude>.
docker run --rm -it \
-v <path_to_nodes_file>:/plexus/nodes:ro \
amdaccelcloud/plexussatellite:3.0.0
Local network settings
If your Plexus platform is on a local network or behind a VPN, apply these additional settings:
- Mount the host network:
--network=host. This is required when using any local network resource (cluster or Plexus server). - Mount the host DNS configuration:
-v /etc/resolv.conf:/etc/resolv.conf:ro. This is required when connecting to the cluster through a VPN.
Note: On macOS, reach the host by using host.docker.internal.
Use the satellite menu
When you start the container, it prompts you to select a resource manager (SSH, Slurm, or Kubernetes):
-- ---------------------------------------------- --
-- -- Welcome to the cluster satellite script --- --
-- ---------------------------------------------- --
-- Select resource manager [SSH, Slurm or Kubernetes]: --
Enter resource manager [Kubernetes]:
After selecting a resource manager, the main menu appears. The menu options are the same for all resource managers:
-- Please enter your choice: --
1) Pre-flight cluster compatibility test
2) Onboard your cluster in Plexus control panel
3) Start the Plexus Satellite
4) Quit
#?
- Option 1: Run the cluster compatibility verification.
- Option 2: Onboard your cluster to Plexus.
- Option 3: Start the broker satellite daemon.
- Option 4: Exit the container.
Verify cluster compatibility (option 1)
Option 1 checks that your cluster configuration meets Plexus requirements. The checks and prompts vary by resource manager.
Kubernetes
The verification checks the following:
- User role permissions
- Storage class availability
- Resources can be properly set up in the cluster
Enter the following parameters when prompted:
- Kubeconfig path: Path to the kubeconfig file in the container (default:
/plexus/kubeconfig) - Storage class: The storage class for persistent data
- Namespace: The namespace to use (defaults to the one specified in the kubeconfig)
- Pod CIDR: Used to check whether it falls within the range covered by Plexus network policies
#? 1
-- Option Pre-flight cluster compatibility test --
Enter kubeconfig path [/plexus/kubeconfig]:
Enter storage class: storage-class
Enter namespace [default]:
Enter pod cidr[]: 10.0.0.0/8
Slurm
The verification checks the following:
- Queues are available and properly configured
- Singularity is installed on every node (including the head node)
- The head node can download Singularity images
- Workloads can run on a specified partition
- Workload output files are shared across all nodes
Enter the following parameters when prompted:
- Cluster host: The hostname of your cluster
- Cluster port: The SSH port
- Cluster username: Your SSH username
- Partition for testing: The partition name to test
- Cluster shared home path: A folder shared among all nodes
- Authentication type: Either
passwordorprivate_key - Cluster password: Your password (or passphrase to unlock the key)
#? 1
-- Option: Pre-flight Cluster compatibility test --
Enter cluster host []: serve.com
Enter cluster port []: 22
Enter cluster username []: ubuntu
Enter partition for testing [debug]: MI300
Enter cluster shared home path []: /home/ubuntu/nfs
Select authentication type [password or private_key]:
Enter authentication type [password]: private_key
Enter cluster private key path [/plexus/private_key]:
Following password will be used to attempt to unlock the key.
Enter cluster password:
Edge SSH
The verification checks one of the nodes for the following requirements (all nodes must have the same configuration):
- Node has enough resources
- Singularity is properly installed
- Docker is properly installed
- The node can download Singularity images
- Singularity and Docker workloads can run on the node
Enter the following parameters when prompted:
- Cluster host: The hostname of the node
- Cluster port: The SSH port
- Cluster username: Your SSH username
- Cluster shared home path: A folder shared among all nodes
- Authentication type: Either
passwordorprivate_key - Cluster password: Your password (or passphrase to unlock the key)
-- Option: Pre-flight Cluster compatibility test --
Enter cluster host []: serve.com
Enter cluster port []: 22
Enter cluster username []: ubuntu
Enter cluster shared home path []: /home/ubuntu/nfs
Select authentication type [password or private_key]:
Enter authentication type [password]: private_key
Enter cluster private key path [/plexus/private_key]:
Following password will be used to attempt to unlock the key.
Enter cluster password:
Onboard your cluster (option 2)
Option 2 registers your cluster in the Plexus server. The prompts vary by resource manager.
Kubernetes
Provide the following parameters when prompted:
- Kubeconfig path: Path to the kubeconfig file in the container (default:
/plexus/kubeconfig) - Storage class: The storage class for persistent data
- Default storage size: Storage allocation per user in gigabytes (default: 1)
- Cluster name: The name to register for this cluster in Plexus
- Plexus server: The Plexus API server URL
- Email: Your Plexus account email
- Password: Your Plexus account password
#? 2
-- Option: Onboard your cluster in Plexus control panel --
Enter kubeconfig path [/plexus/kubeconfig]:
Enter storage class: storage-class
Enter default storage size for users (Gigabytes) [1]:
Enter cluster name: satellite-cluster
Is it a satellite cluster? [true]:
Enter Plexus server [https://aac-api.amd.com]:
Insert your Plexus email: test@amd.com
Insert your Plexus password:
-- Onboarding cluster --
-- Creating user token in https://aac-api.amd.com --
Token generated with value: dc2dd0xxxxxxxxxxxxxx7a12fd5ba71241c777
-- Onboarding cluster --
-- Creating cluster "satellite-cluster" in https://aac-api.amd.com --
Cluster "satellite-cluster" successfully created in: https://aac-api.amd.com/clusters/777
Cluster "satellite-cluster" has uuid: 81bffe12-f7c3-4d3b-b4c7-84f4348ddacf
Slurm
Note: Slurm cluster onboarding is limited to password authentication. If you use SSH-key authentication, onboard the cluster from the Plexus UI instead.
Provide the following SSH credentials first:
- Cluster host: The hostname of your cluster
- Cluster port: The SSH port
- Cluster shared home path: A folder shared among all nodes
- Cluster username: Your SSH username
- Cluster password: Your password
Then provide the Plexus registration details:
- Cluster name: The name to register for this cluster
- Plexus server: The Plexus API server URL
- Email: Your Plexus account email
- Password: Your Plexus account password
#? 2
-- Option: Onboard your cluster in Plexus control panel --
Enter cluster host []: test.cluster.com
Enter cluster port []: 22
Enter cluster shared home path []: /home/ubuntu/nfs
Enter cluster username []: user
Enter cluster password:
Enter cluster name []: satellite-slurm-cluster
Is it a satellite cluster? [true]:
Enter Plexus server [https://aac-api.amd.com]:
Insert your Plexus email: test@amd.com
Insert your Plexus password:
-- Onboarding cluster --
-- Creating user token in https://aac-api.amd.com --
Token generated with value: dc2dd0xxxxxxxxxxxxxx7a12fd5ba71241c777
-- Onboarding cluster --
-- Creating cluster "satellite-slurm-cluster" in https://aac-api.amd.com --
Cluster "satellite-slurm-cluster" successfully created in: https://aac-api.amd.com/clusters/777
Cluster "satellite-slurm-cluster" has uuid: 81bffe12-f7c3-4d3b-b4c7-84f4348ddacf
Edge SSH
Note: Edge SSH cluster onboarding is limited to password authentication. If you use SSH-key authentication, onboard the cluster from the Plexus UI instead.
Provide the following SSH credentials first:
- Nodes config path: Path to the nodes file (default:
/plexus/nodes) - Cluster port: The SSH port
- Cluster shared home path: A folder shared among all nodes
- Cluster username: Your SSH username
- Cluster password: Your password
Then provide the Plexus registration details:
- Cluster name: The name to register for this cluster
- Plexus server: The Plexus API server URL
- Email: Your Plexus account email
- Password: Your Plexus account password
#? 2
-- Option: Onboard your cluster in Plexus control panel --
Enter nodes config path [/plexus/nodes]:
Enter cluster port []: 22
Enter cluster shared home path []: /home/ubuntu
Enter cluster username []: user
Enter cluster password:
Enter cluster name []: satellite-edge-cluster
Is it a satellite cluster? [true]:
Enter Plexus server [https://aac-api.amd.com]:
Insert your Plexus email: test@amd.com
Insert your Plexus password:
-- Onboarding cluster --
-- Creating user token in https://aac-api.amd.com --
Token generated with value: dc2dd0xxxxxxxxxxxxxx7a12fd5ba71241c777
-- Onboarding cluster --
-- Creating cluster "satellite-edge-cluster" in https://aac-api.amd.com --
Cluster "satellite-edge-cluster" successfully created in: https://aac-api.amd.com/clusters/777
Cluster "satellite-edge-cluster" has uuid: 81bffe12-f7c3-4d3b-b4c7-84f4348ddacf
Onboarding output
After a successful onboarding, the script returns the API server URL and the cluster UUID. Save the cluster UUID — you will need it when launching the broker satellite.
- API server URL:
https://aac-api.amd.com/clusters/777 - Cluster UUID:
81bffe12-f7c3-4d3b-b4c7-84f4348ddacf
You can find your new cluster in the Plexus UI at the returned URL (for example, https://aac.amd.com/clusters/777).
Start the Plexus Satellite broker (option 3)
Option 3 launches the broker satellite daemon, which handles communication between your isolated cluster and the Plexus platform. This option is the same for all resource managers.
Provide the following parameters when prompted:
- Plexus server: The Plexus API server URL
- Email: Your Plexus account email
- Password: Your Plexus account password
- Cluster UUID: The UUID returned during onboarding
#? 3
-- Option Start the Plexus Satellite --
Enter plexus server [https://aac-api.amd.com]:
Enter your Plexus email: test@amd.com
Enter your Plexus password:
Enter cluster uuid: 81bffe12-f7c3-4d3b-b4c7-84f4348ddacf
-- Launching broker --
2020-10-09 14:44:02.780978. Getting auth token
2020-10-09 14:44:02.817024. Pulling requests from API
Troubleshooting
Workloads stay pending forever (Kubernetes)
- Missing node labels: CPU workloads run on CPU-only nodes and GPU workloads run on GPU nodes. Label your nodes correctly:
- CPU-only nodes:
node-role.kubernetes.io/plexus-worker-type=plexus-cpu-worker - GPU nodes:
node-role.kubernetes.io/plexus-worker-type=plexus-gpu-worker - Hybrid CPU/GPU nodes:
node-role.kubernetes.io/plexus-worker-type=plexus-hybrid-cpu-gpu-worker
- CPU-only nodes:
- Insufficient resources: The nodes don't have enough resources for the workload. Decrease the number of GPUs or CPUs required by your workload.
- Queue configuration mismatch: The number of CPUs assigned in the cluster queue configuration exceeds the physical cluster capacity. Decrease the maximum CPUs per workload in the queue configuration.
Namespaces appear as disabled (Kubernetes)
This issue can have several causes:
- Invalid kubeconfig: Fix the kubeconfig file. Run the compatibility check (option 1) before creating or launching the broker.
- Incorrect cluster configuration: Ensure your cluster meets Plexus configuration requirements. Run the compatibility check (option 1) to diagnose issues.
- Broker launch delay: If more than 60 seconds pass between cluster creation and launching the broker, click Update Details in the cluster view to discover queues.
License
An End User License Agreement is included with this product. By pulling and using this container, you accept the terms and conditions of this license.