Create a Multinode Application
This guide explains how to create multinode applications in AMD Accelerator Cloud (AAC). Read Create a Docker application first.
Plexus multi-node orchestration script
When more than one replica or node is selected on the workload execution, Plexus loads a set of variables to orchestrate multi-node applications (i.e. PyTorch distributed or TensorFlow MultiWorkerMirroredStrategy).
The following environment variables are configured by the Plexus on multinode execution:
PLEXUS_JOB_UUID: Execution identification.PORT_RANGE: Range of ports to expose the master and worker nodes for internal communication. Default value:"9000 9100". They can be configured at cluster level.PLEXUS_HOSTNAME: Container IP address used to identify each node.PLEXUS_WORKER_ADDR: Set to the value ofPLEXUS_HOSTNAME.PLEXUS_WORKER_PORT: Port on which the worker is deployed. This is the first available port found in thePORT_RANGEfor the node.PLEXUS_HOST_BACKEND: Endpoint where the node is listening, constructed asPLEXUS_WORKER_ADDR:PLEXUS_WORKER_PORT.PLEXUS_NODE_INDEX: Node rank value.- Defaults to:
- Index automatically assigned for non-MPI.
OMPI_COMM_WORLD_RANKif using MPI.
PLEXUS_NUM_NODES: Number of nodes set on the execution by num replicas.PLEXUS_NUM_GPUS: Number of GPUs.PLEXUS_WORLD_SIZE: World size, calculated asNumber of GPUs * Number of nodes.PLEXUS_BATCH_RANK: Batch rank, calculated asNode index * number of GPUs.-
WORKLOAD_TYPE: Specifies the type of application. It can be modified in application environment variables. -
Possible values:
"tensorflow": TensorFlow Distributed Training"pytorch": PyTorch Distributed Data Parallel (DDP). Select python configuration to fit most of the current multinode applications like JAX or Megatron.
-
Defaults to
"pytorch".
PyTorch-specific variables:
MASTER_ADDR: IP address or hostname where the master node is deployed. The master node sets this value usingPLEXUS_HOSTNAME.MASTER_PORT: Port on which the master node is deployed. Selected from thePORT_RANGE.BACKEND_ENDPOINT: Endpoint where the master node is listening, constructed asMASTER_ADDR:MASTER_PORT.
TensorFlow-specific variables:
TF_CONFIG: Exposes the cluster configuration with the following structure:
"{"cluster": {"worker": [LIST OF BACKEND NODES]}, "task": {"type": "worker" "index": '$PLEXUS_NODE_INDEX'}}"
LIST OF BACKEND NODESis a list ofPLEXUS_HOST_BACKENDvalues for each node.PLEXUS_NODE_INDEXis an integer identifying the current node.
Create application
After reading Create a Docker application, you can configure the application using the multinode variables provided by Plexus (described above).
General application as multinode
Set Allow replicas to true.
Configure container run script
Multinode variables are loaded by default in the applications when workload has more than one replica.
torchrun --nnodes=$PLEXUS_NUM_NODES --nproc_per_node=$PLEXUS_NUM_GPUS --master_addr $MASTER_ADDR --master_port $MASTER_PORT script.py
Use cases
PyTorch
- Upload the python script to be executed.
- Verify the python script file exists.
- WORKLOAD_TYPE by default is "pytorch".
- Execute the PyTorch command by using the variables provided by Plexus.
- Example script can be downloaded here: elastic_ddp.py
Docker containers in Kubernetes or Slurm
- Number of nodes will be selected by replicas attribute.
- Kubernetes does not properly resolve pod hostnames, so it uses --rdzv_backend=static.
- Input files can be added to the container.
- Application example: Example application (944)
input_script=/home/aac/elastic_ddp.py
if [ ! -f $input_script ]; then
>&2 echo "File not found: "$input_script
exit 1;
fi
torchrun --nnodes=$PLEXUS_NUM_NODES --nproc_per_node=$PLEXUS_NUM_GPUS --rdzv_backend=static --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --node_rank=$PLEXUS_NODE_INDEX $input_script
TensorFlow
- Configure on application environment variable: WORKLOAD_TYPE=tensorflow
- Upload the python script to be executed
- Execute the Tensorflow command
- Variables are set in
TF_CONFIGand are automatically used by TensorFlow MultiWorkerMirroredStrategy.
python3 $input_file
