Create Multinode Application
This article explains how to create multinode applications in AAC. Before continuing this document, need to read first Create docker application article.
Multinode Wrapper Script
Plexus Multi-Node Orchestration Script
Plexus provides a script to orchestrate multi-node applications based on PyTorch distributed parallelism and TensorFlow MultiWorkerMirroredStrategy. This script dynamically sets the correct variable values for each node in the workload.
The following environment variables are configured by the script:
PLEXUS_JOB_UUID
: Execution identification.PORT_RANGE
: Range of ports to expose the master and worker nodes. This can be preconfigured in container scripts before invoking the multi-node wrapper script. Default value:"9000 9100"
.PLEXUS_HOSTNAME
: Hostname or IP address used to identify each node. Defaults to the container IP address of the node. Can be preconfigured in container scripts.PLEXUS_WORKER_ADDR
: Set to the value ofPLEXUS_HOSTNAME
.PLEXUS_WORKER_PORT
: Port on which the worker is deployed. This is the first available port found in thePORT_RANGE
for the node.PLEXUS_HOST_BACKEND
: Endpoint where the node is listening, constructed asPLEXUS_WORKER_ADDR:PLEXUS_WORKER_PORT
.PLEXUS_NODE_INDEX
: Node rank value.- Can be preconfigured in container scripts before invoke this Multinode wrapper script.
- Defaults to:
OMPI_COMM_WORLD_RANK
if using MPI.- Index automatically assigned for non-MPI.
PLEXUS_NUM_NODES
: Number of nodes. Can be preconfigured in container scripts.PLEXUS_NUM_GPUS
: Number of GPUs.PLEXUS_WORLD_SIZE
: World size, calculated asNumber of GPUs * Number of nodes
.PLEXUS_BATCH_RANK
: Batch rank, calculated asNode index * number of GPUS
.-
WORKLOAD_TYPE
: Specifies the type of application. -
Possible values:
"tensorflow"
: TensorFlow Distributed Training"pytorch"
: PyTorch Distributed Data Parallel (DDP)
-
Defaults to
"pytorch"
.
PyTorch Specific Variables:
MASTER_ADDR
: IP address or hostname where the master node is deployed. The master node sets this value usingPLEXUS_HOSTNAME
.MASTER_PORT
: Port on which the master node is deployed. Selected from thePORT_RANGE
.BACKEND_ENDPOINT
: Endpoint where the master node is listening, constructed asMASTER_ADDR:MASTER_PORT
.
TensorFlow Specific Variables:
TF_CONFIG
: Exposes the cluster configuration with the following structure:
"{"cluster": {"worker": [LIST OF BACKEND NODES]}, "task": {"type": "worker" "index": '$PLEXUS_NODE_INDEX'}}"
LIST OF BACKEND NODES
is a list ofPLEXUS_HOST_BACKEND
values for each node.PLEXUS_NODE_INDEX
is an integer identifying the current node.
Create application
After reading the Create Docker Applications article, you will have enough knowledge to directly configure the application by using the Multinode Wrapper script provided by Plexus.
General application as multinode.
Configure Allow replicas to true.
Configure the application settings
In order to mount the Multinode wrapper script in your application, you need to configure it as application settings.
- Select the API file served setting
- Fill it with name multinode_wrapper
Configure container runscript
Plexus exposes the location of the multi node wrapper script into the PLEXUS_FILE_MULTINODE_WRAPPER
environment variable
The script can be sourced or invoked adding the main script as parameter
- Sourcing the wrapper
source $PLEXUS_FILE_MULTINODE_WRAPPER
torchrun --nnodes=$PLEXUS_NUM_NODES --nproc_per_node=$PLEXUS_NUM_GPUS --master_addr $MASTER_ADDR --master_port $MASTER_PORT script.py
- Executing the wrapper with script as parameter. In this case, variables must be scaped with "\".
$PLEXUS_FILE_MULTINODE_WRAPPER torchrun --nnodes=\$PLEXUS_NUM_NODES --nproc_per_node=\$PLEXUS_NUM_GPUS --master_addr \$MASTER_ADDR --master_port \$MASTER_PORT script.py
Use Cases
Pytorch
- Verify the wrapper script file exists.
- Upload the python script to be executed.
- Verify the python script file exists.
- Source the wrapper script. WORKLOAD_TYPE by default is "pytorch".
- Execute the Pytorch command by using the variables provided by the wrapper.
- Example script can be downloaded here: elastic_ddp.py
Docker containers in Kubernetes or Slurm
- Number of nodes will be selected by replicas attribute.
- Kubernetes does not properly resolve pod hostnames, so it uses --rdzv_backend=static.
- Input files must be added Application example: https://aac.amd.com/applications/944
if [ -z $PLEXUS_FILE_MULTINODE_WRAPPER ] || [ ! -f $PLEXUS_FILE_MULTINODE_WRAPPER ]; then
> &2 echo "PLEXUS_FILE_MULTINODE_WRAPPER file not found."
exit 1;
fi
input_script=/home/aac/elastic_ddp.py
if [ ! -f $input_script ]; then
> &2 echo "File not found: "$input_script
> exit 1;
> fi
source $PLEXUS_FILE_MULTINODE_WRAPPER
torchrun --nnodes=$PLEXUS_NUM_NODES --nproc_per_node=$PLEXUS_NUM_GPUS --rdzv_backend=static --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --node_rank=$PLEXUS_NODE_INDEX $input_script
Tensorflow
- Upload the python script to be executed
- Verify the wrapper script file exists
- Source the wrapper script.
- Execute the Tensorflow command
- Variables are set into
TF_CONFIG
, an automatically used by Tensorflow MultiWorkerMirroredStrategy.
if [ -z $PLEXUS_FILE_MULTINODE_WRAPPER ] || [ ! -f $PLEXUS_FILE_MULTINODE_WRAPPER ]; then
>&2 echo "PLEXUS_FILE_MULTINODE_WRAPPER file not found."
exit 1;
fi
WORKLOAD_TYPE=tensorflow
source $PLEXUS_FILE_MULTINODE_WRAPPER
python3 $input_file