Skip to content

Create a Multinode Application

This guide explains how to create multinode applications in AMD Accelerator Cloud (AAC). Read Create a Docker application first.

Plexus multi-node orchestration script

When more than one replica or node is selected on the workload execution, Plexus loads a set of variables to orchestrate multi-node applications (i.e. PyTorch distributed or TensorFlow MultiWorkerMirroredStrategy).

The following environment variables are configured by the Plexus on multinode execution:

  • PLEXUS_JOB_UUID: Execution identification.
  • PORT_RANGE: Range of ports to expose the master and worker nodes for internal communication. Default value: "9000 9100". They can be configured at cluster level.
  • PLEXUS_HOSTNAME: Container IP address used to identify each node.
  • PLEXUS_WORKER_ADDR: Set to the value of PLEXUS_HOSTNAME.
  • PLEXUS_WORKER_PORT: Port on which the worker is deployed. This is the first available port found in the PORT_RANGE for the node.
  • PLEXUS_HOST_BACKEND: Endpoint where the node is listening, constructed as PLEXUS_WORKER_ADDR:PLEXUS_WORKER_PORT.
  • PLEXUS_NODE_INDEX: Node rank value.
  • Defaults to:
    • Index automatically assigned for non-MPI.
    • OMPI_COMM_WORLD_RANK if using MPI.
  • PLEXUS_NUM_NODES: Number of nodes set on the execution by num replicas.
  • PLEXUS_NUM_GPUS: Number of GPUs.
  • PLEXUS_WORLD_SIZE: World size, calculated as Number of GPUs * Number of nodes.
  • PLEXUS_BATCH_RANK: Batch rank, calculated as Node index * number of GPUs.
  • WORKLOAD_TYPE: Specifies the type of application. It can be modified in application environment variables.

  • Possible values:

  • Defaults to "pytorch".

PyTorch-specific variables:

  • MASTER_ADDR: IP address or hostname where the master node is deployed. The master node sets this value using PLEXUS_HOSTNAME.
  • MASTER_PORT: Port on which the master node is deployed. Selected from the PORT_RANGE.
  • BACKEND_ENDPOINT: Endpoint where the master node is listening, constructed as MASTER_ADDR:MASTER_PORT.

TensorFlow-specific variables:

  • TF_CONFIG: Exposes the cluster configuration with the following structure:

"{"cluster": {"worker": [LIST OF BACKEND NODES]}, "task": {"type": "worker" "index": '$PLEXUS_NODE_INDEX'}}"

  • LIST OF BACKEND NODES is a list of PLEXUS_HOST_BACKEND values for each node.
  • PLEXUS_NODE_INDEX is an integer identifying the current node.

Create application

After reading Create a Docker application, you can configure the application using the multinode variables provided by Plexus (described above).

General application as multinode

Set Allow replicas to true.

General attributes

Configure container run script

Multinode variables are loaded by default in the applications when workload has more than one replica.

torchrun --nnodes=$PLEXUS_NUM_NODES --nproc_per_node=$PLEXUS_NUM_GPUS --master_addr $MASTER_ADDR  --master_port $MASTER_PORT script.py

Use cases

PyTorch

  • Upload the python script to be executed.
  • Verify the python script file exists.
  • WORKLOAD_TYPE by default is "pytorch".
  • Execute the PyTorch command by using the variables provided by Plexus.
  • Example script can be downloaded here: elastic_ddp.py

Docker containers in Kubernetes or Slurm

  • Number of nodes will be selected by replicas attribute.
  • Kubernetes does not properly resolve pod hostnames, so it uses --rdzv_backend=static.
  • Input files can be added to the container.
  • Application example: Example application (944)

input_script=/home/aac/elastic_ddp.py

if [ ! -f $input_script ]; then

 >&2 echo "File not found: "$input_script
 exit 1;
fi


torchrun --nnodes=$PLEXUS_NUM_NODES --nproc_per_node=$PLEXUS_NUM_GPUS --rdzv_backend=static --master_addr=$MASTER_ADDR  --master_port=$MASTER_PORT --node_rank=$PLEXUS_NODE_INDEX $input_script

TensorFlow

  • Configure on application environment variable: WORKLOAD_TYPE=tensorflow
  • Upload the python script to be executed
  • Execute the Tensorflow command
  • Variables are set in TF_CONFIG and are automatically used by TensorFlow MultiWorkerMirroredStrategy.
python3 $input_file