Train a Model with Primus and Megatron-LM

This guide shows how to train a large language model on AAC compute nodes using the AMD ROCm Primus training framework together with Megatron-LM. It covers both single-node (8 GPUs) and multinode training on MI325X or MI355X.

The canonical upstream tutorial is maintained by AMD at Training a model with Primus and Megatron-LM — ROCm Documentation. This page adapts those steps to the AAC Slurm cluster.

Container components

The rocm/primus:v26.2 image bundles the following software stack optimized for Primus and Megatron-LM training:

Software component	Version
ROCm	7.2.0
PyTorch	2.10.0a0+git449b176
Python	3.12.3
Transformer Engine	2.8.0.dev0+51f74fa7
Flash Attention	2.8.3
hipBLASLt	1.2.0-de5c1aebb6
Triton	3.6.0
RCCL	2.27.7

Prerequisites

See Bare metal prerequisites. In addition:

Access to the MI325X or MI355X partition.
A Hugging Face token (only required if you plan to download gated tokenizers or weights).
Podman or Enroot on the compute node (both are installed by default).

Allocate a node

Single node (8 GPUs)

salloc -p <Partition_Name> --gres=gpu:8 --mem=0 --exclusive --account=<ACCOUNT_NAME>

Example:

salloc -p 256C8G1H_MI325X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=myteam

Multinode (example: 4 nodes × 8 GPUs)

salloc -N 4 --ntasks-per-node=8 --cpus-per-task=12 --gres=gpu:8 --mem=0 \
       -p <Partition_Name> --account=<ACCOUNT_NAME>

Example:

salloc -N 4 --ntasks-per-node=8 --cpus-per-task=12 --gres=gpu:8 --mem=0 \
       -p 256C8G1H_MI355X_Ubuntu22 --account=myteam

After salloc completes, SSH into the first allocated node.

Pull and start the container

Primus ships in the same rocm/megatron-lm image family used by the Run Megatron guide.

podman pull docker.io/rocm/primus:v26.2

podman run -it \
    --device /dev/dri --device /dev/kfd --device /dev/infiniband \
    --network host --ipc host \
    --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged \
    -v "$HOME:$HOME" -v "$HOME/.ssh:/root/.ssh" \
    -v /shared/data:/shared/data -v /shared/apps:/shared/apps \
    --name primus_training_env \
    docker.io/rocm/primus:v26.2

To rejoin the container later:

podman start primus_training_env
podman exec -it primus_training_env bash

Tip: You can also launch the container directly via srun --container-image=... with Pyxis — see Using Enroot.

Prepare the Primus repository

Inside the container:

cd /workspace/Primus
pip install -r requirements.txt

Dataset options

You can use either mock data or real data for training.

Mock data can be useful for testing and validation. Use the mock_data field in the Primus config to toggle between mock and real data. The default value is true (mock data enabled):

mock_data: true

If you're using a real dataset, set mock_data: false and update the train_data_path field to point to the location of your dataset:

mock_data: false
train_data_path: /path/to/your/dataset

Ensure the dataset files are accessible from inside the container — mount them via -v on the podman run command, or place them under $HOME (already mounted).

Tokenizer

Set the HF_TOKEN environment variable with the right permissions to access the tokenizer for your chosen model:

export HF_TOKEN=<your_hftoken>

Run training (single node)

From /workspace/Primus inside the container, launch pretraining.

On MI355X

bash runner/primus-cli direct \
  --log_file /tmp/primus_llama3.3_70B.log \
  -- train pretrain \
  --config examples/megatron/configs/MI355X/llama3.3_70B-BF16-pretrain.yaml

On MI325X

Export the FP32 atomic flags first, then launch:

export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1

bash runner/primus-cli direct \
  --log_file /tmp/primus_llama3.3_70B.log \
  -- train pretrain \
  --config examples/megatron/configs/MI300X/llama3.3_70B-BF16-pretrain.yaml

Training logs stream to stdout and are also written to /tmp/primus_llama3.3_70B.log.

Run training (multinode)

Multinode training uses Slurm directly via Primus's run_slurm_pretrain.sh launcher. Run these steps from the first allocated compute node (after salloc -N <nodes> ...).

1. Clone Primus

git clone --recurse-submodules https://github.com/AMD-AGI/Primus.git
cd Primus/
git checkout 44f780d
git submodule update --init --recursive

2. Set environment variables

export DOCKER_IMAGE=rocm/primus:v26.2
export HF_TOKEN=<your_HF_token>

RDMA / networking — match your cluster's fabric NIC (run ip a outside the container to find the interface):

export NCCL_IB_HCA=<your_NCCL_IB_HCA>
export NCCL_SOCKET_IFNAME=<your_NCCL_SOCKET_IFNAME>
export GLOO_SOCKET_IFNAME=<your_GLOO_SOCKET_IFNAME>
export NCCL_IB_GID_INDEX=3 # default 3 for RoCE

On MI325X also set the FP32 atomic flags for better performance:

export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1

3. Launch training

Llama 3.3 70B — FP8 on 8 nodes

NNODES=8 \
EXP=examples/megatron/configs/MI300X/llama3.3_70B-FP8-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
  --micro_batch_size 4 \
  --global_batch_size 256 \
  --recompute_num_layers 80

Llama 3.3 70B — BF16 on 8 nodes

NNODES=8 \
EXP=examples/megatron/configs/MI300X/llama3.3_70B-BF16-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
  --micro_batch_size 1 \
  --global_batch_size 256 \
  --recompute_num_layers 12

Adjust global_batch_size proportionally to your node count — for example global_batch_size = 8 × <single_node_bs> for 8 nodes. Pick the matching MI355X config under examples/megatron/configs/MI355X/ when running on MI355X.