Train a Model with Primus and Megatron-LM
This guide shows how to train a large language model on AAC compute nodes using the AMD ROCm Primus training framework together with Megatron-LM. It covers both single-node (8 GPUs) and multinode training on MI325X or MI355X.
The canonical upstream tutorial is maintained by AMD at Training a model with Primus and Megatron-LM — ROCm Documentation. This page adapts those steps to the AAC Slurm cluster.
Container components
The rocm/primus:v26.2 image bundles the following software stack optimized for Primus and Megatron-LM training:
| Software component | Version |
|---|---|
| ROCm | 7.2.0 |
| PyTorch | 2.10.0a0+git449b176 |
| Python | 3.12.3 |
| Transformer Engine | 2.8.0.dev0+51f74fa7 |
| Flash Attention | 2.8.3 |
| hipBLASLt | 1.2.0-de5c1aebb6 |
| Triton | 3.6.0 |
| RCCL | 2.27.7 |
Prerequisites
See Bare metal prerequisites. In addition:
- Access to the MI325X or MI355X partition.
- A Hugging Face token (only required if you plan to download gated tokenizers or weights).
- Podman or Enroot on the compute node (both are installed by default).
Allocate a node
Single node (8 GPUs)
salloc -p <Partition_Name> --gres=gpu:8 --mem=0 --exclusive --account=<ACCOUNT_NAME>
Example:
salloc -p 256C8G1H_MI325X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=myteam
Multinode (example: 4 nodes × 8 GPUs)
salloc -N 4 --ntasks-per-node=8 --cpus-per-task=12 --gres=gpu:8 --mem=0 \
-p <Partition_Name> --account=<ACCOUNT_NAME>
Example:
salloc -N 4 --ntasks-per-node=8 --cpus-per-task=12 --gres=gpu:8 --mem=0 \
-p 256C8G1H_MI355X_Ubuntu22 --account=myteam
After salloc completes, SSH into the first allocated node.
Pull and start the container
Primus ships in the same rocm/megatron-lm image family used by the Run Megatron guide.
podman pull docker.io/rocm/primus:v26.2
podman run -it \
--device /dev/dri --device /dev/kfd --device /dev/infiniband \
--network host --ipc host \
--group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged \
-v "$HOME:$HOME" -v "$HOME/.ssh:/root/.ssh" \
-v /shared/data:/shared/data -v /shared/apps:/shared/apps \
--name primus_training_env \
docker.io/rocm/primus:v26.2
To rejoin the container later:
podman start primus_training_env
podman exec -it primus_training_env bash
Tip: You can also launch the container directly via
srun --container-image=...with Pyxis — see Using Enroot.
Prepare the Primus repository
Inside the container:
cd /workspace/Primus
pip install -r requirements.txt
Dataset options
You can use either mock data or real data for training.
Mock data can be useful for testing and validation. Use the mock_data field in the Primus config to toggle between mock and real data. The default value is true (mock data enabled):
mock_data: true
If you're using a real dataset, set mock_data: false and update the train_data_path field to point to the location of your dataset:
mock_data: false
train_data_path: /path/to/your/dataset
Ensure the dataset files are accessible from inside the container — mount them via -v on the podman run command, or place them under $HOME (already mounted).
Tokenizer
Set the HF_TOKEN environment variable with the right permissions to access the tokenizer for your chosen model:
export HF_TOKEN=<your_hftoken>
Run training (single node)
From /workspace/Primus inside the container, launch pretraining.
On MI355X
bash runner/primus-cli direct \
--log_file /tmp/primus_llama3.3_70B.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/llama3.3_70B-BF16-pretrain.yaml
On MI325X
Export the FP32 atomic flags first, then launch:
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1
bash runner/primus-cli direct \
--log_file /tmp/primus_llama3.3_70B.log \
-- train pretrain \
--config examples/megatron/configs/MI300X/llama3.3_70B-BF16-pretrain.yaml
Training logs stream to stdout and are also written to /tmp/primus_llama3.3_70B.log.
Run training (multinode)
Multinode training uses Slurm directly via Primus's run_slurm_pretrain.sh launcher. Run these steps from the first allocated compute node (after salloc -N <nodes> ...).
1. Clone Primus
git clone --recurse-submodules https://github.com/AMD-AGI/Primus.git
cd Primus/
git checkout 44f780d
git submodule update --init --recursive
2. Set environment variables
export DOCKER_IMAGE=rocm/primus:v26.2
export HF_TOKEN=<your_HF_token>
RDMA / networking — match your cluster's fabric NIC (run ip a outside the container to find the interface):
export NCCL_IB_HCA=<your_NCCL_IB_HCA>
export NCCL_SOCKET_IFNAME=<your_NCCL_SOCKET_IFNAME>
export GLOO_SOCKET_IFNAME=<your_GLOO_SOCKET_IFNAME>
export NCCL_IB_GID_INDEX=3 # default 3 for RoCE
On MI325X also set the FP32 atomic flags for better performance:
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1
3. Launch training
Llama 3.3 70B — FP8 on 8 nodes
NNODES=8 \
EXP=examples/megatron/configs/MI300X/llama3.3_70B-FP8-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 4 \
--global_batch_size 256 \
--recompute_num_layers 80
Llama 3.3 70B — BF16 on 8 nodes
NNODES=8 \
EXP=examples/megatron/configs/MI300X/llama3.3_70B-BF16-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 1 \
--global_batch_size 256 \
--recompute_num_layers 12
Adjust
global_batch_sizeproportionally to your node count — for exampleglobal_batch_size = 8 × <single_node_bs>for 8 nodes. Pick the matching MI355X config underexamples/megatron/configs/MI355X/when running on MI355X.