Train a Model with Primus and PyTorch

This guide shows how to run the AMD ROCm Primus + PyTorch (torchtitan) training benchmark on AAC compute nodes. The steps are adapted from the upstream tutorial at Training a model with Primus and PyTorch — ROCm Documentation. The default example trains Llama 3.1 8B; Primus also ships configs for Llama 3.1 70B and DeepSeek variants.

Primus with the PyTorch torchtitan backend replaces the older rocm/pytorch-training workflow.

Container components

The rocm/primus:v26.2 image bundles the following software stack:

Software component	Version
ROCm	7.2.0
PyTorch	2.10.0a0+git449b176
Python	3.12.3
Transformer Engine	2.8.0.dev0+51f74fa7
Flash Attention	2.8.3
hipBLASLt	1.2.0-de5c1aebb6
Triton	3.6.0
RCCL	2.27.7

Supported models

The following models are pre-optimized for performance on AMD Instinct MI325X and MI355X GPUs. Some instructions, commands, and training recommendations vary by model — pick one to get started.

Primus ships ready-made configs for:

Meta Llama: Llama 3.1 8B, Llama 3.1 70B
DeepSeek: DeepSeek V3 16B

Prerequisites

See Bare metal prerequisites. In addition:

Access to the MI325X or MI355X partition.
A Hugging Face token for downloading gated models or datasets.
Podman on the compute node (installed by default).

Allocate a node

Single node (8 GPUs)

salloc -p <Partition_Name> --gres=gpu:8 --mem=0 --exclusive --account=<ACCOUNT_NAME>

Example:

salloc -p 256C8G1H_MI325X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=myteam

Multinode (example: 4 nodes × 8 GPUs)

salloc -N 4 --ntasks-per-node=8 --cpus-per-task=12 --gres=gpu:8 --mem=0 \
       -p <Partition_Name> --account=<ACCOUNT_NAME>

Example:

salloc -N 4 --ntasks-per-node=8 --cpus-per-task=12 --gres=gpu:8 --mem=0 \
       -p 256C8G1H_MI355X_Ubuntu22 --account=myteam

SSH into the first allocated node once salloc returns.

Pull and start the container

Pull the rocm/primus:v26.2 image:

podman pull docker.io/rocm/primus:v26.2

Start the container:

podman run -it \
    --device /dev/dri --device /dev/kfd \
    --network host --ipc host \
    --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged \
    -v "$HOME:$HOME" -v "$HOME/.ssh:/root/.ssh" \
    -v /shared/data:/shared/data -v /shared/apps:/shared/apps \
    --name training_env \
    docker.io/rocm/primus:v26.2

To rejoin the container later:

podman start training_env
podman exec -it training_env bash

The container ships with a verified commit of the Primus repository under /workspace/Primus.

Prepare training datasets and dependencies

The benchmarking examples download models and datasets from Hugging Face. Export your token inside the container to access gated repos:

export HF_TOKEN=<your_hugging_face_token>

Pretraining

Navigate to the Primus directory in your container:

cd /workspace/Primus

Use the run_pretrain.sh / primus-cli workflow below to start the pretraining benchmark.

Run training (single node)

The commands below are tailored to Llama 3.1 8B. Swap the --config path for Llama 3.1 70B or DeepSeek variants under the same examples/torchtitan/configs/<GPU>/ directory.

MI355X

Run Llama 3.1 8B with BF16 precision using Primus torchtitan:

bash runner/primus-cli direct \
  --log_file /tmp/primus_llama3.1_8B.log \
  -- train pretrain \
  --config examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml

Run Llama 3.1 8B with FP8 precision:

bash runner/primus-cli direct \
  --log_file /tmp/primus_llama3.1_8B_fp8.log \
  -- train pretrain \
  --config examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml

MI325X / MI300X

Export the FP32 atomic flags for better performance, then launch.

Llama 3.1 8B — BF16:

export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1

bash runner/primus-cli direct \
  --log_file /tmp/primus_llama3.1_8B.log \
  -- train pretrain \
  --config examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml

Llama 3.1 8B — FP8:

bash runner/primus-cli direct \
  --log_file /tmp/primus_llama3.1_8B_fp8.log \
  -- train pretrain \
  --config examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml

Training logs stream to stdout and to the --log_file path.

For Llama 3.1 70B, use the matching llama3.1_70B-*-pretrain.yaml config and adjust --recompute_num_layers and batch sizes.

Choosing other models

Primus ships ready-made configs under examples/torchtitan/configs/<GPU>/. Examples:

llama3.1_70B-BF16-pretrain.yaml
llama3.1_70B-FP8-pretrain.yaml
DeepSeek variants

Troubleshooting

rocminfo: command not found inside the container — the image bundles ROCm; outside the container, run module load rocm/7.2.0.
Out-of-memory or low throughput — adjust micro_batch_size and global_batch_size in the YAML, or switch to the FP8 config on MI355X.
Hugging Face download fails — confirm HF_TOKEN is exported inside the container and has access to the gated repo.
NCCL errors on multinode — verify NCCL_SOCKET_IFNAME and NCCL_IB_HCA match the fabric NIC on every node.
Job stays Pending — the partition may be busy or the requested --constraint is not currently provisioned. See GPU partitioning modes.

For the full upstream walkthrough, see Training a model with Primus and PyTorch → ROCm Documentation.