Train a Model with Primus and PyTorch
This guide shows how to run the AMD ROCm Primus + PyTorch (torchtitan) training benchmark on AAC compute nodes. The steps are adapted from the upstream tutorial at Training a model with Primus and PyTorch — ROCm Documentation. The default example trains Llama 3.1 8B; Primus also ships configs for Llama 3.1 70B and DeepSeek variants.
Primus with the PyTorch torchtitan backend replaces the older rocm/pytorch-training workflow.
Container components
The rocm/primus:v26.2 image bundles the following software stack:
| Software component | Version |
|---|---|
| ROCm | 7.2.0 |
| PyTorch | 2.10.0a0+git449b176 |
| Python | 3.12.3 |
| Transformer Engine | 2.8.0.dev0+51f74fa7 |
| Flash Attention | 2.8.3 |
| hipBLASLt | 1.2.0-de5c1aebb6 |
| Triton | 3.6.0 |
| RCCL | 2.27.7 |
Supported models
The following models are pre-optimized for performance on AMD Instinct MI325X and MI355X GPUs. Some instructions, commands, and training recommendations vary by model — pick one to get started.
Primus ships ready-made configs for:
- Meta Llama: Llama 3.1 8B, Llama 3.1 70B
- DeepSeek: DeepSeek V3 16B
Prerequisites
See Bare metal prerequisites. In addition:
- Access to the MI325X or MI355X partition.
- A Hugging Face token for downloading gated models or datasets.
- Podman on the compute node (installed by default).
Allocate a node
Single node (8 GPUs)
salloc -p <Partition_Name> --gres=gpu:8 --mem=0 --exclusive --account=<ACCOUNT_NAME>
Example:
salloc -p 256C8G1H_MI325X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=myteam
Multinode (example: 4 nodes × 8 GPUs)
salloc -N 4 --ntasks-per-node=8 --cpus-per-task=12 --gres=gpu:8 --mem=0 \
-p <Partition_Name> --account=<ACCOUNT_NAME>
Example:
salloc -N 4 --ntasks-per-node=8 --cpus-per-task=12 --gres=gpu:8 --mem=0 \
-p 256C8G1H_MI355X_Ubuntu22 --account=myteam
SSH into the first allocated node once salloc returns.
Pull and start the container
- Pull the
rocm/primus:v26.2image:
podman pull docker.io/rocm/primus:v26.2
- Start the container:
podman run -it \
--device /dev/dri --device /dev/kfd \
--network host --ipc host \
--group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged \
-v "$HOME:$HOME" -v "$HOME/.ssh:/root/.ssh" \
-v /shared/data:/shared/data -v /shared/apps:/shared/apps \
--name training_env \
docker.io/rocm/primus:v26.2
To rejoin the container later:
podman start training_env
podman exec -it training_env bash
The container ships with a verified commit of the Primus repository under /workspace/Primus.
Prepare training datasets and dependencies
The benchmarking examples download models and datasets from Hugging Face. Export your token inside the container to access gated repos:
export HF_TOKEN=<your_hugging_face_token>
Pretraining
Navigate to the Primus directory in your container:
cd /workspace/Primus
Use the run_pretrain.sh / primus-cli workflow below to start the pretraining benchmark.
Run training (single node)
The commands below are tailored to Llama 3.1 8B. Swap the --config path for Llama 3.1 70B or DeepSeek variants under the same examples/torchtitan/configs/<GPU>/ directory.
MI355X
Run Llama 3.1 8B with BF16 precision using Primus torchtitan:
bash runner/primus-cli direct \
--log_file /tmp/primus_llama3.1_8B.log \
-- train pretrain \
--config examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml
Run Llama 3.1 8B with FP8 precision:
bash runner/primus-cli direct \
--log_file /tmp/primus_llama3.1_8B_fp8.log \
-- train pretrain \
--config examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml
MI325X / MI300X
Export the FP32 atomic flags for better performance, then launch.
Llama 3.1 8B — BF16:
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1
bash runner/primus-cli direct \
--log_file /tmp/primus_llama3.1_8B.log \
-- train pretrain \
--config examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml
Llama 3.1 8B — FP8:
bash runner/primus-cli direct \
--log_file /tmp/primus_llama3.1_8B_fp8.log \
-- train pretrain \
--config examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml
Training logs stream to stdout and to the --log_file path.
For Llama 3.1 70B, use the matching
llama3.1_70B-*-pretrain.yamlconfig and adjust--recompute_num_layersand batch sizes.
Choosing other models
Primus ships ready-made configs under examples/torchtitan/configs/<GPU>/. Examples:
llama3.1_70B-BF16-pretrain.yamlllama3.1_70B-FP8-pretrain.yaml- DeepSeek variants
Troubleshooting
rocminfo: command not foundinside the container — the image bundles ROCm; outside the container, runmodule load rocm/7.2.0.- Out-of-memory or low throughput — adjust
micro_batch_sizeandglobal_batch_sizein the YAML, or switch to the FP8 config on MI355X. - Hugging Face download fails — confirm
HF_TOKENis exported inside the container and has access to the gated repo. - NCCL errors on multinode — verify
NCCL_SOCKET_IFNAMEandNCCL_IB_HCAmatch the fabric NIC on every node. - Job stays Pending — the partition may be busy or the requested
--constraintis not currently provisioned. See GPU partitioning modes.
Related pages
For the full upstream walkthrough, see Training a model with Primus and PyTorch → ROCm Documentation.