How To Run Megatron

The ROCm Megatron-LM framework is a specialized fork of Megatron-LM designed to train large-scale language models efficiently on AMD GPUs. By leveraging AMD Instinct™ MI300X accelerators, this framework offers enhanced performance, scalability, and optimized support for large language model training workloads.

Megatron Github Repository

https://github.com/ROCm/Megatron-LM

AMD confluence page which show how to run Megatron using Docker (with all latest updates/changes/parameters)

https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/pretrain/train_llama_mock_data.html#

Prerequisites

ROCm 6.4.3
Podman
Hugging Face API Token

Allocate and SSH to a node from the partition 256C8G1H_MI355X_Ubuntu22

Example:
salloc --exclusive --mem=0 --gres=gpu:8 -p 256C8G1H_MI355X_Ubuntu22

Pull image using Podman

podman pull docker.io/rocm/megatron-lm:v25.8_py310

Launch the container using Podman

podman run -it \
    --device /dev/dri \
    --device /dev/kfd \
    --device /dev/infiniband \
    --network host --ipc host \
    --group-add video \
    --cap-add SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    -v "$HOME:$HOME" \
    -v "$HOME/.ssh:/root/.ssh" \
    --name megatron_training_env \
    docker.io/rocm/megatron-lm:v25.8_py310

Note: Use these commands if you exit the megatron_training_env container and need to return to it.

podman start megatron_training_env
podman exec -it megatron_training_env bash

Megatron-LM backward compatibility setup

cd /workspace/Megatron-LM/
pip uninstall megatron-core
pip install -e .

The Repository ROCm/Megatron-LM https://github.com/ROCm/Megatron-LM/tree/rocm_dev at verified commit e8e9edc.

Configuration

Update the train_llama3.sh configuration script in the examples/llama directory of ROCm/Megatron-LM https://github.com/ROCm/Megatron-LM/tree/rocm_dev to configure your training run. Options can also be passed as command line arguments as described in Run training.

Network interface

Update the network interface in the script to match your system’s network interface. To find your network interface, run the following (outside of any Podman container):

ip a

Look for an active interface that has an IP address in the same subnet as your other nodes. Then, update the following variables in the script, for example:

export NCCL_SOCKET_IFNAME=ens50f0np0

export GLOO_SOCKET_IFNAME=ens50f0np0

Tokenizer

You can assign the path of an existing tokenizer to the TOKENIZER_MODEL as shown in the following examples. If the tokenizer is not found, it’ll be downloaded if publicly available.

The training script uses the HuggingFaceTokenizer. Set TOKENIZER_MODEL to the appropriate Hugging Face model path.

TOKENIZER_MODEL="meta-llama/Llama-3.1-8B"

Dataset options

You can use either mock data or real data for training. Mock data can be useful for testing and validation. Use the MOCK_DATA variable to toggle between mock and real data. The default value is 1 for enabled.

MOCK_DATA=1

If you’re using a real dataset, update the DATA_PATH variable to point to the location of your dataset.

MOCK_DATA=0

DATA_PATH="/data/bookcorpus_text_sentence"  # Change to where your dataset is stored

Ensure that the files are accessible inside the Podman container.

Download the dataset

For Llama models, use the prepare_dataset.sh https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama script to prepare your dataset. To download the dataset, set the DATASET variable to the dataset you’d like to use. Three datasets are supported: DATASET=wiki, DATASET=fineweb, and DATASET=bookcorpus.

DATASET=wiki TOKENIZER_MODEL=NousResearch/Llama-2-7b-chat-hf bash examples/llama/prepare_dataset.sh #for wiki-en dataset
DATASET=bookcorpus TOKENIZER_MODEL=NousResearch/Llama-2-7b-chat-hf bash examples/llama/prepare_dataset.sh #for bookcorpus dataset

TOKENIZER_MODEL can be any accessible Hugging Face tokenizer.

NOTE : When training set DATA_PATH to the specific file name prefix pointing to the .bin or .idx as in the following example:

DATA_PATH="data/bookcorpus_text_sentence" # Change to where your dataset is stored.

Multi-node configuration

If you’re running multi-node training, update the following environment variables. They can also be passed as command line arguments. Refer to the following example configurations.

Change localhost to the master node’s hostname:

MASTER_ADDR="${MASTER_ADDR:-localhost}"

Set the number of nodes you want to train on (for instance, 2, 4, 8):

NNODES="${NNODES:-1}"

Set the rank of each node (0 for master, 1 for the first worker node, and so on):

NODE_RANK="${NODE_RANK:-0}"

Set DATA_CACHE_PATH to a common directory accessible by all the nodes (for example, an NFS directory) for multi-node runs:

DATA_CACHE_PATH=/root/cache # Set to a common directory for multi-node runs

For multi-node runs, make sure the correct network drivers are installed on the nodes. If inside a container, either install the drivers inside the Podman container or pass the network drivers from the host while creating the Podman container.

Specify which RDMA interfaces to use for communication

export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7

Run training

Use the following example commands to set up the environment, configure key options, and run training on MI300X series accelerators with the AMD Megatron-LM environment.

Single Node Training

To run training on a single node for Llama 3.1 8B FP8, navigate to the Megatron-LM folder and use the following command.

TEE_OUTPUT=1 \
MBS=2 \
BS=128 \
TP=1 \
TE_FP8=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8 \
TOTAL_ITERS=50 \
bash examples/llama/train_llama3.sh

For Llama 3.1 8B BF16, use the following command:

TEE_OUTPUT=1 \
MBS=2 \
BS=128 \
TP=1 \
TE_FP8=0 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8 \
TOTAL_ITERS=50 \
bash examples/llama/train_llama3.sh

Multi-node training examples

To run training on multiple nodes, launch the container on each node. For example, for Llama 3 using a two node setup (NODE0 as the master node), use these commands.

On the master node NODE0:

TEE_OUTPUT=1 \
MBS=2 \
BS=256 \
TP=1 \
TE_FP8=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8  \
MASTER_ADDR=IP_NODE0 \
NNODES=2 \
NODE_RANK=0 \
bash examples/llama/train_llama3.sh

On the worker node NODE1:

TEE_OUTPUT=1 \
MBS=2 \
BS=256 \
TP=1 \
TE_FP8=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8  \
MASTER_ADDR=IP_NODE0 \
NNODES=2 \
NODE_RANK=1 \
bash examples/llama/train_llama3.sh

Key Points

The benchmark tests support the following sets of variables.

TEE_OUTPUT
1 to enable training logs or 0 to disable.

TE_FP8
0 for B16 or 1 for FP8 – 0 by default.

GEMM_TUNING
1 to enable GEMM tuning, which boosts performance by using the best GEMM kernels.

USE_FLASH_ATTN
1 to enable Flash Attention.

FSDP
1 to enable PyTorch FSDP2. If FSDP is enabled, --use-distributed-optimizer, --overlap-param-gather, and --sequence-parallel are automatically disabled.

ENABLE_PROFILING
1 to enable PyTorch profiling for performance analysis.

transformer-impl
transformer_engine to use the Transformer Engine (TE) or local to disable TE.

MODEL_SIZE
8B or 70B for Llama 3 and 3.1. 7B or 70B for Llama 2, for example.

TOTAL_ITERS
The total number of iterations – 10 by default.

MOCK_DATA
1 to use mock data or 0 to use real data you provide.

MBS
Micro batch size.

BS
Global batch size.

TP / TP_SIZE
Tensor parallel (1, 2, 4, 8). TP is disabled when FSDP is turned on.

EP / EP_SIZE
Expert parallel for MoE models.

SEQ_LENGTH
Input sequence length.

PR
Precision for training. bf16 for BF16 (default) or fp8 for FP8 GEMMs.

AC
Activation checkpointing (none, sel, or full) – sel by default.

NUM_LAYERS
Use reduced number of layers as a proxy model.

RECOMPUTE_NUM_LAYERS
Number of layers used for checkpointing recompute.