How To Run Megatron on Plano Slurm Cluster
Examples below use 1CN96C8G1H_4IB_MI250_Ubuntu22 slurm partition which has MI250 compute nodes.
Setup Megatron Environment Using Conda
Allocate and SSH to a node from the partition 1CN96C8G1H_4IB_MI250_Ubuntu22
salloc --exclusive --mem=0 --gres=gpu:8 -p 1CN96C8G1H_4IB_MI250_Ubuntu22
In the SSH session, load ROCm 6.1.2 Environment
module load rocm-6.1.2
Load anaconda environment
module load anaconda3/4.12.0
. $CONDA_ROOT/etc/profile.d/conda.sh
Create conda environment megatron with python 3.8
conda create -n megatron python=3.8 -y
Activate the conda environment
conda activate megatron
Install Pytorch 2.0.1 + ROCm 6.1.2 whl
pip3 install --no-cache-dir torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/rocm6.1.2](https://download.pytorch.org/whl/rocm6.1.2)
Clone and Install Apex from sources
cd $HOME
git clone https://github.com/ROCmSoftwarePlatform/apex.git
cd apex/
python setup.py install --cpp_ext --cuda_ext
Install the latest DeepSpeed
pip3 install --no-cache-dir deepspeed
Install megatron dependencies
pip3 install --no-cache-dir six
pip3 install --no-cache-dir regex
pip3 install --no-cache-dir pybind11
Create a directory for checkpoints
mkdir -p $HOME/checkpoints/megatron
Clone Megatron-DeepSpeed
Repository
cd $HOME
git clone [https://github.com/microsoft/Megatron-DeepSpeed.git](https://github.com/microsoft/Megatron-DeepSpeed.git)
cd $HOME/Megatron-DeepSpeed/
git checkout 1f640c00c115eee9cd8515db80f8c92a4c24e9ca
Setup the pretrain_gpt_distributed_slurm.sh
Script Example
Make a copy of $HOME/Megatron-DeepSpeed/examples/pretrain_gpt_distributed.sh
cp $HOME/Megatron-DeepSpeed/examples/pretrain_gpt_distributed.sh $HOME/Megatron-DeepSpeed/examples/pretrain_gpt_distributed_slurm.sh
Edit the newly copied script $HOME/Megatron-DeepSpeed/examples/pretrain_gpt_distributed_slurm.sh
vim $HOME/Megatron-DeepSpeed/examples/pretrain_gpt_distributed_slurm.sh
with the following content
- Data that will be used is Megatron-LM-v1.1.5-ZeRO3 which can be found under the /shareddata
storage /shareddata/DeepSpeed/Megatron-LM-v1.1.5-ZeRO3/
- BASE_PATH
will be pointed to the directory /shareddata/DeepSpeed/Megatron-LM-v1.1.5-ZeRO3/
- DATA_PATH
will be pointed to $BASE_PATH/my-gpt2_text_document
- VOCAB_FILE
will be pointed to $BASE_PATH/gpt2-vocab.json
- MERGE_FILE
will be pointed to $BASE_PATH/gpt2-merges.txt
- CHECKPOINT_PATH
will be pointed to the directory we created $HOME/checkpoints/megatron
#! /bin/bash
## Runs the "345M" parameter model
node_list=$(scontrol show hostnames $SLURM_JOB_NODELIST)
node_array=(${node_list})
master_node=${node_array[0]}
GPUS_PER_NODE=${SLURM_GPUS_ON_NODE}
# Change for multinode config
MASTER_ADDR=${master_node}
NNODES=${SLURM_NNODES}
NODE_RANK=${SLURM_NODEID}
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
BASE_PATH=/shareddata/DeepSpeed/Megatron-LM-v1.1.5-ZeRO3/
DATA_PATH=$BASE_PATH/my-gpt2_text_document
VOCAB_FILE=$BASE_PATH/gpt2-vocab.json
MERGE_FILE=$BASE_PATH/gpt2-merges.txt
CHECKPOINT_PATH=$HOME/checkpoints/megatron
DISTRIBUTED_ARGS="--nproc_per_node=$GPUS_PER_NODE --nnodes=$NNODES --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:29400"
torchrun $DISTRIBUTED_ARGS \
pretrain_gpt.py \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 8 \
--global-batch-size 64 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--train-iters 1000 \
--lr-decay-iters 100 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--vocab-file $VOCAB_FILE \
--merge-file $MERGE_FILE \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 0.00015 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--lr-warmup-fraction .01 \
--checkpoint-activations \
--log-interval 100 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10 \
--fp16
Single Node Multi GPU/GCD Megatron Test Example
Allocate and SSH to a node from the partition 1CN96C8G1H_4IB_MI250_Ubuntu22
salloc --exclusive --mem=0 --gres=gpu:8 -p 1CN96C8G1H_4IB_MI250_Ubuntu22
In the SSH session, load ROCm 6.1.2 Environment
module load rocm-6.1.2
Load anaconda environment
module load anaconda3/4.12.0
. $CONDA_ROOT/etc/profile.d/conda.sh
Acitvate conda environment megatron
conda activate megatron
Change to $HOME/Megatron-DeepSpeed
directory
cd $HOME/Megatron-DeepSpeed
Run pretrain_gpt_distributed_slurm.sh
on 1 node
srun -N 1 -n 1 bash $HOME/Megatron-DeepSpeed/examples/pretrain_gpt_distributed_slurm.sh
Multinode Multi GPU/GCD Megatron Test Example Allocate and SSH to a node from the 2-node cluster from partition 1CN96C8G1H_4IB_MI250_Ubuntu22
salloc --exclusive --mem=0 --gres=gpu:8 -p 1CN96C8G1H_4IB_MI250_Ubuntu22 -N 2
In the SSH session, load ROCm 6.1.2 Environment
module load rocm-6.1.2
Load anaconda environment
module load anaconda3/4.12.0
. $CONDA_ROOT/etc/profile.d/conda.sh
Acitvate conda environment megatron
conda activate megatron
Change to $HOME/Megatron-DeepSpeed
directory
cd $HOME/Megatron-DeepSpeed
Change --global-batch-size
to 128
in the example script pretrain_gpt_distributed_slurm.sh
vim $HOME/Megatron-DeepSpeed/examples/pretrain_gpt_distributed_slurm.sh
Run pretrain_gpt_distributed_slurm.sh
on 2 nodes
srun -N 2 -n 2 bash $HOME/Megatron-DeepSpeed/examples/pretrain_gpt_distributed_slurm.sh