Skip to content

How To Build and Run NanoGPT on AAC Slurm Cluster

https://github.com/karpathy/nanoGPT

Examples below use 256C8G1H_MI355X_Ubuntu22 slurm partition which has MIXXX compute nodes.

Setup Environment Allocate an Ubuntu 8GPU MI355 workload and SSH to a node from the partition 256C8G1H_MI355X_Ubuntu22

salloc --exclusive --mem=0 --gres=gpu:8 -p 256C8G1H_MI355X_Ubuntu22

In the SSH session, load ROCm Environment

module load rocm-<ROCm Module>
example - module load rocm-6.4.2

Load anaconda3 modulefile

module load anaconda3/<anaconda3 module>
example module load anaconda3/25.5.1

Source the conda.sh file

. $CONDA_ROOT/etc/profile.d/conda.sh

Create conda environment named nano with python 3.10.12

conda create -n nano python=3.10.12 -y 

Accept the Terms of Service and run the above command again

conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
conda create -n pt-stable python=3.10.12 -y

Activate the newly created nano environment

conda activate nano

Install Stable Pytorch Release

https://pytorch.org/get-started/locally/

pip3 --no-cache-dir install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/rocm6.4.2](https://download.pytorch.org/whl/rocm6.4.2)

Install nanoGPT dependencies

  • https://github.com/karpathy/nanoGPT#install
pip3 install --no-cache-dir numpy transformers datasets tiktoken wandb tqdm

NanoGPT Examples

Clone nanoGPT github repository

git clone https://github.com/karpathy/nanoGPT.git

Change to nanoGPT directory

cd nanoGPT/

The following steps can be found on the official nanoGPT page:

https://github.com/karpathy/nanoGPT#quick-start

Prepare GPT shakespeare character dataset

python3 data/shakespeare_char/prepare.py

Train GPT shakespeare character data

python3 train.py config/train_shakespeare_char.py

Sample the model that was trained

python3 sample.py --out_dir=out-shakespeare-char

Reproducing GPT-2

https://github.com/karpathy/nanoGPT#reproducing-gpt-2

Prepare OpenWebText dataset

python3 data/openwebtext/prepare.py

Train GPT-2 on OpenWebText dataset using torchrun

torchrun --nnodes=1 --standalone --nproc_per_node=8 train.py config/train_gpt2.py