How To Build and Run NanoGPT on AAC Slurm Cluster
https://github.com/karpathy/nanoGPT
Examples below use 256C8G1H_MI355X_Ubuntu22 slurm partition which has MIXXX compute nodes.
Setup Environment Allocate an Ubuntu 8GPU MI355 workload and SSH to a node from the partition 256C8G1H_MI355X_Ubuntu22
salloc --exclusive --mem=0 --gres=gpu:8 -p 256C8G1H_MI355X_Ubuntu22
In the SSH session, load ROCm
module load rocm-<ROCm Module>
example - module load rocm-6.4.2
Load anaconda3 modulefile
module load anaconda3/<anaconda3 module>
example module load anaconda3/25.5.1
Source the conda.sh
file
. $CONDA_ROOT/etc/profile.d/conda.sh
Create conda environment named nano
with python 3.10.12
conda create -n nano python=3.10.12 -y
Accept the Terms of Service and run the above command again
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
conda create -n pt-stable python=3.10.12 -y
Activate the newly created nano
environment
conda activate nano
Install Stable Pytorch Release
https://pytorch.org/get-started/locally/
pip3 --no-cache-dir install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/rocm6.4.2](https://download.pytorch.org/whl/rocm6.4.2)
Install nanoGPT dependencies
- https://github.com/karpathy/nanoGPT#install
pip3 install --no-cache-dir numpy transformers datasets tiktoken wandb tqdm
NanoGPT Examples
Clone nanoGPT
github repository
git clone https://github.com/karpathy/nanoGPT.git
Change to nanoGPT
directory
cd nanoGPT/
The following steps can be found on the official nanoGPT page:
https://github.com/karpathy/nanoGPT#quick-start
Prepare GPT shakespeare character dataset
python3 data/shakespeare_char/prepare.py
Train GPT shakespeare character data
python3 train.py config/train_shakespeare_char.py
Sample the model that was trained
python3 sample.py --out_dir=out-shakespeare-char
Reproducing GPT-2
https://github.com/karpathy/nanoGPT#reproducing-gpt-2
Prepare OpenWebText dataset
python3 data/openwebtext/prepare.py
Train GPT-2 on OpenWebText dataset using torchrun
torchrun --nnodes=1 --standalone --nproc_per_node=8 train.py config/train_gpt2.py