How To Build and Run NanoGPT on AAC Slurm Cluster

Prerequisites

HIP GPU (Heterogeneous-computing Interface for Portability). HIP is necessary to run nanoGPT on an AMD GPU because PyTorch, which nanoGPT is built on, relies on HIP to translate its GPU commands for the AMD hardware.

To Build and Run NanoGPT on AAC Slurm Cluster, below are the steps to follow:

Allocate a node on AAC cluster using slurm.
Create a dockerfile in working directory.
Build a Podman image from the dockerfile.
Run the Podman container to prepare the dataset and train the NanoGPT model (including NanoGPT and its dependencies).
Sample from the trained model.

Setup Environment

Allocate a node from the partition

salloc -p <partition_name> --exclusive --mem=0 -w <node_name>

example : salloc -p 256C8G1H_MI355X_Ubuntu22 --exclusive --mem=0 -w gpu-21

Inside the node for pulling docker base image to podman

podman pull docker.io/rocm/pytorch:latest-release

Create a dockerfile in Working Directory

Create a working directory

mkdir <directory_name>

example : mkdir NanoGPT

Create a dockerfile in it

cd <directory_name>
touch dockerfile

Build a Podman image from the dockerfile

Create a dockerfile

vim dockerfile

dockerfile

FROM rocm/pytorch:latest-release

# Log: Setting the working directory
WORKDIR /app
# Log: Working directory set to /app

# Log: Installing additional dependencies
RUN echo "--- LOG: Starting apt update and install git ---" \
    && apt update \
    && apt install -y git \
    && echo "--- LOG: Finished installing git and dependencies ---"

# Log: Cloning NanoGPT repository
RUN echo "--- LOG: Starting git clone of nanoGPT ---" \
    && git clone https://github.com/karpathy/nanoGPT.git . \
    && echo "--- LOG: Finished cloning nanoGPT into /app ---"

# Log: Installing Python dependencies
RUN echo "--- LOG: Starting pip dependency installation ---" \
    && pip install --upgrade pip \
    && pip install "numpy<2" transformers datasets tiktoken wandb tqdm \
    && echo "--- LOG: Finished installing Python dependencies ---"

# Log: Running data preparation script
RUN echo "--- LOG: Starting data preparation script for shakespeare_char ---" \
    && python data/shakespeare_char/prepare.py \
    && echo "--- LOG: Finished data preparation script ---"

# Log: Debugging: list contents of the data directory to verify train.bin exists
RUN echo "--- LOG: Verifying contents of data/shakespeare_char/ ---" \
    && ls -l data/shakespeare_char/ \
    && echo "--- LOG: Finished directory listing ---"

# Log: Disabling torch.compile in default config
RUN echo "--- LOG: Disabling torch.compile in train_shakespeare_char.py ---" \
    && sed -i 's/compile = True/compile = False/' config/train_shakespeare_char.py \
    && echo "--- LOG: Finished modifying config file ---"
# Log: Setting a non-root user
RUN echo "--- LOG: Creating and setting up non-root user appuser ---" \
    && useradd -m appuser && chown -R appuser /app \
    && echo "--- LOG: Finished setting up appuser ---"
RUN mkdir -p /app/out-shakespeare-char && chown -R appuser:appuser /app/out-shakespeare-char

USER appuser
# Log: User set to appuser

# Log: Setting the default command
CMD ["python", "train.py", "config/train_shakespeare_char.py"]

Build an Podman Image using dockerfile

podman build -f /path/to/Dockerfile -t <podman_image_name> /path/to/context

example : podman build -f /shared/devtest/home/sukesh_kalla_qle/NanoGPT/dockerfile -t nanogpt-rocm /shared/devtest/home/sukesh_kalla_qle/NanoGPT

Run the Podman container to prepare the dataset

podman run --rm -it --device=/dev/kfd --device=/dev/dri <podman_image_name> /bin/bash

example : podman run --rm -it --device=/dev/kfd --device=/dev/dri localhost/nanogpt-rocm /bin/bash

Train and Sample the model that was trained

python train.py config/train_shakespeare_char.py --compile=False
python sample.py --out_dir=out-shakespeare-char