Skip to content

How To Build and Run NanoGPT on AMD Accelerator Cloud (AAC) Slurm Cluster

https://github.com/karpathy/nanoGPT

Prerequisites

See Bare metal prerequisites for access, partition naming, and common software. For HIP and ROCm, see the Glossary. You need an allocated node with Podman (e.g. from a partition such as 256C8G1H_MI355X_Ubuntu22).

To build and run NanoGPT on AAC Slurm cluster (steps to follow)

  1. Allocate a node on the AAC cluster using slurm.
  2. Create a dockerfile in working directory.
  3. Build a Podman image from the dockerfile.
  4. Run the Podman container to prepare the dataset and train the NanoGPT model (including NanoGPT and its dependencies).
  5. Sample from the trained model.

Setup environment

Allocate a node from the partition

salloc -p <partition_name> --exclusive --mem=0 -w <node_name>

example : salloc -p 256C8G1H_MI355X_Ubuntu22 --exclusive --mem=0 -w gpu-21 

Inside the node for pulling docker base image to podman

podman pull docker.io/rocm/pytorch:latest-release

Create a Dockerfile in working directory

Create a working directory

mkdir <directory_name>

example : mkdir NanoGPT

Create a dockerfile in it

cd <directory_name>
touch dockerfile

Build a Podman image from the Dockerfile

Create a dockerfile

vim dockerfile

dockerfile

FROM rocm/pytorch:latest-release

# Log: Setting the working directory
WORKDIR /app
# Log: Working directory set to /app

# Log: Installing additional dependencies
RUN echo "--- LOG: Starting apt update and install git ---" \
    && apt update \
    && apt install -y git \
    && echo "--- LOG: Finished installing git and dependencies ---"

# Log: Cloning NanoGPT repository
RUN echo "--- LOG: Starting git clone of nanoGPT ---" \
    && git clone https://github.com/karpathy/nanoGPT.git . \
    && echo "--- LOG: Finished cloning nanoGPT into /app ---"

# Log: Installing Python dependencies
RUN echo "--- LOG: Starting pip dependency installation ---" \
    && pip install --upgrade pip \
    && pip install "numpy<2" transformers datasets tiktoken wandb tqdm \
    && echo "--- LOG: Finished installing Python dependencies ---"

# Log: Running data preparation script
RUN echo "--- LOG: Starting data preparation script for shakespeare_char ---" \
    && python data/shakespeare_char/prepare.py \
    && echo "--- LOG: Finished data preparation script ---"

# Log: Debugging: list contents of the data directory to verify train.bin exists
RUN echo "--- LOG: Verifying contents of data/shakespeare_char/ ---" \
    && ls -l data/shakespeare_char/ \
    && echo "--- LOG: Finished directory listing ---"

# Log: Disabling torch.compile in default config
RUN echo "--- LOG: Disabling torch.compile in train_shakespeare_char.py ---" \
    && sed -i 's/compile = True/compile = False/' config/train_shakespeare_char.py \
    && echo "--- LOG: Finished modifying config file ---"
# Log: Setting a non-root user
RUN echo "--- LOG: Creating and setting up non-root user appuser ---" \
    && useradd -m appuser && chown -R appuser /app \
    && echo "--- LOG: Finished setting up appuser ---"
RUN mkdir -p /app/out-shakespeare-char && chown -R appuser:appuser /app/out-shakespeare-char

USER appuser
# Log: User set to appuser

# Log: Setting the default command
CMD ["python", "train.py", "config/train_shakespeare_char.py"]

Build an Podman Image using dockerfile

podman build -f /path/to/Dockerfile -t <podman_image_name> /path/to/context

example : podman build -f /shared/devtest/home/sukesh_kalla_qle/NanoGPT/dockerfile -t nanogpt-rocm /shared/devtest/home/sukesh_kalla_qle/NanoGPT

Run the Podman container to prepare the dataset

podman run --rm -it --device=/dev/kfd --device=/dev/dri <podman_image_name> /bin/bash

example : podman run --rm -it --device=/dev/kfd --device=/dev/dri localhost/nanogpt-rocm /bin/bash

Train and sample the model that was trained

python train.py config/train_shakespeare_char.py --compile=False
python sample.py --out_dir=out-shakespeare-char