How To Build and Run NanoGPT on AAC Slurm Cluster
https://github.com/karpathy/nanoGPT
Prerequisites
HIP GPU (Heterogeneous-computing Interface for Portability). HIP is necessary to run nanoGPT on an AMD GPU because PyTorch, which nanoGPT is built on, relies on HIP to translate its GPU commands for the AMD hardware.
To Build and Run NanoGPT on AAC Slurm Cluster, below are the steps to follow:
- Allocate a node on AAC cluster using slurm.
- Create a dockerfile in working directory.
- Build a Podman image from the dockerfile.
- Run the Podman container to prepare the dataset and train the NanoGPT model (including NanoGPT and its dependencies).
- Sample from the trained model.
Setup Environment
Allocate a node from the partition
salloc -p <partition_name> --exclusive --mem=0 -w <node_name>
example : salloc -p 256C8G1H_MI355X_Ubuntu22 --exclusive --mem=0 -w gpu-21
Inside the node for pulling docker base image to podman
podman pull docker.io/rocm/pytorch:latest-release
Create a dockerfile in Working Directory
Create a working directory
mkdir <directory_name>
example : mkdir NanoGPT
Create a dockerfile in it
cd <directory_name>
touch dockerfile
Build a Podman image from the dockerfile
Create a dockerfile
vim dockerfile
dockerfile
FROM rocm/pytorch:latest-release
# Log: Setting the working directory
WORKDIR /app
# Log: Working directory set to /app
# Log: Installing additional dependencies
RUN echo "--- LOG: Starting apt update and install git ---" \
&& apt update \
&& apt install -y git \
&& echo "--- LOG: Finished installing git and dependencies ---"
# Log: Cloning NanoGPT repository
RUN echo "--- LOG: Starting git clone of nanoGPT ---" \
&& git clone https://github.com/karpathy/nanoGPT.git . \
&& echo "--- LOG: Finished cloning nanoGPT into /app ---"
# Log: Installing Python dependencies
RUN echo "--- LOG: Starting pip dependency installation ---" \
&& pip install --upgrade pip \
&& pip install "numpy<2" transformers datasets tiktoken wandb tqdm \
&& echo "--- LOG: Finished installing Python dependencies ---"
# Log: Running data preparation script
RUN echo "--- LOG: Starting data preparation script for shakespeare_char ---" \
&& python data/shakespeare_char/prepare.py \
&& echo "--- LOG: Finished data preparation script ---"
# Log: Debugging: list contents of the data directory to verify train.bin exists
RUN echo "--- LOG: Verifying contents of data/shakespeare_char/ ---" \
&& ls -l data/shakespeare_char/ \
&& echo "--- LOG: Finished directory listing ---"
# Log: Disabling torch.compile in default config
RUN echo "--- LOG: Disabling torch.compile in train_shakespeare_char.py ---" \
&& sed -i 's/compile = True/compile = False/' config/train_shakespeare_char.py \
&& echo "--- LOG: Finished modifying config file ---"
# Log: Setting a non-root user
RUN echo "--- LOG: Creating and setting up non-root user appuser ---" \
&& useradd -m appuser && chown -R appuser /app \
&& echo "--- LOG: Finished setting up appuser ---"
RUN mkdir -p /app/out-shakespeare-char && chown -R appuser:appuser /app/out-shakespeare-char
USER appuser
# Log: User set to appuser
# Log: Setting the default command
CMD ["python", "train.py", "config/train_shakespeare_char.py"]
Build an Podman Image using dockerfile
podman build -f /path/to/Dockerfile -t <podman_image_name> /path/to/context
example : podman build -f /shared/devtest/home/sukesh_kalla_qle/NanoGPT/dockerfile -t nanogpt-rocm /shared/devtest/home/sukesh_kalla_qle/NanoGPT
Run the Podman container to prepare the dataset
podman run --rm -it --device=/dev/kfd --device=/dev/dri <podman_image_name> /bin/bash
example : podman run --rm -it --device=/dev/kfd --device=/dev/dri localhost/nanogpt-rocm /bin/bash
Train and Sample the model that was trained
python train.py config/train_shakespeare_char.py --compile=False
python sample.py --out_dir=out-shakespeare-char