How To Build and Run NanoGPT on AMD Accelerator Cloud (AAC) Slurm Cluster
https://github.com/karpathy/nanoGPT
Prerequisites
See Bare metal prerequisites for access, partition naming, and common software. For HIP and ROCm, see the Glossary. You need an allocated node with Podman (e.g. from a partition such as 256C8G1H_MI355X_Ubuntu22).
To build and run NanoGPT on AAC Slurm cluster (steps to follow)
- Allocate a node on the AAC cluster using slurm.
- Create a dockerfile in working directory.
- Build a Podman image from the dockerfile.
- Run the Podman container to prepare the dataset and train the NanoGPT model (including NanoGPT and its dependencies).
- Sample from the trained model.
Setup environment
Allocate a node from the partition
salloc -p <partition_name> --exclusive --mem=0 -w <node_name>
example : salloc -p 256C8G1H_MI355X_Ubuntu22 --exclusive --mem=0 -w gpu-21
Inside the node for pulling docker base image to podman
podman pull docker.io/rocm/pytorch:latest-release
Create a Dockerfile in working directory
Create a working directory
mkdir <directory_name>
example : mkdir NanoGPT
Create a dockerfile in it
cd <directory_name>
touch dockerfile
Build a Podman image from the Dockerfile
Create a dockerfile
vim dockerfile
dockerfile
FROM rocm/pytorch:latest-release
# Log: Setting the working directory
WORKDIR /app
# Log: Working directory set to /app
# Log: Installing additional dependencies
RUN echo "--- LOG: Starting apt update and install git ---" \
&& apt update \
&& apt install -y git \
&& echo "--- LOG: Finished installing git and dependencies ---"
# Log: Cloning NanoGPT repository
RUN echo "--- LOG: Starting git clone of nanoGPT ---" \
&& git clone https://github.com/karpathy/nanoGPT.git . \
&& echo "--- LOG: Finished cloning nanoGPT into /app ---"
# Log: Installing Python dependencies
RUN echo "--- LOG: Starting pip dependency installation ---" \
&& pip install --upgrade pip \
&& pip install "numpy<2" transformers datasets tiktoken wandb tqdm \
&& echo "--- LOG: Finished installing Python dependencies ---"
# Log: Running data preparation script
RUN echo "--- LOG: Starting data preparation script for shakespeare_char ---" \
&& python data/shakespeare_char/prepare.py \
&& echo "--- LOG: Finished data preparation script ---"
# Log: Debugging: list contents of the data directory to verify train.bin exists
RUN echo "--- LOG: Verifying contents of data/shakespeare_char/ ---" \
&& ls -l data/shakespeare_char/ \
&& echo "--- LOG: Finished directory listing ---"
# Log: Disabling torch.compile in default config
RUN echo "--- LOG: Disabling torch.compile in train_shakespeare_char.py ---" \
&& sed -i 's/compile = True/compile = False/' config/train_shakespeare_char.py \
&& echo "--- LOG: Finished modifying config file ---"
# Log: Setting a non-root user
RUN echo "--- LOG: Creating and setting up non-root user appuser ---" \
&& useradd -m appuser && chown -R appuser /app \
&& echo "--- LOG: Finished setting up appuser ---"
RUN mkdir -p /app/out-shakespeare-char && chown -R appuser:appuser /app/out-shakespeare-char
USER appuser
# Log: User set to appuser
# Log: Setting the default command
CMD ["python", "train.py", "config/train_shakespeare_char.py"]
Build an Podman Image using dockerfile
podman build -f /path/to/Dockerfile -t <podman_image_name> /path/to/context
example : podman build -f /shared/devtest/home/sukesh_kalla_qle/NanoGPT/dockerfile -t nanogpt-rocm /shared/devtest/home/sukesh_kalla_qle/NanoGPT
Run the Podman container to prepare the dataset
podman run --rm -it --device=/dev/kfd --device=/dev/dri <podman_image_name> /bin/bash
example : podman run --rm -it --device=/dev/kfd --device=/dev/dri localhost/nanogpt-rocm /bin/bash
Train and sample the model that was trained
python train.py config/train_shakespeare_char.py --compile=False
python sample.py --out_dir=out-shakespeare-char