SGLang Inference on AAC

This guide explains how to run SGLang inference workloads for large language models on AMD Accelerator Cloud (AAC) clusters.

Overview

SGLang is a fast serving framework for large language models and vision language models, optimized for AMD Instinct GPUs. This guide focuses on running inference benchmarks for models like DeepSeek-R1-Distill-Qwen-32B.

Prerequisites

Access to AAC cluster (MI325X or MI355X)
Hugging Face account and access token (for gated models)
Basic familiarity with SLURM commands

Supported hardware

AMD Instinct MI355X GPUs (MI355X cluster)
AMD Instinct MI325X GPUs (MI325X cluster)
ROCm 7.2.0 with SGLang

Single-node Inference

Step 1: Allocate a compute node

# For MI325X cluster MI325X
salloc -p 256C8G1H_MI325X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=<ACCOUNT_NAME>

# For MI355X cluster MI355X
salloc -p 256C8G1H_MI355X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=<ACCOUNT_NAME>

Example:

salloc -p 256C8G1H_MI355X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=myteam

Step 2: Load ROCm environment

module load rocm/7.2.0

Step 3: Pull and run Podman container

podman pull docker.io/lmsysorg/sglang:v0.4.5-rocm630

podman run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v /shared/data:/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace/hf_cache \
    lmsysorg/sglang:v0.4.5-rocm630

Step 4: Clone MAD repository

git clone https://github.com/ROCm/MAD
cd MAD/scripts/sglang

Step 5: Set Hugging Face token (if needed)

export HF_TOKEN=<your_personal_hf_token>

Step 6: Run inference benchmarks

Latency Test (8 GPUs, bfloat16):

./sglang_benchmark_report.sh \
    -s latency \
    -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    -g 8 \
    -d bfloat16

Throughput Test (8 GPUs, bfloat16, random dataset):

./sglang_benchmark_report.sh -s throughput \
    -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    -g 8 -d bfloat16 -a random

View results

Results are saved to:

./reports_bfloat16/summary/DeepSeek-R1-Distill-Qwen-32B_throughput_report.csv
./reports_bfloat16/summary/DeepSeek-R1-Distill-Qwen-32B_latency_report.csv

Multi-Node Inference

Step 1: Allocate multiple nodes

# Allocate 2 nodes on MI355X cluster
salloc -N 2 \
  -p 256C8G1H_MI355X_Ubuntu22 \
  --gres=gpu:8 \
  --mem=0 \
  --exclusive \
  --ntasks-per-node=8 \
  --account=<ACCOUNT_NAME>

Example:

salloc -N 2 -p 256C8G1H_MI355X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --ntasks-per-node=8 --account=myteam

Step 2: Create SLURM batch script

Create a file sglang_inference.sh:

#!/bin/bash
#SBATCH -J sglang_inference
#SBATCH -p 256C8G1H_MI355X_Ubuntu22
#SBATCH --gres=gpu:8
#SBATCH --mem=0
#SBATCH -N 2
#SBATCH --exclusive 
#SBATCH --ntasks-per-node=8
#SBATCH --account=<ACCOUNT_NAME>

# Load ROCm
module load rocm/7.2.0

# Set environment variables
export HF_TOKEN=<your_token>
export HUGGINGFACE_HUB_CACHE=/shared/data/hf_cache

# Clone MAD if not already done
if [ ! -d "/shared/data/MAD" ]; then
    cd /shared/data
    git clone https://github.com/ROCm/MAD
fi

# Run multi-node inference
cd /shared/data/MAD/scripts/sglang

srun --container-image=docker://lmsysorg/sglang:v0.4.5-rocm630 \
  --container-mounts=/shared/data:/shared/data \
  --container-workdir=/shared/data/MAD/scripts/sglang \
  --container-env="HF_TOKEN,HUGGINGFACE_HUB_CACHE" \
  ./sglang_benchmark_report.sh -s throughput \
    -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    -g 16 -d bfloat16 -a random

Step 3: Submit the job

sbatch sglang_inference.sh

Step 4: Monitor the job

# Check job status
squeue -u $USER

# View output
tail -f slurm-<job_id>.out

Using MAD for Automated Benchmarking

MAD (Model Automation and Dashboarding) provides automated testing:

Step 1: Setup MAD

git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt

Step 2: Set credentials

export MAD_SECRETS_HFTOKEN="your_token"

Step 3: Run benchmark

madengine run \
    --tags pyt_sglang_deepseek-r1-distill-qwen-32b \
    --keep-model-dir \
    --live-output \
    --timeout 28800

Results

Results are saved to:

~/MAD/perf_DeepSeek-R1-Distill-Qwen-32B.csv

Supported Models

SGLang works with various models including:

DeepSeek: DeepSeek-R1-Distill-Qwen-32B, DeepSeek-V2
Llama: Llama 2, Llama 3, Llama 3.1
Qwen: Qwen 2.5 family
Mixtral: Mixtral 8x7B

Performance Optimization Tips

1. Tensor Parallelism

For large models, use tensor parallelism across multiple GPUs:

# 8 GPUs with tensor parallelism
./sglang_benchmark_report.sh -s throughput \
    -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    -g 8 -d bfloat16

2. Data Type Selection

bfloat16: Best balance of speed and accuracy
float16: Faster but may have numerical stability issues
int8: Experimental quantization support

3. Batch Size Tuning

Adjust batch size based on available memory:

# Modify in the benchmark script or use custom parameters

Troubleshooting

Out of memory errors

Reduce the number of GPUs used
Use quantized models (int8, int4)
Reduce batch size or sequence length

Hugging Face authentication errors

Ensure the token is set correctly:

export HF_TOKEN=<your_token>

# Verify it's set
 if [ -n "$HF_TOKEN" ]; then
   echo "HF_TOKEN is set"
 else
   echo "HF_TOKEN is not set"
 fi

Model download issues

If models fail to download:

# Set cache directory to shared storage
export HUGGINGFACE_HUB_CACHE=/shared/data/hf_cache

# Pre-download models manually
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

Container permissions

If you get container permission errors on AAC, do not try to use sudo or add yourself to the Docker group. AAC supports containerized workloads through Podman (for interactive sessions) and Pyxis/Enroot (for batch jobs). See Using Enroot with Pyxis for more details.

SGLang Inference on AAC

Overview

Prerequisites

Supported hardware

Single-node Inference

Step 1: Allocate a compute node

Step 2: Load ROCm environment

Step 3: Pull and run Podman container

Step 4: Clone MAD repository

Step 5: Set Hugging Face token (if needed)

Step 6: Run inference benchmarks

View results

Multi-Node Inference

Step 1: Allocate multiple nodes

Step 2: Create SLURM batch script

Step 3: Submit the job

Step 4: Monitor the job

Using MAD for Automated Benchmarking

Step 1: Setup MAD

Step 2: Set credentials

Step 3: Run benchmark

Results

Supported Models

Performance Optimization Tips

1. Tensor Parallelism

2. Data Type Selection

3. Batch Size Tuning

Troubleshooting

Out of memory errors

Hugging Face authentication errors

Model download issues

Container permissions

Storage Recommendations

Benchmark Metrics

Latency Metrics

Throughput Metrics

External Resources

SGLang Inference on AAC

Overview

Prerequisites

Supported hardware

Single-node Inference

Step 1: Allocate a compute node

Step 2: Load ROCm environment

Step 3: Pull and run Podman container

Step 4: Clone MAD repository

Step 5: Set Hugging Face token (if needed)

Step 6: Run inference benchmarks

View results

Multi-Node Inference

Step 1: Allocate multiple nodes

Step 2: Create SLURM batch script

Step 3: Submit the job

Step 4: Monitor the job

Using MAD for Automated Benchmarking

Step 1: Setup MAD

Step 2: Set credentials

Step 3: Run benchmark

Results

Supported Models

Performance Optimization Tips

1. Tensor Parallelism

2. Data Type Selection

3. Batch Size Tuning

Troubleshooting

Out of memory errors

Hugging Face authentication errors

Model download issues

Container permissions

Storage Recommendations

Benchmark Metrics

Latency Metrics

Throughput Metrics

Related Documentation

External Resources