Skip to content

SGLang Inference on AAC

This guide explains how to run SGLang inference workloads for large language models on AMD Accelerator Cloud (AAC) clusters.

Overview

SGLang is a fast serving framework for large language models and vision language models, optimized for AMD Instinct GPUs. This guide focuses on running inference benchmarks for models like DeepSeek-R1-Distill-Qwen-32B.

Prerequisites

  • Access to AAC cluster (MI325X or MI355X)
  • Hugging Face account and access token (for gated models)
  • Basic familiarity with SLURM commands

Supported hardware

  • AMD Instinct MI355X GPUs (MI355X cluster)
  • AMD Instinct MI325X GPUs (MI325X cluster)
  • ROCm 7.2.0 with SGLang

Single-node Inference

Step 1: Allocate a compute node

# For MI325X cluster MI325X
salloc -p 256C8G1H_MI325X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=<ACCOUNT_NAME>

# For MI355X cluster MI355X
salloc -p 256C8G1H_MI355X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=<ACCOUNT_NAME>

Example:

salloc -p 256C8G1H_MI355X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=myteam

Step 2: Load ROCm environment

module load rocm/7.2.0

Step 3: Pull and run Podman container

podman pull docker.io/lmsysorg/sglang:v0.4.5-rocm630

podman run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v /shared/data:/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace/hf_cache \
    lmsysorg/sglang:v0.4.5-rocm630

Step 4: Clone MAD repository

git clone https://github.com/ROCm/MAD
cd MAD/scripts/sglang

Step 5: Set Hugging Face token (if needed)

export HF_TOKEN=<your_personal_hf_token>

Step 6: Run inference benchmarks

Latency Test (8 GPUs, bfloat16):

./sglang_benchmark_report.sh \
    -s latency \
    -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    -g 8 \
    -d bfloat16

Throughput Test (8 GPUs, bfloat16, random dataset):

./sglang_benchmark_report.sh -s throughput \
    -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    -g 8 -d bfloat16 -a random

View results

Results are saved to:

./reports_bfloat16/summary/DeepSeek-R1-Distill-Qwen-32B_throughput_report.csv
./reports_bfloat16/summary/DeepSeek-R1-Distill-Qwen-32B_latency_report.csv

Multi-Node Inference

Step 1: Allocate multiple nodes

# Allocate 2 nodes on MI355X cluster
salloc -N 2 \
  -p 256C8G1H_MI355X_Ubuntu22 \
  --gres=gpu:8 \
  --mem=0 \
  --exclusive \
  --ntasks-per-node=8 \
  --account=<ACCOUNT_NAME>

Example:

salloc -N 2 -p 256C8G1H_MI355X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --ntasks-per-node=8 --account=myteam

Step 2: Create SLURM batch script

Create a file sglang_inference.sh:

#!/bin/bash
#SBATCH -J sglang_inference
#SBATCH -p 256C8G1H_MI355X_Ubuntu22
#SBATCH --gres=gpu:8
#SBATCH --mem=0
#SBATCH -N 2
#SBATCH --exclusive 
#SBATCH --ntasks-per-node=8
#SBATCH --account=<ACCOUNT_NAME>

# Load ROCm
module load rocm/7.2.0

# Set environment variables
export HF_TOKEN=<your_token>
export HUGGINGFACE_HUB_CACHE=/shared/data/hf_cache

# Clone MAD if not already done
if [ ! -d "/shared/data/MAD" ]; then
    cd /shared/data
    git clone https://github.com/ROCm/MAD
fi

# Run multi-node inference
cd /shared/data/MAD/scripts/sglang

srun --container-image=docker://lmsysorg/sglang:v0.4.5-rocm630 \
  --container-mounts=/shared/data:/shared/data \
  --container-workdir=/shared/data/MAD/scripts/sglang \
  --container-env="HF_TOKEN,HUGGINGFACE_HUB_CACHE" \
  ./sglang_benchmark_report.sh -s throughput \
    -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    -g 16 -d bfloat16 -a random

Step 3: Submit the job

sbatch sglang_inference.sh

Step 4: Monitor the job

# Check job status
squeue -u $USER

# View output
tail -f slurm-<job_id>.out

Using MAD for Automated Benchmarking

MAD (Model Automation and Dashboarding) provides automated testing:

Step 1: Setup MAD

git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt

Step 2: Set credentials

export MAD_SECRETS_HFTOKEN="your_token"

Step 3: Run benchmark

madengine run \
    --tags pyt_sglang_deepseek-r1-distill-qwen-32b \
    --keep-model-dir \
    --live-output \
    --timeout 28800

Results

Results are saved to:

~/MAD/perf_DeepSeek-R1-Distill-Qwen-32B.csv

Supported Models

SGLang works with various models including:

  • DeepSeek: DeepSeek-R1-Distill-Qwen-32B, DeepSeek-V2
  • Llama: Llama 2, Llama 3, Llama 3.1
  • Qwen: Qwen 2.5 family
  • Mixtral: Mixtral 8x7B

Performance Optimization Tips

1. Tensor Parallelism

For large models, use tensor parallelism across multiple GPUs:

# 8 GPUs with tensor parallelism
./sglang_benchmark_report.sh -s throughput \
    -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    -g 8 -d bfloat16

2. Data Type Selection

  • bfloat16: Best balance of speed and accuracy
  • float16: Faster but may have numerical stability issues
  • int8: Experimental quantization support

3. Batch Size Tuning

Adjust batch size based on available memory:

# Modify in the benchmark script or use custom parameters

Troubleshooting

Out of memory errors

  1. Reduce the number of GPUs used
  2. Use quantized models (int8, int4)
  3. Reduce batch size or sequence length

Hugging Face authentication errors

Ensure the token is set correctly:

export HF_TOKEN=<your_token>

# Verify it's set
 if [ -n "$HF_TOKEN" ]; then
   echo "HF_TOKEN is set"
 else
   echo "HF_TOKEN is not set"
 fi

Model download issues

If models fail to download:

# Set cache directory to shared storage
export HUGGINGFACE_HUB_CACHE=/shared/data/hf_cache

# Pre-download models manually
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

Container permissions

If you get container permission errors on AAC, do not try to use sudo or add yourself to the Docker group. AAC supports containerized workloads through Podman (for interactive sessions) and Pyxis/Enroot (for batch jobs). See Using Enroot with Pyxis for more details.

Storage Recommendations

  • Store models in /shared/data/hf_cache for multi-node access
  • Set HUGGINGFACE_HUB_CACHE=/shared/data/hf_cache to avoid re-downloading
  • Store benchmark results in $HOME for easy retrieval

Benchmark Metrics

Latency Metrics

  • Time to First Token (TTFT): Time until first output token
  • Time Per Output Token (TPOT): Average time per generated token
  • End-to-End Latency: Total request completion time

Throughput Metrics

  • Tokens Per Second: Total tokens generated per second
  • Requests Per Second: Number of completed requests per second
  • GPU Utilization: Percentage of GPU compute used

External Resources