SGLang Inference on AAC
This guide explains how to run SGLang inference workloads for large language models on AMD Accelerator Cloud (AAC) clusters.
Overview
SGLang is a fast serving framework for large language models and vision language models, optimized for AMD Instinct GPUs. This guide focuses on running inference benchmarks for models like DeepSeek-R1-Distill-Qwen-32B.
Prerequisites
- Access to AAC cluster (MI325X or MI355X)
- Hugging Face account and access token (for gated models)
- Basic familiarity with SLURM commands
Supported hardware
- AMD Instinct MI355X GPUs (MI355X cluster)
- AMD Instinct MI325X GPUs (MI325X cluster)
- ROCm 7.2.0 with SGLang
Single-node Inference
Step 1: Allocate a compute node
# For MI325X cluster MI325X
salloc -p 256C8G1H_MI325X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=<ACCOUNT_NAME>
# For MI355X cluster MI355X
salloc -p 256C8G1H_MI355X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=<ACCOUNT_NAME>
Example:
salloc -p 256C8G1H_MI355X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --account=myteam
Step 2: Load ROCm environment
module load rocm/7.2.0
Step 3: Pull and run Podman container
podman pull docker.io/lmsysorg/sglang:v0.4.5-rocm630
podman run -it \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--shm-size 16G \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
--cap-add=SYS_PTRACE \
-v /shared/data:/workspace \
--env HUGGINGFACE_HUB_CACHE=/workspace/hf_cache \
lmsysorg/sglang:v0.4.5-rocm630
Step 4: Clone MAD repository
git clone https://github.com/ROCm/MAD
cd MAD/scripts/sglang
Step 5: Set Hugging Face token (if needed)
export HF_TOKEN=<your_personal_hf_token>
Step 6: Run inference benchmarks
Latency Test (8 GPUs, bfloat16):
./sglang_benchmark_report.sh \
-s latency \
-m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
-g 8 \
-d bfloat16
Throughput Test (8 GPUs, bfloat16, random dataset):
./sglang_benchmark_report.sh -s throughput \
-m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
-g 8 -d bfloat16 -a random
View results
Results are saved to:
./reports_bfloat16/summary/DeepSeek-R1-Distill-Qwen-32B_throughput_report.csv
./reports_bfloat16/summary/DeepSeek-R1-Distill-Qwen-32B_latency_report.csv
Multi-Node Inference
Step 1: Allocate multiple nodes
# Allocate 2 nodes on MI355X cluster
salloc -N 2 \
-p 256C8G1H_MI355X_Ubuntu22 \
--gres=gpu:8 \
--mem=0 \
--exclusive \
--ntasks-per-node=8 \
--account=<ACCOUNT_NAME>
Example:
salloc -N 2 -p 256C8G1H_MI355X_Ubuntu22 --gres=gpu:8 --mem=0 --exclusive --ntasks-per-node=8 --account=myteam
Step 2: Create SLURM batch script
Create a file sglang_inference.sh:
#!/bin/bash
#SBATCH -J sglang_inference
#SBATCH -p 256C8G1H_MI355X_Ubuntu22
#SBATCH --gres=gpu:8
#SBATCH --mem=0
#SBATCH -N 2
#SBATCH --exclusive
#SBATCH --ntasks-per-node=8
#SBATCH --account=<ACCOUNT_NAME>
# Load ROCm
module load rocm/7.2.0
# Set environment variables
export HF_TOKEN=<your_token>
export HUGGINGFACE_HUB_CACHE=/shared/data/hf_cache
# Clone MAD if not already done
if [ ! -d "/shared/data/MAD" ]; then
cd /shared/data
git clone https://github.com/ROCm/MAD
fi
# Run multi-node inference
cd /shared/data/MAD/scripts/sglang
srun --container-image=docker://lmsysorg/sglang:v0.4.5-rocm630 \
--container-mounts=/shared/data:/shared/data \
--container-workdir=/shared/data/MAD/scripts/sglang \
--container-env="HF_TOKEN,HUGGINGFACE_HUB_CACHE" \
./sglang_benchmark_report.sh -s throughput \
-m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
-g 16 -d bfloat16 -a random
Step 3: Submit the job
sbatch sglang_inference.sh
Step 4: Monitor the job
# Check job status
squeue -u $USER
# View output
tail -f slurm-<job_id>.out
Using MAD for Automated Benchmarking
MAD (Model Automation and Dashboarding) provides automated testing:
Step 1: Setup MAD
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Step 2: Set credentials
export MAD_SECRETS_HFTOKEN="your_token"
Step 3: Run benchmark
madengine run \
--tags pyt_sglang_deepseek-r1-distill-qwen-32b \
--keep-model-dir \
--live-output \
--timeout 28800
Results
Results are saved to:
~/MAD/perf_DeepSeek-R1-Distill-Qwen-32B.csv
Supported Models
SGLang works with various models including:
- DeepSeek: DeepSeek-R1-Distill-Qwen-32B, DeepSeek-V2
- Llama: Llama 2, Llama 3, Llama 3.1
- Qwen: Qwen 2.5 family
- Mixtral: Mixtral 8x7B
Performance Optimization Tips
1. Tensor Parallelism
For large models, use tensor parallelism across multiple GPUs:
# 8 GPUs with tensor parallelism
./sglang_benchmark_report.sh -s throughput \
-m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
-g 8 -d bfloat16
2. Data Type Selection
- bfloat16: Best balance of speed and accuracy
- float16: Faster but may have numerical stability issues
- int8: Experimental quantization support
3. Batch Size Tuning
Adjust batch size based on available memory:
# Modify in the benchmark script or use custom parameters
Troubleshooting
Out of memory errors
- Reduce the number of GPUs used
- Use quantized models (int8, int4)
- Reduce batch size or sequence length
Hugging Face authentication errors
Ensure the token is set correctly:
export HF_TOKEN=<your_token>
# Verify it's set
if [ -n "$HF_TOKEN" ]; then
echo "HF_TOKEN is set"
else
echo "HF_TOKEN is not set"
fi
Model download issues
If models fail to download:
# Set cache directory to shared storage
export HUGGINGFACE_HUB_CACHE=/shared/data/hf_cache
# Pre-download models manually
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
Container permissions
If you get container permission errors on AAC, do not try to use sudo or add yourself to the Docker group. AAC supports containerized workloads through Podman (for interactive sessions) and Pyxis/Enroot (for batch jobs). See Using Enroot with Pyxis for more details.
Storage Recommendations
- Store models in
/shared/data/hf_cachefor multi-node access - Set
HUGGINGFACE_HUB_CACHE=/shared/data/hf_cacheto avoid re-downloading - Store benchmark results in
$HOMEfor easy retrieval
Benchmark Metrics
Latency Metrics
- Time to First Token (TTFT): Time until first output token
- Time Per Output Token (TPOT): Average time per generated token
- End-to-End Latency: Total request completion time
Throughput Metrics
- Tokens Per Second: Total tokens generated per second
- Requests Per Second: Number of completed requests per second
- GPU Utilization: Percentage of GPU compute used
Related Documentation
- AAC Slurm Cluster User Guide
- Using Enroot with Pyxis
- Storage and Shared Filesystems
- Node Reference Guide