dgx-spark-playbooks/nvidia/station-vllm/README.md
2026-05-29 00:08:55 +00:00

205 lines
6.6 KiB
Markdown

# vLLM for Inference
> Install and use vLLM on DGX Station
## Table of Contents
- [Overview](#overview)
- [Instructions](#instructions)
- [Troubleshooting](#troubleshooting)
---
## Overview
## Basic idea
vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.
- **PagedAttention** handles long sequences without running out of GPU memory.
- **Continuous batching** keeps GPUs fully utilized by adding new requests to batches in progress.
- **OpenAI-compatible API** allows applications built for OpenAI to switch to vLLM with minimal changes.
## What you'll accomplish
Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
## What to know before starting
- Basic Docker container usage
- Familiarity with REST APIs
## Prerequisites
- NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
- Docker installed: `docker --version`
- NVIDIA Container Toolkit configured
- HuggingFace account with access token
- Network access to NGC and HuggingFace
## Model Support Matrix
The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
## Time & risk
* **Duration:** 30 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 05/28/2026
* Update models
## Instructions
## Step 1. Set up Docker permissions
If you haven't already, add your user to the docker group to run Docker without sudo:
```bash
sudo usermod -aG docker $USER
newgrp docker
```
## Step 2. Set up environment variables
Set the following so the vLLM container can download the model and use your chosen context length:
```bash
## HuggingFace token (required)
## Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"
## Model to serve
export MODEL_HANDLE="<HF_HANDLE>"
## Maximum context length
export MAX_MODEL_LEN=8192
```
## Step 3. Pull vLLM container image
Pull the vLLM container from NGC. Use the **26.01** image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations.
```bash
docker pull nvcr.io/nvidia/vllm:26.01-py3
```
For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
```
## Step 4. Start vLLM server
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
nvcr.io/nvidia/vllm:26.01-py3 \
vllm serve "$MODEL_HANDLE" \
--max-model-len $MAX_MODEL_LEN \
--gpu-memory-utilization 0.9
```
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:stepfun37 \
"$MODEL_HANDLE" \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--kv-cache-dtype fp8
```
Check the server logs for startup progress:
```bash
docker logs -f vllm-server
```
Expected output includes:
- Model download progress (first run only)
- Model loading into GPU memory
- `Application startup complete.`
Press `Ctrl+C` to exit log view once the server is ready.
## Step 5. Test the API
Send a test request to verify the server is working:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
"max_tokens": 256
}'
```
The response should contain a `choices` array with the model's answer.
## Step 6. Cleanup
Stop and remove the container:
```bash
docker stop vllm-server
docker rm vllm-server
```
Optionally, remove the image and cached model:
Eg.
```bash
docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
```
## Troubleshooting
## Common issues
| Symptom | Cause | Fix |
|---------|--------|-----|
| "permission denied" when running docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` |
| Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker |
| "Token is required" or 401 error | Missing HuggingFace token | Ensure `HF_TOKEN` is exported before running docker command |
| Model download hangs or fails | Network or authentication issue | Check internet connection, verify HF_TOKEN is valid |
| CUDA out of memory | Context length too large | Reduce `MAX_MODEL_LEN` or lower `--gpu-memory-utilization` |
| Server not responding on port 8000 | Port already in use | Check with `lsof -i :8000`, use `-p 8001:8000` for different port |
| Model runs on wrong GPU | Default GPU selection | Use `--gpus '"device=0"'` to select specific GPU |
| NGC authentication fails | Invalid or missing credentials | Run `docker login nvcr.io` with NGC API key |
| EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v" | Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph capture | Use the **26.01** container image: `nvcr.io/nvidia/vllm:26.01-py3` instead of 25.10. |