| .. | ||
| endpoint-production.yaml | ||
| endpoint-test.yaml | ||
| overview.md | ||
| README.md | ||
vLLM for Inference
Install and use vLLM on DGX Station
Table of Contents
Overview
Basic idea
vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.
- PagedAttention handles long sequences without running out of GPU memory.
- Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
- OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.
What you'll accomplish
Serve a supported model using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
What to know before starting
- Basic Docker container usage
- Familiarity with REST APIs
Prerequisites
- NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
- Docker installed:
docker --version - NVIDIA Container Toolkit configured
- HuggingFace account with access token
- Network access to NGC and HuggingFace
Model Support Matrix
The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|---|---|---|---|
| DiffusionGemma 26B A4B IT | BF16 | ✅ | google/diffusiongemma-26B-A4B-it |
| DiffusionGemma 26B A4B IT | NVFP4 | ✅ | nvidia/diffusiongemma-26B-A4B-it-NVFP4 |
| Step-3.7-Flash-FP8 | FP8 | ✅ | stepfun-ai/Step-3.7-Flash-FP8 |
| Step-3.7-Flash-NVFP4 | NVFP4 | ✅ | stepfun-ai/Step-3.7-Flash-NVFP4 |
| Qwen3-235B-A22B-NVFP4 | NVFP4 | ✅ | nvidia/Qwen3-235B-A22B-NVFP4 |
Time & risk
- Duration: 30 minutes (longer on first run due to model download)
- Risks: Model download requires HuggingFace authentication
- Rollback: Stop and remove the container to restore state
- Last Updated: 06/10/2026
- Update models
Instructions
Step 1. Set up Docker permissions
If you haven't already, add your user to the docker group to run Docker without sudo:
sudo usermod -aG docker $USER
newgrp docker
Step 2. Set up environment variables
Set the following so the vLLM container can download the model and use your chosen context length:
## HuggingFace token (required)
## Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"
## Model to serve
export MODEL_HANDLE="<HF_HANDLE>"
## Maximum context length
export MAX_MODEL_LEN=8192
Step 3. Pull vLLM container image
Pull the vLLM container from NGC. Use the 26.01 image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations.
docker pull nvcr.io/nvidia/vllm:26.01-py3
For DiffusionGemma, use the vLLM custom container:
docker pull vllm/vllm-openai:gemma
For Step-3.7-Flash models, pull the custom VLLM container
docker pull vllm/vllm-openai:stepfun37
Step 4. Start vLLM server
Start the vLLM server with the model. On a single-GPU DGX Station, --gpus all uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with --gpus '"device=N"' where N is the GB300 device ID from nvidia-smi.
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
nvcr.io/nvidia/vllm:26.01-py3 \
vllm serve "$MODEL_HANDLE" \
--max-model-len $MAX_MODEL_LEN \
--gpu-memory-utilization 0.9
For DiffusionGemma models (e.g. google/diffusiongemma-26B-A4B-it), run with custom VLLM container.
docker run -d \
--name vllm-server \
-p 8000:8000 \
--gpus all \
--shm-size=16g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e VLLM_USE_V2_MODEL_RUNNER=1 \
-e HF_TOKEN="$HF_TOKEN" \
vllm/vllm-openai:gemma ${MODEL_HANDLE} \
--gpu-memory-utilization 0.85 \
--attention-backend TRITON_ATTN \
--max-num-seqs 16 \
--diffusion-config '{"canvas_length":256}' \
--override-generation-config '{"max_new_tokens": null}' \
--load-format fastsafetensors \
--enable-prefix-caching \
--reasoning-parser gemma4 \
--default-chat-template-kwargs '{"enable_thinking": true}' \
--enable-auto-tool-choice \
--tool-call-parser gemma4
## For BF16 checkpoint add "--moe-backend triton" for better performance
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:stepfun37 \
"$MODEL_HANDLE" \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--kv-cache-dtype fp8
Check the server logs for startup progress:
docker logs -f vllm-server
Expected output includes:
- Model download progress (first run only)
- Model loading into GPU memory
Application startup complete.
Press Ctrl+C to exit log view once the server is ready.
Step 5. Test the API
Send a test request to verify the server is working:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
"max_tokens": 256
}'
The response should contain a choices array with the model's answer.
Step 6. Cleanup
Stop and remove the container:
docker stop vllm-server
docker rm vllm-server
Optionally, remove the image and cached model:
Eg.
docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
Troubleshooting
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| "permission denied" when running docker | User not in docker group | Run sudo usermod -aG docker $USER && newgrp docker |
| Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run nvidia-ctk runtime configure --runtime=docker and restart Docker |
| "Token is required" or 401 error | Missing HuggingFace token | Ensure HF_TOKEN is exported before running docker command |
| Model download hangs or fails | Network or authentication issue | Check internet connection, verify HF_TOKEN is valid |
| CUDA out of memory | Context length too large | Reduce MAX_MODEL_LEN or lower --gpu-memory-utilization |
| Server not responding on port 8000 | Port already in use | Check with lsof -i :8000, use -p 8001:8000 for different port |
| Model runs on wrong GPU | Default GPU selection | Use --gpus '"device=0"' to select specific GPU |
| NGC authentication fails | Invalid or missing credentials | Run docker login nvcr.io with NGC API key |
| EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v" | Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph capture | Use the 26.01 container image: nvcr.io/nvidia/vllm:26.01-py3 instead of 25.10. |