dgx-spark-playbooks/nvidia/station-vllm/endpoint-test.yaml
2026-05-30 11:49:27 +00:00

353 lines
16 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

kind: Playbook
metadata:
name: station-vllm
displayName: vLLM for Inference
shortDescription: Install and use vLLM on DGX Station
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
labelsV2:
- gpuType:playbook:gpu_type_station
- Inference
- vLLM
attributes:
- key: DURATION
value: 30 MIN
spec:
artifactName: station-vllm
nvcfFunctionId: None
attributes:
showUnavailableBanner: false
apiDocsUrl: None
termsOfUse: |
tabs:
-
id: overview
label: Overview
content: |
# Basic idea
vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.
- **PagedAttention** handles long sequences without running out of GPU memory.
- **Continuous batching** keeps GPUs fully utilized by adding new requests to batches in progress.
- **OpenAI-compatible API** allows applications built for OpenAI to switch to vLLM with minimal changes.
# What you'll accomplish
Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
# What to know before starting
- Basic Docker container usage
- Familiarity with REST APIs
# Prerequisites
- NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
- Docker installed: `docker --version`
- NVIDIA Container Toolkit configured
- HuggingFace account with access token
- Network access to NGC and HuggingFace
# Model Support Matrix
The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
| **Kimi-K2.5 (1T)** | NVFP4 | ✅ | [`nvidia/Kimi-K2.5-NVFP4`](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) |
| **DeepSeek-V4-Flash** | NVFP4 | ✅ | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
# Time & risk
* **Duration:** 30 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 05/29/2026
* Update models
* Add base configuration example, per-setting explanations, and DeepSeek-V4-Flash recipe
-
id: instructions
label: Instructions
content: |
# Step 1. Set up Docker permissions
If you haven't already, add your user to the docker group to run Docker without sudo:
```bash
sudo usermod -aG docker $USER
newgrp docker
```
# Step 2. Set up environment variables
Set the following so the vLLM container can download the model and use your chosen context length:
```bash
# HuggingFace token (required)
# Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"
# Model to serve
export MODEL_HANDLE="<HF_HANDLE>"
# Maximum context length
export MAX_MODEL_LEN=8192
```
# Step 3. Pull vLLM container image
Pull the vLLM container from NGC. Use the **26.01** image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations.
```bash
docker pull nvcr.io/nvidia/vllm:26.01-py3
```
For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
```
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, pull the **26.03** image, which includes the `--cpu-offload-params` support used below:
```bash
docker pull nvcr.io/nvidia/vllm:26.03-py3
```
For DeepSeek-V4-Flash, pull the stable DeepSeek-V4 release container. Use the **cu130** build on DGX Station (Blackwell):
```bash
docker pull vllm/vllm-openai:v0.20.0-cu130
```
# Step 4. Start vLLM server
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
## Base configuration (most models)
This is the recommended starting point for any model that fits entirely in VRAM on the GB300. The Qwen3-235B-A22B-NVFP4 model, for example, runs directly with this configuration.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
nvcr.io/nvidia/vllm:26.01-py3 \
vllm serve "$MODEL_HANDLE" \
--max-model-len $MAX_MODEL_LEN \
--gpu-memory-utilization 0.9
```
Settings used:
- `--max-model-len` — maximum context length (prompt + output) per request. Larger values reserve more GPU memory for the KV cache; size it to your workload.
- `--gpu-memory-utilization 0.9` — fraction of GPU memory vLLM may use for weights and KV cache. `0.9` leaves headroom for other processes; raise toward `0.95` to fit more KV cache if the GPU is dedicated.
## Step-3.7-Flash (FP8 / NVFP4)
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:stepfun37 \
"$MODEL_HANDLE" \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--kv-cache-dtype fp8
```
Settings used (in addition to the base configuration):
- `--trust-remote-code` — allows the model's custom modeling code (shipped in its repo) to load. Required for Step-3.7.
- `--reasoning-parser step3p5` — parses the model's reasoning/thinking tokens into the dedicated `reasoning_content` response field.
- `--enable-auto-tool-choice` — lets the model decide when to call a tool, enabling OpenAI-compatible function calling.
- `--tool-call-parser step3p5` — parses the model's tool-call output into structured `tool_calls`. Pairs with `--enable-auto-tool-choice`.
- `--kv-cache-dtype fp8` — stores the KV cache in FP8, roughly halving KV-cache memory versus 16-bit and allowing more concurrent/longer sequences.
## Kimi-K2.5 NVFP4 (1T) — CPU offloading
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, run with the **26.03** NGC container. This model does not fit entirely in VRAM, so the MoE expert weights are offloaded to CPU DRAM with `--cpu-offload-gb 375 --cpu-offload-params experts`. Ensure the system has enough free DRAM to hold the offloaded weights.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
nvcr.io/nvidia/vllm:26.03-py3 \
vllm serve nvidia/Kimi-K2.5-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--dtype auto \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.95 \
--served-model-name nvidia/Kimi-K2.5-NVFP4 \
--tensor-parallel-size 1 \
--no-enable-prefix-caching \
--trust-remote-code \
--max-model-len 40960 \
--max-num-seqs 1 \
--max-num-batched-tokens 32768 \
--cpu-offload-gb 375 \
--cpu-offload-params experts
```
Settings used (in addition to the base configuration):
- `--cpu-offload-gb 375` — amount of CPU DRAM (in GiB) vLLM may use to hold weights that don't fit in VRAM. Must be large enough for the offloaded experts; the system needs at least this much free DRAM.
- `--cpu-offload-params experts` — offloads only the MoE expert weights (the bulk of a large MoE model) to DRAM, keeping attention and other hot weights in VRAM.
- `--tensor-parallel-size 1` — single GPU; the GB300 serves the whole model.
- `--max-num-seqs 1` / `--max-num-batched-tokens 32768` — caps concurrency to one sequence and the batch token budget. With expert weights paged from DRAM, throughput is offload-bound, so a low concurrency keeps latency predictable.
- `--no-enable-prefix-caching` — disables prefix-cache reuse. Offloaded experts make the memory budget tight, so the cache is turned off here rather than spent on KV reuse.
- `--kv-cache-dtype auto` / `--dtype auto` — let vLLM pick the KV-cache and compute dtypes from the model's quantization (NVFP4).
## DeepSeek-V4-Flash — MTP + agentic
For DeepSeek-V4-Flash, run with the stable **v0.20.0-cu130** container. This recipe targets agentic workloads and enables Multi-Token Prediction (MTP) speculative decoding. On a single GB300 (TP1) the MoE expert-parallel path is used; the `deep_gemm_mega_moe` backend from some internal recipes is not needed at TP1 and is omitted here.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:v0.20.0-cu130 \
deepseek-ai/DeepSeek-V4-Flash \
--enable-expert-parallel \
--kv-cache-dtype fp8 \
--trust-remote-code \
--block-size 256 \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache True \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
--max-model-len 32768
```
Settings used (in addition to the base configuration):
- `--enable-expert-parallel` — shards the MoE experts across the available GPU(s) using expert parallelism, the recommended MoE execution path for DeepSeek-V4.
- `--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'` — enables **MTP (Multi-Token Prediction)** speculative decoding: the model proposes 3 tokens per step that are verified in a single forward pass, cutting latency for accepted tokens.
- `--kv-cache-dtype fp8` — FP8 KV cache to fit more concurrent/longer sequences.
- `--block-size 256` — KV-cache page size in tokens. DeepSeek-V4 uses multiple KV-cache groups; `256` matches the recipe validated on Station.
- `--attention_config.use_fp4_indexer_cache True` — enables the FP4 indexer cache used by DeepSeek-V4's attention. (Drop this flag on platforms without native FP4, e.g. Hopper.)
- `--tokenizer-mode deepseek_v4` / `--tool-call-parser deepseek_v4` / `--reasoning-parser deepseek_v4` — DeepSeek-V4-specific tokenizer, tool-call, and reasoning parsers.
- `--enable-auto-tool-choice` — OpenAI-compatible function calling for agentic use.
- `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` — uses full + piecewise CUDA graph capture and enables all custom ops for lower per-step overhead.
- **Prefix caching is left enabled (the vLLM default).** For agentic workloads with large shared prefixes (e.g. a 32k system/context prefix) at low batch sizes (~BS 34), prefix caching gives a significant throughput boost by reusing the cached prefix across requests. Some internal recipes carry `--no-enable-prefix-caching`, but that was inherited from random-data benchmarking and is not recommended for agentic use here.
Check the server logs for startup progress:
```bash
docker logs -f vllm-server
```
Expected output includes:
- Model download progress (first run only)
- Model loading into GPU memory
- `Application startup complete.`
Press `Ctrl+C` to exit log view once the server is ready.
# Step 5. Test the API
Send a test request to verify the server is working:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"$MODEL_HANDLE"'",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
"max_tokens": 256
}'
```
The response should contain a `choices` array with the model's answer.
# Step 6. Cleanup
Stop and remove the container:
```bash
docker stop vllm-server
docker rm vllm-server
```
Optionally, remove the image and cached model:
Eg.
```bash
docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
```
-
id: troubleshooting
label: Troubleshooting
content: |
# Common issues
| Symptom | Cause | Fix |
|---------|--------|-----|
| "permission denied" when running docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` |
| Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker |
| "Token is required" or 401 error | Missing HuggingFace token | Ensure `HF_TOKEN` is exported before running docker command |
| Model download hangs or fails | Network or authentication issue | Check internet connection, verify HF_TOKEN is valid |
| CUDA out of memory | Context length too large | Reduce `MAX_MODEL_LEN` or lower `--gpu-memory-utilization` |
| Server not responding on port 8000 | Port already in use | Check with `lsof -i :8000`, use `-p 8001:8000` for different port |
| Model runs on wrong GPU | Default GPU selection | Use `--gpus '"device=0"'` to select specific GPU |
| NGC authentication fails | Invalid or missing credentials | Run `docker login nvcr.io` with NGC API key |
| EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v" | Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph capture | Use the **26.01** container image: `nvcr.io/nvidia/vllm:26.01-py3` instead of 25.10. |
resources:
- name: vLLM Documentation
url: https://docs.vllm.ai/en/latest/