kind: Playbook metadata: name: station-vllm displayName: vLLM for Inference shortDescription: Install and use vLLM on DGX Station publisher: nvidia description: | # REPLACE THIS WITH YOUR MODEL CARD https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads labelsV2: - gpuType:playbook:gpu_type_station - Inference - vLLM attributes: - key: DURATION value: 30 MIN spec: artifactName: station-vllm nvcfFunctionId: None attributes: showUnavailableBanner: false apiDocsUrl: None termsOfUse: | tabs: - id: overview label: Overview content: | # Basic idea vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs. - **PagedAttention** handles long sequences without running out of GPU memory. - **Continuous batching** keeps GPUs fully utilized by adding new requests to batches in progress. - **OpenAI-compatible API** allows applications built for OpenAI to switch to vLLM with minimal changes. # What you'll accomplish Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models. You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture. # What to know before starting - Basic Docker container usage - Familiarity with REST APIs # Prerequisites - NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs - Docker installed: `docker --version` - NVIDIA Container Toolkit configured - HuggingFace account with access token - Network access to NGC and HuggingFace # Model Support Matrix The following models are supported with vLLM on DGX Station. All listed models are available and ready to use: | Model | Quantization | Support Status | HF Handle | |-------|-------------|----------------|-----------| | **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) | | **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) | | **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) | | **Kimi-K2.5 (1T)** | NVFP4 | ✅ | [`nvidia/Kimi-K2.5-NVFP4`](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) | | **DeepSeek-V4-Flash** | NVFP4 | ✅ | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | # Time & risk * **Duration:** 30 minutes (longer on first run due to model download) * **Risks:** Model download requires HuggingFace authentication * **Rollback:** Stop and remove the container to restore state * **Last Updated:** 05/29/2026 * Update models * Add base configuration example, per-setting explanations, and DeepSeek-V4-Flash recipe - id: instructions label: Instructions content: | # Step 1. Set up Docker permissions If you haven't already, add your user to the docker group to run Docker without sudo: ```bash sudo usermod -aG docker $USER newgrp docker ``` # Step 2. Set up environment variables Set the following so the vLLM container can download the model and use your chosen context length: ```bash # HuggingFace token (required) # Get a token from https://huggingface.co/settings/tokens export HF_TOKEN="your_huggingface_token" # Model to serve export MODEL_HANDLE="" # Maximum context length export MAX_MODEL_LEN=8192 ``` # Step 3. Pull vLLM container image Pull the vLLM container from NGC. Use the **26.01** image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations. ```bash docker pull nvcr.io/nvidia/vllm:26.01-py3 ``` For Step-3.7-Flash models, pull the custom VLLM container ```bash docker pull vllm/vllm-openai:stepfun37 ``` For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, pull the **26.03** image, which includes the `--cpu-offload-params` support used below: ```bash docker pull nvcr.io/nvidia/vllm:26.03-py3 ``` For DeepSeek-V4-Flash, pull the stable DeepSeek-V4 release container. Use the **cu130** build on DGX Station (Blackwell): ```bash docker pull vllm/vllm-openai:v0.20.0-cu130 ``` # Step 4. Start vLLM server Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`. ## Base configuration (most models) This is the recommended starting point for any model that fits entirely in VRAM on the GB300. The Qwen3-235B-A22B-NVFP4 model, for example, runs directly with this configuration. ```bash docker run -d \ --name vllm-server \ --gpus all \ --ipc host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 8000:8000 \ -e HF_TOKEN="$HF_TOKEN" \ -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ nvcr.io/nvidia/vllm:26.01-py3 \ vllm serve "$MODEL_HANDLE" \ --max-model-len $MAX_MODEL_LEN \ --gpu-memory-utilization 0.9 ``` Settings used: - `--max-model-len` — maximum context length (prompt + output) per request. Larger values reserve more GPU memory for the KV cache; size it to your workload. - `--gpu-memory-utilization 0.9` — fraction of GPU memory vLLM may use for weights and KV cache. `0.9` leaves headroom for other processes; raise toward `0.95` to fit more KV cache if the GPU is dedicated. ## Step-3.7-Flash (FP8 / NVFP4) For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300. ```bash docker run -d \ --name vllm-server \ --gpus all \ --ipc host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 8000:8000 \ -e HF_TOKEN="$HF_TOKEN" \ -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ vllm/vllm-openai:stepfun37 \ "$MODEL_HANDLE" \ --gpu-memory-utilization 0.95 \ --trust-remote-code \ --reasoning-parser step3p5 \ --enable-auto-tool-choice \ --tool-call-parser step3p5 \ --kv-cache-dtype fp8 ``` Settings used (in addition to the base configuration): - `--trust-remote-code` — allows the model's custom modeling code (shipped in its repo) to load. Required for Step-3.7. - `--reasoning-parser step3p5` — parses the model's reasoning/thinking tokens into the dedicated `reasoning_content` response field. - `--enable-auto-tool-choice` — lets the model decide when to call a tool, enabling OpenAI-compatible function calling. - `--tool-call-parser step3p5` — parses the model's tool-call output into structured `tool_calls`. Pairs with `--enable-auto-tool-choice`. - `--kv-cache-dtype fp8` — stores the KV cache in FP8, roughly halving KV-cache memory versus 16-bit and allowing more concurrent/longer sequences. ## Kimi-K2.5 NVFP4 (1T) — CPU offloading For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, run with the **26.03** NGC container. This model does not fit entirely in VRAM, so the MoE expert weights are offloaded to CPU DRAM with `--cpu-offload-gb 375 --cpu-offload-params experts`. Ensure the system has enough free DRAM to hold the offloaded weights. ```bash docker run -d \ --name vllm-server \ --gpus all \ --ipc host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 8000:8000 \ -e HF_TOKEN="$HF_TOKEN" \ -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ nvcr.io/nvidia/vllm:26.03-py3 \ vllm serve nvidia/Kimi-K2.5-NVFP4 \ --host 0.0.0.0 \ --port 8000 \ --dtype auto \ --kv-cache-dtype auto \ --gpu-memory-utilization 0.95 \ --served-model-name nvidia/Kimi-K2.5-NVFP4 \ --tensor-parallel-size 1 \ --no-enable-prefix-caching \ --trust-remote-code \ --max-model-len 40960 \ --max-num-seqs 1 \ --max-num-batched-tokens 32768 \ --cpu-offload-gb 375 \ --cpu-offload-params experts ``` Settings used (in addition to the base configuration): - `--cpu-offload-gb 375` — amount of CPU DRAM (in GiB) vLLM may use to hold weights that don't fit in VRAM. Must be large enough for the offloaded experts; the system needs at least this much free DRAM. - `--cpu-offload-params experts` — offloads only the MoE expert weights (the bulk of a large MoE model) to DRAM, keeping attention and other hot weights in VRAM. - `--tensor-parallel-size 1` — single GPU; the GB300 serves the whole model. - `--max-num-seqs 1` / `--max-num-batched-tokens 32768` — caps concurrency to one sequence and the batch token budget. With expert weights paged from DRAM, throughput is offload-bound, so a low concurrency keeps latency predictable. - `--no-enable-prefix-caching` — disables prefix-cache reuse. Offloaded experts make the memory budget tight, so the cache is turned off here rather than spent on KV reuse. - `--kv-cache-dtype auto` / `--dtype auto` — let vLLM pick the KV-cache and compute dtypes from the model's quantization (NVFP4). ## DeepSeek-V4-Flash — MTP + agentic For DeepSeek-V4-Flash, run with the stable **v0.20.0-cu130** container. This recipe targets agentic workloads and enables Multi-Token Prediction (MTP) speculative decoding. On a single GB300 (TP1) the MoE expert-parallel path is used; the `deep_gemm_mega_moe` backend from some internal recipes is not needed at TP1 and is omitted here. ```bash docker run -d \ --name vllm-server \ --gpus all \ --ipc host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 8000:8000 \ -e HF_TOKEN="$HF_TOKEN" \ -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ vllm/vllm-openai:v0.20.0-cu130 \ deepseek-ai/DeepSeek-V4-Flash \ --enable-expert-parallel \ --kv-cache-dtype fp8 \ --trust-remote-code \ --block-size 256 \ --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \ --attention_config.use_fp4_indexer_cache True \ --tokenizer-mode deepseek_v4 \ --tool-call-parser deepseek_v4 \ --enable-auto-tool-choice \ --reasoning-parser deepseek_v4 \ --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \ --max-model-len 32768 ``` Settings used (in addition to the base configuration): - `--enable-expert-parallel` — shards the MoE experts across the available GPU(s) using expert parallelism, the recommended MoE execution path for DeepSeek-V4. - `--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'` — enables **MTP (Multi-Token Prediction)** speculative decoding: the model proposes 3 tokens per step that are verified in a single forward pass, cutting latency for accepted tokens. - `--kv-cache-dtype fp8` — FP8 KV cache to fit more concurrent/longer sequences. - `--block-size 256` — KV-cache page size in tokens. DeepSeek-V4 uses multiple KV-cache groups; `256` matches the recipe validated on Station. - `--attention_config.use_fp4_indexer_cache True` — enables the FP4 indexer cache used by DeepSeek-V4's attention. (Drop this flag on platforms without native FP4, e.g. Hopper.) - `--tokenizer-mode deepseek_v4` / `--tool-call-parser deepseek_v4` / `--reasoning-parser deepseek_v4` — DeepSeek-V4-specific tokenizer, tool-call, and reasoning parsers. - `--enable-auto-tool-choice` — OpenAI-compatible function calling for agentic use. - `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` — uses full + piecewise CUDA graph capture and enables all custom ops for lower per-step overhead. - **Prefix caching is left enabled (the vLLM default).** For agentic workloads with large shared prefixes (e.g. a 32k system/context prefix) at low batch sizes (~BS 3–4), prefix caching gives a significant throughput boost by reusing the cached prefix across requests. Some internal recipes carry `--no-enable-prefix-caching`, but that was inherited from random-data benchmarking and is not recommended for agentic use here. Check the server logs for startup progress: ```bash docker logs -f vllm-server ``` Expected output includes: - Model download progress (first run only) - Model loading into GPU memory - `Application startup complete.` Press `Ctrl+C` to exit log view once the server is ready. # Step 5. Test the API Send a test request to verify the server is working: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "'"$MODEL_HANDLE"'", "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}], "max_tokens": 256 }' ``` The response should contain a `choices` array with the model's answer. # Step 6. Cleanup Stop and remove the container: ```bash docker stop vllm-server docker rm vllm-server ``` Optionally, remove the image and cached model: Eg. ```bash docker rmi "" rm -rf $HOME/.cache/huggingface/hub/"" ``` - id: troubleshooting label: Troubleshooting content: | # Common issues | Symptom | Cause | Fix | |---------|--------|-----| | "permission denied" when running docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` | | Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker | | "Token is required" or 401 error | Missing HuggingFace token | Ensure `HF_TOKEN` is exported before running docker command | | Model download hangs or fails | Network or authentication issue | Check internet connection, verify HF_TOKEN is valid | | CUDA out of memory | Context length too large | Reduce `MAX_MODEL_LEN` or lower `--gpu-memory-utilization` | | Server not responding on port 8000 | Port already in use | Check with `lsof -i :8000`, use `-p 8001:8000` for different port | | Model runs on wrong GPU | Default GPU selection | Use `--gpus '"device=0"'` to select specific GPU | | NGC authentication fails | Invalid or missing credentials | Run `docker login nvcr.io` with NGC API key | | EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v" | Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph capture | Use the **26.01** container image: `nvcr.io/nvidia/vllm:26.01-py3` instead of 25.10. | resources: - name: vLLM Documentation url: https://docs.vllm.ai/en/latest/