dgx-spark-playbooks/nvidia/station-vllm/endpoint-test.yaml

kind: Playbook
metadata:
  name: station-vllm
  displayName: vLLM for Inference
  shortDescription: Install and use vLLM on DGX Station
  publisher: nvidia
  description: |
    # REPLACE THIS WITH YOUR MODEL CARD
    https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
    
  labelsV2:
  - gpuType:playbook:gpu_type_station
  - Inference
  - vLLM
  
  attributes:
  - key: DURATION
    value: 30 MIN
  
spec:
  artifactName: station-vllm
  nvcfFunctionId: None
  attributes:

    showUnavailableBanner: false
    apiDocsUrl: None
    termsOfUse: |
      
    tabs:
    - 
      id: overview
      
      label: Overview
      content: |
        # Basic idea
        
        vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.
        
        - **PagedAttention** handles long sequences without running out of GPU memory.
        - **Continuous batching** keeps GPUs fully utilized by adding new requests to batches in progress.
        - **OpenAI-compatible API** allows applications built for OpenAI to switch to vLLM with minimal changes.
        
        # What you'll accomplish
        
        Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
        
        You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
        
        # What to know before starting
        
        - Basic Docker container usage
        - Familiarity with REST APIs
        
        # Prerequisites
        
        - NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
        - Docker installed: `docker --version`
        - NVIDIA Container Toolkit configured
        - HuggingFace account with access token
        - Network access to NGC and HuggingFace
        
        # Model Support Matrix
        
        The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:
        
        | Model | Quantization | Support Status | HF Handle |
        |-------|-------------|----------------|-----------|
        | **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
        | **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
        | **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
        | **Kimi-K2.5 (1T)** | NVFP4 | ✅ | [`nvidia/Kimi-K2.5-NVFP4`](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) |
        | **DeepSeek-V4-Flash** | NVFP4 | ✅ | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
        
        # Time & risk
        
        * **Duration:** 30 minutes (longer on first run due to model download)
        * **Risks:** Model download requires HuggingFace authentication
        * **Rollback:** Stop and remove the container to restore state
        * **Last Updated:** 05/29/2026
          * Update models
          * Add base configuration example, per-setting explanations, and DeepSeek-V4-Flash recipe
        
      
    - 
      id: instructions
      
      label: Instructions
      content: |
        # Step 1. Set up Docker permissions
        
        If you haven't already, add your user to the docker group to run Docker without sudo:
        
        ```bash
        sudo usermod -aG docker $USER
        newgrp docker
        ```
        
        # Step 2. Set up environment variables
        
        Set the following so the vLLM container can download the model and use your chosen context length:
        
        ```bash
        # HuggingFace token (required)
        # Get a token from https://huggingface.co/settings/tokens
        export HF_TOKEN="your_huggingface_token"
        
        # Model to serve
        export MODEL_HANDLE="<HF_HANDLE>"
        
        # Maximum context length
        export MAX_MODEL_LEN=8192
        ```
        
        # Step 3. Pull vLLM container image
        
        Pull the vLLM container from NGC. Use the **26.01** image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations.
        
        ```bash
        docker pull nvcr.io/nvidia/vllm:26.01-py3
        ```
        
        For Step-3.7-Flash models, pull the custom VLLM container
        ```bash
        docker pull vllm/vllm-openai:stepfun37
        ```
        
        For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, pull the **26.03** image, which includes the `--cpu-offload-params` support used below:
        ```bash
        docker pull nvcr.io/nvidia/vllm:26.03-py3
        ```
        
        For DeepSeek-V4-Flash, pull the stable DeepSeek-V4 release container. Use the **cu130** build on DGX Station (Blackwell):
        ```bash
        docker pull vllm/vllm-openai:v0.20.0-cu130
        ```
        
        # Step 4. Start vLLM server
        
        Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
        
        ## Base configuration (most models)
        
        This is the recommended starting point for any model that fits entirely in VRAM on the GB300. The Qwen3-235B-A22B-NVFP4 model, for example, runs directly with this configuration.
        
        ```bash
        docker run -d \
          --name vllm-server \
          --gpus all \
          --ipc host \
          --ulimit memlock=-1 \
          --ulimit stack=67108864 \
          -p 8000:8000 \
          -e HF_TOKEN="$HF_TOKEN" \
          -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
          nvcr.io/nvidia/vllm:26.01-py3 \
          vllm serve "$MODEL_HANDLE" \
            --max-model-len $MAX_MODEL_LEN \
            --gpu-memory-utilization 0.9
        ```
        
        Settings used:
        - `--max-model-len` — maximum context length (prompt + output) per request. Larger values reserve more GPU memory for the KV cache; size it to your workload.
        - `--gpu-memory-utilization 0.9` — fraction of GPU memory vLLM may use for weights and KV cache. `0.9` leaves headroom for other processes; raise toward `0.95` to fit more KV cache if the GPU is dedicated.
        
        ## Step-3.7-Flash (FP8 / NVFP4)
        
        For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
        
        ```bash
        docker run -d \
          --name vllm-server \
          --gpus all \
          --ipc host \
          --ulimit memlock=-1 \
          --ulimit stack=67108864 \
          -p 8000:8000 \
          -e HF_TOKEN="$HF_TOKEN" \
          -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
          vllm/vllm-openai:stepfun37 \
          "$MODEL_HANDLE" \
            --gpu-memory-utilization 0.95 \
            --trust-remote-code \
            --reasoning-parser step3p5 \
            --enable-auto-tool-choice \
            --tool-call-parser step3p5 \
            --kv-cache-dtype fp8
        ```
        
        Settings used (in addition to the base configuration):
        - `--trust-remote-code` — allows the model's custom modeling code (shipped in its repo) to load. Required for Step-3.7.
        - `--reasoning-parser step3p5` — parses the model's reasoning/thinking tokens into the dedicated `reasoning_content` response field.
        - `--enable-auto-tool-choice` — lets the model decide when to call a tool, enabling OpenAI-compatible function calling.
        - `--tool-call-parser step3p5` — parses the model's tool-call output into structured `tool_calls`. Pairs with `--enable-auto-tool-choice`.
        - `--kv-cache-dtype fp8` — stores the KV cache in FP8, roughly halving KV-cache memory versus 16-bit and allowing more concurrent/longer sequences.
        
        ## Kimi-K2.5 NVFP4 (1T) — CPU offloading
        
        For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, run with the **26.03** NGC container. This model does not fit entirely in VRAM, so the MoE expert weights are offloaded to CPU DRAM with `--cpu-offload-gb 375 --cpu-offload-params experts`. Ensure the system has enough free DRAM to hold the offloaded weights.
        
        ```bash
        docker run -d \
          --name vllm-server \
          --gpus all \
          --ipc host \
          --ulimit memlock=-1 \
          --ulimit stack=67108864 \
          -p 8000:8000 \
          -e HF_TOKEN="$HF_TOKEN" \
          -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
          nvcr.io/nvidia/vllm:26.03-py3 \
          vllm serve nvidia/Kimi-K2.5-NVFP4 \
            --host 0.0.0.0 \
            --port 8000 \
            --dtype auto \
            --kv-cache-dtype auto \
            --gpu-memory-utilization 0.95 \
            --served-model-name nvidia/Kimi-K2.5-NVFP4 \
            --tensor-parallel-size 1 \
            --no-enable-prefix-caching \
            --trust-remote-code \
            --max-model-len 40960 \
            --max-num-seqs 1 \
            --max-num-batched-tokens 32768 \
            --cpu-offload-gb 375 \
            --cpu-offload-params experts
        ```
        
        Settings used (in addition to the base configuration):
        - `--cpu-offload-gb 375` — amount of CPU DRAM (in GiB) vLLM may use to hold weights that don't fit in VRAM. Must be large enough for the offloaded experts; the system needs at least this much free DRAM.
        - `--cpu-offload-params experts` — offloads only the MoE expert weights (the bulk of a large MoE model) to DRAM, keeping attention and other hot weights in VRAM.
        - `--tensor-parallel-size 1` — single GPU; the GB300 serves the whole model.
        - `--max-num-seqs 1` / `--max-num-batched-tokens 32768` — caps concurrency to one sequence and the batch token budget. With expert weights paged from DRAM, throughput is offload-bound, so a low concurrency keeps latency predictable.
        - `--no-enable-prefix-caching` — disables prefix-cache reuse. Offloaded experts make the memory budget tight, so the cache is turned off here rather than spent on KV reuse.
        - `--kv-cache-dtype auto` / `--dtype auto` — let vLLM pick the KV-cache and compute dtypes from the model's quantization (NVFP4).
        
        ## DeepSeek-V4-Flash — MTP + agentic
        
        For DeepSeek-V4-Flash, run with the stable **v0.20.0-cu130** container. This recipe targets agentic workloads and enables Multi-Token Prediction (MTP) speculative decoding. On a single GB300 (TP1) the MoE expert-parallel path is used; the `deep_gemm_mega_moe` backend from some internal recipes is not needed at TP1 and is omitted here.
        
        ```bash
        docker run -d \
          --name vllm-server \
          --gpus all \
          --ipc host \
          --ulimit memlock=-1 \
          --ulimit stack=67108864 \
          -p 8000:8000 \
          -e HF_TOKEN="$HF_TOKEN" \
          -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
          vllm/vllm-openai:v0.20.0-cu130 \
          deepseek-ai/DeepSeek-V4-Flash \
            --enable-expert-parallel \
            --kv-cache-dtype fp8 \
            --trust-remote-code \
            --block-size 256 \
            --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
            --attention_config.use_fp4_indexer_cache True \
            --tokenizer-mode deepseek_v4 \
            --tool-call-parser deepseek_v4 \
            --enable-auto-tool-choice \
            --reasoning-parser deepseek_v4 \
            --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
            --max-model-len 32768
        ```
        
        Settings used (in addition to the base configuration):
        - `--enable-expert-parallel` — shards the MoE experts across the available GPU(s) using expert parallelism, the recommended MoE execution path for DeepSeek-V4.
        - `--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'` — enables **MTP (Multi-Token Prediction)** speculative decoding: the model proposes 3 tokens per step that are verified in a single forward pass, cutting latency for accepted tokens.
        - `--kv-cache-dtype fp8` — FP8 KV cache to fit more concurrent/longer sequences.
        - `--block-size 256` — KV-cache page size in tokens. DeepSeek-V4 uses multiple KV-cache groups; `256` matches the recipe validated on Station.
        - `--attention_config.use_fp4_indexer_cache True` — enables the FP4 indexer cache used by DeepSeek-V4's attention. (Drop this flag on platforms without native FP4, e.g. Hopper.)
        - `--tokenizer-mode deepseek_v4` / `--tool-call-parser deepseek_v4` / `--reasoning-parser deepseek_v4` — DeepSeek-V4-specific tokenizer, tool-call, and reasoning parsers.
        - `--enable-auto-tool-choice` — OpenAI-compatible function calling for agentic use.
        - `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` — uses full + piecewise CUDA graph capture and enables all custom ops for lower per-step overhead.
        - **Prefix caching is left enabled (the vLLM default).** For agentic workloads with large shared prefixes (e.g. a 32k system/context prefix) at low batch sizes (~BS 3–4), prefix caching gives a significant throughput boost by reusing the cached prefix across requests. Some internal recipes carry `--no-enable-prefix-caching`, but that was inherited from random-data benchmarking and is not recommended for agentic use here.
        
        Check the server logs for startup progress:
        
        ```bash
        docker logs -f vllm-server
        ```
        
        Expected output includes:
        - Model download progress (first run only)
        - Model loading into GPU memory
        - `Application startup complete.`
        
        Press `Ctrl+C` to exit log view once the server is ready.
        
        # Step 5. Test the API
        
        Send a test request to verify the server is working:
        
        ```bash
        curl http://localhost:8000/v1/chat/completions \
          -H "Content-Type: application/json" \
          -d '{
            "model": "'"$MODEL_HANDLE"'",
            "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
            "max_tokens": 256
          }'
        ```
        
        The response should contain a `choices` array with the model's answer.
        
        # Step 6. Cleanup
        
        Stop and remove the container:
        
        ```bash
        docker stop vllm-server
        docker rm vllm-server
        ```
        
        Optionally, remove the image and cached model:
        
        Eg.
        ```bash
        docker rmi "<docker image name>"
        rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
        ```
        
      
    - 
      id: troubleshooting
      
      label: Troubleshooting
      content: |
        # Common issues
        
        | Symptom | Cause | Fix |
        |---------|--------|-----|
        | "permission denied" when running docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` |
        | Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker |
        | "Token is required" or 401 error | Missing HuggingFace token | Ensure `HF_TOKEN` is exported before running docker command |
        | Model download hangs or fails | Network or authentication issue | Check internet connection, verify HF_TOKEN is valid |
        | CUDA out of memory | Context length too large | Reduce `MAX_MODEL_LEN` or lower `--gpu-memory-utilization` |
        | Server not responding on port 8000 | Port already in use | Check with `lsof -i :8000`, use `-p 8001:8000` for different port |
        | Model runs on wrong GPU | Default GPU selection | Use `--gpus '"device=0"'` to select specific GPU |
        | NGC authentication fails | Invalid or missing credentials | Run `docker login nvcr.io` with NGC API key |
        | EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v" | Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph capture | Use the **26.01** container image: `nvcr.io/nvidia/vllm:26.01-py3` instead of 25.10. |
        
      
    resources:
    - name: vLLM Documentation
      url: https://docs.vllm.ai/en/latest/