dgx-spark-playbooks/nvidia/station-vllm
2026-06-11 01:07:29 +00:00
..
endpoint-production.yaml chore: Regenerate all playbooks 2026-05-29 15:56:45 +00:00
endpoint-test.yaml chore: Regenerate all playbooks 2026-06-11 01:07:29 +00:00
overview.md chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
README.md chore: Regenerate all playbooks 2026-06-11 01:07:29 +00:00

vLLM for Inference

Install and use vLLM on DGX Station

Table of Contents


Overview

Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

  • PagedAttention handles long sequences without running out of GPU memory.
  • Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
  • OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.

What you'll accomplish

Serve a supported model using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.

You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.

What to know before starting

  • Basic Docker container usage
  • Familiarity with REST APIs

Prerequisites

  • NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
  • Docker installed: docker --version
  • NVIDIA Container Toolkit configured
  • HuggingFace account with access token
  • Network access to NGC and HuggingFace

Model Support Matrix

The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:

Model Quantization Support Status HF Handle
DiffusionGemma 26B A4B IT BF16 google/diffusiongemma-26B-A4B-it
DiffusionGemma 26B A4B IT NVFP4 nvidia/diffusiongemma-26B-A4B-it-NVFP4
Step-3.7-Flash-FP8 FP8 stepfun-ai/Step-3.7-Flash-FP8
Step-3.7-Flash-NVFP4 NVFP4 stepfun-ai/Step-3.7-Flash-NVFP4
Qwen3-235B-A22B-NVFP4 NVFP4 nvidia/Qwen3-235B-A22B-NVFP4

Time & risk

  • Duration: 30 minutes (longer on first run due to model download)
  • Risks: Model download requires HuggingFace authentication
  • Rollback: Stop and remove the container to restore state
  • Last Updated: 06/10/2026
    • Update models

Instructions

Step 1. Set up Docker permissions

If you haven't already, add your user to the docker group to run Docker without sudo:

sudo usermod -aG docker $USER
newgrp docker

Step 2. Set up environment variables

Set the following so the vLLM container can download the model and use your chosen context length:

## HuggingFace token (required)
## Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"

## Model to serve
export MODEL_HANDLE="<HF_HANDLE>"

## Maximum context length
export MAX_MODEL_LEN=8192

Step 3. Pull vLLM container image

Pull the vLLM container from NGC. Use the 26.01 image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations.

docker pull nvcr.io/nvidia/vllm:26.01-py3

For DiffusionGemma, use the vLLM custom container:

docker pull vllm/vllm-openai:gemma

For Step-3.7-Flash models, pull the custom VLLM container

docker pull vllm/vllm-openai:stepfun37

Step 4. Start vLLM server

Start the vLLM server with the model. On a single-GPU DGX Station, --gpus all uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with --gpus '"device=N"' where N is the GB300 device ID from nvidia-smi.

For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.

docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "$MODEL_HANDLE" \
    --max-model-len $MAX_MODEL_LEN \
    --gpu-memory-utilization 0.9

For DiffusionGemma models (e.g. google/diffusiongemma-26B-A4B-it), run with custom VLLM container.

docker run -d \
  --name vllm-server \
  -p 8000:8000 \
  --gpus all \
  --shm-size=16g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e VLLM_USE_V2_MODEL_RUNNER=1 \
  -e HF_TOKEN="$HF_TOKEN" \
  vllm/vllm-openai:gemma ${MODEL_HANDLE} \
  --gpu-memory-utilization 0.85 \
  --attention-backend TRITON_ATTN \
  --max-num-seqs 16 \
  --diffusion-config '{"canvas_length":256}' \
  --override-generation-config '{"max_new_tokens": null}' \
  --load-format fastsafetensors \
  --enable-prefix-caching \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4

## For BF16 checkpoint add "--moe-backend triton" for better performance

For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.

docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  vllm/vllm-openai:stepfun37 \
  "$MODEL_HANDLE" \
    --gpu-memory-utilization 0.95 \
    --trust-remote-code \
    --reasoning-parser step3p5 \
    --enable-auto-tool-choice \
    --tool-call-parser step3p5 \
    --kv-cache-dtype fp8

Check the server logs for startup progress:

docker logs -f vllm-server

Expected output includes:

  • Model download progress (first run only)
  • Model loading into GPU memory
  • Application startup complete.

Press Ctrl+C to exit log view once the server is ready.

Step 5. Test the API

Send a test request to verify the server is working:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    "max_tokens": 256
  }'

The response should contain a choices array with the model's answer.

Step 6. Cleanup

Stop and remove the container:

docker stop vllm-server
docker rm vllm-server

Optionally, remove the image and cached model:

Eg.

docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"

Troubleshooting

Common issues

Symptom Cause Fix
"permission denied" when running docker User not in docker group Run sudo usermod -aG docker $USER && newgrp docker
Container fails to start with GPU error NVIDIA Container Toolkit not configured Run nvidia-ctk runtime configure --runtime=docker and restart Docker
"Token is required" or 401 error Missing HuggingFace token Ensure HF_TOKEN is exported before running docker command
Model download hangs or fails Network or authentication issue Check internet connection, verify HF_TOKEN is valid
CUDA out of memory Context length too large Reduce MAX_MODEL_LEN or lower --gpu-memory-utilization
Server not responding on port 8000 Port already in use Check with lsof -i :8000, use -p 8001:8000 for different port
Model runs on wrong GPU Default GPU selection Use --gpus '"device=0"' to select specific GPU
NGC authentication fails Invalid or missing credentials Run docker login nvcr.io with NGC API key
EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v" Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph capture Use the 26.01 container image: nvcr.io/nvidia/vllm:26.01-py3 instead of 25.10.