mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-06-18 04:22:21 +00:00

History

GitLab CI bc6bf2251e chore: Regenerate all playbooks		2026-06-11 01:07:29 +00:00
..
endpoint-production.yaml	chore: Regenerate all playbooks	2026-05-29 15:56:45 +00:00
endpoint-test.yaml	chore: Regenerate all playbooks	2026-06-11 01:07:29 +00:00
overview.md	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
README.md	chore: Regenerate all playbooks	2026-06-11 01:07:29 +00:00

README.md

vLLM for Inference

Install and use vLLM on DGX Station

Overview
Instructions
Troubleshooting

Overview

Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

PagedAttention handles long sequences without running out of GPU memory.
Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.

What you'll accomplish

Serve a supported model using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.

You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.

What to know before starting

Basic Docker container usage
Familiarity with REST APIs

Prerequisites

NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
Docker installed: docker --version
NVIDIA Container Toolkit configured
HuggingFace account with access token
Network access to NGC and HuggingFace

Model Support Matrix

The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:

Model	Quantization	Support Status	HF Handle
DiffusionGemma 26B A4B IT	BF16	✅	`google/diffusiongemma-26B-A4B-it`
DiffusionGemma 26B A4B IT	NVFP4	✅	`nvidia/diffusiongemma-26B-A4B-it-NVFP4`
Step-3.7-Flash-FP8	FP8	✅	`stepfun-ai/Step-3.7-Flash-FP8`
Step-3.7-Flash-NVFP4	NVFP4	✅	`stepfun-ai/Step-3.7-Flash-NVFP4`
Qwen3-235B-A22B-NVFP4	NVFP4	✅	`nvidia/Qwen3-235B-A22B-NVFP4`

Time & risk

Duration: 30 minutes (longer on first run due to model download)
Risks: Model download requires HuggingFace authentication
Rollback: Stop and remove the container to restore state
Last Updated: 06/10/2026
- Update models

Instructions

Step 1. Set up Docker permissions

If you haven't already, add your user to the docker group to run Docker without sudo:

sudo usermod -aG docker $USER
newgrp docker

Step 2. Set up environment variables

Set the following so the vLLM container can download the model and use your chosen context length:

## HuggingFace token (required)
## Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"

## Model to serve
export MODEL_HANDLE="<HF_HANDLE>"

## Maximum context length
export MAX_MODEL_LEN=8192

Step 3. Pull vLLM container image

Pull the vLLM container from NGC. Use the 26.01 image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations.

docker pull nvcr.io/nvidia/vllm:26.01-py3

For DiffusionGemma, use the vLLM custom container:

docker pull vllm/vllm-openai:gemma

For Step-3.7-Flash models, pull the custom VLLM container

docker pull vllm/vllm-openai:stepfun37

Step 4. Start vLLM server

Start the vLLM server with the model. On a single-GPU DGX Station, --gpus all uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with --gpus '"device=N"' where N is the GB300 device ID from nvidia-smi.

For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.

docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "$MODEL_HANDLE" \
    --max-model-len $MAX_MODEL_LEN \
    --gpu-memory-utilization 0.9

For DiffusionGemma models (e.g. google/diffusiongemma-26B-A4B-it), run with custom VLLM container.

docker run -d \
  --name vllm-server \
  -p 8000:8000 \
  --gpus all \
  --shm-size=16g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e VLLM_USE_V2_MODEL_RUNNER=1 \
  -e HF_TOKEN="$HF_TOKEN" \
  vllm/vllm-openai:gemma ${MODEL_HANDLE} \
  --gpu-memory-utilization 0.85 \
  --attention-backend TRITON_ATTN \
  --max-num-seqs 16 \
  --diffusion-config '{"canvas_length":256}' \
  --override-generation-config '{"max_new_tokens": null}' \
  --load-format fastsafetensors \
  --enable-prefix-caching \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": true}' \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4

## For BF16 checkpoint add "--moe-backend triton" for better performance

For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.

docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  vllm/vllm-openai:stepfun37 \
  "$MODEL_HANDLE" \
    --gpu-memory-utilization 0.95 \
    --trust-remote-code \
    --reasoning-parser step3p5 \
    --enable-auto-tool-choice \
    --tool-call-parser step3p5 \
    --kv-cache-dtype fp8

Check the server logs for startup progress:

docker logs -f vllm-server

Expected output includes:

Model download progress (first run only)
Model loading into GPU memory
Application startup complete.

Press Ctrl+C to exit log view once the server is ready.

Step 5. Test the API

Send a test request to verify the server is working:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    "max_tokens": 256
  }'

The response should contain a choices array with the model's answer.

Step 6. Cleanup

Stop and remove the container:

docker stop vllm-server
docker rm vllm-server

Optionally, remove the image and cached model:

Eg.

docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"

Troubleshooting

Common issues

Symptom	Cause	Fix
"permission denied" when running docker	User not in docker group	Run `sudo usermod -aG docker $USER && newgrp docker`
Container fails to start with GPU error	NVIDIA Container Toolkit not configured	Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker
"Token is required" or 401 error	Missing HuggingFace token	Ensure `HF_TOKEN` is exported before running docker command
Model download hangs or fails	Network or authentication issue	Check internet connection, verify HF_TOKEN is valid
CUDA out of memory	Context length too large	Reduce `MAX_MODEL_LEN` or lower `--gpu-memory-utilization`
Server not responding on port 8000	Port already in use	Check with `lsof -i :8000`, use `-p 8001:8000` for different port
Model runs on wrong GPU	Default GPU selection	Use `--gpus '"device=0"'` to select specific GPU
NGC authentication fails	Invalid or missing credentials	Run `docker login nvcr.io` with NGC API key
EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v"	Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph capture	Use the 26.01 container image: `nvcr.io/nvidia/vllm:26.01-py3` instead of 25.10.