dgx-spark-playbooks/nvidia/station-vllm/README.md

# vLLM for Inference

> Install and use vLLM on DGX Station

## Table of Contents

- [Overview](#overview)
- [Instructions](#instructions)
- [Troubleshooting](#troubleshooting)

---

## Overview

## Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.

- **PagedAttention** handles long sequences without running out of GPU memory.
- **Continuous batching** keeps GPUs fully utilized by adding new requests to batches in progress.
- **OpenAI-compatible API** allows applications built for OpenAI to switch to vLLM with minimal changes.

## What you'll accomplish

Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.

You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.

## What to know before starting

- Basic Docker container usage
- Familiarity with REST APIs

## Prerequisites

- NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
- Docker installed: `docker --version`
- NVIDIA Container Toolkit configured
- HuggingFace account with access token
- Network access to NGC and HuggingFace

## Model Support Matrix

The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:

| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |

## Time & risk

* **Duration:** 30 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 05/28/2026
  * Update models

## Instructions

## Step 1. Set up Docker permissions

If you haven't already, add your user to the docker group to run Docker without sudo:

```bash
sudo usermod -aG docker $USER
newgrp docker
```

## Step 2. Set up environment variables

Set the following so the vLLM container can download the model and use your chosen context length:

```bash
## HuggingFace token (required)
## Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"

## Model to serve
export MODEL_HANDLE="<HF_HANDLE>"

## Maximum context length
export MAX_MODEL_LEN=8192
```

## Step 3. Pull vLLM container image

Pull the vLLM container from NGC. Use the **26.01** image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations.

```bash
docker pull nvcr.io/nvidia/vllm:26.01-py3
```

For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
```

## Step 4. Start vLLM server

Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.

For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.

```bash
docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "$MODEL_HANDLE" \
    --max-model-len $MAX_MODEL_LEN \
    --gpu-memory-utilization 0.9
```

For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.

```bash
docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  vllm/vllm-openai:stepfun37 \
  "$MODEL_HANDLE" \
    --gpu-memory-utilization 0.95 \
    --trust-remote-code \
    --reasoning-parser step3p5 \
    --enable-auto-tool-choice \
    --tool-call-parser step3p5 \
    --kv-cache-dtype fp8
```

Check the server logs for startup progress:

```bash
docker logs -f vllm-server
```

Expected output includes:
- Model download progress (first run only)
- Model loading into GPU memory
- `Application startup complete.`

Press `Ctrl+C` to exit log view once the server is ready.

## Step 5. Test the API

Send a test request to verify the server is working:

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    "max_tokens": 256
  }'
```

The response should contain a `choices` array with the model's answer.

## Step 6. Cleanup

Stop and remove the container:

```bash
docker stop vllm-server
docker rm vllm-server
```

Optionally, remove the image and cached model:

Eg.
```bash
docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
```

## Troubleshooting

## Common issues

| Symptom | Cause | Fix |
|---------|--------|-----|
| "permission denied" when running docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` |
| Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker |
| "Token is required" or 401 error | Missing HuggingFace token | Ensure `HF_TOKEN` is exported before running docker command |
| Model download hangs or fails | Network or authentication issue | Check internet connection, verify HF_TOKEN is valid |
| CUDA out of memory | Context length too large | Reduce `MAX_MODEL_LEN` or lower `--gpu-memory-utilization` |
| Server not responding on port 8000 | Port already in use | Check with `lsof -i :8000`, use `-p 8001:8000` for different port |
| Model runs on wrong GPU | Default GPU selection | Use `--gpus '"device=0"'` to select specific GPU |
| NGC authentication fails | Invalid or missing credentials | Run `docker login nvcr.io` with NGC API key |
| EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v" | Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph capture | Use the **26.01** container image: `nvcr.io/nvidia/vllm:26.01-py3` instead of 25.10. |
chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`# vLLM for Inference`
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`> Install and use vLLM on DGX Station`
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
			`## Table of Contents`

			`- [Overview](#overview)`
chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`- [Instructions](#instructions)`
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00			`- [Troubleshooting](#troubleshooting)`

			`---`

			`## Overview`

			`## Basic idea`

			`vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.`

			`- PagedAttention handles long sequences without running out of GPU memory.`
			`- Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.`
			`- OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.`

			`## What you'll accomplish`

chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`Serve a supported model using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.`

			`You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.`
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
			`## What to know before starting`

			`- Basic Docker container usage`
			`- Familiarity with REST APIs`

			`## Prerequisites`

			`- NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs`
			- Docker installed: `docker --version`
			`- NVIDIA Container Toolkit configured`
			`- HuggingFace account with access token`
			`- Network access to NGC and HuggingFace`

chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`## Model Support Matrix`

			`The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:`

			`\| Model \| Quantization \| Support Status \| HF Handle \|`
			`\|-------\|-------------\|----------------\|-----------\|`
			\| Step-3.7-Flash-FP8 \| FP8 \| ✅ \| [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) \|
			\| Step-3.7-Flash-NVFP4 \| NVFP4 \| ✅ \| [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) \|
			\| Qwen3-235B-A22B-NVFP4 \| NVFP4 \| ✅ \| [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) \|
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
			`## Time & risk`

chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`* Duration: 30 minutes (longer on first run due to model download)`
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00			`* Risks: Model download requires HuggingFace authentication`
			`* Rollback: Stop and remove the container to restore state`
chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`* Last Updated: 05/28/2026`
			`* Update models`
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`## Instructions`
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
			`## Step 1. Set up Docker permissions`

			`If you haven't already, add your user to the docker group to run Docker without sudo:`

			```bash
			`sudo usermod -aG docker $USER`
			`newgrp docker`
			```

			`## Step 2. Set up environment variables`

			`Set the following so the vLLM container can download the model and use your chosen context length:`

			```bash
			`## HuggingFace token (required)`
			`## Get a token from https://huggingface.co/settings/tokens`
			`export HF_TOKEN="your_huggingface_token"`

			`## Model to serve`
chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`export MODEL_HANDLE="<HF_HANDLE>"`
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
			`## Maximum context length`
			`export MAX_MODEL_LEN=8192`
			```

			`## Step 3. Pull vLLM container image`

			`Pull the vLLM container from NGC. Use the 26.01 image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations.`

			```bash
			`docker pull nvcr.io/nvidia/vllm:26.01-py3`
			```

chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`For Step-3.7-Flash models, pull the custom VLLM container`
			```bash
			`docker pull vllm/vllm-openai:stepfun37`
			```

chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00			`## Step 4. Start vLLM server`

chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.

			`For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.`
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
			```bash
			`docker run -d \`
			`--name vllm-server \`
			`--gpus all \`
			`--ipc host \`
			`--ulimit memlock=-1 \`
			`--ulimit stack=67108864 \`
			`-p 8000:8000 \`
			`-e HF_TOKEN="$HF_TOKEN" \`
			`-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \`
			`nvcr.io/nvidia/vllm:26.01-py3 \`
			`vllm serve "$MODEL_HANDLE" \`
			`--max-model-len $MAX_MODEL_LEN \`
			`--gpu-memory-utilization 0.9`
			```

chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.`

			```bash
			`docker run -d \`
			`--name vllm-server \`
			`--gpus all \`
			`--ipc host \`
			`--ulimit memlock=-1 \`
			`--ulimit stack=67108864 \`
			`-p 8000:8000 \`
			`-e HF_TOKEN="$HF_TOKEN" \`
			`-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \`
			`vllm/vllm-openai:stepfun37 \`
			`"$MODEL_HANDLE" \`
			`--gpu-memory-utilization 0.95 \`
			`--trust-remote-code \`
			`--reasoning-parser step3p5 \`
			`--enable-auto-tool-choice \`
			`--tool-call-parser step3p5 \`
			`--kv-cache-dtype fp8`
			```

chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00			`Check the server logs for startup progress:`

			```bash
			`docker logs -f vllm-server`
			```

			`Expected output includes:`
			`- Model download progress (first run only)`
			`- Model loading into GPU memory`
chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			- `Application startup complete.`
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00
			Press `Ctrl+C` to exit log view once the server is ready.

			`## Step 5. Test the API`

			`Send a test request to verify the server is working:`

			```bash
			`curl http://localhost:8000/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "'"$MODEL_HANDLE"'",`
			`"messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],`
			`"max_tokens": 256`
			`}'`
			```

			The response should contain a `choices` array with the model's answer.

			`## Step 6. Cleanup`

			`Stop and remove the container:`

			```bash
			`docker stop vllm-server`
			`docker rm vllm-server`
			```

			`Optionally, remove the image and cached model:`

chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`Eg.`
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00			```bash
chore: Regenerate all playbooks 2026-05-29 00:08:55 +00:00			`docker rmi "<docker image name>"`
			`rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"`
chore: Regenerate all playbooks 2026-05-26 18:25:53 +00:00			```

			`## Troubleshooting`

			`## Common issues`

			`\| Symptom \| Cause \| Fix \|`
			`\|---------\|--------\|-----\|`
			\| "permission denied" when running docker \| User not in docker group \| Run `sudo usermod -aG docker $USER && newgrp docker` \|
			\| Container fails to start with GPU error \| NVIDIA Container Toolkit not configured \| Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker \|
			\| "Token is required" or 401 error \| Missing HuggingFace token \| Ensure `HF_TOKEN` is exported before running docker command \|
			`\| Model download hangs or fails \| Network or authentication issue \| Check internet connection, verify HF_TOKEN is valid \|`
			\| CUDA out of memory \| Context length too large \| Reduce `MAX_MODEL_LEN` or lower `--gpu-memory-utilization` \|
			\| Server not responding on port 8000 \| Port already in use \| Check with `lsof -i :8000`, use `-p 8001:8000` for different port \|
			\| Model runs on wrong GPU \| Default GPU selection \| Use `--gpus '"device=0"'` to select specific GPU \|
			\| NGC authentication fails \| Invalid or missing credentials \| Run `docker login nvcr.io` with NGC API key \|
			\| EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v" \| Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph capture \| Use the 26.01 container image: `nvcr.io/nvidia/vllm:26.01-py3` instead of 25.10. \|