mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-06-20 13:19:34 +00:00

History

GitLab CI 4073d2c1de chore: Regenerate all playbooks		2026-05-26 18:25:53 +00:00
..
endpoint-production.yaml	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
endpoint-test.yaml	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
overview.md	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
README.md	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00

README.md

Serve Qwen3-235B with vLLM

Set up vLLM server with Qwen3-235B on DGX Station

Overview
Serve Qwen3-235B
Troubleshooting

Overview

Basic idea

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

PagedAttention handles long sequences without running out of GPU memory.
Continuous batching keeps GPUs fully utilized by adding new requests to batches in progress.
OpenAI-compatible API allows applications built for OpenAI to switch to vLLM with minimal changes.

What you'll accomplish

Serve the Qwen3-235B-A22B-NVFP4 model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.

What to know before starting

Basic Docker container usage
Familiarity with REST APIs

Prerequisites

NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs
Docker installed: docker --version
NVIDIA Container Toolkit configured
HuggingFace account with access token
Network access to NGC and HuggingFace

Time & risk

Duration: 15-20 minutes (longer on first run due to model download)
Risks: Model download requires HuggingFace authentication
Rollback: Stop and remove the container to restore state
Last Updated: 03/02/2026
- First Publication

Serve Qwen3-235B

Step 1. Set up Docker permissions

If you haven't already, add your user to the docker group to run Docker without sudo:

sudo usermod -aG docker $USER
newgrp docker

Step 2. Set up environment variables

Set the following so the vLLM container can download the model and use your chosen context length:

## HuggingFace token (required)
## Get a token from https://huggingface.co/settings/tokens
export HF_TOKEN="your_huggingface_token"

## Model to serve
export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4"

## Maximum context length
export MAX_MODEL_LEN=8192

Step 3. Pull vLLM container image

Pull the vLLM container from NGC. Use the 26.01 image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations.

docker pull nvcr.io/nvidia/vllm:26.01-py3

Step 4. Start vLLM server

Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, --gpus all uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with --gpus '"device=N"' where N is the GB300 device ID from nvidia-smi.

docker run -d \
  --name vllm-server \
  --gpus all \
  --ipc host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -e HF_TOKEN="$HF_TOKEN" \
  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "$MODEL_HANDLE" \
    --max-model-len $MAX_MODEL_LEN \
    --gpu-memory-utilization 0.9

Check the server logs for startup progress:

docker logs -f vllm-server

Expected output includes:

Model download progress (first run only)
Model loading into GPU memory
Uvicorn running on http://0.0.0.0:8000

Press Ctrl+C to exit log view once the server is ready.

Step 5. Test the API

Send a test request to verify the server is working:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$MODEL_HANDLE"'",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}],
    "max_tokens": 256
  }'

The response should contain a choices array with the model's answer.

Step 6. Cleanup

Stop and remove the container:

docker stop vllm-server
docker rm vllm-server

Optionally, remove the image and cached model:

docker rmi nvcr.io/nvidia/vllm:26.01-py3
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4

Troubleshooting

Common issues

Symptom	Cause	Fix
"permission denied" when running docker	User not in docker group	Run `sudo usermod -aG docker $USER && newgrp docker`
Container fails to start with GPU error	NVIDIA Container Toolkit not configured	Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker
"Token is required" or 401 error	Missing HuggingFace token	Ensure `HF_TOKEN` is exported before running docker command
Model download hangs or fails	Network or authentication issue	Check internet connection, verify HF_TOKEN is valid
CUDA out of memory	Context length too large	Reduce `MAX_MODEL_LEN` or lower `--gpu-memory-utilization`
Server not responding on port 8000	Port already in use	Check with `lsof -i :8000`, use `-p 8001:8000` for different port
Model runs on wrong GPU	Default GPU selection	Use `--gpus '"device=0"'` to select specific GPU
NGC authentication fails	Invalid or missing credentials	Run `docker login nvcr.io` with NGC API key
EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v"	Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph capture	Use the 26.01 container image: `nvcr.io/nvidia/vllm:26.01-py3` instead of 25.10.