# vLLM for Inference > Install and use vLLM on DGX Station ## Table of Contents - [Overview](#overview) - [Instructions](#instructions) - [Troubleshooting](#troubleshooting) --- ## Overview ## Basic idea vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs. - **PagedAttention** handles long sequences without running out of GPU memory. - **Continuous batching** keeps GPUs fully utilized by adding new requests to batches in progress. - **OpenAI-compatible API** allows applications built for OpenAI to switch to vLLM with minimal changes. ## What you'll accomplish Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models. You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture. ## What to know before starting - Basic Docker container usage - Familiarity with REST APIs ## Prerequisites - NVIDIA DGX Station with GB300 and RTX 6000 Pro GPUs - Docker installed: `docker --version` - NVIDIA Container Toolkit configured - HuggingFace account with access token - Network access to NGC and HuggingFace ## Model Support Matrix The following models are supported with vLLM on DGX Station. All listed models are available and ready to use: | Model | Quantization | Support Status | HF Handle | |-------|-------------|----------------|-----------| | **DiffusionGemma 26B A4B IT** | BF16 | ✅ | [`google/diffusiongemma-26B-A4B-it`](https://huggingface.co/google/diffusiongemma-26B-A4B-it) | | **DiffusionGemma 26B A4B IT** | NVFP4 | ✅ | [`nvidia/diffusiongemma-26B-A4B-it-NVFP4`](https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4) | | **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) | | **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) | | **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) | ## Time & risk * **Duration:** 30 minutes (longer on first run due to model download) * **Risks:** Model download requires HuggingFace authentication * **Rollback:** Stop and remove the container to restore state * **Last Updated:** 06/10/2026 * Update models ## Instructions ## Step 1. Set up Docker permissions If you haven't already, add your user to the docker group to run Docker without sudo: ```bash sudo usermod -aG docker $USER newgrp docker ``` ## Step 2. Set up environment variables Set the following so the vLLM container can download the model and use your chosen context length: ```bash ## HuggingFace token (required) ## Get a token from https://huggingface.co/settings/tokens export HF_TOKEN="your_huggingface_token" ## Model to serve export MODEL_HANDLE="" ## Maximum context length export MAX_MODEL_LEN=8192 ``` ## Step 3. Pull vLLM container image Pull the vLLM container from NGC. Use the **26.01** image on DGX Station; the 25.10 image can fail during engine startup with a FlashInfer buffer overflow on some configurations. ```bash docker pull nvcr.io/nvidia/vllm:26.01-py3 ``` For DiffusionGemma, use the vLLM custom container: ```bash docker pull vllm/vllm-openai:gemma ``` For Step-3.7-Flash models, pull the custom VLLM container ```bash docker pull vllm/vllm-openai:stepfun37 ``` ## Step 4. Start vLLM server Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`. For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300. ```bash docker run -d \ --name vllm-server \ --gpus all \ --ipc host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 8000:8000 \ -e HF_TOKEN="$HF_TOKEN" \ -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ nvcr.io/nvidia/vllm:26.01-py3 \ vllm serve "$MODEL_HANDLE" \ --max-model-len $MAX_MODEL_LEN \ --gpu-memory-utilization 0.9 ``` For DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`), run with custom VLLM container. ```bash docker run -d \ --name vllm-server \ -p 8000:8000 \ --gpus all \ --shm-size=16g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -e VLLM_USE_V2_MODEL_RUNNER=1 \ -e HF_TOKEN="$HF_TOKEN" \ vllm/vllm-openai:gemma ${MODEL_HANDLE} \ --gpu-memory-utilization 0.85 \ --attention-backend TRITON_ATTN \ --max-num-seqs 16 \ --diffusion-config '{"canvas_length":256}' \ --override-generation-config '{"max_new_tokens": null}' \ --load-format fastsafetensors \ --enable-prefix-caching \ --reasoning-parser gemma4 \ --default-chat-template-kwargs '{"enable_thinking": true}' \ --enable-auto-tool-choice \ --tool-call-parser gemma4 ## For BF16 checkpoint add "--moe-backend triton" for better performance ``` For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300. ```bash docker run -d \ --name vllm-server \ --gpus all \ --ipc host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 8000:8000 \ -e HF_TOKEN="$HF_TOKEN" \ -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ vllm/vllm-openai:stepfun37 \ "$MODEL_HANDLE" \ --gpu-memory-utilization 0.95 \ --trust-remote-code \ --reasoning-parser step3p5 \ --enable-auto-tool-choice \ --tool-call-parser step3p5 \ --kv-cache-dtype fp8 ``` Check the server logs for startup progress: ```bash docker logs -f vllm-server ``` Expected output includes: - Model download progress (first run only) - Model loading into GPU memory - `Application startup complete.` Press `Ctrl+C` to exit log view once the server is ready. ## Step 5. Test the API Send a test request to verify the server is working: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "'"$MODEL_HANDLE"'", "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}], "max_tokens": 256 }' ``` The response should contain a `choices` array with the model's answer. ## Step 6. Cleanup Stop and remove the container: ```bash docker stop vllm-server docker rm vllm-server ``` Optionally, remove the image and cached model: Eg. ```bash docker rmi "" rm -rf $HOME/.cache/huggingface/hub/"" ``` ## Troubleshooting ## Common issues | Symptom | Cause | Fix | |---------|--------|-----| | "permission denied" when running docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` | | Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker | | "Token is required" or 401 error | Missing HuggingFace token | Ensure `HF_TOKEN` is exported before running docker command | | Model download hangs or fails | Network or authentication issue | Check internet connection, verify HF_TOKEN is valid | | CUDA out of memory | Context length too large | Reduce `MAX_MODEL_LEN` or lower `--gpu-memory-utilization` | | Server not responding on port 8000 | Port already in use | Check with `lsof -i :8000`, use `-p 8001:8000` for different port | | Model runs on wrong GPU | Default GPU selection | Use `--gpus '"device=0"'` to select specific GPU | | NGC authentication fails | Invalid or missing credentials | Run `docker login nvcr.io` with NGC API key | | EngineCore failed / FlashInfer "Buffer overflow when allocating memory for batch_prefill_tmp_v" | Known issue with vLLM 25.10 on some DGX Station setups during CUDA graph capture | Use the **26.01** container image: `nvcr.io/nvidia/vllm:26.01-py3` instead of 25.10. |