2.6 KiB
vLLM Setup on DGX Station
Deploy a vLLM inference server on DGX Station with validated configuration.
Steps
-
Find the GB300 GPU index. Run:
nvidia-smi --query-gpu=index,name --format=csv,noheaderIdentify the device index for the GB300 (typically device 1). Use this index for
--gpusbelow. Do NOT use--gpus all— mixed coherency will cause CUDA failures. -
Ask the user which model to serve. If they don't have a preference, suggest:
nvidia/Qwen3-235B-A22B-NVFP4— large MoE model, fits in 279 GB HBMmeta-llama/Llama-3.1-70B-Instruct— solid general-purpose modelQwen/Qwen3-8B— small model for testing
-
Check if the user has an HF_TOKEN. Many models require HuggingFace authentication. The token must be passed inline with
-e HF_TOKEN="..."— do not rely on shell export in background Docker tasks. -
Deploy the container. Use this validated configuration:
docker pull nvcr.io/nvidia/vllm:26.01-py3 docker run -d \ --name vllm-server \ --gpus '"device=<GB300_INDEX>"' \ --ipc host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -p 8000:8000 \ -e HF_TOKEN="<TOKEN>" \ -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ nvcr.io/nvidia/vllm:26.01-py3 \ vllm serve "<MODEL>" \ --max-model-len 32768 \ --gpu-memory-utilization 0.9Container version: Use
nvcr.io/nvidia/vllm:26.01-py3. Do NOT use 25.10 — it has a FlashInfer buffer overflow on DGX Station. -
Wait for the server to be ready. Monitor logs:
docker logs -f vllm-serverWait for the line indicating the server is listening on port 8000.
-
Test the server:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "<MODEL>", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 64 }' -
Report the result to the user, including:
- Model loaded and serving on port 8000
- GPU memory utilization
- How to stop:
docker stop vllm-server && docker rm vllm-server
Tuning parameters
Adjust these based on the user's workload:
| Parameter | Default | Agent workloads | Throughput workloads |
|---|---|---|---|
--max-model-len |
32768 | 32768-65536 | 8192-16384 |
--gpu-memory-utilization |
0.9 | 0.85-0.90 | 0.90-0.92 |
--enable-prefix-caching |
off | Enable (multi-turn reuse) | Enable |
--max-num-seqs |
default | 4-16 (lower latency) | 32+ (higher throughput) |