mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-23 18:33:54 +00:00
2.5 KiB
2.5 KiB
vLLM NVFP4 Deployment
This setup deploys the NVIDIA Llama 4 Scout model with NVFP4 quantization using vLLM, optimized for Blackwell and Hopper GPU architectures.
Quick Start
-
Set up your HuggingFace token:
cp env.example .env # Edit .env and add your HF_TOKEN -
Build and run:
docker-compose up --build -
Test the deployment:
curl -X POST "http://localhost:8001/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "nvidia/Llama-4-Scout-17B-16E-Instruct-FP4", "messages": [{"role": "user", "content": "Hello! How are you?"}], "max_tokens": 100 }'
Model Information
- Model:
nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 - Quantization: NVFP4 (optimized for Blackwell architecture)
- Alternative:
nvidia/Llama-4-Scout-17B-16E-Instruct-FP8(for Hopper architecture)
Performance Tuning
The startup script automatically detects your GPU architecture and applies optimal settings:
Blackwell (Compute Capability 10.0)
- Enables FlashInfer backend
- Uses NVFP4 quantization
- Enables async scheduling
- Applies fusion optimizations
Hopper (Compute Capability 9.0)
- Uses FP8 quantization
- Disables async scheduling (due to vLLM limitations)
- Standard optimization settings
Configuration Options
Adjust these environment variables in your .env file:
VLLM_TENSOR_PARALLEL_SIZE: Number of GPUs to use (default: 2)VLLM_MAX_NUM_SEQS: Batch size (default: 128)VLLM_MAX_NUM_BATCHED_TOKENS: Token batching limit (default: 8192)VLLM_GPU_MEMORY_UTILIZATION: GPU memory usage (default: 0.9)
Performance Scenarios
- Maximum Throughput:
VLLM_TENSOR_PARALLEL_SIZE=1, increaseVLLM_MAX_NUM_SEQS - Minimum Latency:
VLLM_TENSOR_PARALLEL_SIZE=4-8,VLLM_MAX_NUM_SEQS=8 - Balanced:
VLLM_TENSOR_PARALLEL_SIZE=2,VLLM_MAX_NUM_SEQS=128(default)
Benchmarking
To benchmark performance:
docker exec -it vllm-nvfp4-server vllm bench serve \
--host 0.0.0.0 \
--port 8001 \
--model nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--max-concurrency 128 \
--num-prompts 1280
Requirements
- NVIDIA GPU with Blackwell or Hopper architecture
- CUDA Driver 575 or above
- Docker with NVIDIA Container Toolkit
- HuggingFace token (for model access)
Troubleshooting
- Check GPU compatibility:
nvidia-smi - View logs:
docker-compose logs -f vllm-nvfp4 - Monitor GPU usage:
nvidia-smi -l 1