mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-23 18:33:54 +00:00
93 lines
2.5 KiB
Markdown
93 lines
2.5 KiB
Markdown
# vLLM NVFP4 Deployment
|
|
|
|
This setup deploys the NVIDIA Llama 4 Scout model with NVFP4 quantization using vLLM, optimized for Blackwell and Hopper GPU architectures.
|
|
|
|
## Quick Start
|
|
|
|
1. **Set up your HuggingFace token:**
|
|
```bash
|
|
cp env.example .env
|
|
# Edit .env and add your HF_TOKEN
|
|
```
|
|
|
|
2. **Build and run:**
|
|
```bash
|
|
docker-compose up --build
|
|
```
|
|
|
|
3. **Test the deployment:**
|
|
```bash
|
|
curl -X POST "http://localhost:8001/v1/chat/completions" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "nvidia/Llama-4-Scout-17B-16E-Instruct-FP4",
|
|
"messages": [{"role": "user", "content": "Hello! How are you?"}],
|
|
"max_tokens": 100
|
|
}'
|
|
```
|
|
|
|
## Model Information
|
|
|
|
- **Model**: `nvidia/Llama-4-Scout-17B-16E-Instruct-FP4`
|
|
- **Quantization**: NVFP4 (optimized for Blackwell architecture)
|
|
- **Alternative**: `nvidia/Llama-4-Scout-17B-16E-Instruct-FP8` (for Hopper architecture)
|
|
|
|
## Performance Tuning
|
|
|
|
The startup script automatically detects your GPU architecture and applies optimal settings:
|
|
|
|
### Blackwell (Compute Capability 10.0)
|
|
- Enables FlashInfer backend
|
|
- Uses NVFP4 quantization
|
|
- Enables async scheduling
|
|
- Applies fusion optimizations
|
|
|
|
### Hopper (Compute Capability 9.0)
|
|
- Uses FP8 quantization
|
|
- Disables async scheduling (due to vLLM limitations)
|
|
- Standard optimization settings
|
|
|
|
### Configuration Options
|
|
|
|
Adjust these environment variables in your `.env` file:
|
|
|
|
- `VLLM_TENSOR_PARALLEL_SIZE`: Number of GPUs to use (default: 2)
|
|
- `VLLM_MAX_NUM_SEQS`: Batch size (default: 128)
|
|
- `VLLM_MAX_NUM_BATCHED_TOKENS`: Token batching limit (default: 8192)
|
|
- `VLLM_GPU_MEMORY_UTILIZATION`: GPU memory usage (default: 0.9)
|
|
|
|
### Performance Scenarios
|
|
|
|
- **Maximum Throughput**: `VLLM_TENSOR_PARALLEL_SIZE=1`, increase `VLLM_MAX_NUM_SEQS`
|
|
- **Minimum Latency**: `VLLM_TENSOR_PARALLEL_SIZE=4-8`, `VLLM_MAX_NUM_SEQS=8`
|
|
- **Balanced**: `VLLM_TENSOR_PARALLEL_SIZE=2`, `VLLM_MAX_NUM_SEQS=128` (default)
|
|
|
|
## Benchmarking
|
|
|
|
To benchmark performance:
|
|
|
|
```bash
|
|
docker exec -it vllm-nvfp4-server vllm bench serve \
|
|
--host 0.0.0.0 \
|
|
--port 8001 \
|
|
--model nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 \
|
|
--dataset-name random \
|
|
--random-input-len 1024 \
|
|
--random-output-len 1024 \
|
|
--max-concurrency 128 \
|
|
--num-prompts 1280
|
|
```
|
|
|
|
## Requirements
|
|
|
|
- NVIDIA GPU with Blackwell or Hopper architecture
|
|
- CUDA Driver 575 or above
|
|
- Docker with NVIDIA Container Toolkit
|
|
- HuggingFace token (for model access)
|
|
|
|
## Troubleshooting
|
|
|
|
- Check GPU compatibility: `nvidia-smi`
|
|
- View logs: `docker-compose logs -f vllm-nvfp4`
|
|
- Monitor GPU usage: `nvidia-smi -l 1`
|