dgx-spark-playbooks/nvidia/txt2kg/assets/deploy/services/vllm/README.md

# vLLM NVFP4 Deployment

This setup deploys the NVIDIA Llama 4 Scout model with NVFP4 quantization using vLLM, optimized for Blackwell and Hopper GPU architectures.

## Quick Start

1. **Set up your HuggingFace token:**
   ```bash
   cp env.example .env
   # Edit .env and add your HF_TOKEN
   ```

2. **Build and run:**
   ```bash
   docker-compose up --build
   ```

3. **Test the deployment:**
   ```bash
   curl -X POST "http://localhost:8001/v1/chat/completions" \
     -H "Content-Type: application/json" \
     -d '{
       "model": "nvidia/Llama-4-Scout-17B-16E-Instruct-FP4",
       "messages": [{"role": "user", "content": "Hello! How are you?"}],
       "max_tokens": 100
     }'
   ```

## Model Information

- **Model**: `nvidia/Llama-4-Scout-17B-16E-Instruct-FP4`
- **Quantization**: NVFP4 (optimized for Blackwell architecture)
- **Alternative**: `nvidia/Llama-4-Scout-17B-16E-Instruct-FP8` (for Hopper architecture)

## Performance Tuning

The startup script automatically detects your GPU architecture and applies optimal settings:

### Blackwell (Compute Capability 10.0)
- Enables FlashInfer backend
- Uses NVFP4 quantization
- Enables async scheduling
- Applies fusion optimizations

### Hopper (Compute Capability 9.0)
- Uses FP8 quantization
- Disables async scheduling (due to vLLM limitations)
- Standard optimization settings

### Configuration Options

Adjust these environment variables in your `.env` file:

- `VLLM_TENSOR_PARALLEL_SIZE`: Number of GPUs to use (default: 2)
- `VLLM_MAX_NUM_SEQS`: Batch size (default: 128)
- `VLLM_MAX_NUM_BATCHED_TOKENS`: Token batching limit (default: 8192)
- `VLLM_GPU_MEMORY_UTILIZATION`: GPU memory usage (default: 0.9)

### Performance Scenarios

- **Maximum Throughput**: `VLLM_TENSOR_PARALLEL_SIZE=1`, increase `VLLM_MAX_NUM_SEQS`
- **Minimum Latency**: `VLLM_TENSOR_PARALLEL_SIZE=4-8`, `VLLM_MAX_NUM_SEQS=8`
- **Balanced**: `VLLM_TENSOR_PARALLEL_SIZE=2`, `VLLM_MAX_NUM_SEQS=128` (default)

## Benchmarking

To benchmark performance:

```bash
docker exec -it vllm-nvfp4-server vllm bench serve \
  --host 0.0.0.0 \
  --port 8001 \
  --model nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --max-concurrency 128 \
  --num-prompts 1280
```

## Requirements

- NVIDIA GPU with Blackwell or Hopper architecture
- CUDA Driver 575 or above
- Docker with NVIDIA Container Toolkit
- HuggingFace token (for model access)

## Troubleshooting

- Check GPU compatibility: `nvidia-smi`
- View logs: `docker-compose logs -f vllm-nvfp4`
- Monitor GPU usage: `nvidia-smi -l 1`