dgx-spark-playbooks/nvidia/txt2kg/assets/deploy/services/vllm/README.md

154 lines
4.1 KiB
Markdown
Raw Normal View History

2025-10-10 18:45:20 +00:00
# vLLM Service
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
This service provides advanced GPU-accelerated LLM inference using vLLM with FP8 quantization, offering higher throughput than Ollama for production workloads.
## Overview
vLLM is an optional service that complements Ollama by providing:
- Higher throughput for concurrent requests
- Advanced quantization (FP8)
- PagedAttention for efficient memory usage
- OpenAI-compatible API
2025-10-06 17:05:41 +00:00
## Quick Start
2025-10-10 18:45:20 +00:00
### Using the Complete Stack
The easiest way to run vLLM is with the complete stack:
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
```bash
# From project root
./start.sh --complete
```
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
This starts vLLM along with all other optional services.
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
### Manual Docker Compose
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
```bash
# From project root
docker compose -f deploy/compose/docker-compose.complete.yml up -d vllm
```
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
### Testing the Deployment
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
```bash
# Check health
curl http://localhost:8001/v1/models
# Test chat completion
curl -X POST "http://localhost:8001/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [{"role": "user", "content": "Hello! How are you?"}],
"max_tokens": 100
}'
```
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
## Default Configuration
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
- **Model**: `meta-llama/Llama-3.2-3B-Instruct`
- **Quantization**: FP8 (optimized for compute efficiency)
- **Port**: 8001
- **API**: OpenAI-compatible endpoints
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
## Configuration Options
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
Environment variables configured in `docker-compose.complete.yml`:
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
- `VLLM_MODEL`: Model to load (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `VLLM_TENSOR_PARALLEL_SIZE`: Number of GPUs to use (default: 1)
- `VLLM_MAX_MODEL_LEN`: Maximum sequence length (default: 4096)
2025-10-06 17:05:41 +00:00
- `VLLM_GPU_MEMORY_UTILIZATION`: GPU memory usage (default: 0.9)
2025-10-10 18:45:20 +00:00
- `VLLM_QUANTIZATION`: Quantization method (default: fp8)
- `VLLM_KV_CACHE_DTYPE`: KV cache data type (default: fp8)
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
## Frontend Integration
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
The txt2kg frontend automatically detects and uses vLLM when available:
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
1. Triple extraction: `/api/vllm` endpoint
2. RAG queries: Automatically uses vLLM if configured
3. Model selection: Choose vLLM models in the UI
2025-10-06 17:05:41 +00:00
2025-10-10 18:45:20 +00:00
## Using Different Models
To use a different model, edit the `VLLM_MODEL` environment variable in `docker-compose.complete.yml`:
```yaml
environment:
- VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct
```
Then restart the service:
2025-10-06 17:05:41 +00:00
```bash
2025-10-10 18:45:20 +00:00
docker compose -f deploy/compose/docker-compose.complete.yml restart vllm
2025-10-06 17:05:41 +00:00
```
2025-10-10 18:45:20 +00:00
## Performance Tips
1. **Single GPU**: Set `VLLM_TENSOR_PARALLEL_SIZE=1` for best single-GPU performance
2. **Multi-GPU**: Increase `VLLM_TENSOR_PARALLEL_SIZE` to use multiple GPUs
3. **Memory**: Adjust `VLLM_GPU_MEMORY_UTILIZATION` based on available VRAM
4. **Throughput**: For high throughput, use smaller models or increase quantization
2025-10-06 17:05:41 +00:00
## Requirements
2025-10-10 18:45:20 +00:00
- NVIDIA GPU with CUDA support (Ampere architecture or newer recommended)
- CUDA Driver 535 or above
2025-10-06 17:05:41 +00:00
- Docker with NVIDIA Container Toolkit
2025-10-10 18:45:20 +00:00
- At least 8GB VRAM for default model
- HuggingFace token for gated models (optional, cached in `~/.cache/huggingface`)
2025-10-06 17:05:41 +00:00
## Troubleshooting
2025-10-10 18:45:20 +00:00
### Check Service Status
```bash
# View logs
docker compose -f deploy/compose/docker-compose.complete.yml logs -f vllm
# Check health
curl http://localhost:8001/v1/models
```
### GPU Issues
```bash
# Check GPU availability
nvidia-smi
# Check vLLM container GPU access
docker exec vllm-service nvidia-smi
```
### Model Loading Issues
- Ensure sufficient VRAM for the model
- Check HuggingFace cache: `ls ~/.cache/huggingface/hub`
- For gated models, set HF_TOKEN environment variable
## Comparison with Ollama
| Feature | Ollama | vLLM |
|---------|--------|------|
| **Ease of Use** | ✅ Very easy | ⚠️ More complex |
| **Model Management** | ✅ Built-in pull/push | ❌ Manual download |
| **Throughput** | ⚠️ Moderate | ✅ High |
| **Quantization** | Q4_K_M | FP8, GPTQ |
| **Memory Efficiency** | ✅ Good | ✅ Excellent (PagedAttention) |
| **Use Case** | Development, small-scale | Production, high-throughput |
## When to Use vLLM
Use vLLM when:
- Processing large batches of requests
- Need maximum throughput
- Using multiple GPUs
- Deploying to production with high load
Use Ollama when:
- Getting started with the project
- Single-user development
- Simpler model management needed
- Don't need maximum performance