mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-23 18:33:54 +00:00
154 lines
4.1 KiB
Markdown
154 lines
4.1 KiB
Markdown
# vLLM Service
|
|
|
|
This service provides advanced GPU-accelerated LLM inference using vLLM with FP8 quantization, offering higher throughput than Ollama for production workloads.
|
|
|
|
## Overview
|
|
|
|
vLLM is an optional service that complements Ollama by providing:
|
|
- Higher throughput for concurrent requests
|
|
- Advanced quantization (FP8)
|
|
- PagedAttention for efficient memory usage
|
|
- OpenAI-compatible API
|
|
|
|
## Quick Start
|
|
|
|
### Using the Complete Stack
|
|
|
|
The easiest way to run vLLM is with the complete stack:
|
|
|
|
```bash
|
|
# From project root
|
|
./start.sh --complete
|
|
```
|
|
|
|
This starts vLLM along with all other optional services.
|
|
|
|
### Manual Docker Compose
|
|
|
|
```bash
|
|
# From project root
|
|
docker compose -f deploy/compose/docker-compose.complete.yml up -d vllm
|
|
```
|
|
|
|
### Testing the Deployment
|
|
|
|
```bash
|
|
# Check health
|
|
curl http://localhost:8001/v1/models
|
|
|
|
# Test chat completion
|
|
curl -X POST "http://localhost:8001/v1/chat/completions" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "meta-llama/Llama-3.2-3B-Instruct",
|
|
"messages": [{"role": "user", "content": "Hello! How are you?"}],
|
|
"max_tokens": 100
|
|
}'
|
|
```
|
|
|
|
## Default Configuration
|
|
|
|
- **Model**: `meta-llama/Llama-3.2-3B-Instruct`
|
|
- **Quantization**: FP8 (optimized for compute efficiency)
|
|
- **Port**: 8001
|
|
- **API**: OpenAI-compatible endpoints
|
|
|
|
## Configuration Options
|
|
|
|
Environment variables configured in `docker-compose.complete.yml`:
|
|
|
|
- `VLLM_MODEL`: Model to load (default: `meta-llama/Llama-3.2-3B-Instruct`)
|
|
- `VLLM_TENSOR_PARALLEL_SIZE`: Number of GPUs to use (default: 1)
|
|
- `VLLM_MAX_MODEL_LEN`: Maximum sequence length (default: 4096)
|
|
- `VLLM_GPU_MEMORY_UTILIZATION`: GPU memory usage (default: 0.9)
|
|
- `VLLM_QUANTIZATION`: Quantization method (default: fp8)
|
|
- `VLLM_KV_CACHE_DTYPE`: KV cache data type (default: fp8)
|
|
|
|
## Frontend Integration
|
|
|
|
The txt2kg frontend automatically detects and uses vLLM when available:
|
|
|
|
1. Triple extraction: `/api/vllm` endpoint
|
|
2. RAG queries: Automatically uses vLLM if configured
|
|
3. Model selection: Choose vLLM models in the UI
|
|
|
|
## Using Different Models
|
|
|
|
To use a different model, edit the `VLLM_MODEL` environment variable in `docker-compose.complete.yml`:
|
|
|
|
```yaml
|
|
environment:
|
|
- VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct
|
|
```
|
|
|
|
Then restart the service:
|
|
|
|
```bash
|
|
docker compose -f deploy/compose/docker-compose.complete.yml restart vllm
|
|
```
|
|
|
|
## Performance Tips
|
|
|
|
1. **Single GPU**: Set `VLLM_TENSOR_PARALLEL_SIZE=1` for best single-GPU performance
|
|
2. **Multi-GPU**: Increase `VLLM_TENSOR_PARALLEL_SIZE` to use multiple GPUs
|
|
3. **Memory**: Adjust `VLLM_GPU_MEMORY_UTILIZATION` based on available VRAM
|
|
4. **Throughput**: For high throughput, use smaller models or increase quantization
|
|
|
|
## Requirements
|
|
|
|
- NVIDIA GPU with CUDA support (Ampere architecture or newer recommended)
|
|
- CUDA Driver 535 or above
|
|
- Docker with NVIDIA Container Toolkit
|
|
- At least 8GB VRAM for default model
|
|
- HuggingFace token for gated models (optional, cached in `~/.cache/huggingface`)
|
|
|
|
## Troubleshooting
|
|
|
|
### Check Service Status
|
|
```bash
|
|
# View logs
|
|
docker compose -f deploy/compose/docker-compose.complete.yml logs -f vllm
|
|
|
|
# Check health
|
|
curl http://localhost:8001/v1/models
|
|
```
|
|
|
|
### GPU Issues
|
|
```bash
|
|
# Check GPU availability
|
|
nvidia-smi
|
|
|
|
# Check vLLM container GPU access
|
|
docker exec vllm-service nvidia-smi
|
|
```
|
|
|
|
### Model Loading Issues
|
|
- Ensure sufficient VRAM for the model
|
|
- Check HuggingFace cache: `ls ~/.cache/huggingface/hub`
|
|
- For gated models, set HF_TOKEN environment variable
|
|
|
|
## Comparison with Ollama
|
|
|
|
| Feature | Ollama | vLLM |
|
|
|---------|--------|------|
|
|
| **Ease of Use** | ✅ Very easy | ⚠️ More complex |
|
|
| **Model Management** | ✅ Built-in pull/push | ❌ Manual download |
|
|
| **Throughput** | ⚠️ Moderate | ✅ High |
|
|
| **Quantization** | Q4_K_M | FP8, GPTQ |
|
|
| **Memory Efficiency** | ✅ Good | ✅ Excellent (PagedAttention) |
|
|
| **Use Case** | Development, small-scale | Production, high-throughput |
|
|
|
|
## When to Use vLLM
|
|
|
|
Use vLLM when:
|
|
- Processing large batches of requests
|
|
- Need maximum throughput
|
|
- Using multiple GPUs
|
|
- Deploying to production with high load
|
|
|
|
Use Ollama when:
|
|
- Getting started with the project
|
|
- Single-user development
|
|
- Simpler model management needed
|
|
- Don't need maximum performance
|