mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-23 02:23:53 +00:00
| .. | ||
| build_image.sh | ||
| docker-compose.llama3-8b.yml | ||
| docker-compose.yml | ||
| Dockerfile | ||
| Dockerfile.benchmark | ||
| launch_server.sh | ||
| README.md | ||
| run_benchmark.sh | ||
| run_container.sh | ||
| start-vllm.sh | ||
vLLM Service
This service provides advanced GPU-accelerated LLM inference using vLLM with FP8 quantization, offering higher throughput than Ollama for production workloads.
Overview
vLLM is an optional service that complements Ollama by providing:
- Higher throughput for concurrent requests
- Advanced quantization (FP8)
- PagedAttention for efficient memory usage
- OpenAI-compatible API
Quick Start
Using the Complete Stack
The easiest way to run vLLM is with the complete stack:
# From project root
./start.sh --complete
This starts vLLM along with all other optional services.
Manual Docker Compose
# From project root
docker compose -f deploy/compose/docker-compose.complete.yml up -d vllm
Testing the Deployment
# Check health
curl http://localhost:8001/v1/models
# Test chat completion
curl -X POST "http://localhost:8001/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [{"role": "user", "content": "Hello! How are you?"}],
"max_tokens": 100
}'
Default Configuration
- Model:
meta-llama/Llama-3.2-3B-Instruct - Quantization: FP8 (optimized for compute efficiency)
- Port: 8001
- API: OpenAI-compatible endpoints
Configuration Options
Environment variables configured in docker-compose.complete.yml:
VLLM_MODEL: Model to load (default:meta-llama/Llama-3.2-3B-Instruct)VLLM_TENSOR_PARALLEL_SIZE: Number of GPUs to use (default: 1)VLLM_MAX_MODEL_LEN: Maximum sequence length (default: 4096)VLLM_GPU_MEMORY_UTILIZATION: GPU memory usage (default: 0.9)VLLM_QUANTIZATION: Quantization method (default: fp8)VLLM_KV_CACHE_DTYPE: KV cache data type (default: fp8)
Frontend Integration
The txt2kg frontend automatically detects and uses vLLM when available:
- Triple extraction:
/api/vllmendpoint - RAG queries: Automatically uses vLLM if configured
- Model selection: Choose vLLM models in the UI
Using Different Models
To use a different model, edit the VLLM_MODEL environment variable in docker-compose.complete.yml:
environment:
- VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct
Then restart the service:
docker compose -f deploy/compose/docker-compose.complete.yml restart vllm
Performance Tips
- Single GPU: Set
VLLM_TENSOR_PARALLEL_SIZE=1for best single-GPU performance - Multi-GPU: Increase
VLLM_TENSOR_PARALLEL_SIZEto use multiple GPUs - Memory: Adjust
VLLM_GPU_MEMORY_UTILIZATIONbased on available VRAM - Throughput: For high throughput, use smaller models or increase quantization
Requirements
- NVIDIA GPU with CUDA support (Ampere architecture or newer recommended)
- CUDA Driver 535 or above
- Docker with NVIDIA Container Toolkit
- At least 8GB VRAM for default model
- HuggingFace token for gated models (optional, cached in
~/.cache/huggingface)
Troubleshooting
Check Service Status
# View logs
docker compose -f deploy/compose/docker-compose.complete.yml logs -f vllm
# Check health
curl http://localhost:8001/v1/models
GPU Issues
# Check GPU availability
nvidia-smi
# Check vLLM container GPU access
docker exec vllm-service nvidia-smi
Model Loading Issues
- Ensure sufficient VRAM for the model
- Check HuggingFace cache:
ls ~/.cache/huggingface/hub - For gated models, set HF_TOKEN environment variable
Comparison with Ollama
| Feature | Ollama | vLLM |
|---|---|---|
| Ease of Use | ✅ Very easy | ⚠️ More complex |
| Model Management | ✅ Built-in pull/push | ❌ Manual download |
| Throughput | ⚠️ Moderate | ✅ High |
| Quantization | Q4_K_M | FP8, GPTQ |
| Memory Efficiency | ✅ Good | ✅ Excellent (PagedAttention) |
| Use Case | Development, small-scale | Production, high-throughput |
When to Use vLLM
Use vLLM when:
- Processing large batches of requests
- Need maximum throughput
- Using multiple GPUs
- Deploying to production with high load
Use Ollama when:
- Getting started with the project
- Single-user development
- Simpler model management needed
- Don't need maximum performance