mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-23 02:23:53 +00:00

History

GitLab CI 373591c46f chore: Regenerate all playbooks		2025-12-02 19:43:52 +00:00
..
build_image.sh	chore: Regenerate all playbooks	2025-12-02 19:43:52 +00:00
docker-compose.llama3-8b.yml	chore: Regenerate all playbooks	2025-10-06 17:05:41 +00:00
docker-compose.yml	chore: Regenerate all playbooks	2025-10-06 17:05:41 +00:00
Dockerfile	chore: Regenerate all playbooks	2025-10-06 17:05:41 +00:00
Dockerfile.benchmark	chore: Regenerate all playbooks	2025-10-06 17:05:41 +00:00
launch_server.sh	chore: Regenerate all playbooks	2025-12-02 19:43:52 +00:00
README.md	chore: Regenerate all playbooks	2025-10-10 18:45:20 +00:00
run_benchmark.sh	chore: Regenerate all playbooks	2025-12-02 19:43:52 +00:00
run_container.sh	chore: Regenerate all playbooks	2025-12-02 19:43:52 +00:00
start-vllm.sh	chore: Regenerate all playbooks	2025-12-02 19:43:52 +00:00

README.md

vLLM Service

This service provides advanced GPU-accelerated LLM inference using vLLM with FP8 quantization, offering higher throughput than Ollama for production workloads.

Overview

vLLM is an optional service that complements Ollama by providing:

Higher throughput for concurrent requests
Advanced quantization (FP8)
PagedAttention for efficient memory usage
OpenAI-compatible API

Quick Start

Using the Complete Stack

The easiest way to run vLLM is with the complete stack:

# From project root
./start.sh --complete

This starts vLLM along with all other optional services.

Manual Docker Compose

# From project root
docker compose -f deploy/compose/docker-compose.complete.yml up -d vllm

Testing the Deployment

# Check health
curl http://localhost:8001/v1/models

# Test chat completion
curl -X POST "http://localhost:8001/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [{"role": "user", "content": "Hello! How are you?"}],
    "max_tokens": 100
  }'

Default Configuration

Model: meta-llama/Llama-3.2-3B-Instruct
Quantization: FP8 (optimized for compute efficiency)
Port: 8001
API: OpenAI-compatible endpoints

Configuration Options

Environment variables configured in docker-compose.complete.yml:

VLLM_MODEL: Model to load (default: meta-llama/Llama-3.2-3B-Instruct)
VLLM_TENSOR_PARALLEL_SIZE: Number of GPUs to use (default: 1)
VLLM_MAX_MODEL_LEN: Maximum sequence length (default: 4096)
VLLM_GPU_MEMORY_UTILIZATION: GPU memory usage (default: 0.9)
VLLM_QUANTIZATION: Quantization method (default: fp8)
VLLM_KV_CACHE_DTYPE: KV cache data type (default: fp8)

Frontend Integration

The txt2kg frontend automatically detects and uses vLLM when available:

Triple extraction: /api/vllm endpoint
RAG queries: Automatically uses vLLM if configured
Model selection: Choose vLLM models in the UI

Using Different Models

To use a different model, edit the VLLM_MODEL environment variable in docker-compose.complete.yml:

environment:
  - VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct

Then restart the service:

docker compose -f deploy/compose/docker-compose.complete.yml restart vllm

Performance Tips

Single GPU: Set VLLM_TENSOR_PARALLEL_SIZE=1 for best single-GPU performance
Multi-GPU: Increase VLLM_TENSOR_PARALLEL_SIZE to use multiple GPUs
Memory: Adjust VLLM_GPU_MEMORY_UTILIZATION based on available VRAM
Throughput: For high throughput, use smaller models or increase quantization

Requirements

NVIDIA GPU with CUDA support (Ampere architecture or newer recommended)
CUDA Driver 535 or above
Docker with NVIDIA Container Toolkit
At least 8GB VRAM for default model
HuggingFace token for gated models (optional, cached in ~/.cache/huggingface)

Troubleshooting

Check Service Status

# View logs
docker compose -f deploy/compose/docker-compose.complete.yml logs -f vllm

# Check health
curl http://localhost:8001/v1/models

GPU Issues

# Check GPU availability
nvidia-smi

# Check vLLM container GPU access
docker exec vllm-service nvidia-smi

Model Loading Issues

Ensure sufficient VRAM for the model
Check HuggingFace cache: ls ~/.cache/huggingface/hub
For gated models, set HF_TOKEN environment variable

Comparison with Ollama

Feature	Ollama	vLLM
Ease of Use	✅ Very easy	⚠️ More complex
Model Management	✅ Built-in pull/push	❌ Manual download
Throughput	⚠️ Moderate	✅ High
Quantization	Q4_K_M	FP8, GPTQ
Memory Efficiency	✅ Good	✅ Excellent (PagedAttention)
Use Case	Development, small-scale	Production, high-throughput

When to Use vLLM

Use vLLM when:

Processing large batches of requests
Need maximum throughput
Using multiple GPUs
Deploying to production with high load

Use Ollama when:

Getting started with the project
Single-user development
Simpler model management needed
Don't need maximum performance