dgx-spark-playbooks/nvidia/txt2kg/assets/deploy/services/vllm
2025-12-02 19:43:52 +00:00
..
build_image.sh chore: Regenerate all playbooks 2025-12-02 19:43:52 +00:00
docker-compose.llama3-8b.yml chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
docker-compose.yml chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
Dockerfile chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
Dockerfile.benchmark chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
launch_server.sh chore: Regenerate all playbooks 2025-12-02 19:43:52 +00:00
README.md chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00
run_benchmark.sh chore: Regenerate all playbooks 2025-12-02 19:43:52 +00:00
run_container.sh chore: Regenerate all playbooks 2025-12-02 19:43:52 +00:00
start-vllm.sh chore: Regenerate all playbooks 2025-12-02 19:43:52 +00:00

vLLM Service

This service provides advanced GPU-accelerated LLM inference using vLLM with FP8 quantization, offering higher throughput than Ollama for production workloads.

Overview

vLLM is an optional service that complements Ollama by providing:

  • Higher throughput for concurrent requests
  • Advanced quantization (FP8)
  • PagedAttention for efficient memory usage
  • OpenAI-compatible API

Quick Start

Using the Complete Stack

The easiest way to run vLLM is with the complete stack:

# From project root
./start.sh --complete

This starts vLLM along with all other optional services.

Manual Docker Compose

# From project root
docker compose -f deploy/compose/docker-compose.complete.yml up -d vllm

Testing the Deployment

# Check health
curl http://localhost:8001/v1/models

# Test chat completion
curl -X POST "http://localhost:8001/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [{"role": "user", "content": "Hello! How are you?"}],
    "max_tokens": 100
  }'

Default Configuration

  • Model: meta-llama/Llama-3.2-3B-Instruct
  • Quantization: FP8 (optimized for compute efficiency)
  • Port: 8001
  • API: OpenAI-compatible endpoints

Configuration Options

Environment variables configured in docker-compose.complete.yml:

  • VLLM_MODEL: Model to load (default: meta-llama/Llama-3.2-3B-Instruct)
  • VLLM_TENSOR_PARALLEL_SIZE: Number of GPUs to use (default: 1)
  • VLLM_MAX_MODEL_LEN: Maximum sequence length (default: 4096)
  • VLLM_GPU_MEMORY_UTILIZATION: GPU memory usage (default: 0.9)
  • VLLM_QUANTIZATION: Quantization method (default: fp8)
  • VLLM_KV_CACHE_DTYPE: KV cache data type (default: fp8)

Frontend Integration

The txt2kg frontend automatically detects and uses vLLM when available:

  1. Triple extraction: /api/vllm endpoint
  2. RAG queries: Automatically uses vLLM if configured
  3. Model selection: Choose vLLM models in the UI

Using Different Models

To use a different model, edit the VLLM_MODEL environment variable in docker-compose.complete.yml:

environment:
  - VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct

Then restart the service:

docker compose -f deploy/compose/docker-compose.complete.yml restart vllm

Performance Tips

  1. Single GPU: Set VLLM_TENSOR_PARALLEL_SIZE=1 for best single-GPU performance
  2. Multi-GPU: Increase VLLM_TENSOR_PARALLEL_SIZE to use multiple GPUs
  3. Memory: Adjust VLLM_GPU_MEMORY_UTILIZATION based on available VRAM
  4. Throughput: For high throughput, use smaller models or increase quantization

Requirements

  • NVIDIA GPU with CUDA support (Ampere architecture or newer recommended)
  • CUDA Driver 535 or above
  • Docker with NVIDIA Container Toolkit
  • At least 8GB VRAM for default model
  • HuggingFace token for gated models (optional, cached in ~/.cache/huggingface)

Troubleshooting

Check Service Status

# View logs
docker compose -f deploy/compose/docker-compose.complete.yml logs -f vllm

# Check health
curl http://localhost:8001/v1/models

GPU Issues

# Check GPU availability
nvidia-smi

# Check vLLM container GPU access
docker exec vllm-service nvidia-smi

Model Loading Issues

  • Ensure sufficient VRAM for the model
  • Check HuggingFace cache: ls ~/.cache/huggingface/hub
  • For gated models, set HF_TOKEN environment variable

Comparison with Ollama

Feature Ollama vLLM
Ease of Use Very easy ⚠️ More complex
Model Management Built-in pull/push Manual download
Throughput ⚠️ Moderate High
Quantization Q4_K_M FP8, GPTQ
Memory Efficiency Good Excellent (PagedAttention)
Use Case Development, small-scale Production, high-throughput

When to Use vLLM

Use vLLM when:

  • Processing large batches of requests
  • Need maximum throughput
  • Using multiple GPUs
  • Deploying to production with high load

Use Ollama when:

  • Getting started with the project
  • Single-user development
  • Simpler model management needed
  • Don't need maximum performance