dgx-spark-playbooks/nvidia/txt2kg/assets/deploy/services/vllm/README.md

# vLLM Service

This service provides advanced GPU-accelerated LLM inference using vLLM with FP8 quantization, offering higher throughput than Ollama for production workloads.

## Overview

vLLM is an optional service that complements Ollama by providing:
- Higher throughput for concurrent requests
- Advanced quantization (FP8)
- PagedAttention for efficient memory usage
- OpenAI-compatible API

## Quick Start

### Using the Complete Stack

The easiest way to run vLLM is with the complete stack:

```bash
# From project root
./start.sh --complete
```

This starts vLLM along with all other optional services.

### Manual Docker Compose

```bash
# From project root
docker compose -f deploy/compose/docker-compose.complete.yml up -d vllm
```

### Testing the Deployment

```bash
# Check health
curl http://localhost:8001/v1/models

# Test chat completion
curl -X POST "http://localhost:8001/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [{"role": "user", "content": "Hello! How are you?"}],
    "max_tokens": 100
  }'
```

## Default Configuration

- **Model**: `meta-llama/Llama-3.2-3B-Instruct`
- **Quantization**: FP8 (optimized for compute efficiency)
- **Port**: 8001
- **API**: OpenAI-compatible endpoints

## Configuration Options

Environment variables configured in `docker-compose.complete.yml`:

- `VLLM_MODEL`: Model to load (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `VLLM_TENSOR_PARALLEL_SIZE`: Number of GPUs to use (default: 1)
- `VLLM_MAX_MODEL_LEN`: Maximum sequence length (default: 4096)
- `VLLM_GPU_MEMORY_UTILIZATION`: GPU memory usage (default: 0.9)
- `VLLM_QUANTIZATION`: Quantization method (default: fp8)
- `VLLM_KV_CACHE_DTYPE`: KV cache data type (default: fp8)

## Frontend Integration

The txt2kg frontend automatically detects and uses vLLM when available:

1. Triple extraction: `/api/vllm` endpoint
2. RAG queries: Automatically uses vLLM if configured
3. Model selection: Choose vLLM models in the UI

## Using Different Models

To use a different model, edit the `VLLM_MODEL` environment variable in `docker-compose.complete.yml`:

```yaml
environment:
  - VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct
```

Then restart the service:

```bash
docker compose -f deploy/compose/docker-compose.complete.yml restart vllm
```

## Performance Tips

1. **Single GPU**: Set `VLLM_TENSOR_PARALLEL_SIZE=1` for best single-GPU performance
2. **Multi-GPU**: Increase `VLLM_TENSOR_PARALLEL_SIZE` to use multiple GPUs
3. **Memory**: Adjust `VLLM_GPU_MEMORY_UTILIZATION` based on available VRAM
4. **Throughput**: For high throughput, use smaller models or increase quantization

## Requirements

- NVIDIA GPU with CUDA support (Ampere architecture or newer recommended)
- CUDA Driver 535 or above
- Docker with NVIDIA Container Toolkit
- At least 8GB VRAM for default model
- HuggingFace token for gated models (optional, cached in `~/.cache/huggingface`)

## Troubleshooting

### Check Service Status
```bash
# View logs
docker compose -f deploy/compose/docker-compose.complete.yml logs -f vllm

# Check health
curl http://localhost:8001/v1/models
```

### GPU Issues
```bash
# Check GPU availability
nvidia-smi

# Check vLLM container GPU access
docker exec vllm-service nvidia-smi
```

### Model Loading Issues
- Ensure sufficient VRAM for the model
- Check HuggingFace cache: `ls ~/.cache/huggingface/hub`
- For gated models, set HF_TOKEN environment variable

## Comparison with Ollama

| Feature | Ollama | vLLM |
|---------|--------|------|
| **Ease of Use** | ✅ Very easy | ⚠️ More complex |
| **Model Management** | ✅ Built-in pull/push | ❌ Manual download |
| **Throughput** | ⚠️ Moderate | ✅ High |
| **Quantization** | Q4_K_M | FP8, GPTQ |
| **Memory Efficiency** | ✅ Good | ✅ Excellent (PagedAttention) |
| **Use Case** | Development, small-scale | Production, high-throughput |

## When to Use vLLM

Use vLLM when:
- Processing large batches of requests
- Need maximum throughput
- Using multiple GPUs
- Deploying to production with high load

Use Ollama when:
- Getting started with the project
- Single-user development
- Simpler model management needed
- Don't need maximum performance
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`# vLLM Service`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`This service provides advanced GPU-accelerated LLM inference using vLLM with FP8 quantization, offering higher throughput than Ollama for production workloads.`

			`## Overview`

			`vLLM is an optional service that complements Ollama by providing:`
			`- Higher throughput for concurrent requests`
			`- Advanced quantization (FP8)`
			`- PagedAttention for efficient memory usage`
			`- OpenAI-compatible API`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
			`## Quick Start`

chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`### Using the Complete Stack`

			`The easiest way to run vLLM is with the complete stack:`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			```bash
			`# From project root`
			`./start.sh --complete`
			```
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`This starts vLLM along with all other optional services.`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`### Manual Docker Compose`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			```bash
			`# From project root`
			`docker compose -f deploy/compose/docker-compose.complete.yml up -d vllm`
			```
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`### Testing the Deployment`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			```bash
			`# Check health`
			`curl http://localhost:8001/v1/models`

			`# Test chat completion`
			`curl -X POST "http://localhost:8001/v1/chat/completions" \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "meta-llama/Llama-3.2-3B-Instruct",`
			`"messages": [{"role": "user", "content": "Hello! How are you?"}],`
			`"max_tokens": 100`
			`}'`
			```
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`## Default Configuration`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			- Model: `meta-llama/Llama-3.2-3B-Instruct`
			`- Quantization: FP8 (optimized for compute efficiency)`
			`- Port: 8001`
			`- API: OpenAI-compatible endpoints`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`## Configuration Options`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			Environment variables configured in `docker-compose.complete.yml`:
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			- `VLLM_MODEL`: Model to load (default: `meta-llama/Llama-3.2-3B-Instruct`)
			- `VLLM_TENSOR_PARALLEL_SIZE`: Number of GPUs to use (default: 1)
			- `VLLM_MAX_MODEL_LEN`: Maximum sequence length (default: 4096)
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00			- `VLLM_GPU_MEMORY_UTILIZATION`: GPU memory usage (default: 0.9)
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			- `VLLM_QUANTIZATION`: Quantization method (default: fp8)
			- `VLLM_KV_CACHE_DTYPE`: KV cache data type (default: fp8)
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`## Frontend Integration`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`The txt2kg frontend automatically detects and uses vLLM when available:`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			1. Triple extraction: `/api/vllm` endpoint
			`2. RAG queries: Automatically uses vLLM if configured`
			`3. Model selection: Choose vLLM models in the UI`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`## Using Different Models`

			To use a different model, edit the `VLLM_MODEL` environment variable in `docker-compose.complete.yml`:

			```yaml
			`environment:`
			`- VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct`
			```

			`Then restart the service:`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
			```bash
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`docker compose -f deploy/compose/docker-compose.complete.yml restart vllm`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00			```

chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`## Performance Tips`

			1. Single GPU: Set `VLLM_TENSOR_PARALLEL_SIZE=1` for best single-GPU performance
			2. Multi-GPU: Increase `VLLM_TENSOR_PARALLEL_SIZE` to use multiple GPUs
			3. Memory: Adjust `VLLM_GPU_MEMORY_UTILIZATION` based on available VRAM
			`4. Throughput: For high throughput, use smaller models or increase quantization`

chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00			`## Requirements`

chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`- NVIDIA GPU with CUDA support (Ampere architecture or newer recommended)`
			`- CUDA Driver 535 or above`
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00			`- Docker with NVIDIA Container Toolkit`
chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`- At least 8GB VRAM for default model`
			- HuggingFace token for gated models (optional, cached in `~/.cache/huggingface`)
chore: Regenerate all playbooks 2025-10-06 17:05:41 +00:00
			`## Troubleshooting`

chore: Regenerate all playbooks 2025-10-10 18:45:20 +00:00			`### Check Service Status`
			```bash
			`# View logs`
			`docker compose -f deploy/compose/docker-compose.complete.yml logs -f vllm`

			`# Check health`
			`curl http://localhost:8001/v1/models`
			```

			`### GPU Issues`
			```bash
			`# Check GPU availability`
			`nvidia-smi`

			`# Check vLLM container GPU access`
			`docker exec vllm-service nvidia-smi`
			```

			`### Model Loading Issues`
			`- Ensure sufficient VRAM for the model`
			- Check HuggingFace cache: `ls ~/.cache/huggingface/hub`
			`- For gated models, set HF_TOKEN environment variable`

			`## Comparison with Ollama`

			`\| Feature \| Ollama \| vLLM \|`
			`\|---------\|--------\|------\|`
			`\| Ease of Use \| ✅ Very easy \| ⚠️ More complex \|`
			`\| Model Management \| ✅ Built-in pull/push \| ❌ Manual download \|`
			`\| Throughput \| ⚠️ Moderate \| ✅ High \|`
			`\| Quantization \| Q4_K_M \| FP8, GPTQ \|`
			`\| Memory Efficiency \| ✅ Good \| ✅ Excellent (PagedAttention) \|`
			`\| Use Case \| Development, small-scale \| Production, high-throughput \|`

			`## When to Use vLLM`

			`Use vLLM when:`
			`- Processing large batches of requests`
			`- Need maximum throughput`
			`- Using multiple GPUs`
			`- Deploying to production with high load`

			`Use Ollama when:`
			`- Getting started with the project`
			`- Single-user development`
			`- Simpler model management needed`
			`- Don't need maximum performance`