# vLLM Service

This service provides advanced GPU-accelerated LLM inference using vLLM with FP8 quantization, offering higher throughput than Ollama for production workloads.

## Overview

vLLM is an optional service that complements Ollama by providing:
- Higher throughput for concurrent requests
- Advanced quantization (FP8)
- PagedAttention for efficient memory usage
- OpenAI-compatible API

## Quick Start

### Using the Complete Stack

The easiest way to run vLLM is with the complete stack:

```bash
# From project root
./start.sh --complete
```

This starts vLLM along with all other optional services.

### Manual Docker Compose

```bash
# From project root
docker compose -f deploy/compose/docker-compose.complete.yml up -d vllm
```

### Testing the Deployment

```bash
# Check health
curl http://localhost:8001/v1/models

# Test chat completion
curl -X POST "http://localhost:8001/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "messages": [{"role": "user", "content": "Hello! How are you?"}],
    "max_tokens": 100
  }'
```

## Default Configuration

- **Model**: `meta-llama/Llama-3.2-3B-Instruct`
- **Quantization**: FP8 (optimized for compute efficiency)
- **Port**: 8001
- **API**: OpenAI-compatible endpoints

## Configuration Options

Environment variables configured in `docker-compose.complete.yml`:

- `VLLM_MODEL`: Model to load (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `VLLM_TENSOR_PARALLEL_SIZE`: Number of GPUs to use (default: 1)
- `VLLM_MAX_MODEL_LEN`: Maximum sequence length (default: 4096)
- `VLLM_GPU_MEMORY_UTILIZATION`: GPU memory usage (default: 0.9)
- `VLLM_QUANTIZATION`: Quantization method (default: fp8)
- `VLLM_KV_CACHE_DTYPE`: KV cache data type (default: fp8)

## Frontend Integration

The txt2kg frontend automatically detects and uses vLLM when available:

1. Triple extraction: `/api/vllm` endpoint
2. RAG queries: Automatically uses vLLM if configured
3. Model selection: Choose vLLM models in the UI

## Using Different Models

To use a different model, edit the `VLLM_MODEL` environment variable in `docker-compose.complete.yml`:

```yaml
environment:
  - VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct
```

Then restart the service:

```bash
docker compose -f deploy/compose/docker-compose.complete.yml restart vllm
```

## Performance Tips

1. **Single GPU**: Set `VLLM_TENSOR_PARALLEL_SIZE=1` for best single-GPU performance
2. **Multi-GPU**: Increase `VLLM_TENSOR_PARALLEL_SIZE` to use multiple GPUs
3. **Memory**: Adjust `VLLM_GPU_MEMORY_UTILIZATION` based on available VRAM
4. **Throughput**: For high throughput, use smaller models or increase quantization

## Requirements

- NVIDIA GPU with CUDA support (Ampere architecture or newer recommended)
- CUDA Driver 535 or above
- Docker with NVIDIA Container Toolkit
- At least 8GB VRAM for default model
- HuggingFace token for gated models (optional, cached in `~/.cache/huggingface`)

## Troubleshooting

### Check Service Status
```bash
# View logs
docker compose -f deploy/compose/docker-compose.complete.yml logs -f vllm

# Check health
curl http://localhost:8001/v1/models
```

### GPU Issues
```bash
# Check GPU availability
nvidia-smi

# Check vLLM container GPU access
docker exec vllm-service nvidia-smi
```

### Model Loading Issues
- Ensure sufficient VRAM for the model
- Check HuggingFace cache: `ls ~/.cache/huggingface/hub`
- For gated models, set HF_TOKEN environment variable

## Comparison with Ollama

| Feature | Ollama | vLLM |
|---------|--------|------|
| **Ease of Use** | ✅ Very easy | ⚠️ More complex |
| **Model Management** | ✅ Built-in pull/push | ❌ Manual download |
| **Throughput** | ⚠️ Moderate | ✅ High |
| **Quantization** | Q4_K_M | FP8, GPTQ |
| **Memory Efficiency** | ✅ Good | ✅ Excellent (PagedAttention) |
| **Use Case** | Development, small-scale | Production, high-throughput |

## When to Use vLLM

Use vLLM when:
- Processing large batches of requests
- Need maximum throughput
- Using multiple GPUs
- Deploying to production with high load

Use Ollama when:
- Getting started with the project
- Single-user development
- Simpler model management needed
- Don't need maximum performance