From 2e2bd293ed031426b6b155d9e5b576a744f5737b Mon Sep 17 00:00:00 2001 From: GitLab CI Date: Tue, 7 Oct 2025 17:34:59 +0000 Subject: [PATCH] chore: Regenerate all playbooks --- nvidia/speculative-decoding/README.md | 5 +- nvidia/txt2kg/assets/README.md | 195 +++++--------------------- 2 files changed, 37 insertions(+), 163 deletions(-) diff --git a/nvidia/speculative-decoding/README.md b/nvidia/speculative-decoding/README.md index 494dd39..6da7f6f 100644 --- a/nvidia/speculative-decoding/README.md +++ b/nvidia/speculative-decoding/README.md @@ -5,7 +5,7 @@ ## Table of Contents - [Overview](#overview) -- [How to run inference with speculative decoding](#how-to-run-inference-with-speculative-decoding) +- [Instructions](#instructions) - [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions) - [Step 2. Run Draft-Target Speculative Decoding](#step-2-run-draft-target-speculative-decoding) - [Step 3. Test the Draft-Target setup](#step-3-test-the-draft-target-setup) @@ -57,7 +57,7 @@ These examples demonstrate how to accelerate large language model inference whil **Rollback:** Stop Docker containers and optionally clean up downloaded model cache -## How to run inference with speculative decoding +## Instructions ## Traditional Draft-Target Speculative Decoding @@ -169,3 +169,4 @@ docker stop - Experiment with different `max_draft_len` values (1, 2, 3, 4, 8) - Monitor token acceptance rates and throughput improvements - Test with different prompt lengths and generation parameters +- Read more on Speculative Decoding [here](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html) diff --git a/nvidia/txt2kg/assets/README.md b/nvidia/txt2kg/assets/README.md index dbec322..b223803 100644 --- a/nvidia/txt2kg/assets/README.md +++ b/nvidia/txt2kg/assets/README.md @@ -1,35 +1,29 @@ # NVIDIA txt2kg -Use the following documentation to learn about NVIDIA txt2kg. -- [Overview](#overview) -- [Key Features](#key-features) -- [Target Audience](#target-audience) -- [Software Components](#software-components) -- [Technical Diagram](#technical-diagram) -- [GPU-Accelerated Visualization](#gpu-accelerated-visualization) -- [Minimum System Requirements](#minimum-system-requirements) - - [OS Requirements](#os-requirements) - - [Deployment Options](#deployment-options) - - [Driver Versions](#driver-versions) - - [Hardware Requirements](#hardware-requirements) -- [Next Steps](#next-steps) -- [Deployment Guide](#deployment-guide) - - [Standard Deployment](#standard-deployment) - - [PyGraphistry GPU-Accelerated Deployment](#pygraphistry-gpu-accelerated-deployment) -- [Available Customizations](#available-customizations) -- [License](#license) - ## Overview -This blueprint serves as a reference solution for knowledge graph extraction and querying with Retrieval Augmented Generation (RAG). This txt2kg blueprint extracts knowledge triples from text and constructs a knowledge graph for visualization and querying, creating a more structured form of information retrieval compared to traditional RAG approaches. By leveraging graph databases and entity relationships, this blueprint delivers more contextually rich answers that better represent complex relationships in your data. +This playbook serves as a reference solution for knowledge graph extraction and querying with Retrieval Augmented Generation (RAG). This txt2kg playbook extracts knowledge triples from text and constructs a knowledge graph for visualization and querying, creating a more structured form of information retrieval compared to traditional RAG approaches. By leveraging graph databases and entity relationships, this playbook delivers more contextually rich answers that better represent complex relationships in your data. -By default, this blueprint leverages **Ollama** for local LLM inference, providing a fully self-contained solution that runs entirely on your own hardware. You can optionally use NVIDIA-hosted models available in the [NVIDIA API Catalog](https://build.nvidia.com) or vLLM for advanced GPU-accelerated inference. +
+📋 Table of Contents + + +- [Overview](#overview) +- [Key Features](#key-features) +- [Software Components](#software-components) +- [Technical Diagram](#technical-diagram) +- [Minimum System Requirements](#minimum-system-requirements) +- [Deployment Guide](#deployment-guide) +- [Available Customizations](#available-customizations) +- [License](#license) + +
+ +By default, this playbook leverages **Ollama** for local LLM inference, providing a fully self-contained solution that runs entirely on your own hardware. You can optionally use NVIDIA-hosted models available in the [NVIDIA API Catalog](https://build.nvidia.com) for advanced capabilities. ## Key Features -![Screenshot](/frontend/public/txt2kg.png) - -[Watch the demo video](https://drive.google.com/file/d/1a0VG67zx_pGT4WyPTPH2ynefhfy2I0Py/view?usp=sharing) +![Screenshot](./frontend/public/txt2kg.png) - Knowledge triple extraction from text documents - Knowledge graph construction and visualization @@ -44,26 +38,14 @@ By default, this blueprint leverages **Ollama** for local LLM inference, providi - Fully containerized deployment with Docker Compose - Decomposable and customizable -## Target Audience - -This blueprint is for: - -- **Developers**: Developers who want to quickly set up a local-first Graph-based RAG solution -- **Data Scientists**: Data scientists who want to extract structured knowledge from unstructured text -- **Enterprise Architects**: Architects seeking to combine knowledge graph and RAG solutions for their organization -- **Privacy-Conscious Users**: Organizations requiring fully local, air-gapped deployments -- **GPU Researchers**: Researchers wanting to leverage GPU acceleration for LLM inference and graph visualization - ## Software Components -The following are the default components included in this blueprint: +The following are the default components included in this playbook: * **LLM Inference** * **Ollama** (default): Local LLM inference with GPU acceleration * Default model: `llama3.1:8b` * Supports any Ollama-compatible model - * **vLLM** (optional): Advanced GPU-accelerated inference with quantization - * Default model: `meta-llama/Llama-3.2-3B-Instruct` * **NVIDIA API** (optional): Cloud-based models via NVIDIA API Catalog * **Vector Database & Embedding** * **SentenceTransformer**: Local embedding generation @@ -98,7 +80,7 @@ The architecture follows this workflow: ## GPU-Accelerated LLM Inference -This blueprint includes **GPU-accelerated LLM inference** with Ollama: +This playbook includes **GPU-accelerated LLM inference** with Ollama: ### Ollama Features - **Fully local inference**: No cloud dependencies or API keys required @@ -114,52 +96,21 @@ This blueprint includes **GPU-accelerated LLM inference** with Ollama: - Flash attention enabled - Q8_0 KV cache for memory efficiency -### Using Different Models -```bash -# Pull a different model -docker exec ollama-compose ollama pull llama3.1:70b - -# Update environment variable in docker-compose.yml -OLLAMA_MODEL=llama3.1:70b -``` - ## Minimum System Requirements -### OS Requirements +**OS Requirements:** +- Ubuntu 22.04 or later -Ubuntu 22.04 or later +**Driver Versions:** +- GPU Driver: 530.30.02+ +- CUDA: 12.0+ -### Deployment Options - -- [Standard Docker Compose](./deploy/compose/docker-compose.yml) (Default - Ollama + ArangoDB + Pinecone) -- [vLLM Docker Compose](./deploy/compose/docker-compose.vllm.yml) (Advanced - vLLM for FP8 and NVFP4 quantization) -- [Complete Docker Compose](./deploy/compose/docker-compose.complete.yml) (Full stack with MinIO S3) - -### Driver Versions - -- GPU Driver - 530.30.02+ -- CUDA version - 12.0+ -### Hardware Requirements - -- **For Ollama LLM inference**: - - NVIDIA GPU with CUDA support (GTX 1060 or newer, RTX series recommended) - - VRAM requirements depend on model size: - - 8B models: 6-8GB VRAM - - 70B models: 48GB+ VRAM (or use quantized versions) - - System RAM: 16GB+ recommended -- **For vLLM (optional)**: - - NVIDIA GPU with Ampere architecture or newer (RTX 30xx+, A100, H100) - - Support for FP8 quantization for optimal performance - - Similar VRAM requirements as Ollama - -## Next Steps - -- Clone the repository -- Install Docker and NVIDIA Container Toolkit -- Deploy with Docker Compose (no API keys required!) -- Pull your preferred Ollama model -- Upload documents and explore the knowledge graph -- Customize for your specific use case +**Hardware Requirements:** +- NVIDIA GPU with CUDA support (GTX 1060 or newer, RTX series recommended) +- VRAM requirements depend on model size: + - 8B models: 6-8GB VRAM + - 70B models: 48GB+ VRAM (or use quantized versions) +- System RAM: 16GB+ recommended ## Deployment Guide @@ -173,8 +124,7 @@ The default configuration uses: - Local ArangoDB (no authentication by default) - Local SentenceTransformer embeddings -#### Optional Environment Variables - +Optional environment variables for customization: ```bash # Ollama configuration (optional - defaults are set) OLLAMA_BASE_URL=http://ollama:11434/v1 @@ -182,13 +132,9 @@ OLLAMA_MODEL=llama3.1:8b # NVIDIA API (optional - for cloud models) NVIDIA_API_KEY=your-nvidia-api-key - -# vLLM configuration (optional) -VLLM_BASE_URL=http://vllm:8001/v1 -VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct ``` -### Standard Deployment +### Quick Start 1. **Clone the repository:** ```bash @@ -218,86 +164,13 @@ docker exec ollama-compose ollama pull llama3.1:8b - **ArangoDB**: http://localhost:8529 (no authentication required) - **Ollama API**: http://localhost:11434 -### Advanced Deployment Options - -#### Using vLLM for FP8 Quantization - -vLLM provides advanced GPU acceleration with FP8 quantization for smaller memory footprint: - -```bash -# Use vLLM compose file -docker compose -f deploy/compose/docker-compose.vllm.yml up -d -``` - -vLLM is recommended for: -- Newer NVIDIA GPUs (Ampere architecture or later) -- Production deployments requiring maximum throughput -- Memory-constrained environments (FP8 uses less VRAM) - -#### GPU Setup Prerequisites - -1. **Install NVIDIA Container Toolkit**: - ```bash - # Ubuntu/Debian - distribution=$(. /etc/os-release;echo $ID$VERSION_ID) - curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list - - sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit - sudo systemctl restart docker - ``` - -2. **Verify GPU Access**: - ```bash - docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi - ``` - -#### Troubleshooting - -**Check Service Logs**: -```bash -# View all service logs -docker compose logs -f - -# View Ollama logs -docker compose logs -f ollama - -# View vLLM logs (if using vLLM) -docker compose -f deploy/compose/docker-compose.vllm.yml logs -f vllm -``` - -**GPU Issues**: -```bash -# Check GPU availability -nvidia-smi - -# Verify Docker GPU access -docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi -``` - -**Ollama Model Management**: -```bash -# List available models -docker exec ollama-compose ollama list - -# Pull a different model -docker exec ollama-compose ollama pull mistral - -# Remove a model to free space -docker exec ollama-compose ollama rm llama3.1:8b -``` - ## Available Customizations -The following are some of the customizations you can make: - -- **Switch Ollama models**: Use any model from Ollama's library (Llama, Qwen, etc.) +- **Switch Ollama models**: Use any model from Ollama's library (Llama, Mistral, Qwen, etc.) - **Modify extraction prompts**: Customize how triples are extracted from text - **Adjust embedding parameters**: Change the SentenceTransformer model - **Implement custom entity relationships**: Define domain-specific relationship types - **Add domain-specific knowledge sources**: Integrate external ontologies or taxonomies -- **Configure GPU settings**: Optimize VRAM usage and performance for your hardware -- **Switch to vLLM**: Use vLLM for advanced quantization and higher throughput - **Use NVIDIA API**: Connect to cloud models for specific use cases ## License