chore: Regenerate all playbooks

2026-06-18 04:22:21 +00:00 · 2025-10-10 18:45:20 +00:00 · 2025-10-10 18:45:20 +00:00 · 89b4835335
commit 89b4835335
parent f7f0a5ec85
24 changed files with 699 additions and 1853 deletions
--- a/nvidia/connect-to-your-spark/README.md
+++ b/nvidia/connect-to-your-spark/README.md
@ -77,8 +77,10 @@ to the DGX Spark device

 ## Step 1. Install NVIDIA Sync

-Download and install NVIDIA Sync for your operating system. NVIDIA Sync provides a unified
-interface for managing SSH connections and launching development tools on your DGX Spark device.
+NVIDIA Sync is a desktop app that connects your computer to your DGX Spark over the local network. 
+It gives you a single interface to manage SSH access and launch development tools on your DGX Spark. 
+
+Download and install NVIDIA Sync on your computer to get started.

 ::spark-download

@ -115,30 +117,27 @@ interface for managing SSH connections and launching development tools on your D

 ## Step 2. Configure Apps

-After starting NVIDIA Sync and agreeing to the EULA, select which development tools you want
-to use. These are tools installed on your laptop that Sync can configure and launch connected to your Spark.
+Apps are desktop programs installed on your laptop that NVIDIA Sync can configure and launch with an automatic connection to your Spark.

-You can modify these selections later in the Settings window. Applications marked "unavailable"
-require installation on your laptop. 
+You can change your app selections anytime in the Settings window. Apps that are marked "unavailable" must be installed before you can use them.

-**Default Options:**
+**Default apps:**
 - **DGX Dashboard**: Web application pre-installed on DGX Spark for system management and integrated JupyterLab access
 - **Terminal**: Your system's built-in terminal with automatic SSH connection

 **Optional apps (require separate installation):**
 - **VS Code**: Download from https://code.visualstudio.com/download 
 - **Cursor**: Download from https://cursor.com/downloads 
- **NVIDIA AI Workbench**: Download from https://nvidia.com/workbench
+- **NVIDIA AI Workbench**: Download from https://www.nvidia.com/workbench

 ## Step 3. Add your DGX Spark device

-> **Find Your Hostname or IP**
-> 
+> [!NOTE]
 > You must know either your hostname or IP address to connect.
 >
 > - The default hostname can be found on the Quick Start Guide included in the box. For example, `spark-abcd.local`
 > - If you have a display connected to your device, you can find the hostname on the Settings page of the [DGX Dashboard](http://localhost:11000).
-> - If `.local` (mDNS) hostnames don't work on your network you must use your IP address. This can be found in Ubuntu's network settings or by logging into the admin console of your router.
+> - If `.local` (mDNS) hostnames don't work on your network you must use an IP address. This can be found in Ubuntu's network settings or by logging into the admin console of your router.

 Finally, connect your DGX Spark by filling out the form:

@ -159,7 +158,8 @@ Click add "Add" and NVIDIA Sync will automatically:
 4. Create an SSH alias locally for future connections
 5. Discard your username and password information

-> **Wait for update:**  After completing system setup for the first time, your device may take several minutes to update and become available on the network. If NVIDIA Sync fails to connect, please wait 3-4 minutes and try again.
+> [!IMPORTANT]
+> After completing system setup for the first time, your device may take several minutes to update and become available on the network. If NVIDIA Sync fails to connect, please wait 3-4 minutes and try again.

 ## Step 4. Access your DGX Spark

@ -178,9 +178,10 @@ connection to your DGX Spark.

 ## Step 5. Validate SSH setup

-Verify your local SSH configuration is correct by using the SSH alias:
+NVIDIA Sync creates an SSH alias for your device for easy access manually or from other SSH enabled apps.

-Test direct SSH connection (should not prompt for password)
+Verify your local SSH configuration is correct by using the SSH alias. You should not be prompted for your 
+password when using the alias:

 ```bash
 ## Configured if you use mDNS hostname
@ -207,12 +208,21 @@ Exit the SSH session
 exit
 ```

-## Step 6. Next steps
+## Step 6. Troubleshooting
+
+| Symptom | Cause | Fix |
+|---------|--------|-----|
+| Device name doesn't resolve | mDNS blocked on network | Use IP address instead of hostname.local |
+| Connection refused/timeout | DGX Spark not booted or SSH not ready | Wait for device boot completion; SSH available after updates finish |
+| Authentication failed | SSH key setup incomplete | Re-run device setup in NVIDIA Sync; check credentials |
+
+## Step 7. Next steps

 Test your setup by launching a development tool:
 - Click the NVIDIA Sync system tray icon.
 - Select "Terminal" to open a terminal session on your DGX Spark.
- Or click "DGX Dashboard" to access the web interface at the forwarded localhost port.
+- Select "DGX Dashboard" to use Jupyterlab and manage updates.
+- Try [a custom port example with Open WebUI](/spark/open-webui/sync)

 ## Connect with Manual SSH

--- a/nvidia/open-webui/README.md
+++ b/nvidia/open-webui/README.md
@ -60,14 +60,18 @@ If you see a permission denied error (something like `permission denied while tr

 ```bash
 sudo usermod -aG docker $USER
+newgrp docker
 ```

-> **Warning**: After running usermod, you must log out and log back in to start a new
-> session with updated group permissions.
+Test Docker access again. In the terminal, run:
+
+```bash
+docker ps
+```

 ## Step 2. Verify Docker setup and pull container

-Open a new terminal, pull the Open WebUI container image with integrated Ollama:
+Pull the Open WebUI container image with integrated Ollama:

 ```bash
 docker pull ghcr.io/open-webui/open-webui:ollama
@ -130,7 +134,8 @@ Press Enter to send the message and wait for the model's response.

 Steps to completely remove the Open WebUI installation and free up resources:

-> **Warning**: These commands will permanently delete all Open WebUI data and downloaded models.
+> [!WARNING]
+> These commands will permanently delete all Open WebUI data and downloaded models.

 Stop and remove the Open WebUI container:

@ -151,9 +156,6 @@ Remove persistent data volumes:
 docker volume rm open-webui open-webui-ollama
 ```

-To rollback permission change: `sudo deluser $USER docker`
-
-
 ## Step 9. Next steps

 Try downloading different models from the Ollama library at https://ollama.com/library.
@ -168,7 +170,8 @@ docker pull ghcr.io/open-webui/open-webui:ollama

 ## Setup Open WebUI on Remote Spark with NVIDIA Sync

-> **Note**: If you haven't already installed NVIDIA Sync, [learn how here.](/spark/connect-to-your-spark/sync)
+> [!TIP]
+> If you haven't already installed NVIDIA Sync, [learn how here.](/spark/connect-to-your-spark/sync)

 ## Step 1. Configure Docker permissions

@ -184,17 +187,18 @@ If you see a permission denied error (something like `permission denied while tr

 ```bash
 sudo usermod -aG docker $USER
+newgrp docker
 ```

-> **Warning**: After running usermod, you must close the terminal window completely to start a new
-> session with updated group permissions.
+Test Docker access again. In the terminal, run:
+
+```bash
+docker ps
+```

 ## Step 2. Verify Docker setup and pull container

-This step confirms Docker is working properly and downloads the Open WebUI container
-image. This runs on the DGX Spark device and may take several minutes depending on network speed.
-
-Open a new Terminal app from NVIDIA Sync and pull the Open WebUI container image with integrated Ollama:
+Open a new Terminal app from NVIDIA Sync and pull the Open WebUI container image with integrated Ollama on your DGX Spark:

 ```bash
 docker pull ghcr.io/open-webui/open-webui:ollama
@ -204,18 +208,15 @@ Once the container image is downloaded, continue to setup NVIDIA Sync.

 ## Step 3. Open NVIDIA Sync Settings

-Click on the NVIDIA Sync icon in your system tray or taskbar to open the main application window.
+- Click on the NVIDIA Sync icon in your system tray or taskbar to open the main application window.
+- Click the gear icon in the top right corner to open the Settings window.
+- Click on the "Custom" tab to access Custom Ports configuration.

-Click the gear icon in the top right corner to open the Settings window.
+## Step 4. Add Open WebUI custom port configuration

-Click on the "Custom" tab to access Custom Ports configuration.
+Setting up a new Custom port will XXXX

-## Step 4. Add Open WebUI custom port
-
-This step creates a new entry in NVIDIA Sync that will manage the Open
-WebUI container and create the necessary SSH tunnel.
-
-Click the "Add New" button in the Custom section.
+Click the "Add New" button on the Custom tab.

 Fill out the form with these values:

@ -270,22 +271,23 @@ echo "Running. Press Ctrl+C to stop ${NAME}."
 while :; do sleep 86400; done
 ```

-Click the "Add" button to save configuration.
+Click the "Add" button to save configuration to your DGX Spark.

 ## Step 5. Launch Open WebUI

-This step starts the Open WebUI container on your DGX Spark and establishes the SSH
-tunnel. The browser will open automatically if configured correctly.
-
 Click on the NVIDIA Sync icon in your system tray or taskbar to open the main application window.

 Under the "Custom" section, click on "Open WebUI".

 Your default web browser should automatically open to the Open WebUI interface at `http://localhost:12000`.

+> [!TIP]
+> On first run, Open WebUI downloads models. This can delay server start and cause the page to fail to load in your browser. Simply wait and refresh the page.
+> On future launches it will open quickly.
+
 ## Step 6. Create administrator account

-This step sets up the initial administrator account for Open WebUI. This is a local account that you will use to access the Open WebUI interface.
+To start using Open WebUI you must create an initial administrator account. This is a local account that you will use to access the Open WebUI interface.

 In the Open WebUI interface, click the "Get Started" button at the bottom of the screen.

@ -295,14 +297,14 @@ Click the registration button to create your account and access the main interfa

 ## Step 7. Download and configure a model

-This step downloads a language model through Ollama and configures it for use in
-Open WebUI. The download happens on your DGX Spark device and may take several minutes.
+Next, download a language model with Ollama and configure it for use in
+Open WebUI. This download happens on your DGX Spark device and may take several minutes.

 Click on the "Select a model" dropdown in the top left corner of the Open WebUI interface.

 Type `gpt-oss:20b` in the search field.

-Click the "Pull 'gpt-oss:20b' from Ollama.com" button that appears.
+Click the `Pull "gpt-oss:20b" from Ollama.com` button that appears.

 Wait for the model download to complete. You can monitor progress in the interface.

@ -310,9 +312,6 @@ Once complete, select "gpt-oss:20b" from the model dropdown.

 ## Step 8. Test the model

-This step verifies that the complete setup is working properly by testing model
-inference through the web interface.
-
 In the chat textarea at the bottom of the Open WebUI interface, enter:

 ```
@ -331,11 +330,40 @@ Under the "Custom" section, click the `x` icon on the right of the "Open WebUI"

 This will close the tunnel and stop the Open WebUI docker container.

-## Step 11. Cleanup and rollback
+## Step 10. Troubleshooting
+
+Common issues and their solutions.
+
+| Symptom | Cause | Fix |
+|---------|-------|-----|
+| Permission denied on docker ps | User not in docker group | Run Step 1 completely, including terminal restart |
+| Browser doesn't open automatically | Auto-open setting disabled | Manually navigate to localhost:12000 |
+| Model download fails | Network connectivity issues | Check internet connection, retry download |
+| GPU not detected in container | Missing `--gpus=all flag` | Recreate container with correct start script |
+| Port 12000 already in use | Another application using port | Change port in Custom App settings or stop conflicting service |
+
+## Step 11. Next steps
+
+Try downloading different models from the Ollama library at https://ollama.com/library.
+
+You can monitor GPU and memory usage through the DGX Dashboard available in NVIDIA Sync as you try different models.
+
+If Open WebUI reports an update is available, you can pull the the container image by running this in your terminal:
+
+```bash
+docker stop open-webui
+docker rm open-webui 
+docker pull ghcr.io/open-webui/open-webui:ollama
+```
+
+After the update, launch Open WebUI again from NVIDIA Sync.
+
+## Step 12. Cleanup and rollback

 Steps to completely remove the Open WebUI installation and free up resources:

-> **Warning**: These commands will permanently delete all Open WebUI data and downloaded models.
+> [!WARNING]
+> These commands will permanently delete all Open WebUI data and downloaded models.

 Stop and remove the Open WebUI container:

@ -356,24 +384,8 @@ Remove persistent data volumes:
 docker volume rm open-webui open-webui-ollama
 ```

-To rollback permission change: `sudo deluser $USER docker`
-
 Remove the Custom App from NVIDIA Sync by opening Settings > Custom tab and deleting the entry.

-## Step 12. Next steps
-
-Try downloading different models from the Ollama library at https://ollama.com/library.
-
-You can monitor GPU and memory usage through the DGX Dashboard available in NVIDIA Sync as you try different models.
-
-If Open WebUI reports an update is available, you can update the container image by running:
-
-```bash
-docker pull ghcr.io/open-webui/open-webui:ollama
-```
-
-After the update, launch Open WebUI again from NVIDIA Sync.
-
 ## Troubleshooting

 ## Common issues with manual setup
@ -395,7 +407,8 @@ After the update, launch Open WebUI again from NVIDIA Sync.
 | GPU not detected in container | Missing `--gpus=all flag` | Recreate container with correct start script |
 | Port 12000 already in use | Another application using port | Change port in Custom App settings or stop conflicting service |

-> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. 
+> > [!NOTE]
+> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. 
 > With many applications still updating to take advantage of UMA, you may encounter memory issues even when within 
 > the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
 ```bash
--- a/nvidia/txt2kg/assets/README.md
+++ b/nvidia/txt2kg/assets/README.md
@ -28,55 +28,57 @@ By default, this playbook leverages **Ollama** for local LLM inference, providin
 - Knowledge triple extraction from text documents
 - Knowledge graph construction and visualization
 - **Local-first architecture** with Ollama for LLM inference
- Graph-based RAG for more contextual answers
 - Graph database integration with ArangoDB
- Local vector embeddings with Pinecone-compatible storage
- GPU-accelerated LLM inference with Ollama and optional vLLM
- Sentence Transformers for efficient embedding generation
 - Interactive knowledge graph visualization with Three.js WebGPU
- Optional NVIDIA API integration for cloud-based models
+- GPU-accelerated LLM inference with Ollama
 - Fully containerized deployment with Docker Compose
- Decomposable and customizable
+- Optional NVIDIA API integration for cloud-based models
+- Optional vector search and advanced inference capabilities
+- Optional graph-based RAG for contextual answers

 ## Software Components

-The following are the default components included in this playbook:
+### Core Components (Default)

 * **LLM Inference**
-  * **Ollama** (default): Local LLM inference with GPU acceleration
+  * **Ollama**: Local LLM inference with GPU acceleration
    * Default model: `llama3.1:8b`
    * Supports any Ollama-compatible model
-  * **NVIDIA API** (optional): Cloud-based models via NVIDIA API Catalog
-* **Vector Database & Embedding**
-  * **SentenceTransformer**: Local embedding generation
-    * Model: `all-MiniLM-L6-v2`
-  * **Pinecone (Local)**: Self-hosted vector storage and similarity search
-    * No cloud API key required
-    * Compatible with Pinecone client libraries
 * **Knowledge Graph Database**
  * **ArangoDB**: Graph database for storing knowledge triples (entities and relationships)
    * Web interface on port 8529
    * No authentication required (configurable)
 * **Graph Visualization**
  * **Three.js WebGPU**: Client-side GPU-accelerated graph rendering
-  * Optional remote WebGPU clustering for large graphs
 * **Frontend & API**
  * **Next.js**: Modern React framework with API routes

+### Optional Components
+
+* **Vector Database & Embedding** (with `--complete` flag)
+  * **SentenceTransformer**: Local embedding generation (model: `all-MiniLM-L6-v2`)
+  * **Pinecone**: Self-hosted vector storage and similarity search
+* **Cloud Models** (configure separately)
+  * **NVIDIA API**: Cloud-based models via NVIDIA API Catalog
+
 ## Technical Diagram

-The architecture follows this workflow:
+### Default Architecture (Minimal Setup)
+
+The core workflow for knowledge graph building and visualization:
 1. User uploads documents through the txt2kg web UI
 2. Documents are processed and chunked for analysis
 3. **Ollama** extracts knowledge triples (subject-predicate-object) from the text using local LLM inference
 4. Triples are stored in **ArangoDB** graph database
-5. **SentenceTransformer** generates entity embeddings
-6. Embeddings are stored in local **Pinecone** vector database
-7. User queries are processed through graph-based RAG:
-   - KNN search identifies relevant entities in the vector database
-   - Graph traversal enhances context with entity relationships from ArangoDB
-   - Ollama generates responses using the enriched context
-8. Results are visualized with **Three.js WebGPU** rendering in the browser
+5. Knowledge graph is visualized with **Three.js WebGPU** rendering in the browser
+6. Users can query the graph and generate insights using Ollama
+
+### Future Enhancements
+
+Additional capabilities can be added:
+- **Vector search**: Add semantic similarity search with local Pinecone and SentenceTransformer embeddings
+- **S3 storage**: MinIO for scalable document storage
+- **GNN-based GraphRAG**: Graph Neural Networks for enhanced retrieval

 ## GPU-Accelerated LLM Inference

@ -86,7 +88,7 @@ This playbook includes **GPU-accelerated LLM inference** with Ollama:
 - **Fully local inference**: No cloud dependencies or API keys required
 - **GPU acceleration**: Automatic CUDA support with NVIDIA GPUs
 - **Multiple model support**: Use any Ollama-compatible model
- **Optimized performance**: Flash attention, KV cache optimization, and quantization
+- **Optimized inference**: Flash attention, KV cache optimization, and quantization
 - **Easy model management**: Pull and switch models with simple commands
 - **Privacy-first**: All data processing happens on your hardware

@ -96,21 +98,10 @@ This playbook includes **GPU-accelerated LLM inference** with Ollama:
 - Flash attention enabled
 - Q8_0 KV cache for memory efficiency

-## Minimum System Requirements
+## Software Requirements

-**OS Requirements:**
- Ubuntu 22.04 or later
-
-**Driver Versions:**
- GPU Driver: 530.30.02+
- CUDA: 12.0+
-
-**Hardware Requirements:**
- NVIDIA GPU with CUDA support (GTX 1060 or newer, RTX series recommended)
- VRAM requirements depend on model size:
-  - 8B models: 6-8GB VRAM
-  - 70B models: 48GB+ VRAM (or use quantized versions)
- System RAM: 16GB+ recommended
+- CUDA 12.0+
+- Docker with NVIDIA Container Toolkit

 ## Deployment Guide

@ -120,9 +111,7 @@ This playbook includes **GPU-accelerated LLM inference** with Ollama:

 The default configuration uses:
 - Local Ollama (no API key needed)
- Local Pinecone (no API key needed)
 - Local ArangoDB (no authentication by default)
- Local SentenceTransformer embeddings

 Optional environment variables for customization:
 ```bash
@ -150,7 +139,6 @@ cd txt2kg
 That's it! No configuration needed. The script will:
 - Start all required services with Docker Compose
 - Set up ArangoDB database
- Initialize local Pinecone vector storage
 - Launch Ollama with GPU acceleration
 - Start the Next.js frontend

@ -168,8 +156,6 @@ docker exec ollama-compose ollama pull llama3.1:8b

 - **Switch Ollama models**: Use any model from Ollama's library (Llama, Mistral, Qwen, etc.)
 - **Modify extraction prompts**: Customize how triples are extracted from text
- **Adjust embedding parameters**: Change the SentenceTransformer model
- **Implement custom entity relationships**: Define domain-specific relationship types
 - **Add domain-specific knowledge sources**: Integrate external ontologies or taxonomies
 - **Use NVIDIA API**: Connect to cloud models for specific use cases

@ -177,4 +163,4 @@ docker exec ollama-compose ollama pull llama3.1:8b

 [MIT](LICENSE)

-This is licensed under the MIT License. This project will download and install additional third-party open source software projects and containers.
+This project will download and install additional third-party open source software projects and containers.
--- a/nvidia/txt2kg/assets/deploy/README.md
+++ b/nvidia/txt2kg/assets/deploy/README.md
@ -5,34 +5,63 @@ This directory contains all deployment-related configuration for the txt2kg proj
 ## Structure

 - **compose/**: Docker Compose files for local development and testing
-  - `docker-compose.yml`: Main Docker Compose configuration
-  - `docker-compose.gnn.yml`: Docker Compose configuration for GNN components
-  - `docker-compose.neo4j.yml`: Docker Compose configuration for Neo4j
+  - `docker-compose.yml`: Minimal Docker Compose configuration (Ollama + ArangoDB + Next.js)
+  - `docker-compose.complete.yml`: Complete stack with optional services (vLLM, Pinecone, Sentence Transformers)
+  - `docker-compose.optional.yml`: Additional optional services
+  - `docker-compose.vllm.yml`: Legacy vLLM configuration (use `--complete` flag instead)

- **docker/**: Docker-related files
-  - Dockerfile
-  - Initialization scripts for services
+- **app/**: Frontend application Docker configuration
+  - Dockerfile for Next.js application

 - **services/**: Containerized services
-  - **gnn_model/**: Graph Neural Network model service
-  - **sentence-transformers/**: Sentence transformer service for embeddings
+  - **ollama/**: Ollama LLM inference service with GPU support
+  - **sentence-transformers/**: Sentence transformer service for embeddings (optional)
+  - **vllm/**: vLLM inference service with FP8 quantization (optional)
+  - **gpu-viz/**: GPU-accelerated graph visualization services (optional, run separately)
+  - **gnn_model/**: Graph Neural Network model service (experimental, not in default compose files)

 ## Usage

-To start the default services:
+**Recommended: Use the start script**

 ```bash
-docker-compose -f deploy/compose/docker-compose.yml up
+# Minimal setup (Ollama + ArangoDB + Next.js frontend)
+./start.sh
+
+# Complete stack (includes vLLM, Pinecone, Sentence Transformers)
+./start.sh --complete
+
+# Development mode (run frontend without Docker)
+./start.sh --dev-frontend
 ```

-To include GNN components:
+**Manual Docker Compose commands:**
+
+To start the minimal services:

 ```bash
-docker-compose -f deploy/compose/docker-compose.yml -f deploy/compose/docker-compose.gnn.yml up
+docker compose -f deploy/compose/docker-compose.yml up -d
 ```

-To include Neo4j:
+To start the complete stack:

 ```bash
-docker-compose -f deploy/compose/docker-compose.yml -f deploy/compose/docker-compose.neo4j.yml up
-```
+docker compose -f deploy/compose/docker-compose.complete.yml up -d
+```
+
+## Services Included
+
+### Minimal Stack (default)
+- **Next.js App**: Web UI on port 3001
+- **ArangoDB**: Graph database on port 8529
+- **Ollama**: Local LLM inference on port 11434
+
+### Complete Stack (`--complete` flag)
+All minimal services plus:
+- **vLLM**: Advanced LLM inference on port 8001
+- **Pinecone (Local)**: Vector embeddings on port 5081
+- **Sentence Transformers**: Embedding generation on port 8000
+
+### Optional Services (run separately)
+- **GPU-Viz Services**: See `services/gpu-viz/README.md` for GPU-accelerated visualization
+- **GNN Model Service**: See `services/gnn_model/README.md` for experimental GNN-based RAG
--- a/nvidia/txt2kg/assets/deploy/compose/docker-compose.complete.yml
+++ b/nvidia/txt2kg/assets/deploy/compose/docker-compose.complete.yml
@ -1,5 +1,3 @@
-version: '3.8'
-
 services:
  app:
    build:
@ -19,23 +17,26 @@ services:
      - MODEL_NAME=all-MiniLM-L6-v2
      - GRPC_SSL_CIPHER_SUITES=HIGH+ECDSA:HIGH+aRSA
      - NODE_TLS_REJECT_UNAUTHORIZED=0
-      # - XAI_API_KEY=${XAI_API_KEY}  # xAI integration removed
      - OLLAMA_BASE_URL=http://ollama:11434/v1
-      - OLLAMA_MODEL=qwen3:1.7b
-      - S3_ENDPOINT=http://minio:9000
-      - S3_REGION=us-east-1
-      - S3_BUCKET=txt2kg
-      - S3_ACCESS_KEY=minioadmin
-      - S3_SECRET_KEY=minioadmin
+      - OLLAMA_MODEL=llama3.1:8b
+      - VLLM_BASE_URL=http://vllm:8001/v1
+      - VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
+      - REMOTE_WEBGPU_SERVICE_URL=http://txt2kg-remote-webgpu:8083
+      # Node.js timeout configurations for large model processing
+      - NODE_OPTIONS=--max-http-header-size=80000
+      - UV_THREADPOOL_SIZE=128
+      - HTTP_TIMEOUT=1800000
+      - REQUEST_TIMEOUT=1800000
    networks:
      - pinecone-net
-      - s3-net
      - default
+      - txt2kg-network
    depends_on:
      - arangodb
+      - ollama
      - entity-embeddings
      - sentence-transformers
-      - minio
+      - vllm

  arangodb:
    image: arangodb:latest
@ -89,52 +90,93 @@ services:
    networks:
      - default

-  # MinIO S3-compatible storage
-  minio:
-    image: minio/minio:latest
-    container_name: txt2kg-minio
+  ollama:
+    build:
+      context: ../services/ollama
+      dockerfile: Dockerfile
+    image: ollama-custom:latest
+    container_name: ollama-compose
    ports:
-      - "9000:9000"  # API endpoint
-      - "9001:9001"  # Web console
-    environment:
-      - MINIO_ROOT_USER=minioadmin
-      - MINIO_ROOT_PASSWORD=minioadmin
+      - '11434:11434'
    volumes:
-      - minio_data:/data
-    command: server /data --console-address ":9001"
-    healthcheck:
-      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
-      interval: 30s
-      timeout: 20s
-      retries: 3
+      - ollama_data:/root/.ollama
+    environment:
+      - OLLAMA_FLASH_ATTENTION=1
+      - OLLAMA_KEEP_ALIVE=30m
+      - OLLAMA_CUDA=1
+      - OLLAMA_LLM_LIBRARY=cuda
+      - OLLAMA_NUM_PARALLEL=1
+      - OLLAMA_MAX_LOADED_MODELS=1
+      - OLLAMA_KV_CACHE_TYPE=q8_0
+      - OLLAMA_GPU_LAYERS=999
+      - OLLAMA_GPU_MEMORY_FRACTION=0.9
+      - CUDA_VISIBLE_DEVICES=0
    networks:
-      - s3-net
      - default
+    restart: unless-stopped
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 60s

-  createbucket:
-    image: minio/mc
-    depends_on:
-      - minio
-    entrypoint: >
-      /bin/sh -c "
-      sleep 5;
-      /usr/bin/mc config host add myminio http://minio:9000 minioadmin minioadmin;
-      /usr/bin/mc mb myminio/txt2kg;
-      /usr/bin/mc policy set public myminio/txt2kg;
-      exit 0;
-      "
+  vllm:
+    build:
+      context: ../../deploy/services/vllm
+      dockerfile: Dockerfile
+    container_name: vllm-service
+    ports:
+      - '8001:8001'
+    environment:
+      - VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
+      - VLLM_TENSOR_PARALLEL_SIZE=1
+      - VLLM_MAX_MODEL_LEN=4096
+      - VLLM_GPU_MEMORY_UTILIZATION=0.9
+      - VLLM_QUANTIZATION=fp8
+      - VLLM_KV_CACHE_DTYPE=fp8
+      - VLLM_PORT=8001
+      - VLLM_HOST=0.0.0.0
+      - CUDA_VISIBLE_DEVICES=0
+      - NCCL_DEBUG=INFO
+    volumes:
+      - vllm_models:/app/models
+      - /tmp:/tmp
+      - ~/.cache/huggingface:/root/.cache/huggingface
    networks:
-      - s3-net
+      - default
+    restart: unless-stopped
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8001/v1/models"]
+      interval: 30s
+      timeout: 10s
+      retries: 5
+      start_period: 120s

 volumes:
  arangodb_data:
  arangodb_apps_data:
-  minio_data:
+  ollama_data:
+  vllm_models:

 networks:
  pinecone-net:
    name: pinecone
-  s3-net:
-    name: s3-network
  default:
-    driver: bridge 
+    driver: bridge
+  txt2kg-network:
+    driver: bridge
--- a/nvidia/txt2kg/assets/deploy/compose/docker-compose.optional.yml
+++ b/nvidia/txt2kg/assets/deploy/compose/docker-compose.optional.yml
@ -0,0 +1,86 @@
+services:
+  app:
+    environment:
+      - PINECONE_HOST=entity-embeddings
+      - PINECONE_PORT=5081
+      - PINECONE_API_KEY=pclocal
+      - PINECONE_ENVIRONMENT=local
+      - SENTENCE_TRANSFORMER_URL=http://sentence-transformers:80
+      - MODEL_NAME=all-MiniLM-L6-v2
+      - VLLM_BASE_URL=http://vllm:8001/v1
+      - VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
+    networks:
+      - pinecone-net
+    depends_on:
+      - entity-embeddings
+      - sentence-transformers
+      - vllm
+  entity-embeddings:
+    image: ghcr.io/pinecone-io/pinecone-index:latest
+    container_name: entity-embeddings
+    environment:
+      PORT: 5081
+      INDEX_TYPE: serverless
+      VECTOR_TYPE: dense
+      DIMENSION: 384
+      METRIC: cosine
+      INDEX_NAME: entity-embeddings
+    ports:
+      - "5081:5081"
+    platform: linux/amd64
+    networks:
+      - pinecone-net
+    restart: unless-stopped
+  sentence-transformers:
+    build:
+      context: ../../deploy/services/sentence-transformers
+      dockerfile: Dockerfile
+    ports:
+      - '8000:80'
+    environment:
+      - MODEL_NAME=all-MiniLM-L6-v2
+    networks:
+      - default
+  vllm:
+    build:
+      context: ../../deploy/services/vllm
+      dockerfile: Dockerfile
+    container_name: vllm-service
+    ports:
+      - '8001:8001'
+    environment:
+      - VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
+      - VLLM_TENSOR_PARALLEL_SIZE=1
+      - VLLM_MAX_MODEL_LEN=4096
+      - VLLM_GPU_MEMORY_UTILIZATION=0.9
+      - VLLM_QUANTIZATION=fp8
+      - VLLM_KV_CACHE_DTYPE=fp8
+      - VLLM_PORT=8001
+      - VLLM_HOST=0.0.0.0
+    volumes:
+      - vllm_models:/app/models
+      - /tmp:/tmp
+    networks:
+      - default
+    restart: unless-stopped
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8001/v1/models"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 60s
+
+volumes:
+  vllm_models:
+
+networks:
+  pinecone-net:
+    name: pinecone
+
--- a/nvidia/txt2kg/assets/deploy/compose/docker-compose.vllm.yml
+++ b/nvidia/txt2kg/assets/deploy/compose/docker-compose.vllm.yml
@ -1,3 +1,7 @@
+# This is a legacy file - use --with-optional flag instead
+# The vLLM service is now included in docker-compose.optional.yml
+# This file is kept for backwards compatibility
+
 services:
  app:
    build:
--- a/nvidia/txt2kg/assets/deploy/compose/docker-compose.yml
+++ b/nvidia/txt2kg/assets/deploy/compose/docker-compose.yml
@ -8,20 +8,11 @@ services:
    environment:
      - ARANGODB_URL=http://arangodb:8529
      - ARANGODB_DB=txt2kg
-      - PINECONE_HOST=entity-embeddings
-      - PINECONE_PORT=5081
-      - PINECONE_API_KEY=pclocal
-      - PINECONE_ENVIRONMENT=local
      - LANGCHAIN_TRACING_V2=true
-      - SENTENCE_TRANSFORMER_URL=http://sentence-transformers:80
-      - MODEL_NAME=all-MiniLM-L6-v2
      - GRPC_SSL_CIPHER_SUITES=HIGH+ECDSA:HIGH+aRSA
      - NODE_TLS_REJECT_UNAUTHORIZED=0
-      # - XAI_API_KEY=${XAI_API_KEY}  # xAI integration removed
      - OLLAMA_BASE_URL=http://ollama:11434/v1
      - OLLAMA_MODEL=llama3.1:8b
-      - VLLM_BASE_URL=http://vllm:8001/v1
-      - VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
      - REMOTE_WEBGPU_SERVICE_URL=http://txt2kg-remote-webgpu:8083
      # Node.js timeout configurations for large model processing
      - NODE_OPTIONS=--max-http-header-size=80000
@ -29,9 +20,11 @@ services:
      - HTTP_TIMEOUT=1800000
      - REQUEST_TIMEOUT=1800000
    networks:
-      - pinecone-net
      - default
      - txt2kg-network
+    depends_on:
+      - arangodb
+      - ollama
  arangodb:
    image: arangodb:latest
    ports:
@ -54,32 +47,6 @@ services:
        echo 'Creating txt2kg database...' &&
        arangosh --server.endpoint tcp://arangodb:8529 --server.authentication false --javascript.execute-string 'try { db._createDatabase(\"txt2kg\"); console.log(\"Database txt2kg created successfully!\"); } catch(e) { if(e.message.includes(\"duplicate\")) { console.log(\"Database txt2kg already exists\"); } else { throw e; } }'
      "
-  entity-embeddings:
-    image: ghcr.io/pinecone-io/pinecone-index:latest
-    container_name: entity-embeddings
-    environment:
-      PORT: 5081
-      INDEX_TYPE: serverless
-      VECTOR_TYPE: dense
-      DIMENSION: 384
-      METRIC: cosine
-      INDEX_NAME: entity-embeddings
-    ports:
-      - "5081:5081"
-    platform: linux/amd64
-    networks:
-      - pinecone-net
-    restart: unless-stopped
-  sentence-transformers:
-    build:
-      context: ../../deploy/services/sentence-transformers
-      dockerfile: Dockerfile
-    ports:
-      - '8000:80'
-    environment:
-      - MODEL_NAME=all-MiniLM-L6-v2
-    networks:
-      - default
  ollama:
    build:
      context: ../services/ollama
@ -117,52 +84,14 @@ services:
      timeout: 10s
      retries: 3
      start_period: 60s
-  vllm:
-    build:
-      context: ../../deploy/services/vllm
-      dockerfile: Dockerfile
-    container_name: vllm-service
-    ports:
-      - '8001:8001'
-    environment:
-      - VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
-      - VLLM_TENSOR_PARALLEL_SIZE=1
-      - VLLM_MAX_MODEL_LEN=4096
-      - VLLM_GPU_MEMORY_UTILIZATION=0.9
-      - VLLM_QUANTIZATION=fp8
-      - VLLM_KV_CACHE_DTYPE=fp8
-      - VLLM_PORT=8001
-      - VLLM_HOST=0.0.0.0
-    volumes:
-      - vllm_models:/app/models
-      - /tmp:/tmp
-    networks:
-      - default
-    restart: unless-stopped
-    deploy:
-      resources:
-        reservations:
-          devices:
-            - driver: nvidia
-              count: all
-              capabilities: [gpu]
-    healthcheck:
-      test: ["CMD", "curl", "-f", "http://localhost:8001/v1/models"]
-      interval: 30s
-      timeout: 10s
-      retries: 3
-      start_period: 60s

 volumes:
  arangodb_data:
  arangodb_apps_data:
  ollama_data:
-  vllm_models:

 networks:
-  pinecone-net:
-    name: pinecone
  default:
    driver: bridge
  txt2kg-network:
-    driver: bridge
+    driver: bridge
--- a/nvidia/txt2kg/assets/deploy/services/gnn_model/README.md
+++ b/nvidia/txt2kg/assets/deploy/services/gnn_model/README.md
@ -1,44 +1,66 @@
-# GNN Model Service
+# GNN Model Service (Experimental)

-This service provides a REST API for serving predictions from a Graph Neural Network (GNN) model trained to enhance RAG (Retrieval Augmented Generation) performance. It allows comparing GNN-based knowledge graph retrieval with traditional RAG approaches.
+**Status**: This is an experimental service for serving Graph Neural Network models trained for enhanced RAG retrieval.
+
+**Note**: This service is **not included** in the default docker-compose configurations and must be deployed separately.

 ## Overview

-The service exposes a simple API to:
- Load a pre-trained GNN model that combines graph structures with language models
- Process queries by incorporating graph-structured knowledge
- Return predictions that leverage both text and graph relationships
+This service provides a REST API for serving predictions from a Graph Neural Network (GNN) model that enhances knowledge graph retrieval:
+
+- Load pre-trained GNN models (GAT architecture)
+- Process queries with graph-structured knowledge
+- Combine GNN embeddings with LLM generation
+- Compare GNN-based retrieval vs traditional RAG

 ## Getting Started

 ### Prerequisites

- Docker and Docker Compose
- The trained model file (created using `train_export.py`)
+- Python 3.8+
+- PyTorch and PyTorch Geometric
+- A trained model file (created using `train_export.py` in `scripts/gnn/`)
+- Docker (optional)

-### Running the Service
+### Training the Model

-The service is included in the main docker-compose configuration. Simply run:
+Before using the service, you must train a GNN model using the training pipeline:

 ```bash
-docker-compose up -d
-```
+# See scripts/gnn/README.md for full instructions

-This will start the GNN model service along with other services in the system.
+# 1. Preprocess data from ArangoDB
+python scripts/gnn/preprocess_data.py --use_arango --output_dir ./output

-## Training the Model
+# 2. Train the model
+python scripts/gnn/train_test_gnn.py --output_dir ./output

-Before using the service, you need to train the GNN model:
-
-```bash
-# Create the models directory if it doesn't exist
-mkdir -p models
-
-# Run the training script
+# 3. Export model for serving
 python deploy/services/gnn_model/train_export.py --output_dir models
 ```

-This will create the `tech-qa-model.pt` file in the models directory, which the service will load.
+This creates the `tech-qa-model.pt` file needed by the service.
+
+### Running the Service
+
+#### Option A: Direct Python
+
+```bash
+cd deploy/services/gnn_model
+pip install -r requirements.txt
+python app.py
+```
+
+Service runs on: http://localhost:5000
+
+#### Option B: Docker
+
+```bash
+cd deploy/services/gnn_model
+docker build -t gnn-model-service .
+docker run -p 5000:5000 -v $(pwd)/models:/app/models gnn-model-service
+```
+

 ## API Endpoints

@ -89,7 +111,26 @@ The GNN model service uses:
 - A Language Model (LLM) to generate answers
 - A combined architecture (GRetriever) that leverages both components

-## Limitations
+## Integration with txt2kg

- The current implementation requires graph construction to be handled separately
- The `create_graph_from_text` function in the service is a placeholder that needs implementation based on your specific graph construction approach 
+To integrate this service with the main txt2kg application:
+
+1. Train a model using the GNN training pipeline
+2. Deploy the GNN service on a separate port
+3. Update the frontend to call the GNN service endpoints
+4. Compare GNN-enhanced retrieval vs standard RAG
+
+## Current Status
+
+This is an experimental feature. The service code exists but requires:
+- A trained GNN model
+- Integration with the frontend query pipeline
+- Graph construction from txt2kg knowledge graphs
+- Performance benchmarking vs traditional RAG
+
+## Future Enhancements
+
+- Docker Compose integration for easier deployment
+- Automatic model training from txt2kg graphs
+- Real-time model updates as graphs grow
+- Comparison UI in the frontend 
--- a/nvidia/txt2kg/assets/deploy/services/gpu-viz/GPU_Rendering_Library_Options.md
+++ b/nvidia/txt2kg/assets/deploy/services/gpu-viz/GPU_Rendering_Library_Options.md
@ -1,305 +0,0 @@
-# GPU Rendering Library Options for Remote Visualization
-
-## 🎯 **Yes! Three.js is Perfect for Adding GPU Rendering**
-
-Your existing **Three.js v0.176.0** stack is ideal for adding true GPU-accelerated WebGL rendering to the remote service. Here's a comprehensive comparison of options:
-
-## 🚀 **Option 1: Three.js (Recommended)**
-
-### **Why Three.js is Perfect**
- ✅ **Already in your stack** - Three.js v0.176.0 in package.json
- ✅ **Mature WebGL abstraction** - Handles GPU complexity
- ✅ **InstancedMesh for performance** - Single draw call for millions of nodes
- ✅ **Built-in optimizations** - Frustum culling, LOD, memory management
- ✅ **Easy development** - High-level API, good documentation
-
-### **Three.js GPU Features for Graph Rendering**
-
-#### **1. InstancedMesh for Mass Node Rendering**
-```javascript
-// Single GPU draw call for 100k+ nodes
-const geometry = new THREE.CircleGeometry(1, 8);
-const material = new THREE.MeshBasicMaterial({ vertexColors: true });
-const instancedMesh = new THREE.InstancedMesh(geometry, material, nodeCount);
-
-// Set position, scale, color for each instance
-const matrix = new THREE.Matrix4();
-const color = new THREE.Color();
-
-nodes.forEach((node, i) => {
-    matrix.makeScale(node.size, node.size, 1);
-    matrix.setPosition(node.x, node.y, 0);
-    instancedMesh.setMatrixAt(i, matrix);
-    
-    color.setHex(node.clusterColor);
-    instancedMesh.setColorAt(i, color);
-});
-
-// GPU renders all nodes in one call
-scene.add(instancedMesh);
-```
-
-#### **2. BufferGeometry for Edge Performance**
-```javascript
-// GPU-optimized edge rendering
-const positions = new Float32Array(edgeCount * 6);
-const colors = new Float32Array(edgeCount * 6);
-
-edges.forEach((edge, i) => {
-    const idx = i * 6;
-    // Source vertex
-    positions[idx] = edge.source.x;
-    positions[idx + 1] = edge.source.y;
-    // Target vertex  
-    positions[idx + 3] = edge.target.x;
-    positions[idx + 4] = edge.target.y;
-});
-
-const geometry = new THREE.BufferGeometry();
-geometry.setAttribute('position', new THREE.BufferAttribute(positions, 3));
-geometry.setAttribute('color', new THREE.BufferAttribute(colors, 3));
-
-const lineSegments = new THREE.LineSegments(geometry, material);
-```
-
-#### **3. Built-in Performance Optimizations**
-```javascript
-// Three.js GPU optimizations
-renderer.sortObjects = false;           // Disable expensive sorting
-renderer.setPixelRatio(Math.min(devicePixelRatio, 2)); // Limit pixel density
-
-// Frustum culling (automatic)
-// Level-of-detail (LOD) support
-// Automatic geometry merging
-// GPU texture atlasing
-```
-
-### **Performance Comparison**
-
-| Approach | 10k Nodes | 100k Nodes | 1M Nodes | FPS |
-|----------|-----------|------------|----------|-----|
-| **D3.js SVG** | ✅ Good | ❌ Slow | ❌ Unusable | 15fps |
-| **Three.js Standard** | ✅ Excellent | ✅ Good | ❌ Slow | 45fps |
-| **Three.js Instanced** | ✅ Excellent | ✅ Excellent | ✅ Good | 60fps |
-
-## 🔧 **Option 2: deck.gl (For Data-Heavy Visualizations)**
-
-### **Pros**
- ✅ **Built for large datasets** - Optimized for millions of points
- ✅ **WebGL2 compute shaders** - True GPU computation
- ✅ **Built-in graph layouts** - Force-directed on GPU
- ✅ **Excellent performance** - 1M+ nodes at 60fps
-
-### **Cons**
- ❌ **Large bundle size** - Adds ~500KB
- ❌ **Complex API** - Steeper learning curve
- ❌ **React-focused** - Less suitable for iframe embedding
-
-```javascript
-// deck.gl GPU-accelerated approach
-import { ScatterplotLayer, LineLayer } from '@deck.gl/layers';
-
-const nodeLayer = new ScatterplotLayer({
-    data: nodes,
-    getPosition: d => [d.x, d.y],
-    getRadius: d => d.size,
-    getFillColor: d => d.color,
-    radiusUnits: 'pixels',
-    // GPU instancing automatically enabled
-});
-
-const edgeLayer = new LineLayer({
-    data: edges,
-    getSourcePosition: d => [d.source.x, d.source.y],
-    getTargetPosition: d => [d.target.x, d.target.y],
-    getColor: [100, 100, 100],
-    getWidth: 1
-});
-```
-
-## ⚡ **Option 3: regl (Raw WebGL Performance)**
-
-### **Pros**
- ✅ **Maximum performance** - Direct WebGL access
- ✅ **Small bundle** - ~50KB
- ✅ **Full control** - Custom shaders, compute pipelines
- ✅ **Functional API** - Clean, predictable
-
-### **Cons**
- ❌ **Low-level complexity** - Manual memory management
- ❌ **Shader development** - GLSL programming required
- ❌ **More development time** - Everything custom
-
-```javascript
-// regl direct WebGL approach
-const drawNodes = regl({
-    vert: `
-        attribute vec2 position;
-        attribute float size;
-        attribute vec3 color;
-        varying vec3 vColor;
-        
-        void main() {
-            gl_Position = vec4(position, 0, 1);
-            gl_PointSize = size;
-            vColor = color;
-        }
-    `,
-    
-    frag: `
-        precision mediump float;
-        varying vec3 vColor;
-        
-        void main() {
-            gl_FragColor = vec4(vColor, 1);
-        }
-    `,
-    
-    attributes: {
-        position: nodePositions,
-        size: nodeSizes,
-        color: nodeColors
-    },
-    
-    count: nodeCount,
-    primitive: 'points'
-});
-```
-
-## 🎮 **Option 4: WebGPU (Future-Proof)**
-
-### **Pros**
- ✅ **Next-generation API** - Successor to WebGL
- ✅ **Compute shaders** - True parallel processing
- ✅ **Better performance** - Lower overhead
- ✅ **Multi-threading** - Parallel command buffers
-
-### **Cons**
- ❌ **Limited browser support** - Chrome/Edge only (2024)
- ❌ **New API** - Rapidly changing specification
- ❌ **Complex setup** - More verbose than WebGL
-
-```javascript
-// WebGPU approach (future)
-const adapter = await navigator.gpu.requestAdapter();
-const device = await adapter.requestDevice();
-
-const computePipeline = device.createComputePipeline({
-    compute: {
-        module: device.createShaderModule({
-            code: `
-                @compute @workgroup_size(64)
-                fn main(@builtin(global_invocation_id) global_id : vec3<u32>) {
-                    let index = global_id.x;
-                    if (index >= arrayLength(&positions)) { return; }
-                    
-                    // GPU-parallel force calculation
-                    var force = vec2<f32>(0.0, 0.0);
-                    for (var i = 0u; i < arrayLength(&positions); i++) {
-                        if (i != index) {
-                            let diff = positions[index] - positions[i];
-                            let dist = length(diff);
-                            force += normalize(diff) * (1.0 / (dist * dist));
-                        }
-                    }
-                    
-                    velocities[index] += force * 0.01;
-                    positions[index] += velocities[index] * 0.1;
-                }
-            `
-        }),
-        entryPoint: 'main'
-    }
-});
-```
-
-## 🏆 **Recommendation: Three.js Integration**
-
-### **For Your Use Case, Three.js is Optimal Because:**
-
-1. **Already Available** - No new dependencies
-2. **Proven Performance** - Handles 100k+ nodes smoothly  
-3. **Easy Integration** - Replace D3.js rendering with Three.js
-4. **Maintenance** - Well-documented, stable API
-5. **Development Speed** - Rapid implementation
-
-### **Implementation Strategy**
-
-#### **Phase 1: Basic Three.js WebGL (Week 1)**
-```python
-# Enhanced remote service with Three.js
-def _generate_threejs_html(self, session_data, config):
-    return f"""
-    <script src="https://cdnjs.cloudflare.com/ajax/libs/three.js/0.176.0/three.min.js"></script>
-    <script>
-        // Basic Three.js WebGL rendering
-        const renderer = new THREE.WebGLRenderer({{ 
-            powerPreference: "high-performance" 
-        }});
-        const scene = new THREE.Scene();
-        const camera = new THREE.PerspectiveCamera(75, width/height, 0.1, 1000);
-        
-        // Render nodes and edges with GPU
-        createNodeVisualization();
-        createEdgeVisualization();
-    </script>
-    """
-```
-
-#### **Phase 2: GPU Optimization (Week 2)**
- Add InstancedMesh for node rendering
- Implement BufferGeometry for edges  
- Enable frustum culling and LOD
-
-#### **Phase 3: Advanced Features (Week 3)**
- GPU-based interaction (raycasting)
- Smooth camera controls
- Real-time layout animation
-
-### **Expected Performance Improvements**
-
-| Feature | D3.js SVG | Three.js WebGL | Improvement |
-|---------|-----------|----------------|-------------|
-| **50k nodes** | 5 FPS | 60 FPS | **12x faster** |
-| **Animation** | Choppy | Smooth | **Fluid motion** |
-| **Memory usage** | 200MB DOM | 50MB GPU | **4x less memory** |
-| **Interaction** | Laggy | Responsive | **Real-time** |
-
-## 💡 **Implementation Roadmap**
-
-### **Step 1: Replace HTML Template**
-```python
-# In remote_gpu_rendering_service.py
-def _generate_interactive_html(self, session_data, config):
-    if config.get('use_webgl', True):
-        return self._generate_threejs_webgl_html(session_data, config)
-    else:
-        return self._generate_d3_svg_html(session_data, config)  # Fallback
-```
-
-### **Step 2: Add WebGL Configuration**
-```typescript
-// In RemoteGPUViewer component
-const processWithWebGLOptimization = async () => {
-    const config = {
-        use_webgl: nodeCount > 5000,
-        instanced_rendering: nodeCount > 10000,
-        lod_enabled: nodeCount > 25000,
-        render_quality: 'high'
-    };
-    // Process with enhanced GPU service
-};
-```
-
-### **Step 3: Performance Monitoring**
-```javascript
-// Built-in Three.js performance monitoring
-console.log('Render Info:', {
-    triangles: renderer.info.render.triangles,
-    calls: renderer.info.render.calls,
-    geometries: renderer.info.memory.geometries,
-    textures: renderer.info.memory.textures
-});
-```
-
-**Result**: Your remote GPU service will provide **true GPU-accelerated rendering** with minimal development effort by leveraging your existing Three.js stack. 
--- a/nvidia/txt2kg/assets/deploy/services/gpu-viz/JavaScript_Library_Integration.md
+++ b/nvidia/txt2kg/assets/deploy/services/gpu-viz/JavaScript_Library_Integration.md
@ -1,264 +0,0 @@
-# JavaScript Library Stack Integration with Remote GPU Rendering
-
-## 🚀 **Library Architecture Overview**
-
-Your project leverages a sophisticated JavaScript stack optimized for graph visualization performance:
-
-### **Core Visualization Libraries**
-```json
-{
-  "3d-force-graph": "^1.77.0",    // WebGL 3D graph rendering
-  "three": "^0.176.0",             // WebGL/WebGPU 3D engine  
-  "d3": "^7.9.0",                  // Data binding & force simulation
-  "@types/d3": "^7.4.3",           // TypeScript definitions
-  "@types/three": "^0.175.0"       // Three.js TypeScript support
-}
-```
-
-### **Frontend Framework**
-```json
-{
-  "next": "15.1.0",                // React framework with SSR
-  "react": "^19",                  // Component architecture
-  "tailwindcss": "^3.4.17"        // Utility-first CSS
-}
-```
-
-## 🎯 **Performance Optimization Strategies**
-
-### **1. Dynamic Import Strategy**
-
-**Problem:** Large visualization libraries increase initial bundle size
-**Solution:** Conditional loading based on graph complexity
-
-```typescript
-// ForceGraphWrapper.tsx - Dynamic loading pattern
-const ForceGraph3D = (await import('3d-force-graph')).default;
-
-// Benefits:
-// - Reduces initial bundle by ~2MB
-// - Enables GPU capability detection
-// - Prevents SSR WebGL conflicts
-```
-
-### **2. GPU Capability Detection**
-
-**Enhanced detection based on your library capabilities:**
-
-```typescript
-const shouldUseRemoteRendering = (nodeCount: number) => {
-  const maxWebGLNodes = window.WebGL2RenderingContext ? 50000 : 10000;
-  const maxWebGPUNodes = 'gpu' in navigator ? 100000 : 25000;
-  
-  // Three.js geometry memory limits
-  const estimatedMemoryMB = (nodeCount * 64) / (1024 * 1024);
-  const maxClientMemory = hasWebGPU ? 512 : 256; // MB
-  
-  return nodeCount > maxWebGLNodes || estimatedMemoryMB > maxClientMemory;
-};
-```
-
-### **3. Library-Specific Optimizations**
-
-#### **Three.js Renderer Settings**
-```typescript
-const optimizeForThreeJS = (nodeCount: number) => ({
-  // Instanced rendering for large graphs
-  instance_rendering: nodeCount > 10000,
-  
-  // Texture optimization
-  texture_atlasing: nodeCount > 5000,
-  max_texture_size: nodeCount > 25000 ? 2048 : 1024,
-  
-  // Performance culling
-  frustum_culling: nodeCount > 15000,
-  occlusion_culling: nodeCount > 25000,
-  
-  // Level-of-detail for distant nodes
-  enable_lod: nodeCount > 25000
-});
-```
-
-#### **D3.js Force Simulation Tuning**
-```typescript
-const optimizeForD3 = (nodeCount: number) => ({
-  // Reduced iterations for large graphs
-  physics_iterations: nodeCount > 50000 ? 100 : 300,
-  
-  // Faster convergence
-  alpha_decay: nodeCount > 50000 ? 0.05 : 0.02,
-  
-  // More damping for stability
-  velocity_decay: nodeCount > 50000 ? 0.6 : 0.4
-});
-```
-
-## 🔧 **Remote GPU Service Integration**
-
-### **Enhanced HTML Template Generation**
-
-The remote GPU service now generates HTML compatible with your frontend:
-
-```python
-def _generate_interactive_html(self, session_data: dict, config: dict) -> str:
-    html_template = f"""
-    <!-- Using D3.js v7.9.0 consistent with frontend -->
-    <script src="https://d3js.org/d3.v7.min.js"></script>
-    
-    <script>
-        // Configuration matching your library versions
-        const config = {{
-            d3_version: "7.9.0",           // Match package.json
-            threejs_version: "0.176.0",    // Match package.json
-            force_graph_version: "1.77.0", // Match package.json
-            
-            // Performance settings based on render quality
-            maxParticles: {settings['particles']},
-            lineWidth: {settings['line_width']},
-            nodeDetail: {settings['node_detail']}
-        }};
-        
-        // D3 force simulation with GPU-optimized parameters
-        this.simulation = d3.forceSimulation()
-            .force("link", d3.forceLink().id(d => d.id).distance(60))
-            .force("charge", d3.forceManyBody().strength(-120))
-            .force("center", d3.forceCenter(this.width / 2, this.height / 2))
-            .alphaDecay(0.02)
-            .velocityDecay(0.4);
-    </script>
-    """
-```
-
-### **Frontend Component Integration**
-
-```typescript
-// RemoteGPUViewer.tsx - Library-aware processing
-const processGraphWithLibraryOptimization = async () => {
-  const optimizedConfig = {
-    // Frontend library compatibility
-    d3_version: "7.9.0",
-    threejs_version: "0.176.0", 
-    force_graph_version: "1.77.0",
-    
-    // WebGL optimization features
-    webgl_features: {
-      instance_rendering: nodeCount > 10000,
-      texture_atlasing: nodeCount > 5000,
-      frustum_culling: nodeCount > 15000
-    },
-    
-    // Performance tuning
-    progressive_loading: nodeCount > 25000,
-    gpu_memory_management: true
-  };
-  
-  const response = await fetch('/api/render', {
-    method: 'POST',
-    body: JSON.stringify({ graph_data, config: optimizedConfig })
-  });
-};
-```
-
-## 📊 **Performance Benchmarks by Library Stack**
-
-### **Client-Side Rendering Limits**
-
-| Library Stack | Max Nodes | Memory Usage | Performance |
-|---------------|-----------|--------------|-------------|
-| **D3.js + SVG** | 5,000 | ~50MB | Good interaction |
-| **Three.js + WebGL** | 50,000 | ~256MB | Smooth 60fps |
-| **Three.js + WebGPU** | 100,000 | ~512MB | GPU-accelerated |
-| **Remote GPU** | 1M+ | ~100KB transfer | Server-rendered |
-
-### **Rendering Strategy Decision Tree**
-
-```typescript
-const selectRenderingStrategy = (nodeCount: number) => {
-  if (nodeCount < 5000) {
-    return "local_svg";        // D3.js + SVG DOM
-  } else if (nodeCount < 25000) {
-    return "local_webgl";      // Three.js + WebGL
-  } else if (nodeCount < 100000 && hasWebGPU) {
-    return "local_webgpu";     // Three.js + WebGPU
-  } else {
-    return "remote_gpu";       // Remote cuGraph + GPU
-  }
-};
-```
-
-## 🚀 **Advanced Integration Features**
-
-### **1. Progressive Loading**
-```typescript
-// For graphs >25k nodes, enable progressive loading
-if (nodeCount > 25000) {
-  config.progressive_loading = true;
-  config.initial_load_size = 10000;  // Load first 10k nodes
-  config.batch_size = 5000;          // Load 5k at a time
-}
-```
-
-### **2. WebSocket Real-time Updates**
-```typescript
-// Real-time parameter updates via WebSocket
-const updateLayoutAlgorithm = (algorithm: string) => {
-  if (wsRef.current?.readyState === WebSocket.OPEN) {
-    wsRef.current.send(JSON.stringify({
-      type: "update_params",
-      layout_algorithm: algorithm
-    }));
-  }
-};
-```
-
-### **3. Memory-Aware Quality Settings**
-```typescript
-const adjustQuality = (availableMemory: number, nodeCount: number) => {
-  if (availableMemory < 256) return "low";      // Mobile devices
-  if (availableMemory < 512) return "medium";   // Standard devices  
-  if (nodeCount > 100000) return "high";       // Large graphs
-  return "ultra";                               // High-end systems
-};
-```
-
-## 💡 **Best Practices for Your Stack**
-
-### **1. Bundle Optimization**
- Use dynamic imports for 3D libraries
- Lazy load based on graph size detection
- Implement service worker caching for repeated visualizations
-
-### **2. Memory Management**
-```typescript
-// Cleanup Three.js resources
-const cleanup = () => {
-  if (graphRef.current) {
-    graphRef.current.scene?.traverse((object) => {
-      if (object.geometry) object.geometry.dispose();
-      if (object.material) object.material.dispose();
-    });
-    graphRef.current.renderer?.dispose();
-  }
-};
-```
-
-### **3. Responsive Rendering**
-```typescript
-// Adjust complexity based on device capabilities
-const getDeviceCapabilities = () => ({
-  memory: (navigator as any).deviceMemory || 4, // GB
-  cores: navigator.hardwareConcurrency || 4,
-  gpu: 'gpu' in navigator ? 'webgpu' : 'webgl'
-});
-```
-
-## 🎯 **Integration Results**
-
-✅ **Seamless fallback** between local and remote rendering
-✅ **Library version consistency** across client and server
-✅ **Memory-aware quality adjustment** based on device capabilities  
-✅ **Progressive enhancement** from SVG → WebGL → WebGPU → Remote GPU
-✅ **Real-time parameter updates** via WebSocket
-✅ **Zero-config optimization** based on graph complexity
-
-This integration provides the best of both worlds: the interactivity of your existing Three.js/D3.js stack for smaller graphs, and the scalability of remote GPU processing for large-scale visualizations. 
--- a/nvidia/txt2kg/assets/deploy/services/gpu-viz/README.md
+++ b/nvidia/txt2kg/assets/deploy/services/gpu-viz/README.md
@ -1,48 +1,68 @@
-# Unified GPU Graph Visualization Service
+# GPU Graph Visualization Services

 ## 🚀 Overview

-The unified service combines **PyGraphistry Cloud** and **Local GPU (cuGraph)** processing into a single FastAPI service, giving you maximum flexibility for graph visualization.
+This directory contains optional GPU-accelerated graph visualization services that run separately from the main txt2kg application. These services provide advanced visualization capabilities for large-scale graphs.

-## ⚡ Processing Modes
+**Note**: These services are **optional** and not included in the default docker-compose configurations. They must be run separately.

+## 📦 Available Services
+
+### 1. Unified GPU Service (`unified_gpu_service.py`)
+Combines **PyGraphistry Cloud** and **Local GPU (cuGraph)** processing into a single FastAPI service.
+
+**Processing Modes:**
 | Mode | Description | Requirements |
 |------|-------------|--------------|
 | **PyGraphistry Cloud** | Interactive GPU embeds in browser | API credentials |
 | **Local GPU (cuGraph)** | Full GPU processing on your hardware | NVIDIA GPU + cuGraph |
 | **Local CPU** | NetworkX fallback processing | None |

-## 🛠️ Quick Setup
+### 2. Remote GPU Rendering Service (`remote_gpu_rendering_service.py`)
+Provides GPU-accelerated graph layout and rendering with iframe-embeddable visualizations.
+
+### 3. Local GPU Service (`local_gpu_viz_service.py`)
+Local GPU processing service with WebSocket support for real-time updates.
+
+## 🛠️ Setup
+
+### Prerequisites
+- NVIDIA GPU with CUDA support (for GPU modes)
+- RAPIDS cuGraph (for local GPU processing)
+- PyGraphistry account (for cloud mode)
+
+### Installation

-### 1. Set Environment Variables (Optional)
 ```bash
-# For PyGraphistry Cloud features
-export GRAPHISTRY_PERSONAL_KEY="your_personal_key"
-export GRAPHISTRY_SECRET_KEY="your_secret_key"
+# Install dependencies
+pip install -r deploy/services/gpu-viz/requirements.txt
+
+# For remote WebGPU service
+pip install -r deploy/services/gpu-viz/requirements-remote-webgpu.txt
 ```

-### 2. Run the Service
+### Running Services

-#### Option A: Direct Python
+#### Unified GPU Service
 ```bash
-cd services
+cd deploy/services/gpu-viz
 python unified_gpu_service.py
 ```

-#### Option B: Using Startup Script
+Service runs on: http://localhost:8080
+
+#### Remote GPU Rendering Service
 ```bash
-cd services
-./start_gpu_services.sh
+cd deploy/services/gpu-viz
+python remote_gpu_rendering_service.py
 ```

-#### Option C: Docker (NVIDIA PyG Container)
+Service runs on: http://localhost:8082
+
+#### Using Startup Script
 ```bash
-cd services
-docker build -t unified-gpu-viz .
-docker run --gpus all -p 8080:8080 \
-  -e GRAPHISTRY_PERSONAL_KEY="your_key" \
-  -e GRAPHISTRY_SECRET_KEY="your_secret" \
-  unified-gpu-viz
+cd deploy/services/gpu-viz
+./start_remote_gpu_services.sh
 ```

 ## 📡 API Usage
@ -85,25 +105,19 @@ Response:

 ## 🎯 Frontend Integration

-### React Component Usage
+The txt2kg frontend includes built-in components for GPU visualization:

-```tsx
-import { UnifiedGPUViewer } from '@/components/unified-gpu-viewer'
+- `UnifiedGPUViewer`: Connects to unified GPU service
+- `PyGraphistryViewer`: Direct PyGraphistry cloud integration
+- `ForceGraphWrapper`: Three.js WebGPU visualization (default)

-function MyApp() {
-  const graphData = {
-    nodes: [...],
-    links: [...]
-  }
+### Using GPU Services in Frontend

-  return (
-    <UnifiedGPUViewer 
-      graphData={graphData}
-      onError={(error) => console.error(error)}
-    />
-  )
-}
-```
+The frontend has API routes that can connect to these services:
+- `/api/pygraphistry/*`: PyGraphistry integration
+- `/api/unified-gpu/*`: Unified GPU service integration
+
+To use these services, ensure they are running separately and configure the frontend environment variables accordingly.

 ### Mode-Specific Processing

--- a/nvidia/txt2kg/assets/deploy/services/gpu-viz/true_gpu_rendering_comparison.md
+++ b/nvidia/txt2kg/assets/deploy/services/gpu-viz/true_gpu_rendering_comparison.md
@ -1,243 +0,0 @@
-# True GPU Rendering vs Current Approach
-
-## 🎯 **Current Remote GPU Service**
-
-### **What Uses GPU (✅)**
- **Graph Layout**: cuGraph Force Atlas 2, Spectral Layout
- **Clustering**: cuGraph Leiden, Louvain algorithms  
- **Centrality**: cuGraph PageRank, Betweenness Centrality
- **Data Processing**: Node positioning, edge bundling
-
-### **What Uses CPU (❌)**
- **Visual Rendering**: D3.js SVG/Canvas drawing
- **Animation**: D3.js transitions and transforms
- **Interaction**: DOM event handling, hover, zoom
- **Text Rendering**: Node labels, tooltips
-
-## 🔥 **True GPU Rendering (Like PyGraphistry)**
-
-### **What Would Need GPU Acceleration**
-
-#### **1. WebGL Compute Shaders**
-```glsl
-// Vertex shader for node positioning
-attribute vec2 position;
-attribute float size;
-attribute vec3 color;
-
-uniform mat4 projectionMatrix;
-uniform float time;
-
-void main() {
-    // GPU-accelerated node positioning
-    vec2 pos = position + computeForceLayout(time);
-    gl_Position = projectionMatrix * vec4(pos, 0.0, 1.0);
-    gl_PointSize = size;
-}
-```
-
-#### **2. GPU Particle Systems**
-```javascript
-// WebGL-based node rendering
-class GPUNodeRenderer {
-    constructor(gl, nodeCount) {
-        this.nodeCount = nodeCount;
-        
-        // Create vertex buffers for GPU processing
-        this.positionBuffer = gl.createBuffer();
-        this.colorBuffer = gl.createBuffer();
-        this.sizeBuffer = gl.createBuffer();
-        
-        // Compile GPU shaders
-        this.program = this.createShaderProgram(gl);
-    }
-    
-    render(nodes) {
-        // Update GPU buffers - no CPU iteration
-        gl.bindBuffer(gl.ARRAY_BUFFER, this.positionBuffer);
-        gl.bufferData(gl.ARRAY_BUFFER, new Float32Array(positions), gl.DYNAMIC_DRAW);
-        
-        // GPU draws all nodes in single call
-        gl.drawArrays(gl.POINTS, 0, this.nodeCount);
-    }
-}
-```
-
-#### **3. GPU-Based Interaction**
-```javascript
-// GPU picking for node selection
-class GPUPicker {
-    constructor(gl, nodeCount) {
-        // Render nodes to off-screen framebuffer with unique colors
-        this.pickingFramebuffer = gl.createFramebuffer();
-        this.pickingTexture = gl.createTexture();
-    }
-    
-    getNodeAtPosition(x, y) {
-        // Read single pixel from GPU framebuffer
-        const pixel = new Uint8Array(4);
-        gl.readPixels(x, y, 1, 1, gl.RGBA, gl.UNSIGNED_BYTE, pixel);
-        
-        // Decode node ID from color
-        return this.colorToNodeId(pixel);
-    }
-}
-```
-
-## 📊 **Performance Comparison**
-
-### **Current D3.js CPU Rendering**
-```javascript
-// CPU-bound operations
-nodes.forEach(node => {
-    // For each node, update DOM element
-    d3.select(`#node-${node.id}`)
-        .attr("cx", node.x)
-        .attr("cy", node.y)
-        .attr("r", node.size);
-});
-
-// Performance: O(n) DOM operations
-// 10k nodes = 10k DOM updates per frame
-// Maximum ~60fps with heavy optimization
-```
-
-### **GPU WebGL Rendering**
-```javascript
-// GPU-accelerated operations
-class GPURenderer {
-    updateNodes(nodeData) {
-        // Single buffer update for all nodes
-        gl.bufferSubData(gl.ARRAY_BUFFER, 0, nodeData);
-        
-        // Single draw call for all nodes
-        gl.drawArraysInstanced(gl.TRIANGLES, 0, 6, nodeCount);
-    }
-}
-
-// Performance: O(1) GPU operations
-// 1M nodes = 1 GPU draw call
-// Can maintain 60fps with millions of nodes
-```
-
-## 🛠️ **Implementation Options**
-
-### **Option 1: WebGL2 + Compute Shaders**
-```html
-<!-- Enhanced HTML template with WebGL -->
-<canvas id="gpu-canvas" width="800" height="600"></canvas>
-<script>
-    const canvas = document.getElementById('gpu-canvas');
-    const gl = canvas.getContext('webgl2');
-    
-    // Load compute shaders for layout animation
-    const computeShader = gl.createShader(gl.COMPUTE_SHADER);
-    gl.shaderSource(computeShader, computeShaderSource);
-    
-    // Render loop using GPU
-    function animate() {
-        // Update node positions on GPU
-        gl.useProgram(computeProgram);
-        gl.dispatchCompute(Math.ceil(nodeCount / 64), 1, 1);
-        
-        // Render nodes on GPU
-        gl.useProgram(renderProgram);
-        gl.drawArraysInstanced(gl.POINTS, 0, 1, nodeCount);
-        
-        requestAnimationFrame(animate);
-    }
-</script>
-```
-
-### **Option 2: WebGPU (Future)**
-```javascript
-// Next-generation WebGPU API
-const adapter = await navigator.gpu.requestAdapter();
-const device = await adapter.requestDevice();
-
-// GPU compute pipeline for layout
-const computePipeline = device.createComputePipeline({
-    compute: {
-        module: device.createShaderModule({ code: layoutComputeShader }),
-        entryPoint: 'main'
-    }
-});
-
-// GPU render pipeline
-const renderPipeline = device.createRenderPipeline({
-    vertex: { module: vertexShaderModule, entryPoint: 'main' },
-    fragment: { module: fragmentShaderModule, entryPoint: 'main' },
-    primitive: { topology: 'point-list' }
-});
-```
-
-### **Option 3: Three.js GPU Optimization**
-```javascript
-// Use Three.js InstancedMesh for GPU instancing
-import * as THREE from 'three';
-
-class GPUGraphRenderer {
-    constructor(nodeCount) {
-        // Single geometry instanced for all nodes
-        const geometry = new THREE.CircleGeometry(1, 8);
-        const material = new THREE.MeshBasicMaterial();
-        
-        // GPU-instanced mesh for all nodes
-        this.instancedMesh = new THREE.InstancedMesh(
-            geometry, material, nodeCount
-        );
-        
-        // Position matrix for each instance
-        this.matrix = new THREE.Matrix4();
-    }
-    
-    updateNode(index, x, y, scale, color) {
-        // Update single instance matrix
-        this.matrix.makeScale(scale, scale, 1);
-        this.matrix.setPosition(x, y, 0);
-        this.instancedMesh.setMatrixAt(index, this.matrix);
-        this.instancedMesh.setColorAt(index, color);
-    }
-    
-    render() {
-        // Single GPU draw call for all nodes
-        this.instancedMesh.instanceMatrix.needsUpdate = true;
-        this.instancedMesh.instanceColor.needsUpdate = true;
-    }
-}
-```
-
-## 🎯 **Recommendation**
-
-### **Current Approach is Good For:**
- ✅ **Rapid development** - Standard D3.js patterns
- ✅ **Small-medium graphs** (<50k nodes) 
- ✅ **Interactive features** - Easy DOM manipulation
- ✅ **Debugging** - Standard web dev tools
- ✅ **Compatibility** - Works in all browsers
-
-### **True GPU Rendering Needed For:**
- 🚀 **Million+ node graphs** with smooth 60fps
- 🚀 **Real-time layout animation** 
- 🚀 **Complex visual effects** (particles, trails)
- 🚀 **VR/AR graph visualization**
- 🚀 **Multi-touch interaction** on large displays
-
-## 💡 **Hybrid Solution**
-
-The optimal approach combines both:
-
-```javascript
-// Intelligent renderer selection
-const selectRenderer = (nodeCount) => {
-    if (nodeCount < 10000) {
-        return new D3SVGRenderer();     // CPU DOM rendering
-    } else if (nodeCount < 100000) {
-        return new ThreeJSRenderer();   // WebGL with Three.js
-    } else {
-        return new WebGLRenderer();     // Custom GPU shaders
-    }
-};
-```
-
-**Current Status:** Your remote service provides **GPU-accelerated data processing** with **CPU-based rendering** - which is perfect for most use cases and much easier to develop/maintain than full GPU rendering. 
--- a/nvidia/txt2kg/assets/deploy/services/ollama/Dockerfile.monitor
+++ b/nvidia/txt2kg/assets/deploy/services/ollama/Dockerfile.monitor
@ -1,26 +0,0 @@
-FROM ubuntu:22.04
-
-# Install required packages
-RUN apt-get update && apt-get install -y \
-    curl \
-    docker.io \
-    bc \
-    && rm -rf /var/lib/apt/lists/*
-
-# Copy the monitoring script
-COPY gpu_memory_monitor.sh /usr/local/bin/
-RUN chmod +x /usr/local/bin/gpu_memory_monitor.sh
-
-# Create a non-root user
-RUN useradd -m -s /bin/bash monitor
-
-# Set environment variables with defaults
-ENV CHECK_INTERVAL=60
-ENV MIN_AVAILABLE_PERCENT=70
-ENV AUTO_FIX=true
-
-# Run as non-root user
-USER monitor
-WORKDIR /home/monitor
-
-CMD ["/usr/local/bin/gpu_memory_monitor.sh"]
--- a/nvidia/txt2kg/assets/deploy/services/ollama/NVIDIA_MPS_GUIDE.md
+++ b/nvidia/txt2kg/assets/deploy/services/ollama/NVIDIA_MPS_GUIDE.md
@ -1,252 +0,0 @@
-# NVIDIA MPS Guide for Ollama GPU Optimization
-
-## 🚀 Overview
-
-NVIDIA Multi-Process Service (MPS) is a game-changing technology that enables multiple processes to share a single GPU context, eliminating expensive context switching overhead and dramatically improving concurrent workload performance.
-
-This guide documents our discovery: **MPS transforms the DGX Spark from a single-threaded bottleneck into a high-throughput powerhouse**, achieving **3x concurrent performance** with near-perfect scaling.
-
-## 📊 Performance Results Summary
-
-### Triple Extraction Benchmark (llama3.1:8b)
-
-| System | Mode | Individual Performance | Aggregate Throughput | Scaling Efficiency |
-|--------|------|----------------------|---------------------|-------------------|
-| **RTX 5090** | Single | ~300 tok/s | 300 tok/s | 100% (baseline) |
-| **Mac M4 Pro** | Single | ~45 tok/s | 45 tok/s | 100% (baseline) |
-| **DGX Spark** | Single (MPS) | 33.3 tok/s | 33.3 tok/s | 100% (baseline) |
-| **DGX Spark** | 2x Concurrent | ~33.2 tok/s each | **66.4 tok/s** | **97% efficiency** |
-| **DGX Spark** | 3x Concurrent | ~33.1 tok/s each | **99.4 tok/s** | **99% efficiency** |
-
-### 🏆 Key Achievement
-**DGX Spark + MPS delivers 2.2x higher aggregate throughput than RTX 5090 in multi-request scenarios!**
-
-## 🛠️ MPS Setup Instructions
-
-### 1. Start MPS Server
-
-```bash
-# Set MPS directory
-export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
-mkdir -p /tmp/nvidia-mps
-
-# Start MPS control daemon
-sudo env "CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps" nvidia-cuda-mps-control -d
-```
-
-### 2. Restart Ollama with MPS Support
-
-```bash
-# Stop current Ollama
-cd /path/to/ollama
-docker compose down
-
-# Start Ollama with MPS environment
-sudo env "CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps" docker compose up -d
-```
-
-### 3. Verify MPS is Working
-
-```bash
-# Check MPS processes
-ps aux | grep mps
-
-# Expected output:
-# root nvidia-cuda-mps-control -d
-# root nvidia-cuda-mps-server -force-tegra
-
-# Check Ollama processes show M+C flag
-nvidia-smi
-# Look for M+C in the Type column for Ollama processes
-```
-
-### 4. Stop MPS (when needed)
-
-```bash
-sudo nvidia-cuda-mps-control quit
-```
-
-## 🔬 Technical Architecture
-
-### CUDA MPS Architecture
-```
-┌─────────────────────────────────────────┐
-│  GPU (Single CUDA Context)              │
-│  ├── MPS Server (Resource Manager)      │
-│  ├── Ollama Process 1 ──┐               │
-│  ├── Ollama Process 2 ──┼── Shared      │
-│  └── Ollama Process 3 ──┘   Context     │
-└─────────────────────────────────────────┘
-```
-
-### Traditional Multi-Process Architecture
-```
-┌─────────────────────────────────────────┐
-│  GPU                                    │
-│  ├── Process 1 (Context 1) ─────────────│
-│  ├── Process 2 (Context 2) ─────────────│
-│  └── Process 3 (Context 3) ─────────────│
-│      ↑ Context Switching Overhead       │
-└─────────────────────────────────────────┘
-```
-
-## ⚖️ MPS vs Multiple API Servers Comparison
-
-### 🚀 CUDA MPS Advantages
-
-**Performance:**
- ✅ No context switching overhead (single shared context)
- ✅ Concurrent kernel execution from different processes
- ✅ Lower latency for small requests
- ✅ Better GPU utilization (kernels can overlap)
-
-**Memory Efficiency:**
- ✅ Shared GPU memory management
- ✅ No duplicate driver overhead per process
- ✅ More efficient memory allocation
- ✅ Can fit more models in same memory
-
-**Resource Management:**
- ✅ Single point of GPU resource control
- ✅ Automatic load balancing across processes
- ✅ Better thermal management
- ✅ Unified monitoring and debugging
-
-### 🏢 Multiple API Servers Advantages
-
-**Isolation & Reliability:**
- ✅ Process isolation (one crash doesn't affect others)
- ✅ Independent scaling per service
- ✅ Different models can have different configurations
- ✅ Easier to update/restart individual services
-
-**Flexibility:**
- ✅ Different frameworks (vLLM, TensorRT-LLM, etc.)
- ✅ Per-service optimization
- ✅ Independent monitoring and logging
- ✅ Service-specific resource limits
-
-**Operational:**
- ✅ Standard container orchestration (K8s, Docker)
- ✅ Familiar DevOps patterns
- ✅ Load balancing at HTTP level
- ✅ Rolling updates and deployments
-
-## 🎯 Decision Framework
-
-### Use CUDA MPS When:
- 🏆 Maximum GPU utilization is critical
- ⚡ Low latency is paramount
- 💰 Cost optimization (more models per GPU)
- 🔄 Same framework/runtime (e.g., all Ollama)
- 📊 Predictable, homogeneous workloads
- 🎮 Single-tenant environments
-
-### Use Multiple API Servers When:
- 🛡️ High availability/fault tolerance required
- 🔧 Different models need different optimizations
- 📈 Independent scaling per service needed
- 🌐 Multi-tenant production environments
- 🔄 Frequent model updates/deployments
- 👥 Different teams managing different models
-
-## 📊 Performance Impact Analysis
-
-| Metric | CUDA MPS | Multiple Servers |
-|--------|----------|------------------|
-| Context Switch Overhead | ~0% | ~5-15% |
-| Memory Efficiency | ~95% | ~80-85% |
-| Latency (small requests) | Lower | Higher |
-| Throughput (concurrent) | Higher | Lower |
-| Fault Isolation | Lower | Higher |
-| Operational Complexity | Lower | Higher |
-
-## 🔍 Memory Capacity Analysis
-
-### Model Memory Requirements
- **llama3.1:8b (Q4_K_M)**: ~4.9GB per instance
-
-### System Comparison
-| System | Total Memory | Theoretical Max | Practical Max |
-|--------|--------------|----------------|---------------|
-| **RTX 5090** | 24GB VRAM | 4-5 models | 2-3 models |
-| **DGX Spark** | 120GB Unified | 20+ models | 10+ models |
-
-### RTX 5090 Limitations:
- ❌ Limited to 24GB VRAM (hard ceiling)
- ❌ Driver overhead reduces available memory
- ❌ Memory fragmentation issues
- ❌ Thermal throttling under concurrent load
- ❌ Context switching still expensive
-
-### DGX Spark Advantages:
- ✅ 5x more memory capacity (120GB vs 24GB)
- ✅ Unified memory architecture
- ✅ Better thermal design for sustained loads
- ✅ Can scale to 10+ concurrent models
- ✅ No VRAM bottleneck
-
-## 🧪 Testing Concurrent Performance
-
-### Single Instance Baseline
-```bash
-curl -X POST http://localhost:11434/api/chat \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "llama3.1:8b",
-    "messages": [{"role": "user", "content": "Your prompt here"}],
-    "stream": false
-  }'
-```
-
-### Concurrent Testing
-```bash
-# Run multiple requests simultaneously
-curl [request1] & curl [request2] & curl [request3] & wait
-```
-
-### Expected Results with MPS:
- **1 instance**: 33.3 tok/s
- **2 concurrent**: ~66.4 tok/s total (97% efficiency)
- **3 concurrent**: ~99.4 tok/s total (99% efficiency)
-
-## 🎯 Recommendations
-
-### For Triple Extraction Workloads:
-**MPS is the optimal choice because:**
-1. **Homogeneous workload** - same model (llama3.1:8b)
-2. **Performance critical** - maximum throughput needed
-3. **Cost optimization** - more concurrent requests per GPU
-4. **Predictable usage** - biomedical triple extraction
-
-### Hybrid Approach:
-Consider running:
- **MPS in production** for maximum throughput
- **Separate dev/test servers** for experimentation
- **Different models** on separate instances when needed
-
-## 🚨 Important Notes
-
-1. **MPS requires careful setup** - ensure proper environment variables
-2. **Monitor GPU temperature** under heavy concurrent loads
-3. **Test thoroughly** before production deployment
-4. **Have fallback plan** to standard single-process mode
-5. **Consider workload patterns** - MPS excels with consistent concurrent requests
-
-## 🔗 Related Files
-
- `docker-compose.yml` - Ollama service configuration
- `ollama_gpu_benchmark.py` - Performance testing script
- `clear_cache_and_restart.sh` - Memory optimization script
- `gpu_memory_monitor.sh` - GPU monitoring script
-
-## 📚 Additional Resources
-
- [NVIDIA MPS Documentation](https://docs.nvidia.com/deploy/mps/index.html)
- [CUDA Multi-Process Service Guide](https://docs.nvidia.com/cuda/mps/index.html)
- [Ollama Documentation](https://ollama.ai/docs)
-
---
-
-**Last Updated**: October 2, 2025
-**Tested On**: DGX Spark with 120GB unified memory, CUDA 13.0, Ollama latest
--- a/nvidia/txt2kg/assets/deploy/services/ollama/README_GPU_MONITORING.md
+++ b/nvidia/txt2kg/assets/deploy/services/ollama/README_GPU_MONITORING.md
@ -1,78 +0,0 @@
-# Ollama GPU Memory Monitoring
-
-This setup includes automatic monitoring and fixing of GPU memory detection issues that can occur on unified memory systems (like DGX Spark, Jetson, etc.).
-
-## The Problem
-
-On unified memory systems, Ollama sometimes can't detect the full amount of available GPU memory due to buffer cache not being reclaimable. This causes models to fall back to CPU inference, dramatically reducing performance.
-
-**Symptoms:**
- Ollama logs show low "available" vs "total" GPU memory
- Models show mixed CPU/GPU processing instead of 100% GPU
- Performance is much slower than expected
-
-## The Solution
-
-This Docker Compose setup includes an optional GPU memory monitor that:
-
-1. **Monitors** Ollama's GPU memory detection every 60 seconds
-2. **Detects** when available memory drops below 70% of total
-3. **Automatically fixes** the issue by clearing buffer cache and restarting Ollama
-4. **Logs** all actions for debugging
-
-## Usage
-
-### Standard Setup (Most Systems)
-```bash
-docker compose up -d
-```
-
-### Unified Memory Systems (DGX Spark, Jetson, etc.)
-```bash
-docker compose --profile unified-memory up -d
-```
-
-This will start both Ollama and the GPU memory monitor.
-
-## Configuration
-
-The monitor can be configured via environment variables:
-
- `CHECK_INTERVAL=60` - How often to check (seconds)
- `MIN_AVAILABLE_PERCENT=70` - Threshold for triggering fixes (percentage)
- `AUTO_FIX=true` - Whether to automatically fix issues
-
-## Manual Commands
-
-You can still use the manual scripts if needed:
-
-```bash
-# Check current GPU memory status
-./monitor_gpu_memory.sh
-
-# Manually clear cache and restart
-./clear_cache_and_restart.sh
-```
-
-## Monitoring Logs
-
-To see what the monitor is doing:
-
-```bash
-docker logs ollama-gpu-monitor -f
-```
-
-## When to Use
-
-Use the unified memory profile if you experience:
- Inconsistent Ollama performance
- Models loading on CPU instead of GPU
- GPU memory showing as much lower than system RAM
- You're on a system with unified memory (DGX, Jetson, etc.)
-
-## Performance Impact
-
-The monitor has minimal performance impact:
- Runs one check every 60 seconds
- Only takes action when issues are detected
- Automatic fixes typically resolve issues within 30 seconds
--- a/nvidia/txt2kg/assets/deploy/services/ollama/docker-compose.yml
+++ b/nvidia/txt2kg/assets/deploy/services/ollama/docker-compose.yml
@ -1,66 +0,0 @@
-version: '3.8'
-
-services:
-  ollama:
-    build:
-      context: .
-      dockerfile: Dockerfile
-    image: ollama-custom:latest
-    container_name: ollama-server
-    ports:
-      - "11434:11434"
-    volumes:
-      - ollama_models:/root/.ollama
-    environment:
-      - OLLAMA_HOST=0.0.0.0:11434
-      - OLLAMA_FLASH_ATTENTION=1
-      - OLLAMA_KEEP_ALIVE=30m
-      - OLLAMA_CUDA=1
-      # Performance tuning for large models like Llama3 70B
-      - OLLAMA_LLM_LIBRARY=cuda
-      - OLLAMA_NUM_PARALLEL=1         # Favor latency/stability for 70B; increase for smaller models
-      - OLLAMA_MAX_LOADED_MODELS=1    # Avoid VRAM contention
-      - OLLAMA_KV_CACHE_TYPE=q8_0     # Reduce KV cache VRAM with minimal perf impact
-      # Removed restrictive settings for 70B model testing:
-      # - OLLAMA_CONTEXT_LENGTH=8192 (let Ollama auto-detect)
-      # - OLLAMA_NUM_PARALLEL=4 (let Ollama decide)
-      # - OLLAMA_MAX_LOADED=1 (allow multiple models)
-      # - OLLAMA_NUM_THREADS=16 (may force CPU usage)
-    runtime: nvidia
-    deploy:
-      resources:
-        reservations:
-          devices:
-            - driver: nvidia
-              count: all
-              capabilities: [gpu]
-    restart: unless-stopped
-    healthcheck:
-      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
-      interval: 30s
-      timeout: 10s
-      retries: 3
-      start_period: 60s
-
-  # GPU Memory Monitor - only for unified memory systems like DGX Spark
-  gpu-monitor:
-    build:
-      context: .
-      dockerfile: Dockerfile.monitor
-    container_name: ollama-gpu-monitor
-    depends_on:
-      - ollama
-    volumes:
-      - /var/run/docker.sock:/var/run/docker.sock:ro
-    environment:
-      - CHECK_INTERVAL=60          # Check every 60 seconds
-      - MIN_AVAILABLE_PERCENT=70   # Alert if less than 70% GPU memory available
-      - AUTO_FIX=true             # Automatically fix buffer cache issues
-    privileged: true  # Required to clear buffer cache and restart containers
-    restart: unless-stopped
-    profiles:
-      - unified-memory  # Only start with --profile unified-memory
-
-volumes:
-  ollama_models:
-    driver: local
--- a/nvidia/txt2kg/assets/deploy/services/ollama/entrypoint.sh
+++ b/nvidia/txt2kg/assets/deploy/services/ollama/entrypoint.sh
@ -8,10 +8,10 @@ OLLAMA_PID=$!

 # Wait for Ollama to be ready
 echo "Waiting for Ollama to be ready..."
-max_attempts=30
+max_attempts=120
 attempt=0
 while [ $attempt -lt $max_attempts ]; do
-    if curl -s http://localhost:11434/api/tags > /dev/null 2>&1; then
+    if /bin/ollama list > /dev/null 2>&1; then
        echo "Ollama is ready!"
        break
    fi
@ -26,9 +26,8 @@ fi

 # Check if any models are present
 echo "Checking for existing models..."
-MODELS=$(curl -s http://localhost:11434/api/tags | grep -o '"models":\s*\[\]' || echo "has_models")

-if [[ "$MODELS" == *'"models": []'* ]]; then
+if ! /bin/ollama list | grep -q llama3.1:8b; then
    echo "No models found. Pulling llama3.1:8b..."
    /bin/ollama pull llama3.1:8b
    echo "Successfully pulled llama3.1:8b"
@ -38,5 +37,4 @@ fi

 # Keep the container running
 echo "Setup complete. Ollama is running."
-wait $OLLAMA_PID
-
+wait $OLLAMA_PID
--- a/nvidia/txt2kg/assets/deploy/services/ollama/gpu_memory_monitor.sh
+++ b/nvidia/txt2kg/assets/deploy/services/ollama/gpu_memory_monitor.sh
@ -1,108 +0,0 @@
-#!/bin/bash
-#
-# Ollama GPU Memory Monitor - runs inside a sidecar container
-# Automatically detects and fixes unified memory buffer cache issues
-#
-
-set -e
-
-# Configuration
-CHECK_INTERVAL=${CHECK_INTERVAL:-60}  # Check every 60 seconds
-MIN_AVAILABLE_PERCENT=${MIN_AVAILABLE_PERCENT:-70}  # Alert if less than 70% available
-AUTO_FIX=${AUTO_FIX:-true}  # Automatically fix issues
-
-log() {
-    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
-}
-
-check_ollama_memory() {
-    # Wait for Ollama to be ready
-    if ! curl -s http://ollama:11434/api/tags > /dev/null 2>&1; then
-        log "Ollama not ready, skipping check"
-        return 0
-    fi
-
-    # Get Ollama logs to find inference compute info
-    local compute_log=$(docker logs ollama-server 2>&1 | grep "inference compute" | tail -1)
-    
-    if [ -z "$compute_log" ]; then
-        log "No inference compute logs found"
-        return 0
-    fi
-
-    # Extract memory info
-    local total_mem=$(echo "$compute_log" | grep -o 'total="[^"]*"' | cut -d'"' -f2)
-    local available_mem=$(echo "$compute_log" | grep -o 'available="[^"]*"' | cut -d'"' -f2)
-    
-    if [ -z "$total_mem" ] || [ -z "$available_mem" ]; then
-        log "Could not parse memory information"
-        return 0
-    fi
-
-    # Convert to numeric (assuming GiB)
-    local total_num=$(echo "$total_mem" | sed 's/ GiB//')
-    local available_num=$(echo "$available_mem" | sed 's/ GiB//')
-    
-    # Calculate percentage
-    local available_percent=$(echo "scale=1; $available_num * 100 / $total_num" | bc)
-    
-    log "GPU Memory: $available_mem / $total_mem available (${available_percent}%)"
-    
-    # Check if we need to take action
-    if (( $(echo "$available_percent < $MIN_AVAILABLE_PERCENT" | bc -l) )); then
-        log "WARNING: Low GPU memory availability detected (${available_percent}%)"
-        
-        if [ "$AUTO_FIX" = "true" ]; then
-            log "Attempting to fix by clearing buffer cache..."
-            fix_memory_issue
-        else
-            log "Auto-fix disabled. Manual intervention required."
-        fi
-        
-        return 1
-    else
-        log "GPU memory availability OK (${available_percent}%)"
-        return 0
-    fi
-}
-
-fix_memory_issue() {
-    log "Clearing system buffer cache..."
-    
-    # Clear buffer cache from host (requires privileged container)
-    echo 1 > /proc/sys/vm/drop_caches 2>/dev/null || {
-        log "Cannot clear buffer cache from container. Trying host command..."
-        # Alternative: use nsenter to run on host
-        nsenter -t 1 -m -p sh -c 'sync && echo 1 > /proc/sys/vm/drop_caches' 2>/dev/null || {
-            log "Failed to clear buffer cache. Manual intervention required."
-            return 1
-        }
-    }
-    
-    # Wait a moment
-    sleep 5
-    
-    # Restart Ollama container
-    log "Restarting Ollama container..."
-    docker restart ollama-server
-    
-    # Wait for restart
-    sleep 15
-    
-    log "Fix applied. Ollama should have better memory detection now."
-}
-
-main() {
-    log "Starting Ollama GPU Memory Monitor"
-    log "Check interval: ${CHECK_INTERVAL}s, Min available: ${MIN_AVAILABLE_PERCENT}%, Auto-fix: ${AUTO_FIX}"
-    
-    while true; do
-        check_ollama_memory || true  # Don't exit on check failures
-        sleep "$CHECK_INTERVAL"
-    done
-}
-
-# Handle signals gracefully
-trap 'log "Shutting down monitor..."; exit 0' SIGTERM SIGINT
-
-main
--- a/nvidia/txt2kg/assets/deploy/services/ollama/monitor_gpu_memory.sh
+++ b/nvidia/txt2kg/assets/deploy/services/ollama/monitor_gpu_memory.sh
@ -1,79 +0,0 @@
-#!/bin/bash
-#
-# Monitor Ollama GPU memory usage and alert when buffer cache is consuming too much
-# This helps detect when the unified memory issue is occurring
-#
-
-set -e
-
-# Colors for output
-RED='\033[0;31m'
-GREEN='\033[0;32m'
-YELLOW='\033[1;33m'
-NC='\033[0m' # No Color
-
-# Thresholds
-MIN_AVAILABLE_PERCENT=70  # Alert if less than 70% GPU memory available
-
-echo "🔍 Ollama GPU Memory Monitor"
-echo "================================"
-
-# Check if Ollama container is running
-if ! docker ps | grep -q ollama-server; then
-    echo -e "${RED}❌ Ollama container is not running${NC}"
-    exit 1
-fi
-
-# Get the latest inference compute log
-COMPUTE_LOG=$(docker logs ollama-server 2>&1 | grep "inference compute" | tail -1)
-
-if [ -z "$COMPUTE_LOG" ]; then
-    echo -e "${YELLOW}⚠️  No inference compute logs found. Model may not be loaded.${NC}"
-    exit 1
-fi
-
-echo "Latest GPU memory status:"
-echo "$COMPUTE_LOG"
-
-# Extract total and available memory
-TOTAL_MEM=$(echo "$COMPUTE_LOG" | grep -o 'total="[^"]*"' | cut -d'"' -f2)
-AVAILABLE_MEM=$(echo "$COMPUTE_LOG" | grep -o 'available="[^"]*"' | cut -d'"' -f2)
-
-# Convert to numeric values (assuming GiB)
-TOTAL_NUM=$(echo "$TOTAL_MEM" | sed 's/ GiB//')
-AVAILABLE_NUM=$(echo "$AVAILABLE_MEM" | sed 's/ GiB//')
-
-# Calculate percentage
-AVAILABLE_PERCENT=$(echo "scale=1; $AVAILABLE_NUM * 100 / $TOTAL_NUM" | bc)
-
-echo ""
-echo "Memory Analysis:"
-echo "  Total GPU Memory: $TOTAL_MEM"
-echo "  Available Memory: $AVAILABLE_MEM"
-echo "  Available Percentage: ${AVAILABLE_PERCENT}%"
-
-# Check if we need to alert
-if (( $(echo "$AVAILABLE_PERCENT < $MIN_AVAILABLE_PERCENT" | bc -l) )); then
-    echo ""
-    echo -e "${RED}🚨 WARNING: Low GPU memory availability detected!${NC}"
-    echo -e "${RED}   Only ${AVAILABLE_PERCENT}% of GPU memory is available${NC}"
-    echo -e "${YELLOW}   This may cause models to run on CPU instead of GPU${NC}"
-    echo ""
-    echo "💡 Recommended action:"
-    echo "   Run: ./clear_cache_and_restart.sh"
-    echo ""
-    
-    # Show current system memory usage
-    echo "Current system memory usage:"
-    free -h
-    
-    exit 1
-else
-    echo ""
-    echo -e "${GREEN}✅ GPU memory availability looks good (${AVAILABLE_PERCENT}%)${NC}"
-fi
-
-# Show current model status
-echo ""
-echo "Current loaded models:"
-docker exec ollama-server ollama ps
--- a/nvidia/txt2kg/assets/deploy/services/vllm/README.md
+++ b/nvidia/txt2kg/assets/deploy/services/vllm/README.md
@ -1,92 +1,153 @@
-# vLLM NVFP4 Deployment
+# vLLM Service

-This setup deploys the NVIDIA Llama 4 Scout model with NVFP4 quantization using vLLM, optimized for Blackwell and Hopper GPU architectures.
+This service provides advanced GPU-accelerated LLM inference using vLLM with FP8 quantization, offering higher throughput than Ollama for production workloads.
+
+## Overview
+
+vLLM is an optional service that complements Ollama by providing:
+- Higher throughput for concurrent requests
+- Advanced quantization (FP8)
+- PagedAttention for efficient memory usage
+- OpenAI-compatible API

 ## Quick Start

-1. **Set up your HuggingFace token:**
-   ```bash
-   cp env.example .env
-   # Edit .env and add your HF_TOKEN
-   ```
+### Using the Complete Stack

-2. **Build and run:**
-   ```bash
-   docker-compose up --build
-   ```
-
-3. **Test the deployment:**
-   ```bash
-   curl -X POST "http://localhost:8001/v1/chat/completions" \
-     -H "Content-Type: application/json" \
-     -d '{
-       "model": "nvidia/Llama-4-Scout-17B-16E-Instruct-FP4",
-       "messages": [{"role": "user", "content": "Hello! How are you?"}],
-       "max_tokens": 100
-     }'
-   ```
-
-## Model Information
-
- **Model**: `nvidia/Llama-4-Scout-17B-16E-Instruct-FP4`
- **Quantization**: NVFP4 (optimized for Blackwell architecture)
- **Alternative**: `nvidia/Llama-4-Scout-17B-16E-Instruct-FP8` (for Hopper architecture)
-
-## Performance Tuning
-
-The startup script automatically detects your GPU architecture and applies optimal settings:
-
-### Blackwell (Compute Capability 10.0)
- Enables FlashInfer backend
- Uses NVFP4 quantization
- Enables async scheduling
- Applies fusion optimizations
-
-### Hopper (Compute Capability 9.0)
- Uses FP8 quantization
- Disables async scheduling (due to vLLM limitations)
- Standard optimization settings
-
-### Configuration Options
-
-Adjust these environment variables in your `.env` file:
-
- `VLLM_TENSOR_PARALLEL_SIZE`: Number of GPUs to use (default: 2)
- `VLLM_MAX_NUM_SEQS`: Batch size (default: 128)
- `VLLM_MAX_NUM_BATCHED_TOKENS`: Token batching limit (default: 8192)
- `VLLM_GPU_MEMORY_UTILIZATION`: GPU memory usage (default: 0.9)
-
-### Performance Scenarios
-
- **Maximum Throughput**: `VLLM_TENSOR_PARALLEL_SIZE=1`, increase `VLLM_MAX_NUM_SEQS`
- **Minimum Latency**: `VLLM_TENSOR_PARALLEL_SIZE=4-8`, `VLLM_MAX_NUM_SEQS=8`
- **Balanced**: `VLLM_TENSOR_PARALLEL_SIZE=2`, `VLLM_MAX_NUM_SEQS=128` (default)
-
-## Benchmarking
-
-To benchmark performance:
+The easiest way to run vLLM is with the complete stack:

 ```bash
-docker exec -it vllm-nvfp4-server vllm bench serve \
-  --host 0.0.0.0 \
-  --port 8001 \
-  --model nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 \
-  --dataset-name random \
-  --random-input-len 1024 \
-  --random-output-len 1024 \
-  --max-concurrency 128 \
-  --num-prompts 1280
+# From project root
+./start.sh --complete
 ```

+This starts vLLM along with all other optional services.
+
+### Manual Docker Compose
+
+```bash
+# From project root
+docker compose -f deploy/compose/docker-compose.complete.yml up -d vllm
+```
+
+### Testing the Deployment
+
+```bash
+# Check health
+curl http://localhost:8001/v1/models
+
+# Test chat completion
+curl -X POST "http://localhost:8001/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "meta-llama/Llama-3.2-3B-Instruct",
+    "messages": [{"role": "user", "content": "Hello! How are you?"}],
+    "max_tokens": 100
+  }'
+```
+
+## Default Configuration
+
+- **Model**: `meta-llama/Llama-3.2-3B-Instruct`
+- **Quantization**: FP8 (optimized for compute efficiency)
+- **Port**: 8001
+- **API**: OpenAI-compatible endpoints
+
+## Configuration Options
+
+Environment variables configured in `docker-compose.complete.yml`:
+
+- `VLLM_MODEL`: Model to load (default: `meta-llama/Llama-3.2-3B-Instruct`)
+- `VLLM_TENSOR_PARALLEL_SIZE`: Number of GPUs to use (default: 1)
+- `VLLM_MAX_MODEL_LEN`: Maximum sequence length (default: 4096)
+- `VLLM_GPU_MEMORY_UTILIZATION`: GPU memory usage (default: 0.9)
+- `VLLM_QUANTIZATION`: Quantization method (default: fp8)
+- `VLLM_KV_CACHE_DTYPE`: KV cache data type (default: fp8)
+
+## Frontend Integration
+
+The txt2kg frontend automatically detects and uses vLLM when available:
+
+1. Triple extraction: `/api/vllm` endpoint
+2. RAG queries: Automatically uses vLLM if configured
+3. Model selection: Choose vLLM models in the UI
+
+## Using Different Models
+
+To use a different model, edit the `VLLM_MODEL` environment variable in `docker-compose.complete.yml`:
+
+```yaml
+environment:
+  - VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct
+```
+
+Then restart the service:
+
+```bash
+docker compose -f deploy/compose/docker-compose.complete.yml restart vllm
+```
+
+## Performance Tips
+
+1. **Single GPU**: Set `VLLM_TENSOR_PARALLEL_SIZE=1` for best single-GPU performance
+2. **Multi-GPU**: Increase `VLLM_TENSOR_PARALLEL_SIZE` to use multiple GPUs
+3. **Memory**: Adjust `VLLM_GPU_MEMORY_UTILIZATION` based on available VRAM
+4. **Throughput**: For high throughput, use smaller models or increase quantization
+
 ## Requirements

- NVIDIA GPU with Blackwell or Hopper architecture
- CUDA Driver 575 or above
+- NVIDIA GPU with CUDA support (Ampere architecture or newer recommended)
+- CUDA Driver 535 or above
 - Docker with NVIDIA Container Toolkit
- HuggingFace token (for model access)
+- At least 8GB VRAM for default model
+- HuggingFace token for gated models (optional, cached in `~/.cache/huggingface`)

 ## Troubleshooting

- Check GPU compatibility: `nvidia-smi`
- View logs: `docker-compose logs -f vllm-nvfp4`
- Monitor GPU usage: `nvidia-smi -l 1`
+### Check Service Status
+```bash
+# View logs
+docker compose -f deploy/compose/docker-compose.complete.yml logs -f vllm
+
+# Check health
+curl http://localhost:8001/v1/models
+```
+
+### GPU Issues
+```bash
+# Check GPU availability
+nvidia-smi
+
+# Check vLLM container GPU access
+docker exec vllm-service nvidia-smi
+```
+
+### Model Loading Issues
+- Ensure sufficient VRAM for the model
+- Check HuggingFace cache: `ls ~/.cache/huggingface/hub`
+- For gated models, set HF_TOKEN environment variable
+
+## Comparison with Ollama
+
+| Feature | Ollama | vLLM |
+|---------|--------|------|
+| **Ease of Use** | ✅ Very easy | ⚠️ More complex |
+| **Model Management** | ✅ Built-in pull/push | ❌ Manual download |
+| **Throughput** | ⚠️ Moderate | ✅ High |
+| **Quantization** | Q4_K_M | FP8, GPTQ |
+| **Memory Efficiency** | ✅ Good | ✅ Excellent (PagedAttention) |
+| **Use Case** | Development, small-scale | Production, high-throughput |
+
+## When to Use vLLM
+
+Use vLLM when:
+- Processing large batches of requests
+- Need maximum throughput
+- Using multiple GPUs
+- Deploying to production with high load
+
+Use Ollama when:
+- Getting started with the project
+- Single-user development
+- Simpler model management needed
+- Don't need maximum performance
--- a/nvidia/txt2kg/assets/frontend/README.md
+++ b/nvidia/txt2kg/assets/frontend/README.md
@ -4,14 +4,34 @@ This directory contains the Next.js frontend application for the txt2kg project.

 ## Structure

- **app/**: Next.js app directory with pages and routes
- **components/**: React components
- **contexts/**: React context providers
+- **app/**: Next.js 15 app directory with pages and API routes
+  - API routes for LLM providers (Ollama, vLLM, NVIDIA API)
+  - Triple extraction and graph query endpoints
+  - Settings and health check endpoints
+- **components/**: React 19 components
+  - Graph visualization (Three.js WebGPU)
+  - PyGraphistry integration for GPU-accelerated rendering
+  - RAG query interface
+  - Document upload and processing
+- **contexts/**: React context providers for state management
 - **hooks/**: Custom React hooks
 - **lib/**: Utility functions and shared logic
+  - LLM service (Ollama, vLLM, NVIDIA API integration)
+  - Graph database services (ArangoDB, Neo4j)
+  - Pinecone vector database integration
+  - RAG service for knowledge graph querying
 - **public/**: Static assets
- **styles/**: CSS and styling files
- **types/**: TypeScript type definitions
+- **types/**: TypeScript type definitions for graph data structures
+
+## Technology Stack
+
+- **Next.js 15**: React framework with App Router
+- **React 19**: Latest React with improved concurrent features
+- **TypeScript**: Type-safe development
+- **Tailwind CSS**: Utility-first styling
+- **Three.js**: WebGL/WebGPU 3D graph visualization
+- **D3.js**: Data-driven visualizations
+- **LangChain**: LLM orchestration and chaining

 ## Development

@ -23,9 +43,47 @@ npm install
 npm run dev
 ```

+Or use the start script from project root:
+
+```bash
+./start.sh --dev-frontend
+```
+
+The development server will run on http://localhost:3000
+
 ## Building for Production

 ```bash
 cd frontend
 npm run build
-``` 
+npm start
+```
+
+Or use Docker (recommended):
+
+```bash
+# From project root
+./start.sh
+```
+
+The production app will run on http://localhost:3001
+
+## Environment Variables
+
+Required environment variables are configured in docker-compose files:
+
+- `ARANGODB_URL`: ArangoDB connection URL
+- `OLLAMA_BASE_URL`: Ollama API endpoint
+- `VLLM_BASE_URL`: vLLM API endpoint (optional)
+- `NVIDIA_API_KEY`: NVIDIA API key (optional)
+- `PINECONE_HOST`: Local Pinecone host (optional)
+- `SENTENCE_TRANSFORMER_URL`: Embeddings service URL (optional)
+
+## Features
+
+- **Knowledge Graph Extraction**: Extract triples from text using LLMs
+- **Graph Visualization**: Interactive 3D visualization with Three.js WebGPU
+- **RAG Queries**: Query knowledge graphs with retrieval-augmented generation
+- **Multiple LLM Providers**: Support for Ollama, vLLM, and NVIDIA API
+- **GPU-Accelerated Rendering**: Optional PyGraphistry integration for large graphs
+- **Vector Search**: Pinecone integration for semantic search 
--- a/nvidia/txt2kg/assets/scripts/gnn/README.md
+++ b/nvidia/txt2kg/assets/scripts/gnn/README.md
@ -1,9 +1,11 @@
-# TXT2KG Pipeline with ArangoDB Integration
+# GNN Training Pipeline (Experimental)

-This project provides a two-stage pipeline for knowledge graph-based question answering:
+**Status**: This is an experimental feature for training Graph Neural Network models for enhanced RAG retrieval.

-1. **Data Preprocessing** (`preprocess_data.py`): Extracts knowledge graph triples from either ArangoDB or using TXT2KG, and prepares the dataset.
-2. **Model Training & Testing** (`train_test_gnn.py`): Trains and evaluates a GNN-based retriever model on the preprocessed dataset.
+This pipeline provides a two-stage process for training GNN-based knowledge graph retrieval models:
+
+1. **Data Preprocessing** (`preprocess_data.py`): Extracts knowledge graph triples from ArangoDB and prepares training datasets.
+2. **Model Training & Testing** (`train_test_gnn.py`): Trains and evaluates a GNN-based retriever model using PyTorch Geometric.

 ## Prerequisites

@ -20,10 +22,11 @@ This project provides a two-stage pipeline for knowledge graph-based question an
 pip install -r scripts/requirements.txt
 ```

-2. Ensure ArangoDB is running. You can use the docker-compose file:
+2. Ensure ArangoDB is running. You can use the main start script:

 ```bash
-docker-compose up -d arangodb arangodb-init
+# From project root
+./start.sh
 ```

 ## Usage
@ -57,13 +60,9 @@ You can specify custom ArangoDB connection parameters:
 python scripts/preprocess_data.py --use_arango --arango_url "http://localhost:8529" --arango_db "your_db" --arango_user "username" --arango_password "password"
 ```

-#### Using TXT2KG (original behavior)
+#### Using Direct Triple Extraction

-If you don't pass the `--use_arango` flag, the script will use the original TXT2KG approach:
-
-```bash
-python scripts/preprocess_data.py --NV_NIM_KEY "your-nvidia-api-key"
-```
+If you don't pass the `--use_arango` flag, the script will extract triples directly using the configured LLM provider.

 ### Stage 2: Model Training & Testing

--- a/nvidia/txt2kg/assets/start.sh
+++ b/nvidia/txt2kg/assets/start.sh
@ -4,7 +4,6 @@

 # Parse command line arguments
 DEV_FRONTEND=false
-USE_VLLM=false
 USE_COMPLETE=false

 while [[ $# -gt 0 ]]; do
@ -13,10 +12,6 @@ while [[ $# -gt 0 ]]; do
      DEV_FRONTEND=true
      shift
      ;;
-    --vllm)
-      USE_VLLM=true
-      shift
-      ;;
    --complete)
      USE_COMPLETE=true
      shift
@ -25,12 +20,15 @@ while [[ $# -gt 0 ]]; do
      echo "Usage: ./start.sh [OPTIONS]"
      echo ""
      echo "Options:"
-      echo "  --dev-frontend   Run frontend in development mode (without Docker)"
-      echo "  --vllm           Use vLLM instead of Ollama for LLM inference"
-      echo "  --complete       Use complete stack with MinIO S3 storage"
-      echo "  --help, -h       Show this help message"
+      echo "  --dev-frontend    Run frontend in development mode (without Docker)"
+      echo "  --complete        Use complete stack (vLLM, Pinecone, Sentence Transformers)"
+      echo "  --help, -h        Show this help message"
      echo ""
-      echo "Default: Starts with Ollama, ArangoDB, local Pinecone, and Next.js frontend"
+      echo "Default: Starts minimal stack with Ollama, ArangoDB, and Next.js frontend"
+      echo ""
+      echo "Examples:"
+      echo "  ./start.sh                # Start minimal demo (recommended)"
+      echo "  ./start.sh --complete     # Start with all optional services"
      exit 0
      ;;
    *)
@ -81,15 +79,12 @@ else
 fi

 # Build the docker-compose command
-if [ "$USE_VLLM" = true ]; then
-  CMD="$DOCKER_COMPOSE_CMD -f $(pwd)/deploy/compose/docker-compose.vllm.yml"
-  echo "Using vLLM for GPU-accelerated LLM inference with FP8 quantization..."
-elif [ "$USE_COMPLETE" = true ]; then
+if [ "$USE_COMPLETE" = true ]; then
  CMD="$DOCKER_COMPOSE_CMD -f $(pwd)/deploy/compose/docker-compose.complete.yml"
-  echo "Using complete stack with MinIO S3 storage..."
+  echo "Using complete stack (Ollama, vLLM, Pinecone, Sentence Transformers)..."
 else
  CMD="$DOCKER_COMPOSE_CMD -f $(pwd)/deploy/compose/docker-compose.yml"
-  echo "Using default configuration (Ollama + ArangoDB + local Pinecone)..."
+  echo "Using minimal configuration (Ollama + ArangoDB only)..."
 fi

 # Execute the command
@ -104,14 +99,16 @@ echo "=========================================="
 echo "txt2kg is now running!"
 echo "=========================================="
 echo ""
-echo "Services:"
+echo "Core Services:"
 echo "  • Web UI: http://localhost:3001"
 echo "  • ArangoDB: http://localhost:8529"
 echo "  • Ollama API: http://localhost:11434"
-echo "  • Local Pinecone: http://localhost:5081"
 echo ""

-if [ "$USE_VLLM" = true ]; then
+if [ "$USE_COMPLETE" = true ]; then
+  echo "Additional Services (Complete Stack):"
+  echo "  • Local Pinecone: http://localhost:5081"
+  echo "  • Sentence Transformers: http://localhost:8000"
  echo "  • vLLM API: http://localhost:8001"
  echo ""
 fi
@ -125,6 +122,6 @@ echo "  3. Upload documents and start building your knowledge graph!"
 echo ""
 echo "Other options:"
 echo "  • Run frontend in dev mode: ./start.sh --dev-frontend"
-echo "  • Use vLLM instead of Ollama: ./start.sh --vllm"
+echo "  • Use complete stack: ./start.sh --complete"
 echo "  • View logs: docker compose logs -f"
 echo ""