chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-10 18:45:20 +00:00
parent f7f0a5ec85
commit 89b4835335
24 changed files with 699 additions and 1853 deletions

View File

@ -77,8 +77,10 @@ to the DGX Spark device
## Step 1. Install NVIDIA Sync
Download and install NVIDIA Sync for your operating system. NVIDIA Sync provides a unified
interface for managing SSH connections and launching development tools on your DGX Spark device.
NVIDIA Sync is a desktop app that connects your computer to your DGX Spark over the local network.
It gives you a single interface to manage SSH access and launch development tools on your DGX Spark.
Download and install NVIDIA Sync on your computer to get started.
::spark-download
@ -115,30 +117,27 @@ interface for managing SSH connections and launching development tools on your D
## Step 2. Configure Apps
After starting NVIDIA Sync and agreeing to the EULA, select which development tools you want
to use. These are tools installed on your laptop that Sync can configure and launch connected to your Spark.
Apps are desktop programs installed on your laptop that NVIDIA Sync can configure and launch with an automatic connection to your Spark.
You can modify these selections later in the Settings window. Applications marked "unavailable"
require installation on your laptop.
You can change your app selections anytime in the Settings window. Apps that are marked "unavailable" must be installed before you can use them.
**Default Options:**
**Default apps:**
- **DGX Dashboard**: Web application pre-installed on DGX Spark for system management and integrated JupyterLab access
- **Terminal**: Your system's built-in terminal with automatic SSH connection
**Optional apps (require separate installation):**
- **VS Code**: Download from https://code.visualstudio.com/download
- **Cursor**: Download from https://cursor.com/downloads
- **NVIDIA AI Workbench**: Download from https://nvidia.com/workbench
- **NVIDIA AI Workbench**: Download from https://www.nvidia.com/workbench
## Step 3. Add your DGX Spark device
> **Find Your Hostname or IP**
>
> [!NOTE]
> You must know either your hostname or IP address to connect.
>
> - The default hostname can be found on the Quick Start Guide included in the box. For example, `spark-abcd.local`
> - If you have a display connected to your device, you can find the hostname on the Settings page of the [DGX Dashboard](http://localhost:11000).
> - If `.local` (mDNS) hostnames don't work on your network you must use your IP address. This can be found in Ubuntu's network settings or by logging into the admin console of your router.
> - If `.local` (mDNS) hostnames don't work on your network you must use an IP address. This can be found in Ubuntu's network settings or by logging into the admin console of your router.
Finally, connect your DGX Spark by filling out the form:
@ -159,7 +158,8 @@ Click add "Add" and NVIDIA Sync will automatically:
4. Create an SSH alias locally for future connections
5. Discard your username and password information
> **Wait for update:** After completing system setup for the first time, your device may take several minutes to update and become available on the network. If NVIDIA Sync fails to connect, please wait 3-4 minutes and try again.
> [!IMPORTANT]
> After completing system setup for the first time, your device may take several minutes to update and become available on the network. If NVIDIA Sync fails to connect, please wait 3-4 minutes and try again.
## Step 4. Access your DGX Spark
@ -178,9 +178,10 @@ connection to your DGX Spark.
## Step 5. Validate SSH setup
Verify your local SSH configuration is correct by using the SSH alias:
NVIDIA Sync creates an SSH alias for your device for easy access manually or from other SSH enabled apps.
Test direct SSH connection (should not prompt for password)
Verify your local SSH configuration is correct by using the SSH alias. You should not be prompted for your
password when using the alias:
```bash
## Configured if you use mDNS hostname
@ -207,12 +208,21 @@ Exit the SSH session
exit
```
## Step 6. Next steps
## Step 6. Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| Device name doesn't resolve | mDNS blocked on network | Use IP address instead of hostname.local |
| Connection refused/timeout | DGX Spark not booted or SSH not ready | Wait for device boot completion; SSH available after updates finish |
| Authentication failed | SSH key setup incomplete | Re-run device setup in NVIDIA Sync; check credentials |
## Step 7. Next steps
Test your setup by launching a development tool:
- Click the NVIDIA Sync system tray icon.
- Select "Terminal" to open a terminal session on your DGX Spark.
- Or click "DGX Dashboard" to access the web interface at the forwarded localhost port.
- Select "DGX Dashboard" to use Jupyterlab and manage updates.
- Try [a custom port example with Open WebUI](/spark/open-webui/sync)
## Connect with Manual SSH

View File

@ -60,14 +60,18 @@ If you see a permission denied error (something like `permission denied while tr
```bash
sudo usermod -aG docker $USER
newgrp docker
```
> **Warning**: After running usermod, you must log out and log back in to start a new
> session with updated group permissions.
Test Docker access again. In the terminal, run:
```bash
docker ps
```
## Step 2. Verify Docker setup and pull container
Open a new terminal, pull the Open WebUI container image with integrated Ollama:
Pull the Open WebUI container image with integrated Ollama:
```bash
docker pull ghcr.io/open-webui/open-webui:ollama
@ -130,7 +134,8 @@ Press Enter to send the message and wait for the model's response.
Steps to completely remove the Open WebUI installation and free up resources:
> **Warning**: These commands will permanently delete all Open WebUI data and downloaded models.
> [!WARNING]
> These commands will permanently delete all Open WebUI data and downloaded models.
Stop and remove the Open WebUI container:
@ -151,9 +156,6 @@ Remove persistent data volumes:
docker volume rm open-webui open-webui-ollama
```
To rollback permission change: `sudo deluser $USER docker`
## Step 9. Next steps
Try downloading different models from the Ollama library at https://ollama.com/library.
@ -168,7 +170,8 @@ docker pull ghcr.io/open-webui/open-webui:ollama
## Setup Open WebUI on Remote Spark with NVIDIA Sync
> **Note**: If you haven't already installed NVIDIA Sync, [learn how here.](/spark/connect-to-your-spark/sync)
> [!TIP]
> If you haven't already installed NVIDIA Sync, [learn how here.](/spark/connect-to-your-spark/sync)
## Step 1. Configure Docker permissions
@ -184,17 +187,18 @@ If you see a permission denied error (something like `permission denied while tr
```bash
sudo usermod -aG docker $USER
newgrp docker
```
> **Warning**: After running usermod, you must close the terminal window completely to start a new
> session with updated group permissions.
Test Docker access again. In the terminal, run:
```bash
docker ps
```
## Step 2. Verify Docker setup and pull container
This step confirms Docker is working properly and downloads the Open WebUI container
image. This runs on the DGX Spark device and may take several minutes depending on network speed.
Open a new Terminal app from NVIDIA Sync and pull the Open WebUI container image with integrated Ollama:
Open a new Terminal app from NVIDIA Sync and pull the Open WebUI container image with integrated Ollama on your DGX Spark:
```bash
docker pull ghcr.io/open-webui/open-webui:ollama
@ -204,18 +208,15 @@ Once the container image is downloaded, continue to setup NVIDIA Sync.
## Step 3. Open NVIDIA Sync Settings
Click on the NVIDIA Sync icon in your system tray or taskbar to open the main application window.
- Click on the NVIDIA Sync icon in your system tray or taskbar to open the main application window.
- Click the gear icon in the top right corner to open the Settings window.
- Click on the "Custom" tab to access Custom Ports configuration.
Click the gear icon in the top right corner to open the Settings window.
## Step 4. Add Open WebUI custom port configuration
Click on the "Custom" tab to access Custom Ports configuration.
Setting up a new Custom port will XXXX
## Step 4. Add Open WebUI custom port
This step creates a new entry in NVIDIA Sync that will manage the Open
WebUI container and create the necessary SSH tunnel.
Click the "Add New" button in the Custom section.
Click the "Add New" button on the Custom tab.
Fill out the form with these values:
@ -270,22 +271,23 @@ echo "Running. Press Ctrl+C to stop ${NAME}."
while :; do sleep 86400; done
```
Click the "Add" button to save configuration.
Click the "Add" button to save configuration to your DGX Spark.
## Step 5. Launch Open WebUI
This step starts the Open WebUI container on your DGX Spark and establishes the SSH
tunnel. The browser will open automatically if configured correctly.
Click on the NVIDIA Sync icon in your system tray or taskbar to open the main application window.
Under the "Custom" section, click on "Open WebUI".
Your default web browser should automatically open to the Open WebUI interface at `http://localhost:12000`.
> [!TIP]
> On first run, Open WebUI downloads models. This can delay server start and cause the page to fail to load in your browser. Simply wait and refresh the page.
> On future launches it will open quickly.
## Step 6. Create administrator account
This step sets up the initial administrator account for Open WebUI. This is a local account that you will use to access the Open WebUI interface.
To start using Open WebUI you must create an initial administrator account. This is a local account that you will use to access the Open WebUI interface.
In the Open WebUI interface, click the "Get Started" button at the bottom of the screen.
@ -295,14 +297,14 @@ Click the registration button to create your account and access the main interfa
## Step 7. Download and configure a model
This step downloads a language model through Ollama and configures it for use in
Open WebUI. The download happens on your DGX Spark device and may take several minutes.
Next, download a language model with Ollama and configure it for use in
Open WebUI. This download happens on your DGX Spark device and may take several minutes.
Click on the "Select a model" dropdown in the top left corner of the Open WebUI interface.
Type `gpt-oss:20b` in the search field.
Click the "Pull 'gpt-oss:20b' from Ollama.com" button that appears.
Click the `Pull "gpt-oss:20b" from Ollama.com` button that appears.
Wait for the model download to complete. You can monitor progress in the interface.
@ -310,9 +312,6 @@ Once complete, select "gpt-oss:20b" from the model dropdown.
## Step 8. Test the model
This step verifies that the complete setup is working properly by testing model
inference through the web interface.
In the chat textarea at the bottom of the Open WebUI interface, enter:
```
@ -331,11 +330,40 @@ Under the "Custom" section, click the `x` icon on the right of the "Open WebUI"
This will close the tunnel and stop the Open WebUI docker container.
## Step 11. Cleanup and rollback
## Step 10. Troubleshooting
Common issues and their solutions.
| Symptom | Cause | Fix |
|---------|-------|-----|
| Permission denied on docker ps | User not in docker group | Run Step 1 completely, including terminal restart |
| Browser doesn't open automatically | Auto-open setting disabled | Manually navigate to localhost:12000 |
| Model download fails | Network connectivity issues | Check internet connection, retry download |
| GPU not detected in container | Missing `--gpus=all flag` | Recreate container with correct start script |
| Port 12000 already in use | Another application using port | Change port in Custom App settings or stop conflicting service |
## Step 11. Next steps
Try downloading different models from the Ollama library at https://ollama.com/library.
You can monitor GPU and memory usage through the DGX Dashboard available in NVIDIA Sync as you try different models.
If Open WebUI reports an update is available, you can pull the the container image by running this in your terminal:
```bash
docker stop open-webui
docker rm open-webui
docker pull ghcr.io/open-webui/open-webui:ollama
```
After the update, launch Open WebUI again from NVIDIA Sync.
## Step 12. Cleanup and rollback
Steps to completely remove the Open WebUI installation and free up resources:
> **Warning**: These commands will permanently delete all Open WebUI data and downloaded models.
> [!WARNING]
> These commands will permanently delete all Open WebUI data and downloaded models.
Stop and remove the Open WebUI container:
@ -356,24 +384,8 @@ Remove persistent data volumes:
docker volume rm open-webui open-webui-ollama
```
To rollback permission change: `sudo deluser $USER docker`
Remove the Custom App from NVIDIA Sync by opening Settings > Custom tab and deleting the entry.
## Step 12. Next steps
Try downloading different models from the Ollama library at https://ollama.com/library.
You can monitor GPU and memory usage through the DGX Dashboard available in NVIDIA Sync as you try different models.
If Open WebUI reports an update is available, you can update the container image by running:
```bash
docker pull ghcr.io/open-webui/open-webui:ollama
```
After the update, launch Open WebUI again from NVIDIA Sync.
## Troubleshooting
## Common issues with manual setup
@ -395,7 +407,8 @@ After the update, launch Open WebUI again from NVIDIA Sync.
| GPU not detected in container | Missing `--gpus=all flag` | Recreate container with correct start script |
| Port 12000 already in use | Another application using port | Change port in Custom App settings or stop conflicting service |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> > [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash

View File

@ -28,55 +28,57 @@ By default, this playbook leverages **Ollama** for local LLM inference, providin
- Knowledge triple extraction from text documents
- Knowledge graph construction and visualization
- **Local-first architecture** with Ollama for LLM inference
- Graph-based RAG for more contextual answers
- Graph database integration with ArangoDB
- Local vector embeddings with Pinecone-compatible storage
- GPU-accelerated LLM inference with Ollama and optional vLLM
- Sentence Transformers for efficient embedding generation
- Interactive knowledge graph visualization with Three.js WebGPU
- Optional NVIDIA API integration for cloud-based models
- GPU-accelerated LLM inference with Ollama
- Fully containerized deployment with Docker Compose
- Decomposable and customizable
- Optional NVIDIA API integration for cloud-based models
- Optional vector search and advanced inference capabilities
- Optional graph-based RAG for contextual answers
## Software Components
The following are the default components included in this playbook:
### Core Components (Default)
* **LLM Inference**
* **Ollama** (default): Local LLM inference with GPU acceleration
* **Ollama**: Local LLM inference with GPU acceleration
* Default model: `llama3.1:8b`
* Supports any Ollama-compatible model
* **NVIDIA API** (optional): Cloud-based models via NVIDIA API Catalog
* **Vector Database & Embedding**
* **SentenceTransformer**: Local embedding generation
* Model: `all-MiniLM-L6-v2`
* **Pinecone (Local)**: Self-hosted vector storage and similarity search
* No cloud API key required
* Compatible with Pinecone client libraries
* **Knowledge Graph Database**
* **ArangoDB**: Graph database for storing knowledge triples (entities and relationships)
* Web interface on port 8529
* No authentication required (configurable)
* **Graph Visualization**
* **Three.js WebGPU**: Client-side GPU-accelerated graph rendering
* Optional remote WebGPU clustering for large graphs
* **Frontend & API**
* **Next.js**: Modern React framework with API routes
### Optional Components
* **Vector Database & Embedding** (with `--complete` flag)
* **SentenceTransformer**: Local embedding generation (model: `all-MiniLM-L6-v2`)
* **Pinecone**: Self-hosted vector storage and similarity search
* **Cloud Models** (configure separately)
* **NVIDIA API**: Cloud-based models via NVIDIA API Catalog
## Technical Diagram
The architecture follows this workflow:
### Default Architecture (Minimal Setup)
The core workflow for knowledge graph building and visualization:
1. User uploads documents through the txt2kg web UI
2. Documents are processed and chunked for analysis
3. **Ollama** extracts knowledge triples (subject-predicate-object) from the text using local LLM inference
4. Triples are stored in **ArangoDB** graph database
5. **SentenceTransformer** generates entity embeddings
6. Embeddings are stored in local **Pinecone** vector database
7. User queries are processed through graph-based RAG:
- KNN search identifies relevant entities in the vector database
- Graph traversal enhances context with entity relationships from ArangoDB
- Ollama generates responses using the enriched context
8. Results are visualized with **Three.js WebGPU** rendering in the browser
5. Knowledge graph is visualized with **Three.js WebGPU** rendering in the browser
6. Users can query the graph and generate insights using Ollama
### Future Enhancements
Additional capabilities can be added:
- **Vector search**: Add semantic similarity search with local Pinecone and SentenceTransformer embeddings
- **S3 storage**: MinIO for scalable document storage
- **GNN-based GraphRAG**: Graph Neural Networks for enhanced retrieval
## GPU-Accelerated LLM Inference
@ -86,7 +88,7 @@ This playbook includes **GPU-accelerated LLM inference** with Ollama:
- **Fully local inference**: No cloud dependencies or API keys required
- **GPU acceleration**: Automatic CUDA support with NVIDIA GPUs
- **Multiple model support**: Use any Ollama-compatible model
- **Optimized performance**: Flash attention, KV cache optimization, and quantization
- **Optimized inference**: Flash attention, KV cache optimization, and quantization
- **Easy model management**: Pull and switch models with simple commands
- **Privacy-first**: All data processing happens on your hardware
@ -96,21 +98,10 @@ This playbook includes **GPU-accelerated LLM inference** with Ollama:
- Flash attention enabled
- Q8_0 KV cache for memory efficiency
## Minimum System Requirements
## Software Requirements
**OS Requirements:**
- Ubuntu 22.04 or later
**Driver Versions:**
- GPU Driver: 530.30.02+
- CUDA: 12.0+
**Hardware Requirements:**
- NVIDIA GPU with CUDA support (GTX 1060 or newer, RTX series recommended)
- VRAM requirements depend on model size:
- 8B models: 6-8GB VRAM
- 70B models: 48GB+ VRAM (or use quantized versions)
- System RAM: 16GB+ recommended
- CUDA 12.0+
- Docker with NVIDIA Container Toolkit
## Deployment Guide
@ -120,9 +111,7 @@ This playbook includes **GPU-accelerated LLM inference** with Ollama:
The default configuration uses:
- Local Ollama (no API key needed)
- Local Pinecone (no API key needed)
- Local ArangoDB (no authentication by default)
- Local SentenceTransformer embeddings
Optional environment variables for customization:
```bash
@ -150,7 +139,6 @@ cd txt2kg
That's it! No configuration needed. The script will:
- Start all required services with Docker Compose
- Set up ArangoDB database
- Initialize local Pinecone vector storage
- Launch Ollama with GPU acceleration
- Start the Next.js frontend
@ -168,8 +156,6 @@ docker exec ollama-compose ollama pull llama3.1:8b
- **Switch Ollama models**: Use any model from Ollama's library (Llama, Mistral, Qwen, etc.)
- **Modify extraction prompts**: Customize how triples are extracted from text
- **Adjust embedding parameters**: Change the SentenceTransformer model
- **Implement custom entity relationships**: Define domain-specific relationship types
- **Add domain-specific knowledge sources**: Integrate external ontologies or taxonomies
- **Use NVIDIA API**: Connect to cloud models for specific use cases
@ -177,4 +163,4 @@ docker exec ollama-compose ollama pull llama3.1:8b
[MIT](LICENSE)
This is licensed under the MIT License. This project will download and install additional third-party open source software projects and containers.
This project will download and install additional third-party open source software projects and containers.

View File

@ -5,34 +5,63 @@ This directory contains all deployment-related configuration for the txt2kg proj
## Structure
- **compose/**: Docker Compose files for local development and testing
- `docker-compose.yml`: Main Docker Compose configuration
- `docker-compose.gnn.yml`: Docker Compose configuration for GNN components
- `docker-compose.neo4j.yml`: Docker Compose configuration for Neo4j
- `docker-compose.yml`: Minimal Docker Compose configuration (Ollama + ArangoDB + Next.js)
- `docker-compose.complete.yml`: Complete stack with optional services (vLLM, Pinecone, Sentence Transformers)
- `docker-compose.optional.yml`: Additional optional services
- `docker-compose.vllm.yml`: Legacy vLLM configuration (use `--complete` flag instead)
- **docker/**: Docker-related files
- Dockerfile
- Initialization scripts for services
- **app/**: Frontend application Docker configuration
- Dockerfile for Next.js application
- **services/**: Containerized services
- **gnn_model/**: Graph Neural Network model service
- **sentence-transformers/**: Sentence transformer service for embeddings
- **ollama/**: Ollama LLM inference service with GPU support
- **sentence-transformers/**: Sentence transformer service for embeddings (optional)
- **vllm/**: vLLM inference service with FP8 quantization (optional)
- **gpu-viz/**: GPU-accelerated graph visualization services (optional, run separately)
- **gnn_model/**: Graph Neural Network model service (experimental, not in default compose files)
## Usage
To start the default services:
**Recommended: Use the start script**
```bash
docker-compose -f deploy/compose/docker-compose.yml up
# Minimal setup (Ollama + ArangoDB + Next.js frontend)
./start.sh
# Complete stack (includes vLLM, Pinecone, Sentence Transformers)
./start.sh --complete
# Development mode (run frontend without Docker)
./start.sh --dev-frontend
```
To include GNN components:
**Manual Docker Compose commands:**
To start the minimal services:
```bash
docker-compose -f deploy/compose/docker-compose.yml -f deploy/compose/docker-compose.gnn.yml up
docker compose -f deploy/compose/docker-compose.yml up -d
```
To include Neo4j:
To start the complete stack:
```bash
docker-compose -f deploy/compose/docker-compose.yml -f deploy/compose/docker-compose.neo4j.yml up
```
docker compose -f deploy/compose/docker-compose.complete.yml up -d
```
## Services Included
### Minimal Stack (default)
- **Next.js App**: Web UI on port 3001
- **ArangoDB**: Graph database on port 8529
- **Ollama**: Local LLM inference on port 11434
### Complete Stack (`--complete` flag)
All minimal services plus:
- **vLLM**: Advanced LLM inference on port 8001
- **Pinecone (Local)**: Vector embeddings on port 5081
- **Sentence Transformers**: Embedding generation on port 8000
### Optional Services (run separately)
- **GPU-Viz Services**: See `services/gpu-viz/README.md` for GPU-accelerated visualization
- **GNN Model Service**: See `services/gnn_model/README.md` for experimental GNN-based RAG

View File

@ -1,5 +1,3 @@
version: '3.8'
services:
app:
build:
@ -19,23 +17,26 @@ services:
- MODEL_NAME=all-MiniLM-L6-v2
- GRPC_SSL_CIPHER_SUITES=HIGH+ECDSA:HIGH+aRSA
- NODE_TLS_REJECT_UNAUTHORIZED=0
# - XAI_API_KEY=${XAI_API_KEY} # xAI integration removed
- OLLAMA_BASE_URL=http://ollama:11434/v1
- OLLAMA_MODEL=qwen3:1.7b
- S3_ENDPOINT=http://minio:9000
- S3_REGION=us-east-1
- S3_BUCKET=txt2kg
- S3_ACCESS_KEY=minioadmin
- S3_SECRET_KEY=minioadmin
- OLLAMA_MODEL=llama3.1:8b
- VLLM_BASE_URL=http://vllm:8001/v1
- VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
- REMOTE_WEBGPU_SERVICE_URL=http://txt2kg-remote-webgpu:8083
# Node.js timeout configurations for large model processing
- NODE_OPTIONS=--max-http-header-size=80000
- UV_THREADPOOL_SIZE=128
- HTTP_TIMEOUT=1800000
- REQUEST_TIMEOUT=1800000
networks:
- pinecone-net
- s3-net
- default
- txt2kg-network
depends_on:
- arangodb
- ollama
- entity-embeddings
- sentence-transformers
- minio
- vllm
arangodb:
image: arangodb:latest
@ -89,52 +90,93 @@ services:
networks:
- default
# MinIO S3-compatible storage
minio:
image: minio/minio:latest
container_name: txt2kg-minio
ollama:
build:
context: ../services/ollama
dockerfile: Dockerfile
image: ollama-custom:latest
container_name: ollama-compose
ports:
- "9000:9000" # API endpoint
- "9001:9001" # Web console
environment:
- MINIO_ROOT_USER=minioadmin
- MINIO_ROOT_PASSWORD=minioadmin
- '11434:11434'
volumes:
- minio_data:/data
command: server /data --console-address ":9001"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
- ollama_data:/root/.ollama
environment:
- OLLAMA_FLASH_ATTENTION=1
- OLLAMA_KEEP_ALIVE=30m
- OLLAMA_CUDA=1
- OLLAMA_LLM_LIBRARY=cuda
- OLLAMA_NUM_PARALLEL=1
- OLLAMA_MAX_LOADED_MODELS=1
- OLLAMA_KV_CACHE_TYPE=q8_0
- OLLAMA_GPU_LAYERS=999
- OLLAMA_GPU_MEMORY_FRACTION=0.9
- CUDA_VISIBLE_DEVICES=0
networks:
- s3-net
- default
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
createbucket:
image: minio/mc
depends_on:
- minio
entrypoint: >
/bin/sh -c "
sleep 5;
/usr/bin/mc config host add myminio http://minio:9000 minioadmin minioadmin;
/usr/bin/mc mb myminio/txt2kg;
/usr/bin/mc policy set public myminio/txt2kg;
exit 0;
"
vllm:
build:
context: ../../deploy/services/vllm
dockerfile: Dockerfile
container_name: vllm-service
ports:
- '8001:8001'
environment:
- VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
- VLLM_TENSOR_PARALLEL_SIZE=1
- VLLM_MAX_MODEL_LEN=4096
- VLLM_GPU_MEMORY_UTILIZATION=0.9
- VLLM_QUANTIZATION=fp8
- VLLM_KV_CACHE_DTYPE=fp8
- VLLM_PORT=8001
- VLLM_HOST=0.0.0.0
- CUDA_VISIBLE_DEVICES=0
- NCCL_DEBUG=INFO
volumes:
- vllm_models:/app/models
- /tmp:/tmp
- ~/.cache/huggingface:/root/.cache/huggingface
networks:
- s3-net
- default
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8001/v1/models"]
interval: 30s
timeout: 10s
retries: 5
start_period: 120s
volumes:
arangodb_data:
arangodb_apps_data:
minio_data:
ollama_data:
vllm_models:
networks:
pinecone-net:
name: pinecone
s3-net:
name: s3-network
default:
driver: bridge
driver: bridge
txt2kg-network:
driver: bridge

View File

@ -0,0 +1,86 @@
services:
app:
environment:
- PINECONE_HOST=entity-embeddings
- PINECONE_PORT=5081
- PINECONE_API_KEY=pclocal
- PINECONE_ENVIRONMENT=local
- SENTENCE_TRANSFORMER_URL=http://sentence-transformers:80
- MODEL_NAME=all-MiniLM-L6-v2
- VLLM_BASE_URL=http://vllm:8001/v1
- VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
networks:
- pinecone-net
depends_on:
- entity-embeddings
- sentence-transformers
- vllm
entity-embeddings:
image: ghcr.io/pinecone-io/pinecone-index:latest
container_name: entity-embeddings
environment:
PORT: 5081
INDEX_TYPE: serverless
VECTOR_TYPE: dense
DIMENSION: 384
METRIC: cosine
INDEX_NAME: entity-embeddings
ports:
- "5081:5081"
platform: linux/amd64
networks:
- pinecone-net
restart: unless-stopped
sentence-transformers:
build:
context: ../../deploy/services/sentence-transformers
dockerfile: Dockerfile
ports:
- '8000:80'
environment:
- MODEL_NAME=all-MiniLM-L6-v2
networks:
- default
vllm:
build:
context: ../../deploy/services/vllm
dockerfile: Dockerfile
container_name: vllm-service
ports:
- '8001:8001'
environment:
- VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
- VLLM_TENSOR_PARALLEL_SIZE=1
- VLLM_MAX_MODEL_LEN=4096
- VLLM_GPU_MEMORY_UTILIZATION=0.9
- VLLM_QUANTIZATION=fp8
- VLLM_KV_CACHE_DTYPE=fp8
- VLLM_PORT=8001
- VLLM_HOST=0.0.0.0
volumes:
- vllm_models:/app/models
- /tmp:/tmp
networks:
- default
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8001/v1/models"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
volumes:
vllm_models:
networks:
pinecone-net:
name: pinecone

View File

@ -1,3 +1,7 @@
# This is a legacy file - use --with-optional flag instead
# The vLLM service is now included in docker-compose.optional.yml
# This file is kept for backwards compatibility
services:
app:
build:

View File

@ -8,20 +8,11 @@ services:
environment:
- ARANGODB_URL=http://arangodb:8529
- ARANGODB_DB=txt2kg
- PINECONE_HOST=entity-embeddings
- PINECONE_PORT=5081
- PINECONE_API_KEY=pclocal
- PINECONE_ENVIRONMENT=local
- LANGCHAIN_TRACING_V2=true
- SENTENCE_TRANSFORMER_URL=http://sentence-transformers:80
- MODEL_NAME=all-MiniLM-L6-v2
- GRPC_SSL_CIPHER_SUITES=HIGH+ECDSA:HIGH+aRSA
- NODE_TLS_REJECT_UNAUTHORIZED=0
# - XAI_API_KEY=${XAI_API_KEY} # xAI integration removed
- OLLAMA_BASE_URL=http://ollama:11434/v1
- OLLAMA_MODEL=llama3.1:8b
- VLLM_BASE_URL=http://vllm:8001/v1
- VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
- REMOTE_WEBGPU_SERVICE_URL=http://txt2kg-remote-webgpu:8083
# Node.js timeout configurations for large model processing
- NODE_OPTIONS=--max-http-header-size=80000
@ -29,9 +20,11 @@ services:
- HTTP_TIMEOUT=1800000
- REQUEST_TIMEOUT=1800000
networks:
- pinecone-net
- default
- txt2kg-network
depends_on:
- arangodb
- ollama
arangodb:
image: arangodb:latest
ports:
@ -54,32 +47,6 @@ services:
echo 'Creating txt2kg database...' &&
arangosh --server.endpoint tcp://arangodb:8529 --server.authentication false --javascript.execute-string 'try { db._createDatabase(\"txt2kg\"); console.log(\"Database txt2kg created successfully!\"); } catch(e) { if(e.message.includes(\"duplicate\")) { console.log(\"Database txt2kg already exists\"); } else { throw e; } }'
"
entity-embeddings:
image: ghcr.io/pinecone-io/pinecone-index:latest
container_name: entity-embeddings
environment:
PORT: 5081
INDEX_TYPE: serverless
VECTOR_TYPE: dense
DIMENSION: 384
METRIC: cosine
INDEX_NAME: entity-embeddings
ports:
- "5081:5081"
platform: linux/amd64
networks:
- pinecone-net
restart: unless-stopped
sentence-transformers:
build:
context: ../../deploy/services/sentence-transformers
dockerfile: Dockerfile
ports:
- '8000:80'
environment:
- MODEL_NAME=all-MiniLM-L6-v2
networks:
- default
ollama:
build:
context: ../services/ollama
@ -117,52 +84,14 @@ services:
timeout: 10s
retries: 3
start_period: 60s
vllm:
build:
context: ../../deploy/services/vllm
dockerfile: Dockerfile
container_name: vllm-service
ports:
- '8001:8001'
environment:
- VLLM_MODEL=meta-llama/Llama-3.2-3B-Instruct
- VLLM_TENSOR_PARALLEL_SIZE=1
- VLLM_MAX_MODEL_LEN=4096
- VLLM_GPU_MEMORY_UTILIZATION=0.9
- VLLM_QUANTIZATION=fp8
- VLLM_KV_CACHE_DTYPE=fp8
- VLLM_PORT=8001
- VLLM_HOST=0.0.0.0
volumes:
- vllm_models:/app/models
- /tmp:/tmp
networks:
- default
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8001/v1/models"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
volumes:
arangodb_data:
arangodb_apps_data:
ollama_data:
vllm_models:
networks:
pinecone-net:
name: pinecone
default:
driver: bridge
txt2kg-network:
driver: bridge
driver: bridge

View File

@ -1,44 +1,66 @@
# GNN Model Service
# GNN Model Service (Experimental)
This service provides a REST API for serving predictions from a Graph Neural Network (GNN) model trained to enhance RAG (Retrieval Augmented Generation) performance. It allows comparing GNN-based knowledge graph retrieval with traditional RAG approaches.
**Status**: This is an experimental service for serving Graph Neural Network models trained for enhanced RAG retrieval.
**Note**: This service is **not included** in the default docker-compose configurations and must be deployed separately.
## Overview
The service exposes a simple API to:
- Load a pre-trained GNN model that combines graph structures with language models
- Process queries by incorporating graph-structured knowledge
- Return predictions that leverage both text and graph relationships
This service provides a REST API for serving predictions from a Graph Neural Network (GNN) model that enhances knowledge graph retrieval:
- Load pre-trained GNN models (GAT architecture)
- Process queries with graph-structured knowledge
- Combine GNN embeddings with LLM generation
- Compare GNN-based retrieval vs traditional RAG
## Getting Started
### Prerequisites
- Docker and Docker Compose
- The trained model file (created using `train_export.py`)
- Python 3.8+
- PyTorch and PyTorch Geometric
- A trained model file (created using `train_export.py` in `scripts/gnn/`)
- Docker (optional)
### Running the Service
### Training the Model
The service is included in the main docker-compose configuration. Simply run:
Before using the service, you must train a GNN model using the training pipeline:
```bash
docker-compose up -d
```
# See scripts/gnn/README.md for full instructions
This will start the GNN model service along with other services in the system.
# 1. Preprocess data from ArangoDB
python scripts/gnn/preprocess_data.py --use_arango --output_dir ./output
## Training the Model
# 2. Train the model
python scripts/gnn/train_test_gnn.py --output_dir ./output
Before using the service, you need to train the GNN model:
```bash
# Create the models directory if it doesn't exist
mkdir -p models
# Run the training script
# 3. Export model for serving
python deploy/services/gnn_model/train_export.py --output_dir models
```
This will create the `tech-qa-model.pt` file in the models directory, which the service will load.
This creates the `tech-qa-model.pt` file needed by the service.
### Running the Service
#### Option A: Direct Python
```bash
cd deploy/services/gnn_model
pip install -r requirements.txt
python app.py
```
Service runs on: http://localhost:5000
#### Option B: Docker
```bash
cd deploy/services/gnn_model
docker build -t gnn-model-service .
docker run -p 5000:5000 -v $(pwd)/models:/app/models gnn-model-service
```
## API Endpoints
@ -89,7 +111,26 @@ The GNN model service uses:
- A Language Model (LLM) to generate answers
- A combined architecture (GRetriever) that leverages both components
## Limitations
## Integration with txt2kg
- The current implementation requires graph construction to be handled separately
- The `create_graph_from_text` function in the service is a placeholder that needs implementation based on your specific graph construction approach
To integrate this service with the main txt2kg application:
1. Train a model using the GNN training pipeline
2. Deploy the GNN service on a separate port
3. Update the frontend to call the GNN service endpoints
4. Compare GNN-enhanced retrieval vs standard RAG
## Current Status
This is an experimental feature. The service code exists but requires:
- A trained GNN model
- Integration with the frontend query pipeline
- Graph construction from txt2kg knowledge graphs
- Performance benchmarking vs traditional RAG
## Future Enhancements
- Docker Compose integration for easier deployment
- Automatic model training from txt2kg graphs
- Real-time model updates as graphs grow
- Comparison UI in the frontend

View File

@ -1,305 +0,0 @@
# GPU Rendering Library Options for Remote Visualization
## 🎯 **Yes! Three.js is Perfect for Adding GPU Rendering**
Your existing **Three.js v0.176.0** stack is ideal for adding true GPU-accelerated WebGL rendering to the remote service. Here's a comprehensive comparison of options:
## 🚀 **Option 1: Three.js (Recommended)**
### **Why Three.js is Perfect**
- ✅ **Already in your stack** - Three.js v0.176.0 in package.json
- ✅ **Mature WebGL abstraction** - Handles GPU complexity
- ✅ **InstancedMesh for performance** - Single draw call for millions of nodes
- ✅ **Built-in optimizations** - Frustum culling, LOD, memory management
- ✅ **Easy development** - High-level API, good documentation
### **Three.js GPU Features for Graph Rendering**
#### **1. InstancedMesh for Mass Node Rendering**
```javascript
// Single GPU draw call for 100k+ nodes
const geometry = new THREE.CircleGeometry(1, 8);
const material = new THREE.MeshBasicMaterial({ vertexColors: true });
const instancedMesh = new THREE.InstancedMesh(geometry, material, nodeCount);
// Set position, scale, color for each instance
const matrix = new THREE.Matrix4();
const color = new THREE.Color();
nodes.forEach((node, i) => {
matrix.makeScale(node.size, node.size, 1);
matrix.setPosition(node.x, node.y, 0);
instancedMesh.setMatrixAt(i, matrix);
color.setHex(node.clusterColor);
instancedMesh.setColorAt(i, color);
});
// GPU renders all nodes in one call
scene.add(instancedMesh);
```
#### **2. BufferGeometry for Edge Performance**
```javascript
// GPU-optimized edge rendering
const positions = new Float32Array(edgeCount * 6);
const colors = new Float32Array(edgeCount * 6);
edges.forEach((edge, i) => {
const idx = i * 6;
// Source vertex
positions[idx] = edge.source.x;
positions[idx + 1] = edge.source.y;
// Target vertex
positions[idx + 3] = edge.target.x;
positions[idx + 4] = edge.target.y;
});
const geometry = new THREE.BufferGeometry();
geometry.setAttribute('position', new THREE.BufferAttribute(positions, 3));
geometry.setAttribute('color', new THREE.BufferAttribute(colors, 3));
const lineSegments = new THREE.LineSegments(geometry, material);
```
#### **3. Built-in Performance Optimizations**
```javascript
// Three.js GPU optimizations
renderer.sortObjects = false; // Disable expensive sorting
renderer.setPixelRatio(Math.min(devicePixelRatio, 2)); // Limit pixel density
// Frustum culling (automatic)
// Level-of-detail (LOD) support
// Automatic geometry merging
// GPU texture atlasing
```
### **Performance Comparison**
| Approach | 10k Nodes | 100k Nodes | 1M Nodes | FPS |
|----------|-----------|------------|----------|-----|
| **D3.js SVG** | ✅ Good | ❌ Slow | ❌ Unusable | 15fps |
| **Three.js Standard** | ✅ Excellent | ✅ Good | ❌ Slow | 45fps |
| **Three.js Instanced** | ✅ Excellent | ✅ Excellent | ✅ Good | 60fps |
## 🔧 **Option 2: deck.gl (For Data-Heavy Visualizations)**
### **Pros**
- ✅ **Built for large datasets** - Optimized for millions of points
- ✅ **WebGL2 compute shaders** - True GPU computation
- ✅ **Built-in graph layouts** - Force-directed on GPU
- ✅ **Excellent performance** - 1M+ nodes at 60fps
### **Cons**
- ❌ **Large bundle size** - Adds ~500KB
- ❌ **Complex API** - Steeper learning curve
- ❌ **React-focused** - Less suitable for iframe embedding
```javascript
// deck.gl GPU-accelerated approach
import { ScatterplotLayer, LineLayer } from '@deck.gl/layers';
const nodeLayer = new ScatterplotLayer({
data: nodes,
getPosition: d => [d.x, d.y],
getRadius: d => d.size,
getFillColor: d => d.color,
radiusUnits: 'pixels',
// GPU instancing automatically enabled
});
const edgeLayer = new LineLayer({
data: edges,
getSourcePosition: d => [d.source.x, d.source.y],
getTargetPosition: d => [d.target.x, d.target.y],
getColor: [100, 100, 100],
getWidth: 1
});
```
## ⚡ **Option 3: regl (Raw WebGL Performance)**
### **Pros**
- ✅ **Maximum performance** - Direct WebGL access
- ✅ **Small bundle** - ~50KB
- ✅ **Full control** - Custom shaders, compute pipelines
- ✅ **Functional API** - Clean, predictable
### **Cons**
- ❌ **Low-level complexity** - Manual memory management
- ❌ **Shader development** - GLSL programming required
- ❌ **More development time** - Everything custom
```javascript
// regl direct WebGL approach
const drawNodes = regl({
vert: `
attribute vec2 position;
attribute float size;
attribute vec3 color;
varying vec3 vColor;
void main() {
gl_Position = vec4(position, 0, 1);
gl_PointSize = size;
vColor = color;
}
`,
frag: `
precision mediump float;
varying vec3 vColor;
void main() {
gl_FragColor = vec4(vColor, 1);
}
`,
attributes: {
position: nodePositions,
size: nodeSizes,
color: nodeColors
},
count: nodeCount,
primitive: 'points'
});
```
## 🎮 **Option 4: WebGPU (Future-Proof)**
### **Pros**
- ✅ **Next-generation API** - Successor to WebGL
- ✅ **Compute shaders** - True parallel processing
- ✅ **Better performance** - Lower overhead
- ✅ **Multi-threading** - Parallel command buffers
### **Cons**
- ❌ **Limited browser support** - Chrome/Edge only (2024)
- ❌ **New API** - Rapidly changing specification
- ❌ **Complex setup** - More verbose than WebGL
```javascript
// WebGPU approach (future)
const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();
const computePipeline = device.createComputePipeline({
compute: {
module: device.createShaderModule({
code: `
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) global_id : vec3<u32>) {
let index = global_id.x;
if (index >= arrayLength(&positions)) { return; }
// GPU-parallel force calculation
var force = vec2<f32>(0.0, 0.0);
for (var i = 0u; i < arrayLength(&positions); i++) {
if (i != index) {
let diff = positions[index] - positions[i];
let dist = length(diff);
force += normalize(diff) * (1.0 / (dist * dist));
}
}
velocities[index] += force * 0.01;
positions[index] += velocities[index] * 0.1;
}
`
}),
entryPoint: 'main'
}
});
```
## 🏆 **Recommendation: Three.js Integration**
### **For Your Use Case, Three.js is Optimal Because:**
1. **Already Available** - No new dependencies
2. **Proven Performance** - Handles 100k+ nodes smoothly
3. **Easy Integration** - Replace D3.js rendering with Three.js
4. **Maintenance** - Well-documented, stable API
5. **Development Speed** - Rapid implementation
### **Implementation Strategy**
#### **Phase 1: Basic Three.js WebGL (Week 1)**
```python
# Enhanced remote service with Three.js
def _generate_threejs_html(self, session_data, config):
return f"""
<script src="https://cdnjs.cloudflare.com/ajax/libs/three.js/0.176.0/three.min.js"></script>
<script>
// Basic Three.js WebGL rendering
const renderer = new THREE.WebGLRenderer({{
powerPreference: "high-performance"
}});
const scene = new THREE.Scene();
const camera = new THREE.PerspectiveCamera(75, width/height, 0.1, 1000);
// Render nodes and edges with GPU
createNodeVisualization();
createEdgeVisualization();
</script>
"""
```
#### **Phase 2: GPU Optimization (Week 2)**
- Add InstancedMesh for node rendering
- Implement BufferGeometry for edges
- Enable frustum culling and LOD
#### **Phase 3: Advanced Features (Week 3)**
- GPU-based interaction (raycasting)
- Smooth camera controls
- Real-time layout animation
### **Expected Performance Improvements**
| Feature | D3.js SVG | Three.js WebGL | Improvement |
|---------|-----------|----------------|-------------|
| **50k nodes** | 5 FPS | 60 FPS | **12x faster** |
| **Animation** | Choppy | Smooth | **Fluid motion** |
| **Memory usage** | 200MB DOM | 50MB GPU | **4x less memory** |
| **Interaction** | Laggy | Responsive | **Real-time** |
## 💡 **Implementation Roadmap**
### **Step 1: Replace HTML Template**
```python
# In remote_gpu_rendering_service.py
def _generate_interactive_html(self, session_data, config):
if config.get('use_webgl', True):
return self._generate_threejs_webgl_html(session_data, config)
else:
return self._generate_d3_svg_html(session_data, config) # Fallback
```
### **Step 2: Add WebGL Configuration**
```typescript
// In RemoteGPUViewer component
const processWithWebGLOptimization = async () => {
const config = {
use_webgl: nodeCount > 5000,
instanced_rendering: nodeCount > 10000,
lod_enabled: nodeCount > 25000,
render_quality: 'high'
};
// Process with enhanced GPU service
};
```
### **Step 3: Performance Monitoring**
```javascript
// Built-in Three.js performance monitoring
console.log('Render Info:', {
triangles: renderer.info.render.triangles,
calls: renderer.info.render.calls,
geometries: renderer.info.memory.geometries,
textures: renderer.info.memory.textures
});
```
**Result**: Your remote GPU service will provide **true GPU-accelerated rendering** with minimal development effort by leveraging your existing Three.js stack.

View File

@ -1,264 +0,0 @@
# JavaScript Library Stack Integration with Remote GPU Rendering
## 🚀 **Library Architecture Overview**
Your project leverages a sophisticated JavaScript stack optimized for graph visualization performance:
### **Core Visualization Libraries**
```json
{
"3d-force-graph": "^1.77.0", // WebGL 3D graph rendering
"three": "^0.176.0", // WebGL/WebGPU 3D engine
"d3": "^7.9.0", // Data binding & force simulation
"@types/d3": "^7.4.3", // TypeScript definitions
"@types/three": "^0.175.0" // Three.js TypeScript support
}
```
### **Frontend Framework**
```json
{
"next": "15.1.0", // React framework with SSR
"react": "^19", // Component architecture
"tailwindcss": "^3.4.17" // Utility-first CSS
}
```
## 🎯 **Performance Optimization Strategies**
### **1. Dynamic Import Strategy**
**Problem:** Large visualization libraries increase initial bundle size
**Solution:** Conditional loading based on graph complexity
```typescript
// ForceGraphWrapper.tsx - Dynamic loading pattern
const ForceGraph3D = (await import('3d-force-graph')).default;
// Benefits:
// - Reduces initial bundle by ~2MB
// - Enables GPU capability detection
// - Prevents SSR WebGL conflicts
```
### **2. GPU Capability Detection**
**Enhanced detection based on your library capabilities:**
```typescript
const shouldUseRemoteRendering = (nodeCount: number) => {
const maxWebGLNodes = window.WebGL2RenderingContext ? 50000 : 10000;
const maxWebGPUNodes = 'gpu' in navigator ? 100000 : 25000;
// Three.js geometry memory limits
const estimatedMemoryMB = (nodeCount * 64) / (1024 * 1024);
const maxClientMemory = hasWebGPU ? 512 : 256; // MB
return nodeCount > maxWebGLNodes || estimatedMemoryMB > maxClientMemory;
};
```
### **3. Library-Specific Optimizations**
#### **Three.js Renderer Settings**
```typescript
const optimizeForThreeJS = (nodeCount: number) => ({
// Instanced rendering for large graphs
instance_rendering: nodeCount > 10000,
// Texture optimization
texture_atlasing: nodeCount > 5000,
max_texture_size: nodeCount > 25000 ? 2048 : 1024,
// Performance culling
frustum_culling: nodeCount > 15000,
occlusion_culling: nodeCount > 25000,
// Level-of-detail for distant nodes
enable_lod: nodeCount > 25000
});
```
#### **D3.js Force Simulation Tuning**
```typescript
const optimizeForD3 = (nodeCount: number) => ({
// Reduced iterations for large graphs
physics_iterations: nodeCount > 50000 ? 100 : 300,
// Faster convergence
alpha_decay: nodeCount > 50000 ? 0.05 : 0.02,
// More damping for stability
velocity_decay: nodeCount > 50000 ? 0.6 : 0.4
});
```
## 🔧 **Remote GPU Service Integration**
### **Enhanced HTML Template Generation**
The remote GPU service now generates HTML compatible with your frontend:
```python
def _generate_interactive_html(self, session_data: dict, config: dict) -> str:
html_template = f"""
<!-- Using D3.js v7.9.0 consistent with frontend -->
<script src="https://d3js.org/d3.v7.min.js"></script>
<script>
// Configuration matching your library versions
const config = {{
d3_version: "7.9.0", // Match package.json
threejs_version: "0.176.0", // Match package.json
force_graph_version: "1.77.0", // Match package.json
// Performance settings based on render quality
maxParticles: {settings['particles']},
lineWidth: {settings['line_width']},
nodeDetail: {settings['node_detail']}
}};
// D3 force simulation with GPU-optimized parameters
this.simulation = d3.forceSimulation()
.force("link", d3.forceLink().id(d => d.id).distance(60))
.force("charge", d3.forceManyBody().strength(-120))
.force("center", d3.forceCenter(this.width / 2, this.height / 2))
.alphaDecay(0.02)
.velocityDecay(0.4);
</script>
"""
```
### **Frontend Component Integration**
```typescript
// RemoteGPUViewer.tsx - Library-aware processing
const processGraphWithLibraryOptimization = async () => {
const optimizedConfig = {
// Frontend library compatibility
d3_version: "7.9.0",
threejs_version: "0.176.0",
force_graph_version: "1.77.0",
// WebGL optimization features
webgl_features: {
instance_rendering: nodeCount > 10000,
texture_atlasing: nodeCount > 5000,
frustum_culling: nodeCount > 15000
},
// Performance tuning
progressive_loading: nodeCount > 25000,
gpu_memory_management: true
};
const response = await fetch('/api/render', {
method: 'POST',
body: JSON.stringify({ graph_data, config: optimizedConfig })
});
};
```
## 📊 **Performance Benchmarks by Library Stack**
### **Client-Side Rendering Limits**
| Library Stack | Max Nodes | Memory Usage | Performance |
|---------------|-----------|--------------|-------------|
| **D3.js + SVG** | 5,000 | ~50MB | Good interaction |
| **Three.js + WebGL** | 50,000 | ~256MB | Smooth 60fps |
| **Three.js + WebGPU** | 100,000 | ~512MB | GPU-accelerated |
| **Remote GPU** | 1M+ | ~100KB transfer | Server-rendered |
### **Rendering Strategy Decision Tree**
```typescript
const selectRenderingStrategy = (nodeCount: number) => {
if (nodeCount < 5000) {
return "local_svg"; // D3.js + SVG DOM
} else if (nodeCount < 25000) {
return "local_webgl"; // Three.js + WebGL
} else if (nodeCount < 100000 && hasWebGPU) {
return "local_webgpu"; // Three.js + WebGPU
} else {
return "remote_gpu"; // Remote cuGraph + GPU
}
};
```
## 🚀 **Advanced Integration Features**
### **1. Progressive Loading**
```typescript
// For graphs >25k nodes, enable progressive loading
if (nodeCount > 25000) {
config.progressive_loading = true;
config.initial_load_size = 10000; // Load first 10k nodes
config.batch_size = 5000; // Load 5k at a time
}
```
### **2. WebSocket Real-time Updates**
```typescript
// Real-time parameter updates via WebSocket
const updateLayoutAlgorithm = (algorithm: string) => {
if (wsRef.current?.readyState === WebSocket.OPEN) {
wsRef.current.send(JSON.stringify({
type: "update_params",
layout_algorithm: algorithm
}));
}
};
```
### **3. Memory-Aware Quality Settings**
```typescript
const adjustQuality = (availableMemory: number, nodeCount: number) => {
if (availableMemory < 256) return "low"; // Mobile devices
if (availableMemory < 512) return "medium"; // Standard devices
if (nodeCount > 100000) return "high"; // Large graphs
return "ultra"; // High-end systems
};
```
## 💡 **Best Practices for Your Stack**
### **1. Bundle Optimization**
- Use dynamic imports for 3D libraries
- Lazy load based on graph size detection
- Implement service worker caching for repeated visualizations
### **2. Memory Management**
```typescript
// Cleanup Three.js resources
const cleanup = () => {
if (graphRef.current) {
graphRef.current.scene?.traverse((object) => {
if (object.geometry) object.geometry.dispose();
if (object.material) object.material.dispose();
});
graphRef.current.renderer?.dispose();
}
};
```
### **3. Responsive Rendering**
```typescript
// Adjust complexity based on device capabilities
const getDeviceCapabilities = () => ({
memory: (navigator as any).deviceMemory || 4, // GB
cores: navigator.hardwareConcurrency || 4,
gpu: 'gpu' in navigator ? 'webgpu' : 'webgl'
});
```
## 🎯 **Integration Results**
**Seamless fallback** between local and remote rendering
**Library version consistency** across client and server
**Memory-aware quality adjustment** based on device capabilities
**Progressive enhancement** from SVG → WebGL → WebGPU → Remote GPU
**Real-time parameter updates** via WebSocket
**Zero-config optimization** based on graph complexity
This integration provides the best of both worlds: the interactivity of your existing Three.js/D3.js stack for smaller graphs, and the scalability of remote GPU processing for large-scale visualizations.

View File

@ -1,48 +1,68 @@
# Unified GPU Graph Visualization Service
# GPU Graph Visualization Services
## 🚀 Overview
The unified service combines **PyGraphistry Cloud** and **Local GPU (cuGraph)** processing into a single FastAPI service, giving you maximum flexibility for graph visualization.
This directory contains optional GPU-accelerated graph visualization services that run separately from the main txt2kg application. These services provide advanced visualization capabilities for large-scale graphs.
## ⚡ Processing Modes
**Note**: These services are **optional** and not included in the default docker-compose configurations. They must be run separately.
## 📦 Available Services
### 1. Unified GPU Service (`unified_gpu_service.py`)
Combines **PyGraphistry Cloud** and **Local GPU (cuGraph)** processing into a single FastAPI service.
**Processing Modes:**
| Mode | Description | Requirements |
|------|-------------|--------------|
| **PyGraphistry Cloud** | Interactive GPU embeds in browser | API credentials |
| **Local GPU (cuGraph)** | Full GPU processing on your hardware | NVIDIA GPU + cuGraph |
| **Local CPU** | NetworkX fallback processing | None |
## 🛠️ Quick Setup
### 2. Remote GPU Rendering Service (`remote_gpu_rendering_service.py`)
Provides GPU-accelerated graph layout and rendering with iframe-embeddable visualizations.
### 3. Local GPU Service (`local_gpu_viz_service.py`)
Local GPU processing service with WebSocket support for real-time updates.
## 🛠️ Setup
### Prerequisites
- NVIDIA GPU with CUDA support (for GPU modes)
- RAPIDS cuGraph (for local GPU processing)
- PyGraphistry account (for cloud mode)
### Installation
### 1. Set Environment Variables (Optional)
```bash
# For PyGraphistry Cloud features
export GRAPHISTRY_PERSONAL_KEY="your_personal_key"
export GRAPHISTRY_SECRET_KEY="your_secret_key"
# Install dependencies
pip install -r deploy/services/gpu-viz/requirements.txt
# For remote WebGPU service
pip install -r deploy/services/gpu-viz/requirements-remote-webgpu.txt
```
### 2. Run the Service
### Running Services
#### Option A: Direct Python
#### Unified GPU Service
```bash
cd services
cd deploy/services/gpu-viz
python unified_gpu_service.py
```
#### Option B: Using Startup Script
Service runs on: http://localhost:8080
#### Remote GPU Rendering Service
```bash
cd services
./start_gpu_services.sh
cd deploy/services/gpu-viz
python remote_gpu_rendering_service.py
```
#### Option C: Docker (NVIDIA PyG Container)
Service runs on: http://localhost:8082
#### Using Startup Script
```bash
cd services
docker build -t unified-gpu-viz .
docker run --gpus all -p 8080:8080 \
-e GRAPHISTRY_PERSONAL_KEY="your_key" \
-e GRAPHISTRY_SECRET_KEY="your_secret" \
unified-gpu-viz
cd deploy/services/gpu-viz
./start_remote_gpu_services.sh
```
## 📡 API Usage
@ -85,25 +105,19 @@ Response:
## 🎯 Frontend Integration
### React Component Usage
The txt2kg frontend includes built-in components for GPU visualization:
```tsx
import { UnifiedGPUViewer } from '@/components/unified-gpu-viewer'
- `UnifiedGPUViewer`: Connects to unified GPU service
- `PyGraphistryViewer`: Direct PyGraphistry cloud integration
- `ForceGraphWrapper`: Three.js WebGPU visualization (default)
function MyApp() {
const graphData = {
nodes: [...],
links: [...]
}
### Using GPU Services in Frontend
return (
<UnifiedGPUViewer
graphData={graphData}
onError={(error) => console.error(error)}
/>
)
}
```
The frontend has API routes that can connect to these services:
- `/api/pygraphistry/*`: PyGraphistry integration
- `/api/unified-gpu/*`: Unified GPU service integration
To use these services, ensure they are running separately and configure the frontend environment variables accordingly.
### Mode-Specific Processing

View File

@ -1,243 +0,0 @@
# True GPU Rendering vs Current Approach
## 🎯 **Current Remote GPU Service**
### **What Uses GPU (✅)**
- **Graph Layout**: cuGraph Force Atlas 2, Spectral Layout
- **Clustering**: cuGraph Leiden, Louvain algorithms
- **Centrality**: cuGraph PageRank, Betweenness Centrality
- **Data Processing**: Node positioning, edge bundling
### **What Uses CPU (❌)**
- **Visual Rendering**: D3.js SVG/Canvas drawing
- **Animation**: D3.js transitions and transforms
- **Interaction**: DOM event handling, hover, zoom
- **Text Rendering**: Node labels, tooltips
## 🔥 **True GPU Rendering (Like PyGraphistry)**
### **What Would Need GPU Acceleration**
#### **1. WebGL Compute Shaders**
```glsl
// Vertex shader for node positioning
attribute vec2 position;
attribute float size;
attribute vec3 color;
uniform mat4 projectionMatrix;
uniform float time;
void main() {
// GPU-accelerated node positioning
vec2 pos = position + computeForceLayout(time);
gl_Position = projectionMatrix * vec4(pos, 0.0, 1.0);
gl_PointSize = size;
}
```
#### **2. GPU Particle Systems**
```javascript
// WebGL-based node rendering
class GPUNodeRenderer {
constructor(gl, nodeCount) {
this.nodeCount = nodeCount;
// Create vertex buffers for GPU processing
this.positionBuffer = gl.createBuffer();
this.colorBuffer = gl.createBuffer();
this.sizeBuffer = gl.createBuffer();
// Compile GPU shaders
this.program = this.createShaderProgram(gl);
}
render(nodes) {
// Update GPU buffers - no CPU iteration
gl.bindBuffer(gl.ARRAY_BUFFER, this.positionBuffer);
gl.bufferData(gl.ARRAY_BUFFER, new Float32Array(positions), gl.DYNAMIC_DRAW);
// GPU draws all nodes in single call
gl.drawArrays(gl.POINTS, 0, this.nodeCount);
}
}
```
#### **3. GPU-Based Interaction**
```javascript
// GPU picking for node selection
class GPUPicker {
constructor(gl, nodeCount) {
// Render nodes to off-screen framebuffer with unique colors
this.pickingFramebuffer = gl.createFramebuffer();
this.pickingTexture = gl.createTexture();
}
getNodeAtPosition(x, y) {
// Read single pixel from GPU framebuffer
const pixel = new Uint8Array(4);
gl.readPixels(x, y, 1, 1, gl.RGBA, gl.UNSIGNED_BYTE, pixel);
// Decode node ID from color
return this.colorToNodeId(pixel);
}
}
```
## 📊 **Performance Comparison**
### **Current D3.js CPU Rendering**
```javascript
// CPU-bound operations
nodes.forEach(node => {
// For each node, update DOM element
d3.select(`#node-${node.id}`)
.attr("cx", node.x)
.attr("cy", node.y)
.attr("r", node.size);
});
// Performance: O(n) DOM operations
// 10k nodes = 10k DOM updates per frame
// Maximum ~60fps with heavy optimization
```
### **GPU WebGL Rendering**
```javascript
// GPU-accelerated operations
class GPURenderer {
updateNodes(nodeData) {
// Single buffer update for all nodes
gl.bufferSubData(gl.ARRAY_BUFFER, 0, nodeData);
// Single draw call for all nodes
gl.drawArraysInstanced(gl.TRIANGLES, 0, 6, nodeCount);
}
}
// Performance: O(1) GPU operations
// 1M nodes = 1 GPU draw call
// Can maintain 60fps with millions of nodes
```
## 🛠️ **Implementation Options**
### **Option 1: WebGL2 + Compute Shaders**
```html
<!-- Enhanced HTML template with WebGL -->
<canvas id="gpu-canvas" width="800" height="600"></canvas>
<script>
const canvas = document.getElementById('gpu-canvas');
const gl = canvas.getContext('webgl2');
// Load compute shaders for layout animation
const computeShader = gl.createShader(gl.COMPUTE_SHADER);
gl.shaderSource(computeShader, computeShaderSource);
// Render loop using GPU
function animate() {
// Update node positions on GPU
gl.useProgram(computeProgram);
gl.dispatchCompute(Math.ceil(nodeCount / 64), 1, 1);
// Render nodes on GPU
gl.useProgram(renderProgram);
gl.drawArraysInstanced(gl.POINTS, 0, 1, nodeCount);
requestAnimationFrame(animate);
}
</script>
```
### **Option 2: WebGPU (Future)**
```javascript
// Next-generation WebGPU API
const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();
// GPU compute pipeline for layout
const computePipeline = device.createComputePipeline({
compute: {
module: device.createShaderModule({ code: layoutComputeShader }),
entryPoint: 'main'
}
});
// GPU render pipeline
const renderPipeline = device.createRenderPipeline({
vertex: { module: vertexShaderModule, entryPoint: 'main' },
fragment: { module: fragmentShaderModule, entryPoint: 'main' },
primitive: { topology: 'point-list' }
});
```
### **Option 3: Three.js GPU Optimization**
```javascript
// Use Three.js InstancedMesh for GPU instancing
import * as THREE from 'three';
class GPUGraphRenderer {
constructor(nodeCount) {
// Single geometry instanced for all nodes
const geometry = new THREE.CircleGeometry(1, 8);
const material = new THREE.MeshBasicMaterial();
// GPU-instanced mesh for all nodes
this.instancedMesh = new THREE.InstancedMesh(
geometry, material, nodeCount
);
// Position matrix for each instance
this.matrix = new THREE.Matrix4();
}
updateNode(index, x, y, scale, color) {
// Update single instance matrix
this.matrix.makeScale(scale, scale, 1);
this.matrix.setPosition(x, y, 0);
this.instancedMesh.setMatrixAt(index, this.matrix);
this.instancedMesh.setColorAt(index, color);
}
render() {
// Single GPU draw call for all nodes
this.instancedMesh.instanceMatrix.needsUpdate = true;
this.instancedMesh.instanceColor.needsUpdate = true;
}
}
```
## 🎯 **Recommendation**
### **Current Approach is Good For:**
- ✅ **Rapid development** - Standard D3.js patterns
- ✅ **Small-medium graphs** (<50k nodes)
- ✅ **Interactive features** - Easy DOM manipulation
- ✅ **Debugging** - Standard web dev tools
- ✅ **Compatibility** - Works in all browsers
### **True GPU Rendering Needed For:**
- 🚀 **Million+ node graphs** with smooth 60fps
- 🚀 **Real-time layout animation**
- 🚀 **Complex visual effects** (particles, trails)
- 🚀 **VR/AR graph visualization**
- 🚀 **Multi-touch interaction** on large displays
## 💡 **Hybrid Solution**
The optimal approach combines both:
```javascript
// Intelligent renderer selection
const selectRenderer = (nodeCount) => {
if (nodeCount < 10000) {
return new D3SVGRenderer(); // CPU DOM rendering
} else if (nodeCount < 100000) {
return new ThreeJSRenderer(); // WebGL with Three.js
} else {
return new WebGLRenderer(); // Custom GPU shaders
}
};
```
**Current Status:** Your remote service provides **GPU-accelerated data processing** with **CPU-based rendering** - which is perfect for most use cases and much easier to develop/maintain than full GPU rendering.

View File

@ -1,26 +0,0 @@
FROM ubuntu:22.04
# Install required packages
RUN apt-get update && apt-get install -y \
curl \
docker.io \
bc \
&& rm -rf /var/lib/apt/lists/*
# Copy the monitoring script
COPY gpu_memory_monitor.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/gpu_memory_monitor.sh
# Create a non-root user
RUN useradd -m -s /bin/bash monitor
# Set environment variables with defaults
ENV CHECK_INTERVAL=60
ENV MIN_AVAILABLE_PERCENT=70
ENV AUTO_FIX=true
# Run as non-root user
USER monitor
WORKDIR /home/monitor
CMD ["/usr/local/bin/gpu_memory_monitor.sh"]

View File

@ -1,252 +0,0 @@
# NVIDIA MPS Guide for Ollama GPU Optimization
## 🚀 Overview
NVIDIA Multi-Process Service (MPS) is a game-changing technology that enables multiple processes to share a single GPU context, eliminating expensive context switching overhead and dramatically improving concurrent workload performance.
This guide documents our discovery: **MPS transforms the DGX Spark from a single-threaded bottleneck into a high-throughput powerhouse**, achieving **3x concurrent performance** with near-perfect scaling.
## 📊 Performance Results Summary
### Triple Extraction Benchmark (llama3.1:8b)
| System | Mode | Individual Performance | Aggregate Throughput | Scaling Efficiency |
|--------|------|----------------------|---------------------|-------------------|
| **RTX 5090** | Single | ~300 tok/s | 300 tok/s | 100% (baseline) |
| **Mac M4 Pro** | Single | ~45 tok/s | 45 tok/s | 100% (baseline) |
| **DGX Spark** | Single (MPS) | 33.3 tok/s | 33.3 tok/s | 100% (baseline) |
| **DGX Spark** | 2x Concurrent | ~33.2 tok/s each | **66.4 tok/s** | **97% efficiency** |
| **DGX Spark** | 3x Concurrent | ~33.1 tok/s each | **99.4 tok/s** | **99% efficiency** |
### 🏆 Key Achievement
**DGX Spark + MPS delivers 2.2x higher aggregate throughput than RTX 5090 in multi-request scenarios!**
## 🛠️ MPS Setup Instructions
### 1. Start MPS Server
```bash
# Set MPS directory
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
mkdir -p /tmp/nvidia-mps
# Start MPS control daemon
sudo env "CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps" nvidia-cuda-mps-control -d
```
### 2. Restart Ollama with MPS Support
```bash
# Stop current Ollama
cd /path/to/ollama
docker compose down
# Start Ollama with MPS environment
sudo env "CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps" docker compose up -d
```
### 3. Verify MPS is Working
```bash
# Check MPS processes
ps aux | grep mps
# Expected output:
# root nvidia-cuda-mps-control -d
# root nvidia-cuda-mps-server -force-tegra
# Check Ollama processes show M+C flag
nvidia-smi
# Look for M+C in the Type column for Ollama processes
```
### 4. Stop MPS (when needed)
```bash
sudo nvidia-cuda-mps-control quit
```
## 🔬 Technical Architecture
### CUDA MPS Architecture
```
┌─────────────────────────────────────────┐
│ GPU (Single CUDA Context) │
│ ├── MPS Server (Resource Manager) │
│ ├── Ollama Process 1 ──┐ │
│ ├── Ollama Process 2 ──┼── Shared │
│ └── Ollama Process 3 ──┘ Context │
└─────────────────────────────────────────┘
```
### Traditional Multi-Process Architecture
```
┌─────────────────────────────────────────┐
│ GPU │
│ ├── Process 1 (Context 1) ─────────────│
│ ├── Process 2 (Context 2) ─────────────│
│ └── Process 3 (Context 3) ─────────────│
│ ↑ Context Switching Overhead │
└─────────────────────────────────────────┘
```
## ⚖️ MPS vs Multiple API Servers Comparison
### 🚀 CUDA MPS Advantages
**Performance:**
- ✅ No context switching overhead (single shared context)
- ✅ Concurrent kernel execution from different processes
- ✅ Lower latency for small requests
- ✅ Better GPU utilization (kernels can overlap)
**Memory Efficiency:**
- ✅ Shared GPU memory management
- ✅ No duplicate driver overhead per process
- ✅ More efficient memory allocation
- ✅ Can fit more models in same memory
**Resource Management:**
- ✅ Single point of GPU resource control
- ✅ Automatic load balancing across processes
- ✅ Better thermal management
- ✅ Unified monitoring and debugging
### 🏢 Multiple API Servers Advantages
**Isolation & Reliability:**
- ✅ Process isolation (one crash doesn't affect others)
- ✅ Independent scaling per service
- ✅ Different models can have different configurations
- ✅ Easier to update/restart individual services
**Flexibility:**
- ✅ Different frameworks (vLLM, TensorRT-LLM, etc.)
- ✅ Per-service optimization
- ✅ Independent monitoring and logging
- ✅ Service-specific resource limits
**Operational:**
- ✅ Standard container orchestration (K8s, Docker)
- ✅ Familiar DevOps patterns
- ✅ Load balancing at HTTP level
- ✅ Rolling updates and deployments
## 🎯 Decision Framework
### Use CUDA MPS When:
- 🏆 Maximum GPU utilization is critical
- ⚡ Low latency is paramount
- 💰 Cost optimization (more models per GPU)
- 🔄 Same framework/runtime (e.g., all Ollama)
- 📊 Predictable, homogeneous workloads
- 🎮 Single-tenant environments
### Use Multiple API Servers When:
- 🛡️ High availability/fault tolerance required
- 🔧 Different models need different optimizations
- 📈 Independent scaling per service needed
- 🌐 Multi-tenant production environments
- 🔄 Frequent model updates/deployments
- 👥 Different teams managing different models
## 📊 Performance Impact Analysis
| Metric | CUDA MPS | Multiple Servers |
|--------|----------|------------------|
| Context Switch Overhead | ~0% | ~5-15% |
| Memory Efficiency | ~95% | ~80-85% |
| Latency (small requests) | Lower | Higher |
| Throughput (concurrent) | Higher | Lower |
| Fault Isolation | Lower | Higher |
| Operational Complexity | Lower | Higher |
## 🔍 Memory Capacity Analysis
### Model Memory Requirements
- **llama3.1:8b (Q4_K_M)**: ~4.9GB per instance
### System Comparison
| System | Total Memory | Theoretical Max | Practical Max |
|--------|--------------|----------------|---------------|
| **RTX 5090** | 24GB VRAM | 4-5 models | 2-3 models |
| **DGX Spark** | 120GB Unified | 20+ models | 10+ models |
### RTX 5090 Limitations:
- ❌ Limited to 24GB VRAM (hard ceiling)
- ❌ Driver overhead reduces available memory
- ❌ Memory fragmentation issues
- ❌ Thermal throttling under concurrent load
- ❌ Context switching still expensive
### DGX Spark Advantages:
- ✅ 5x more memory capacity (120GB vs 24GB)
- ✅ Unified memory architecture
- ✅ Better thermal design for sustained loads
- ✅ Can scale to 10+ concurrent models
- ✅ No VRAM bottleneck
## 🧪 Testing Concurrent Performance
### Single Instance Baseline
```bash
curl -X POST http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Your prompt here"}],
"stream": false
}'
```
### Concurrent Testing
```bash
# Run multiple requests simultaneously
curl [request1] & curl [request2] & curl [request3] & wait
```
### Expected Results with MPS:
- **1 instance**: 33.3 tok/s
- **2 concurrent**: ~66.4 tok/s total (97% efficiency)
- **3 concurrent**: ~99.4 tok/s total (99% efficiency)
## 🎯 Recommendations
### For Triple Extraction Workloads:
**MPS is the optimal choice because:**
1. **Homogeneous workload** - same model (llama3.1:8b)
2. **Performance critical** - maximum throughput needed
3. **Cost optimization** - more concurrent requests per GPU
4. **Predictable usage** - biomedical triple extraction
### Hybrid Approach:
Consider running:
- **MPS in production** for maximum throughput
- **Separate dev/test servers** for experimentation
- **Different models** on separate instances when needed
## 🚨 Important Notes
1. **MPS requires careful setup** - ensure proper environment variables
2. **Monitor GPU temperature** under heavy concurrent loads
3. **Test thoroughly** before production deployment
4. **Have fallback plan** to standard single-process mode
5. **Consider workload patterns** - MPS excels with consistent concurrent requests
## 🔗 Related Files
- `docker-compose.yml` - Ollama service configuration
- `ollama_gpu_benchmark.py` - Performance testing script
- `clear_cache_and_restart.sh` - Memory optimization script
- `gpu_memory_monitor.sh` - GPU monitoring script
## 📚 Additional Resources
- [NVIDIA MPS Documentation](https://docs.nvidia.com/deploy/mps/index.html)
- [CUDA Multi-Process Service Guide](https://docs.nvidia.com/cuda/mps/index.html)
- [Ollama Documentation](https://ollama.ai/docs)
---
**Last Updated**: October 2, 2025
**Tested On**: DGX Spark with 120GB unified memory, CUDA 13.0, Ollama latest

View File

@ -1,78 +0,0 @@
# Ollama GPU Memory Monitoring
This setup includes automatic monitoring and fixing of GPU memory detection issues that can occur on unified memory systems (like DGX Spark, Jetson, etc.).
## The Problem
On unified memory systems, Ollama sometimes can't detect the full amount of available GPU memory due to buffer cache not being reclaimable. This causes models to fall back to CPU inference, dramatically reducing performance.
**Symptoms:**
- Ollama logs show low "available" vs "total" GPU memory
- Models show mixed CPU/GPU processing instead of 100% GPU
- Performance is much slower than expected
## The Solution
This Docker Compose setup includes an optional GPU memory monitor that:
1. **Monitors** Ollama's GPU memory detection every 60 seconds
2. **Detects** when available memory drops below 70% of total
3. **Automatically fixes** the issue by clearing buffer cache and restarting Ollama
4. **Logs** all actions for debugging
## Usage
### Standard Setup (Most Systems)
```bash
docker compose up -d
```
### Unified Memory Systems (DGX Spark, Jetson, etc.)
```bash
docker compose --profile unified-memory up -d
```
This will start both Ollama and the GPU memory monitor.
## Configuration
The monitor can be configured via environment variables:
- `CHECK_INTERVAL=60` - How often to check (seconds)
- `MIN_AVAILABLE_PERCENT=70` - Threshold for triggering fixes (percentage)
- `AUTO_FIX=true` - Whether to automatically fix issues
## Manual Commands
You can still use the manual scripts if needed:
```bash
# Check current GPU memory status
./monitor_gpu_memory.sh
# Manually clear cache and restart
./clear_cache_and_restart.sh
```
## Monitoring Logs
To see what the monitor is doing:
```bash
docker logs ollama-gpu-monitor -f
```
## When to Use
Use the unified memory profile if you experience:
- Inconsistent Ollama performance
- Models loading on CPU instead of GPU
- GPU memory showing as much lower than system RAM
- You're on a system with unified memory (DGX, Jetson, etc.)
## Performance Impact
The monitor has minimal performance impact:
- Runs one check every 60 seconds
- Only takes action when issues are detected
- Automatic fixes typically resolve issues within 30 seconds

View File

@ -1,66 +0,0 @@
version: '3.8'
services:
ollama:
build:
context: .
dockerfile: Dockerfile
image: ollama-custom:latest
container_name: ollama-server
ports:
- "11434:11434"
volumes:
- ollama_models:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
- OLLAMA_FLASH_ATTENTION=1
- OLLAMA_KEEP_ALIVE=30m
- OLLAMA_CUDA=1
# Performance tuning for large models like Llama3 70B
- OLLAMA_LLM_LIBRARY=cuda
- OLLAMA_NUM_PARALLEL=1 # Favor latency/stability for 70B; increase for smaller models
- OLLAMA_MAX_LOADED_MODELS=1 # Avoid VRAM contention
- OLLAMA_KV_CACHE_TYPE=q8_0 # Reduce KV cache VRAM with minimal perf impact
# Removed restrictive settings for 70B model testing:
# - OLLAMA_CONTEXT_LENGTH=8192 (let Ollama auto-detect)
# - OLLAMA_NUM_PARALLEL=4 (let Ollama decide)
# - OLLAMA_MAX_LOADED=1 (allow multiple models)
# - OLLAMA_NUM_THREADS=16 (may force CPU usage)
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
# GPU Memory Monitor - only for unified memory systems like DGX Spark
gpu-monitor:
build:
context: .
dockerfile: Dockerfile.monitor
container_name: ollama-gpu-monitor
depends_on:
- ollama
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
environment:
- CHECK_INTERVAL=60 # Check every 60 seconds
- MIN_AVAILABLE_PERCENT=70 # Alert if less than 70% GPU memory available
- AUTO_FIX=true # Automatically fix buffer cache issues
privileged: true # Required to clear buffer cache and restart containers
restart: unless-stopped
profiles:
- unified-memory # Only start with --profile unified-memory
volumes:
ollama_models:
driver: local

View File

@ -8,10 +8,10 @@ OLLAMA_PID=$!
# Wait for Ollama to be ready
echo "Waiting for Ollama to be ready..."
max_attempts=30
max_attempts=120
attempt=0
while [ $attempt -lt $max_attempts ]; do
if curl -s http://localhost:11434/api/tags > /dev/null 2>&1; then
if /bin/ollama list > /dev/null 2>&1; then
echo "Ollama is ready!"
break
fi
@ -26,9 +26,8 @@ fi
# Check if any models are present
echo "Checking for existing models..."
MODELS=$(curl -s http://localhost:11434/api/tags | grep -o '"models":\s*\[\]' || echo "has_models")
if [[ "$MODELS" == *'"models": []'* ]]; then
if ! /bin/ollama list | grep -q llama3.1:8b; then
echo "No models found. Pulling llama3.1:8b..."
/bin/ollama pull llama3.1:8b
echo "Successfully pulled llama3.1:8b"
@ -38,5 +37,4 @@ fi
# Keep the container running
echo "Setup complete. Ollama is running."
wait $OLLAMA_PID
wait $OLLAMA_PID

View File

@ -1,108 +0,0 @@
#!/bin/bash
#
# Ollama GPU Memory Monitor - runs inside a sidecar container
# Automatically detects and fixes unified memory buffer cache issues
#
set -e
# Configuration
CHECK_INTERVAL=${CHECK_INTERVAL:-60} # Check every 60 seconds
MIN_AVAILABLE_PERCENT=${MIN_AVAILABLE_PERCENT:-70} # Alert if less than 70% available
AUTO_FIX=${AUTO_FIX:-true} # Automatically fix issues
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
}
check_ollama_memory() {
# Wait for Ollama to be ready
if ! curl -s http://ollama:11434/api/tags > /dev/null 2>&1; then
log "Ollama not ready, skipping check"
return 0
fi
# Get Ollama logs to find inference compute info
local compute_log=$(docker logs ollama-server 2>&1 | grep "inference compute" | tail -1)
if [ -z "$compute_log" ]; then
log "No inference compute logs found"
return 0
fi
# Extract memory info
local total_mem=$(echo "$compute_log" | grep -o 'total="[^"]*"' | cut -d'"' -f2)
local available_mem=$(echo "$compute_log" | grep -o 'available="[^"]*"' | cut -d'"' -f2)
if [ -z "$total_mem" ] || [ -z "$available_mem" ]; then
log "Could not parse memory information"
return 0
fi
# Convert to numeric (assuming GiB)
local total_num=$(echo "$total_mem" | sed 's/ GiB//')
local available_num=$(echo "$available_mem" | sed 's/ GiB//')
# Calculate percentage
local available_percent=$(echo "scale=1; $available_num * 100 / $total_num" | bc)
log "GPU Memory: $available_mem / $total_mem available (${available_percent}%)"
# Check if we need to take action
if (( $(echo "$available_percent < $MIN_AVAILABLE_PERCENT" | bc -l) )); then
log "WARNING: Low GPU memory availability detected (${available_percent}%)"
if [ "$AUTO_FIX" = "true" ]; then
log "Attempting to fix by clearing buffer cache..."
fix_memory_issue
else
log "Auto-fix disabled. Manual intervention required."
fi
return 1
else
log "GPU memory availability OK (${available_percent}%)"
return 0
fi
}
fix_memory_issue() {
log "Clearing system buffer cache..."
# Clear buffer cache from host (requires privileged container)
echo 1 > /proc/sys/vm/drop_caches 2>/dev/null || {
log "Cannot clear buffer cache from container. Trying host command..."
# Alternative: use nsenter to run on host
nsenter -t 1 -m -p sh -c 'sync && echo 1 > /proc/sys/vm/drop_caches' 2>/dev/null || {
log "Failed to clear buffer cache. Manual intervention required."
return 1
}
}
# Wait a moment
sleep 5
# Restart Ollama container
log "Restarting Ollama container..."
docker restart ollama-server
# Wait for restart
sleep 15
log "Fix applied. Ollama should have better memory detection now."
}
main() {
log "Starting Ollama GPU Memory Monitor"
log "Check interval: ${CHECK_INTERVAL}s, Min available: ${MIN_AVAILABLE_PERCENT}%, Auto-fix: ${AUTO_FIX}"
while true; do
check_ollama_memory || true # Don't exit on check failures
sleep "$CHECK_INTERVAL"
done
}
# Handle signals gracefully
trap 'log "Shutting down monitor..."; exit 0' SIGTERM SIGINT
main

View File

@ -1,79 +0,0 @@
#!/bin/bash
#
# Monitor Ollama GPU memory usage and alert when buffer cache is consuming too much
# This helps detect when the unified memory issue is occurring
#
set -e
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Thresholds
MIN_AVAILABLE_PERCENT=70 # Alert if less than 70% GPU memory available
echo "🔍 Ollama GPU Memory Monitor"
echo "================================"
# Check if Ollama container is running
if ! docker ps | grep -q ollama-server; then
echo -e "${RED}❌ Ollama container is not running${NC}"
exit 1
fi
# Get the latest inference compute log
COMPUTE_LOG=$(docker logs ollama-server 2>&1 | grep "inference compute" | tail -1)
if [ -z "$COMPUTE_LOG" ]; then
echo -e "${YELLOW}⚠️ No inference compute logs found. Model may not be loaded.${NC}"
exit 1
fi
echo "Latest GPU memory status:"
echo "$COMPUTE_LOG"
# Extract total and available memory
TOTAL_MEM=$(echo "$COMPUTE_LOG" | grep -o 'total="[^"]*"' | cut -d'"' -f2)
AVAILABLE_MEM=$(echo "$COMPUTE_LOG" | grep -o 'available="[^"]*"' | cut -d'"' -f2)
# Convert to numeric values (assuming GiB)
TOTAL_NUM=$(echo "$TOTAL_MEM" | sed 's/ GiB//')
AVAILABLE_NUM=$(echo "$AVAILABLE_MEM" | sed 's/ GiB//')
# Calculate percentage
AVAILABLE_PERCENT=$(echo "scale=1; $AVAILABLE_NUM * 100 / $TOTAL_NUM" | bc)
echo ""
echo "Memory Analysis:"
echo " Total GPU Memory: $TOTAL_MEM"
echo " Available Memory: $AVAILABLE_MEM"
echo " Available Percentage: ${AVAILABLE_PERCENT}%"
# Check if we need to alert
if (( $(echo "$AVAILABLE_PERCENT < $MIN_AVAILABLE_PERCENT" | bc -l) )); then
echo ""
echo -e "${RED}🚨 WARNING: Low GPU memory availability detected!${NC}"
echo -e "${RED} Only ${AVAILABLE_PERCENT}% of GPU memory is available${NC}"
echo -e "${YELLOW} This may cause models to run on CPU instead of GPU${NC}"
echo ""
echo "💡 Recommended action:"
echo " Run: ./clear_cache_and_restart.sh"
echo ""
# Show current system memory usage
echo "Current system memory usage:"
free -h
exit 1
else
echo ""
echo -e "${GREEN}✅ GPU memory availability looks good (${AVAILABLE_PERCENT}%)${NC}"
fi
# Show current model status
echo ""
echo "Current loaded models:"
docker exec ollama-server ollama ps

View File

@ -1,92 +1,153 @@
# vLLM NVFP4 Deployment
# vLLM Service
This setup deploys the NVIDIA Llama 4 Scout model with NVFP4 quantization using vLLM, optimized for Blackwell and Hopper GPU architectures.
This service provides advanced GPU-accelerated LLM inference using vLLM with FP8 quantization, offering higher throughput than Ollama for production workloads.
## Overview
vLLM is an optional service that complements Ollama by providing:
- Higher throughput for concurrent requests
- Advanced quantization (FP8)
- PagedAttention for efficient memory usage
- OpenAI-compatible API
## Quick Start
1. **Set up your HuggingFace token:**
```bash
cp env.example .env
# Edit .env and add your HF_TOKEN
```
### Using the Complete Stack
2. **Build and run:**
```bash
docker-compose up --build
```
3. **Test the deployment:**
```bash
curl -X POST "http://localhost:8001/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Llama-4-Scout-17B-16E-Instruct-FP4",
"messages": [{"role": "user", "content": "Hello! How are you?"}],
"max_tokens": 100
}'
```
## Model Information
- **Model**: `nvidia/Llama-4-Scout-17B-16E-Instruct-FP4`
- **Quantization**: NVFP4 (optimized for Blackwell architecture)
- **Alternative**: `nvidia/Llama-4-Scout-17B-16E-Instruct-FP8` (for Hopper architecture)
## Performance Tuning
The startup script automatically detects your GPU architecture and applies optimal settings:
### Blackwell (Compute Capability 10.0)
- Enables FlashInfer backend
- Uses NVFP4 quantization
- Enables async scheduling
- Applies fusion optimizations
### Hopper (Compute Capability 9.0)
- Uses FP8 quantization
- Disables async scheduling (due to vLLM limitations)
- Standard optimization settings
### Configuration Options
Adjust these environment variables in your `.env` file:
- `VLLM_TENSOR_PARALLEL_SIZE`: Number of GPUs to use (default: 2)
- `VLLM_MAX_NUM_SEQS`: Batch size (default: 128)
- `VLLM_MAX_NUM_BATCHED_TOKENS`: Token batching limit (default: 8192)
- `VLLM_GPU_MEMORY_UTILIZATION`: GPU memory usage (default: 0.9)
### Performance Scenarios
- **Maximum Throughput**: `VLLM_TENSOR_PARALLEL_SIZE=1`, increase `VLLM_MAX_NUM_SEQS`
- **Minimum Latency**: `VLLM_TENSOR_PARALLEL_SIZE=4-8`, `VLLM_MAX_NUM_SEQS=8`
- **Balanced**: `VLLM_TENSOR_PARALLEL_SIZE=2`, `VLLM_MAX_NUM_SEQS=128` (default)
## Benchmarking
To benchmark performance:
The easiest way to run vLLM is with the complete stack:
```bash
docker exec -it vllm-nvfp4-server vllm bench serve \
--host 0.0.0.0 \
--port 8001 \
--model nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--max-concurrency 128 \
--num-prompts 1280
# From project root
./start.sh --complete
```
This starts vLLM along with all other optional services.
### Manual Docker Compose
```bash
# From project root
docker compose -f deploy/compose/docker-compose.complete.yml up -d vllm
```
### Testing the Deployment
```bash
# Check health
curl http://localhost:8001/v1/models
# Test chat completion
curl -X POST "http://localhost:8001/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [{"role": "user", "content": "Hello! How are you?"}],
"max_tokens": 100
}'
```
## Default Configuration
- **Model**: `meta-llama/Llama-3.2-3B-Instruct`
- **Quantization**: FP8 (optimized for compute efficiency)
- **Port**: 8001
- **API**: OpenAI-compatible endpoints
## Configuration Options
Environment variables configured in `docker-compose.complete.yml`:
- `VLLM_MODEL`: Model to load (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `VLLM_TENSOR_PARALLEL_SIZE`: Number of GPUs to use (default: 1)
- `VLLM_MAX_MODEL_LEN`: Maximum sequence length (default: 4096)
- `VLLM_GPU_MEMORY_UTILIZATION`: GPU memory usage (default: 0.9)
- `VLLM_QUANTIZATION`: Quantization method (default: fp8)
- `VLLM_KV_CACHE_DTYPE`: KV cache data type (default: fp8)
## Frontend Integration
The txt2kg frontend automatically detects and uses vLLM when available:
1. Triple extraction: `/api/vllm` endpoint
2. RAG queries: Automatically uses vLLM if configured
3. Model selection: Choose vLLM models in the UI
## Using Different Models
To use a different model, edit the `VLLM_MODEL` environment variable in `docker-compose.complete.yml`:
```yaml
environment:
- VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct
```
Then restart the service:
```bash
docker compose -f deploy/compose/docker-compose.complete.yml restart vllm
```
## Performance Tips
1. **Single GPU**: Set `VLLM_TENSOR_PARALLEL_SIZE=1` for best single-GPU performance
2. **Multi-GPU**: Increase `VLLM_TENSOR_PARALLEL_SIZE` to use multiple GPUs
3. **Memory**: Adjust `VLLM_GPU_MEMORY_UTILIZATION` based on available VRAM
4. **Throughput**: For high throughput, use smaller models or increase quantization
## Requirements
- NVIDIA GPU with Blackwell or Hopper architecture
- CUDA Driver 575 or above
- NVIDIA GPU with CUDA support (Ampere architecture or newer recommended)
- CUDA Driver 535 or above
- Docker with NVIDIA Container Toolkit
- HuggingFace token (for model access)
- At least 8GB VRAM for default model
- HuggingFace token for gated models (optional, cached in `~/.cache/huggingface`)
## Troubleshooting
- Check GPU compatibility: `nvidia-smi`
- View logs: `docker-compose logs -f vllm-nvfp4`
- Monitor GPU usage: `nvidia-smi -l 1`
### Check Service Status
```bash
# View logs
docker compose -f deploy/compose/docker-compose.complete.yml logs -f vllm
# Check health
curl http://localhost:8001/v1/models
```
### GPU Issues
```bash
# Check GPU availability
nvidia-smi
# Check vLLM container GPU access
docker exec vllm-service nvidia-smi
```
### Model Loading Issues
- Ensure sufficient VRAM for the model
- Check HuggingFace cache: `ls ~/.cache/huggingface/hub`
- For gated models, set HF_TOKEN environment variable
## Comparison with Ollama
| Feature | Ollama | vLLM |
|---------|--------|------|
| **Ease of Use** | ✅ Very easy | ⚠️ More complex |
| **Model Management** | ✅ Built-in pull/push | ❌ Manual download |
| **Throughput** | ⚠️ Moderate | ✅ High |
| **Quantization** | Q4_K_M | FP8, GPTQ |
| **Memory Efficiency** | ✅ Good | ✅ Excellent (PagedAttention) |
| **Use Case** | Development, small-scale | Production, high-throughput |
## When to Use vLLM
Use vLLM when:
- Processing large batches of requests
- Need maximum throughput
- Using multiple GPUs
- Deploying to production with high load
Use Ollama when:
- Getting started with the project
- Single-user development
- Simpler model management needed
- Don't need maximum performance

View File

@ -4,14 +4,34 @@ This directory contains the Next.js frontend application for the txt2kg project.
## Structure
- **app/**: Next.js app directory with pages and routes
- **components/**: React components
- **contexts/**: React context providers
- **app/**: Next.js 15 app directory with pages and API routes
- API routes for LLM providers (Ollama, vLLM, NVIDIA API)
- Triple extraction and graph query endpoints
- Settings and health check endpoints
- **components/**: React 19 components
- Graph visualization (Three.js WebGPU)
- PyGraphistry integration for GPU-accelerated rendering
- RAG query interface
- Document upload and processing
- **contexts/**: React context providers for state management
- **hooks/**: Custom React hooks
- **lib/**: Utility functions and shared logic
- LLM service (Ollama, vLLM, NVIDIA API integration)
- Graph database services (ArangoDB, Neo4j)
- Pinecone vector database integration
- RAG service for knowledge graph querying
- **public/**: Static assets
- **styles/**: CSS and styling files
- **types/**: TypeScript type definitions
- **types/**: TypeScript type definitions for graph data structures
## Technology Stack
- **Next.js 15**: React framework with App Router
- **React 19**: Latest React with improved concurrent features
- **TypeScript**: Type-safe development
- **Tailwind CSS**: Utility-first styling
- **Three.js**: WebGL/WebGPU 3D graph visualization
- **D3.js**: Data-driven visualizations
- **LangChain**: LLM orchestration and chaining
## Development
@ -23,9 +43,47 @@ npm install
npm run dev
```
Or use the start script from project root:
```bash
./start.sh --dev-frontend
```
The development server will run on http://localhost:3000
## Building for Production
```bash
cd frontend
npm run build
```
npm start
```
Or use Docker (recommended):
```bash
# From project root
./start.sh
```
The production app will run on http://localhost:3001
## Environment Variables
Required environment variables are configured in docker-compose files:
- `ARANGODB_URL`: ArangoDB connection URL
- `OLLAMA_BASE_URL`: Ollama API endpoint
- `VLLM_BASE_URL`: vLLM API endpoint (optional)
- `NVIDIA_API_KEY`: NVIDIA API key (optional)
- `PINECONE_HOST`: Local Pinecone host (optional)
- `SENTENCE_TRANSFORMER_URL`: Embeddings service URL (optional)
## Features
- **Knowledge Graph Extraction**: Extract triples from text using LLMs
- **Graph Visualization**: Interactive 3D visualization with Three.js WebGPU
- **RAG Queries**: Query knowledge graphs with retrieval-augmented generation
- **Multiple LLM Providers**: Support for Ollama, vLLM, and NVIDIA API
- **GPU-Accelerated Rendering**: Optional PyGraphistry integration for large graphs
- **Vector Search**: Pinecone integration for semantic search

View File

@ -1,9 +1,11 @@
# TXT2KG Pipeline with ArangoDB Integration
# GNN Training Pipeline (Experimental)
This project provides a two-stage pipeline for knowledge graph-based question answering:
**Status**: This is an experimental feature for training Graph Neural Network models for enhanced RAG retrieval.
1. **Data Preprocessing** (`preprocess_data.py`): Extracts knowledge graph triples from either ArangoDB or using TXT2KG, and prepares the dataset.
2. **Model Training & Testing** (`train_test_gnn.py`): Trains and evaluates a GNN-based retriever model on the preprocessed dataset.
This pipeline provides a two-stage process for training GNN-based knowledge graph retrieval models:
1. **Data Preprocessing** (`preprocess_data.py`): Extracts knowledge graph triples from ArangoDB and prepares training datasets.
2. **Model Training & Testing** (`train_test_gnn.py`): Trains and evaluates a GNN-based retriever model using PyTorch Geometric.
## Prerequisites
@ -20,10 +22,11 @@ This project provides a two-stage pipeline for knowledge graph-based question an
pip install -r scripts/requirements.txt
```
2. Ensure ArangoDB is running. You can use the docker-compose file:
2. Ensure ArangoDB is running. You can use the main start script:
```bash
docker-compose up -d arangodb arangodb-init
# From project root
./start.sh
```
## Usage
@ -57,13 +60,9 @@ You can specify custom ArangoDB connection parameters:
python scripts/preprocess_data.py --use_arango --arango_url "http://localhost:8529" --arango_db "your_db" --arango_user "username" --arango_password "password"
```
#### Using TXT2KG (original behavior)
#### Using Direct Triple Extraction
If you don't pass the `--use_arango` flag, the script will use the original TXT2KG approach:
```bash
python scripts/preprocess_data.py --NV_NIM_KEY "your-nvidia-api-key"
```
If you don't pass the `--use_arango` flag, the script will extract triples directly using the configured LLM provider.
### Stage 2: Model Training & Testing

View File

@ -4,7 +4,6 @@
# Parse command line arguments
DEV_FRONTEND=false
USE_VLLM=false
USE_COMPLETE=false
while [[ $# -gt 0 ]]; do
@ -13,10 +12,6 @@ while [[ $# -gt 0 ]]; do
DEV_FRONTEND=true
shift
;;
--vllm)
USE_VLLM=true
shift
;;
--complete)
USE_COMPLETE=true
shift
@ -25,12 +20,15 @@ while [[ $# -gt 0 ]]; do
echo "Usage: ./start.sh [OPTIONS]"
echo ""
echo "Options:"
echo " --dev-frontend Run frontend in development mode (without Docker)"
echo " --vllm Use vLLM instead of Ollama for LLM inference"
echo " --complete Use complete stack with MinIO S3 storage"
echo " --help, -h Show this help message"
echo " --dev-frontend Run frontend in development mode (without Docker)"
echo " --complete Use complete stack (vLLM, Pinecone, Sentence Transformers)"
echo " --help, -h Show this help message"
echo ""
echo "Default: Starts with Ollama, ArangoDB, local Pinecone, and Next.js frontend"
echo "Default: Starts minimal stack with Ollama, ArangoDB, and Next.js frontend"
echo ""
echo "Examples:"
echo " ./start.sh # Start minimal demo (recommended)"
echo " ./start.sh --complete # Start with all optional services"
exit 0
;;
*)
@ -81,15 +79,12 @@ else
fi
# Build the docker-compose command
if [ "$USE_VLLM" = true ]; then
CMD="$DOCKER_COMPOSE_CMD -f $(pwd)/deploy/compose/docker-compose.vllm.yml"
echo "Using vLLM for GPU-accelerated LLM inference with FP8 quantization..."
elif [ "$USE_COMPLETE" = true ]; then
if [ "$USE_COMPLETE" = true ]; then
CMD="$DOCKER_COMPOSE_CMD -f $(pwd)/deploy/compose/docker-compose.complete.yml"
echo "Using complete stack with MinIO S3 storage..."
echo "Using complete stack (Ollama, vLLM, Pinecone, Sentence Transformers)..."
else
CMD="$DOCKER_COMPOSE_CMD -f $(pwd)/deploy/compose/docker-compose.yml"
echo "Using default configuration (Ollama + ArangoDB + local Pinecone)..."
echo "Using minimal configuration (Ollama + ArangoDB only)..."
fi
# Execute the command
@ -104,14 +99,16 @@ echo "=========================================="
echo "txt2kg is now running!"
echo "=========================================="
echo ""
echo "Services:"
echo "Core Services:"
echo " • Web UI: http://localhost:3001"
echo " • ArangoDB: http://localhost:8529"
echo " • Ollama API: http://localhost:11434"
echo " • Local Pinecone: http://localhost:5081"
echo ""
if [ "$USE_VLLM" = true ]; then
if [ "$USE_COMPLETE" = true ]; then
echo "Additional Services (Complete Stack):"
echo " • Local Pinecone: http://localhost:5081"
echo " • Sentence Transformers: http://localhost:8000"
echo " • vLLM API: http://localhost:8001"
echo ""
fi
@ -125,6 +122,6 @@ echo " 3. Upload documents and start building your knowledge graph!"
echo ""
echo "Other options:"
echo " • Run frontend in dev mode: ./start.sh --dev-frontend"
echo " • Use vLLM instead of Ollama: ./start.sh --vllm"
echo " • Use complete stack: ./start.sh --complete"
echo " • View logs: docker compose logs -f"
echo ""