Compare commits

...

7 Commits

Author SHA1 Message Date
Aaron Brewbaker
3f8f8ed01c
Merge 2d52e1aab3 into 2022e2b24b 2026-04-20 17:48:31 +00:00
GitLab CI
2022e2b24b chore: Regenerate all playbooks 2026-04-20 15:46:44 +00:00
GitLab CI
3ba4d58f1e chore: Regenerate all playbooks 2026-04-14 17:45:10 +00:00
GitLab CI
6e98abc3b0 chore: Regenerate all playbooks 2026-04-14 01:42:17 +00:00
GitLab CI
1d85b97d79 chore: Regenerate all playbooks 2026-04-14 00:52:53 +00:00
GitLab CI
6a4d122e92 chore: Regenerate all playbooks 2026-04-13 13:31:35 +00:00
Aaron Brewbaker
2d52e1aab3 feat: add DGX Spark MCP Server playbook
This playbook installs the DGX Spark MCP Server, a tool for hardware-aware Spark optimization on DGX systems.
Includes:
- Installation script (npm based)
- Systemd service configuration
- Default configuration
- Documentation
2025-11-25 19:22:44 -05:00
9 changed files with 294 additions and 58 deletions

View File

@ -39,7 +39,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
- [Connect Multiple DGX Spark through a Switch](nvidia/multi-sparks-through-switch/)
- [NCCL for Two Sparks](nvidia/nccl/)
- [Fine-tune with NeMo](nvidia/nemo-fine-tune/)
- [NemoClaw with Nemotron-3-Super and Telegram on DGX Spark](nvidia/nemoclaw/)
- [NemoClaw with Nemotron 3 Super and Telegram on DGX Spark](nvidia/nemoclaw/)
- [Nemotron-3-Nano with llama.cpp](nvidia/nemotron/)
- [NIM on Spark](nvidia/nim-llm/)
- [NVFP4 Quantization](nvidia/nvfp4-quantization/)

View File

@ -0,0 +1,84 @@
# DGX Spark MCP Server Playbook
This playbook installs and configures the **DGX Spark MCP Server**, a tool that provides hardware-aware Apache Spark optimization for NVIDIA DGX systems via the Model Context Protocol (MCP).
## Overview
The DGX Spark MCP Server enables MCP clients (like Claude Desktop or Claude Code) to:
* **Detect Hardware**: Automatically read DGX GPU topology, memory, and CPU specs.
* **Optimize Spark**: Generate tuned Spark configurations (`spark-submit` args) based on detected hardware and workload type (ETL, ML Training, Inference).
* **Monitor**: Check real-time GPU availability before submitting jobs.
## Prerequisites
* **NVIDIA DGX System** (or compatible GPU server)
* **NVIDIA Drivers** installed (`nvidia-smi` available)
* **Node.js 18+**
* **Root access** (for systemd service installation)
## Directory Structure
```
.
├── config/
│ └── default.json # Default configuration
├── deploy/
│ └── dgx-spark-mcp.service # Systemd service file
└── scripts/
└── install.sh # Automated installer
```
## Installation
1. **Run the installer**:
```bash
sudo ./scripts/install.sh
```
This script will:
* Install `dgx-spark-mcp` globally via `npm`.
* Create a dedicated system user (`dgx`).
* Setup logging directory `/var/log/dgx-spark-mcp`.
* Install and start the systemd service.
2. **Verify Installation**:
```bash
systemctl status dgx-spark-mcp
```
## Configuration
The configuration file is located at `/etc/dgx-spark-mcp/config.json`.
### Key Settings
* **`mcp.transport`**: `stdio` (default) or `sse`.
* **`hardware.enableGpuMonitoring`**: Set to `true` to enable real-time `nvidia-smi` queries.
* **`logging.level`**: `info` or `debug`.
## Usage with Claude Desktop
Add the following to your `claude_desktop_config.json`:
```json
{
"mcpServers": {
"dgx-spark": {
"command": "dgx-spark-mcp"
}
}
}
```
## Troubleshooting
**Service fails to start?**
Check logs:
```bash
journalctl -u dgx-spark-mcp -f
```
**Permission denied?**
Ensure the `dgx` user has permissions to access `nvidia-smi`. You may need to add the user to the `video` group:
```bash
usermod -a -G video dgx
```

View File

@ -0,0 +1,33 @@
{
"server": {
"port": 3000,
"host": "localhost",
"nodeEnv": "production"
},
"logging": {
"level": "info",
"format": "json",
"dir": "/var/log/dgx-spark-mcp",
"maxFiles": 10,
"maxSize": "10m"
},
"mcp": {
"serverName": "dgx-spark-mcp",
"serverVersion": "0.1.0",
"transport": "stdio"
},
"hardware": {
"nvidiaSmiPath": "/usr/bin/nvidia-smi",
"cacheTTL": 30000,
"enableGpuMonitoring": true
},
"spark": {},
"performance": {
"enableMetrics": true,
"metricsInterval": 60000,
"healthCheckInterval": 30000
},
"security": {
"enableAuth": false
}
}

View File

@ -0,0 +1,48 @@
[Unit]
Description=DGX Spark MCP Server
Documentation=https://github.com/raibid-labs/dgx-spark-mcp
After=network.target
Wants=network-online.target
[Service]
Type=simple
User=dgx
Group=dgx
# Environment variables
Environment="NODE_ENV=production"
Environment="DGX_MCP_CONFIG_PATH=/etc/dgx-spark-mcp/config.json"
# Start the service
# Assumes installed globally via npm
ExecStart=/usr/local/bin/dgx-spark-mcp
# Restart policy
Restart=on-failure
RestartSec=10
StartLimitInterval=600
StartLimitBurst=5
# Resource limits
LimitNOFILE=65536
LimitNPROC=4096
# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
# Allow write access to logs
ReadWritePaths=/var/log/dgx-spark-mcp
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=dgx-spark-mcp
# Process management
KillMode=mixed
KillSignal=SIGTERM
TimeoutStopSec=30
[Install]
WantedBy=multi-user.target

View File

@ -0,0 +1,78 @@
#!/bin/bash
set -euo pipefail
# DGX Spark MCP Server - Playbook Installation Script
# Installs the server from NPM and configures systemd
# Configuration
PACKAGE_NAME="dgx-spark-mcp"
SERVICE_NAME="dgx-spark-mcp"
CONFIG_DIR="/etc/dgx-spark-mcp"
LOG_DIR="/var/log/dgx-spark-mcp"
USER="dgx"
GROUP="dgx"
# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
NC='\033[0m'
log_info() { echo -e "${GREEN}[INFO]${NC} $1"; }
log_error() { echo -e "${RED}[ERROR]${NC} $1"; }
# Check root
if [[ $EUID -ne 0 ]]; then
log_error "This script must be run as root"
exit 1
fi
# 1. Install Node.js (if missing) - Brief check
if ! command -v node &> /dev/null; then
log_info "Node.js not found. Please install Node.js 18+."
exit 1
fi
# 2. Install Package
log_info "Installing $PACKAGE_NAME from registry..."
npm install -g $PACKAGE_NAME
# 3. Create User
if ! id -u "$USER" &>/dev/null; then
log_info "Creating user $USER..."
useradd --system --no-create-home --shell /bin/false "$USER"
fi
# 4. Setup Directories
log_info "Setting up directories..."
mkdir -p "$CONFIG_DIR"
mkdir -p "$LOG_DIR"
# Copy config if provided in playbook
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
if [[ -f "$SCRIPT_DIR/../config/default.json" ]]; then
cp "$SCRIPT_DIR/../config/default.json" "$CONFIG_DIR/config.json"
else
log_info "No default config found, using internal defaults."
fi
# Permissions
chown -R "$USER:$GROUP" "$LOG_DIR"
chown -R "$USER:$GROUP" "$CONFIG_DIR"
chmod 755 "$LOG_DIR"
chmod 755 "$CONFIG_DIR"
# 5. Setup Service
log_info "Configuring systemd service..."
if [[ -f "$SCRIPT_DIR/../deploy/$SERVICE_NAME.service" ]]; then
cp "$SCRIPT_DIR/../deploy/$SERVICE_NAME.service" "/etc/systemd/system/$SERVICE_NAME.service"
systemctl daemon-reload
systemctl enable "$SERVICE_NAME"
systemctl restart "$SERVICE_NAME"
log_info "Service started."
else
log_error "Service file not found."
exit 1
fi
log_info "Installation complete."
log_info "Status: systemctl status $SERVICE_NAME"

View File

@ -1,4 +1,4 @@
# NemoClaw with Nemotron-3-Super and Telegram on DGX Spark
# NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
> Install NemoClaw on DGX Spark with local Ollama inference and Telegram bot integration
@ -25,7 +25,7 @@
- [Step 6. Talk to the agent (CLI)](#step-6-talk-to-the-agent-cli)
- [Step 7. Interactive TUI](#step-7-interactive-tui)
- [Step 8. Exit the sandbox and access the Web UI](#step-8-exit-the-sandbox-and-access-the-web-ui)
- [Step 9. Prepare credentials](#step-9-prepare-credentials)
- [Step 9. Create a Telegram bot](#step-9-create-a-telegram-bot)
- [Step 10. Configure and start the Telegram bridge](#step-10-configure-and-start-the-telegram-bridge)
- [Step 11. Stop services](#step-11-stop-services)
- [Step 12. Uninstall NemoClaw](#step-12-uninstall-nemoclaw)
@ -192,14 +192,6 @@ Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
```
Verify it is running:
```bash
curl http://localhost:11434
```
Expected: `Ollama is running`. If not, start it: `ollama serve &`
Configure Ollama to listen on all interfaces so the sandbox container can reach it:
```bash
@ -209,6 +201,17 @@ sudo systemctl daemon-reload
sudo systemctl restart ollama
```
Verify it is running and reachable on all interfaces:
```bash
curl http://0.0.0.0:11434
```
Expected: `Ollama is running`. If not, start it with `sudo systemctl start ollama`.
> [!IMPORTANT]
> Always start Ollama via systemd (`sudo systemctl restart ollama`) — do not use `ollama serve &`. A manually started Ollama process does not pick up the `OLLAMA_HOST=0.0.0.0` setting above, and the NemoClaw sandbox will not be able to reach the inference server.
### Step 3. Pull the Nemotron 3 Super model
Download Nemotron 3 Super 120B (~87 GB; may take 15--30 minutes depending on network speed):
@ -237,10 +240,10 @@ You should see `nemotron-3-super:120b` in the output.
### Step 4. Install NemoClaw
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones NemoClaw at the pinned stable release (`v0.0.1`), builds the CLI, and runs the onboard wizard to create a sandbox.
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the latest stable NemoClaw release, builds the CLI, and runs the onboard wizard to create a sandbox.
```bash
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.4 bash
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
```
The onboard wizard walks you through setup:
@ -358,14 +361,12 @@ http://127.0.0.1:18789/#token=<long-token-here>
## Phase 3: Telegram Bot
### Step 9. Prepare credentials
> [!NOTE]
> If you already configured Telegram during the NemoClaw onboarding wizard (step 5/8), you can skip this phase. These steps cover adding Telegram after the initial setup.
You need two items:
### Step 9. Create a Telegram bot
| Item | Where to get it |
|------|----------------|
| Telegram bot token | Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the token it gives you. |
| NVIDIA API key | Go to [build.nvidia.com/settings/api-keys](https://build.nvidia.com/settings/api-keys) and create or copy a key (starts with `nvapi-`). |
Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the bot token it gives you.
### Step 10. Configure and start the Telegram bridge
@ -376,6 +377,7 @@ Set the required environment variables. Replace the placeholders with your actua
```bash
export TELEGRAM_BOT_TOKEN=<your-bot-token>
export SANDBOX_NAME=my-assistant
export NVIDIA_API_KEY=<your-nvidia-api-key>
```
Add the Telegram network policy to the sandbox:
@ -384,34 +386,36 @@ Add the Telegram network policy to the sandbox:
nemoclaw my-assistant policy-add
```
When prompted, type `telegram` and hit **Y** to confirm.
When prompted, select `telegram` and hit **Y** to confirm.
Start the Telegram bridge. On first run it will ask for your NVIDIA API key:
Start the Telegram bridge.
```bash
export TELEGRAM_BOT_TOKEN=<your-bot-token>
nemoclaw start
```
Paste your `nvapi-` key when prompted.
The Telegram bridge starts only when the `TELEGRAM_BOT_TOKEN` environment variable is set. Verify the services are running:
You should see:
```text
[services] telegram-bridge started
Telegram: bridge running
```bash
nemoclaw status
```
Open Telegram, find your bot, and send it a message. The bot forwards it to the agent and replies.
> [!NOTE]
> The first response may include a debug log line like "gateway Running as non-root..." -- this is cosmetic and can be ignored.
> The first response may take 30--90 seconds for a 120B parameter model running locally.
> [!NOTE]
> If you need to restart the bridge, `nemoclaw stop` may not cleanly stop the process. If that happens, find and kill the bridge process via its PID file:
> If the bridge does not appear in `nemoclaw status`, make sure `TELEGRAM_BOT_TOKEN` is exported in the same shell session where you run `nemoclaw start`. You can also try stopping and restarting:
> ```bash
> kill -9 "$(cat /tmp/nemoclaw-services-${SANDBOX_NAME}/telegram-bridge.pid)"
> nemoclaw stop
> export TELEGRAM_BOT_TOKEN=<your-bot-token>
> nemoclaw start
> ```
> Then run `nemoclaw start` again.
> [!NOTE]
> For details on restricting which Telegram chats can interact with the agent, see the [NemoClaw Telegram bridge documentation](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html).
---
@ -419,7 +423,7 @@ Open Telegram, find your bot, and send it a message. The bot forwards it to the
### Step 11. Stop services
Stop any running auxiliary services (Telegram bridge, cloudflared):
Stop any running auxiliary services (Telegram bridge, cloudflared tunnel):
```bash
nemoclaw stop
@ -474,7 +478,7 @@ The uninstaller runs 6 steps:
| `nemoclaw my-assistant status` | Show sandbox status and inference config |
| `nemoclaw my-assistant logs --follow` | Stream sandbox logs in real time |
| `nemoclaw list` | List all registered sandboxes |
| `nemoclaw start` | Start auxiliary services (Telegram bridge) |
| `nemoclaw start` | Start auxiliary services (Telegram bridge, cloudflared) |
| `nemoclaw stop` | Stop auxiliary services |
| `openshell term` | Open the monitoring TUI on the host |
| `openshell forward list` | List active port forwards |

View File

@ -214,34 +214,22 @@ Verify Ollama is running (it auto-starts as a service after installation). If no
ollama serve &
```
Configure Ollama to listen on all interfaces so the OpenShell gateway container can reach it. Create a systemd override:
```bash
mkdir -p /etc/systemd/system/ollama.service.d/
sudo nano /etc/systemd/system/ollama.service.d/override.conf
```
Add these lines to the file (create the file if it does not exist):
```ini
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
```
Save and exit, then reload and restart Ollama:
Configure Ollama to listen on all interfaces so the OpenShell gateway container can reach it:
```bash
sudo mkdir -p /etc/systemd/system/ollama.service.d
printf '[Service]\nEnvironment="OLLAMA_HOST=0.0.0.0"\n' | sudo tee /etc/systemd/system/ollama.service.d/override.conf
sudo systemctl daemon-reload
sudo systemctl restart ollama
```
Verify Ollama is listening on all interfaces:
Verify Ollama is running and reachable on all interfaces:
```bash
ss -tlnp | grep 11434
curl http://0.0.0.0:11434
```
You should see `*:11434` in the output. If it only shows `127.0.0.1:11434`, confirm the override file contents and that you ran `systemctl daemon-reload` before restarting.
Expected: `Ollama is running`. If not, start it with `sudo systemctl start ollama`.
Next, run a model from Ollama (adjust the model name to match your choice from [the Ollama model library](https://ollama.com/library)). The `ollama run` command will pull the model automatically if it is not already present. Running the model here ensures it is loaded and ready when you use it with OpenClaw, reducing the chance of timeouts later. Example for nemotron-3-super:

View File

@ -57,7 +57,7 @@ In short: two Sparks let you run models that are too large for one, while specul
- Docker with GPU support enabled
```bash
docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 nvidia-smi
docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 nvidia-smi
```
- Active HuggingFace Token for model access
- Network connectivity for model downloads
@ -68,9 +68,9 @@ In short: two Sparks let you run models that are too large for one, while specul
* **Duration:** 10-20 minutes for setup, additional time for model downloads (varies by network speed)
* **Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
* **Rollback:** Stop Docker containers and optionally clean up downloaded model cache.
* **Last Updated:** 01/02/2026
* Upgrade to latest container v1.2.0rc6
* Add EAGLE-3 Speculative Decoding example with GPT-OSS-120B
* **Last Updated:** 04/20/2026
* Upgrade to latest container 1.3.0rc12
* Add Speculative Decoding example with Qwen3-235B-A22B on Two Sparks
## Instructions
@ -111,7 +111,7 @@ docker run \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus=all --ipc=host --network host \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 \
bash -c '
hf download openai/gpt-oss-120b && \
hf download nvidia/gpt-oss-120b-Eagle3-long-context \
@ -172,7 +172,7 @@ docker run \
-e HF_TOKEN=$HF_TOKEN \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus=all --ipc=host --network host nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
--gpus=all --ipc=host --network host nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 \
bash -c "
# # Download models
hf download nvidia/Llama-3.3-70B-Instruct-FP4 && \
@ -309,7 +309,7 @@ docker run -d --rm \
-e TRITON_PTXAS_PATH="/usr/local/cuda/bin/ptxas" \
-v ~/.cache/huggingface/:/root/.cache/huggingface/ \
-v ~/.ssh:/tmp/.ssh:ro \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 \
bash -c "curl https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/trt-llm/assets/trtllm-mn-entrypoint.sh | bash"
```

View File

@ -685,6 +685,7 @@ docker rmi ghcr.io/open-webui/open-webui:main
| "invalid mount config for type 'bind'" | Missing or non-executable entrypoint script | Run `docker inspect <container_id>` to see full error message. Verify `trtllm-mn-entrypoint.sh` exists on both nodes in your home directory (`ls -la $HOME/trtllm-mn-entrypoint.sh`) and has executable permissions (`chmod +x $HOME/trtllm-mn-entrypoint.sh`) |
| "task: non-zero exit (255)" | Container exit with error code 255 | Check container logs with `docker ps -a --filter "name=trtllm-multinode_trtllm"` to get container ID, then `docker logs <container_id>` to see detailed error messages |
| Docker state stuck in "Pending" with "no suitable node (insufficien...)" | Docker daemon not properly configured for GPU access | Verify steps 2-4 were completed successfully and check that `/etc/docker/daemon.json` contains correct GPU configuration |
| Serving model fails `ptxas fatal` errors | Model needs runtime triton kernel compilation | In Step 10, add `-x TRITON_PTXAS_PATH` to your `mpirun` command |
> [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.