Compare commits

...

16 Commits

Author SHA1 Message Date
Raymond Lo
a71a8a9856
Merge 7456f3da1d into 8452a1c5b1 2026-04-08 05:14:21 +00:00
GitLab CI
8452a1c5b1 chore: Regenerate all playbooks 2026-04-08 02:41:59 +00:00
GitLab CI
9414a5141f chore: Regenerate all playbooks 2026-04-07 04:13:30 +00:00
GitLab CI
911ca6db8b chore: Regenerate all playbooks 2026-04-06 19:32:24 +00:00
GitLab CI
08c06d5bd9 chore: Regenerate all playbooks 2026-04-03 03:33:41 +00:00
GitLab CI
87796cfb06 chore: Regenerate all playbooks 2026-04-03 02:52:13 +00:00
GitLab CI
36ac5b74eb chore: Regenerate all playbooks 2026-04-02 22:45:52 +00:00
GitLab CI
c88578ffe1 chore: Regenerate all playbooks 2026-04-02 18:20:14 +00:00
GitLab CI
532624b364 chore: Regenerate all playbooks 2026-04-02 18:13:36 +00:00
GitLab CI
cfbe0f9631 chore: Regenerate all playbooks 2026-04-01 04:39:59 +00:00
GitLab CI
77b6255ba2 chore: Regenerate all playbooks 2026-04-01 04:26:10 +00:00
GitLab CI
de1110cdc7 chore: Regenerate all playbooks 2026-04-01 03:49:49 +00:00
GitLab CI
03dad8645b chore: Regenerate all playbooks 2026-03-31 13:33:01 +00:00
GitLab CI
0c03f4d204 chore: Regenerate all playbooks 2026-03-30 17:32:55 +00:00
GitLab CI
c3770ec3c7 chore: Regenerate all playbooks 2026-03-30 15:12:21 +00:00
Raymond Lo
7456f3da1d
Fix Isaac Sim links to use HTTPS
Updated links to Isaac Sim in the README file to use HTTPS. Or it will redirect to github.
2026-01-20 11:01:16 -08:00
14 changed files with 840 additions and 313 deletions

View File

@ -31,6 +31,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
- [Install and Use Isaac Sim and Isaac Lab](nvidia/isaac/)
- [Optimized JAX](nvidia/jax/)
- [Live VLM WebUI](nvidia/live-vlm-webui/)
- [Run models with llama.cpp on DGX Spark](nvidia/llama-cpp/)
- [LLaMA Factory](nvidia/llama-factory/)
- [LM Studio on DGX Spark](nvidia/lm-studio/)
- [Build and Deploy a Multi-Agent Chatbot](nvidia/multi-agent-chatbot/)
@ -38,7 +39,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
- [Connect Multiple DGX Spark through a Switch](nvidia/multi-sparks-through-switch/)
- [NCCL for Two Sparks](nvidia/nccl/)
- [Fine-tune with NeMo](nvidia/nemo-fine-tune/)
- [NemoClaw with Nemotron-3-Super on DGX Spark](nvidia/nemoclaw/)
- [NemoClaw with Nemotron-3-Super and Telegram on DGX Spark](nvidia/nemoclaw/)
- [Nemotron-3-Nano with llama.cpp](nvidia/nemotron/)
- [NIM on Spark](nvidia/nim-llm/)
- [NVFP4 Quantization](nvidia/nvfp4-quantization/)

View File

@ -158,15 +158,15 @@ network:
enP2p1s0f0np0:
dhcp4: false
addresses:
- 192.168.0.2/24
- 192.168.1.1/24
enp1s0f1np1:
dhcp4: false
addresses:
- 192.168.1.1/24
- 192.168.2.1/24
enP2p1s0f1np1:
dhcp4: false
addresses:
- 192.168.1.2/24
- 192.168.3.1/24
EOF
## Set appropriate permissions
@ -186,19 +186,19 @@ network:
enp1s0f0np0:
dhcp4: false
addresses:
- 192.168.2.1/24
- 192.168.4.1/24
enP2p1s0f0np0:
dhcp4: false
addresses:
- 192.168.2.2/24
- 192.168.5.1/24
enp1s0f1np1:
dhcp4: false
addresses:
- 192.168.0.3/24
- 192.168.0.2/24
enP2p1s0f1np1:
dhcp4: false
addresses:
- 192.168.0.4/24
- 192.168.1.2/24
EOF
## Set appropriate permissions
@ -218,19 +218,19 @@ network:
enp1s0f0np0:
dhcp4: false
addresses:
- 192.168.1.3/24
- 192.168.2.2/24
enP2p1s0f0np0:
dhcp4: false
addresses:
- 192.168.1.4/24
- 192.168.3.2/24
enp1s0f1np1:
dhcp4: false
addresses:
- 192.168.2.3/24
- 192.168.4.2/24
enP2p1s0f1np1:
dhcp4: false
addresses:
- 192.168.2.4/24
- 192.168.5.2/24
EOF
## Set appropriate permissions
@ -254,8 +254,8 @@ bash ./discover-sparks
Expected output similar to the below, with different IPs and node names. You may see more than one IP for each node as four interfaces (**enp1s0f0np0**, **enP2p1s0f0np0**, **enp1s0f1np1** and **enP2p1s0f1np1**) have IP addresses assigned. This is expected and does not cause any issues. The first time you run the script, you'll be prompted for your password for each node.
```
Found: 192.168.0.1 (dgx-spark-1.local)
Found: 192.168.0.3 (dgx-spark-2.local)
Found: 192.168.1.3 (dgx-spark-3.local)
Found: 192.168.0.2 (dgx-spark-2.local)
Found: 192.168.3.2 (dgx-spark-3.local)
Setting up bidirectional SSH access (local <-> remote nodes)...
You may be prompted for your password for each node.

View File

@ -122,7 +122,7 @@ ${ISAACSIM_PATH}/isaac-sim.sh
## Run Isaac Lab
## Step 1. Install Isaac Sim
If you haven't already done so, install [Isaac Sim](build.nvidia.com/spark/isaac/isaac-sim) first.
If you haven't already done so, install [Isaac Sim](https://build.nvidia.com/spark/isaac/isaac-sim) first.
## Step 2. Clone the Isaac Lab repository into your workspace
@ -140,7 +140,7 @@ cd IsaacLab
## Step 3. Create a symbolic link to the Isaac Sim installation
Be sure that you have already installed Isaac Sim from [Isaac Sim](build.nvidia.com/spark/isaac/isaac-sim) before running the following command.
Be sure that you have already installed Isaac Sim from [Isaac Sim](https://build.nvidia.com/spark/isaac/isaac-sim) before running the following command.
```bash
echo "ISAACSIM_PATH=$ISAACSIM_PATH"

269
nvidia/llama-cpp/README.md Normal file
View File

@ -0,0 +1,269 @@
# Run models with llama.cpp on DGX Spark
> Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Gemma 4 31B IT as example)
## Table of Contents
- [Overview](#overview)
- [Instructions](#instructions)
- [Troubleshooting](#troubleshooting)
---
## Overview
## Basic idea
[llama.cpp](https://github.com/ggml-org/llama.cpp) is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through `llama-server`s OpenAI-compatible HTTP API.
This playbook walks through that stack end to end. As the model example, it uses **Gemma 4 31B IT** - a frontier reasoning model built by Google DeepMind that llama.cpp supports, with strengths in coding, agentic workflows, and fine-tuning. The instructions download its **F16** GGUF from Hugging Face. The same build and server steps apply to other GGUFs (including other sizes in the support matrix below).
## What you'll accomplish
You will build llama.cpp with CUDA for GB10, download a Gemma 4 31B IT model checkpoint, and run **`llama-server`** with GPU offload. You get:
- Local inference through llama.cpp (no separate Python inference framework required)
- An OpenAI-compatible `/v1/chat/completions` endpoint for tools and apps
- A concrete validation that **Gemma 4 31B IT** runs on this stack on DGX Spark
## What to know before starting
- Basic familiarity with Linux command line and terminal commands
- Understanding of git and building from source with CMake
- Basic knowledge of REST APIs and cURL for testing
- Familiarity with Hugging Face Hub for downloading GGUF files
## Prerequisites
**Hardware requirements**
- NVIDIA DGX Spark with GB10 GPU
- Sufficient unified memory for the F16 checkpoint (on the order of **~62GB** for weights alone; more when KV cache and runtime overhead are included)
- At least **~70GB** free disk for the F16 download plus build artifacts (use a smaller quant from the same repo if you need less disk and VRAM)
**Software requirements**
- NVIDIA DGX OS
- Git: `git --version`
- CMake (3.14+): `cmake --version`
- CUDA Toolkit: `nvcc --version`
- Network access to GitHub and Hugging Face
## Model Support Matrix
The following models are supported with llama.cpp on Spark. All listed models are available and ready to use:
| Model | Support Status | HF Handle |
|-------|----------------|-----------|
| **Gemma 4 31B IT** | ✅ | `ggml-org/gemma-4-31B-it-GGUF` |
| **Gemma 4 26B A4B IT** | ✅ | `ggml-org/gemma-4-26B-A4B-it-GGUF` |
| **Gemma 4 E4B IT** | ✅ | `ggml-org/gemma-4-E4B-it-GGUF` |
| **Gemma 4 E2B IT** | ✅ | `ggml-org/gemma-4-E2B-it-GGUF` |
| **Nemotron-3-Nano** | ✅ | `unsloth/Nemotron-3-Nano-30B-A3B-GGUF` |
## Time & risk
* **Estimated time:** About 30 minutes, plus downloading the ~62GB example
* **Risk level:** Low — build is local to your clone; no system-wide installs required for the steps below
* **Rollback:** Remove the `llama.cpp` clone and the model directory under `~/models/` to reclaim disk space
* **Last updated:** 04/02/2026
* First Publication
## Instructions
## Step 1. Verify prerequisites
This walkthrough uses **Gemma 4 31B IT** (`gemma-4-31B-it-f16.gguf`) as the example checkpoint. You can substitute another GGUF from [`ggml-org/gemma-4-31B-it-GGUF`](https://huggingface.co/ggml-org/gemma-4-31B-it-GGUF) (for example `Q4_K_M` or `Q8_0`) by changing the `hf download` filename and `--model` path in later steps.
Ensure the required tools are installed:
```bash
git --version
cmake --version
nvcc --version
```
All commands should return version information. If any are missing, install them before continuing.
Install the Hugging Face CLI:
```bash
python3 -m venv llama-cpp-venv
source llama-cpp-venv/bin/activate
pip install -U "huggingface_hub[cli]"
```
Verify installation:
```bash
hf version
```
## Step 2. Clone the llama.cpp repository
Clone upstream llama.cpp—the framework you are building:
```bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
```
## Step 3. Build llama.cpp with CUDA
Configure CMake with CUDA and GB10s **sm_121** architecture so GGMLs CUDA backend matches your GPU:
```bash
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8
```
The build usually takes on the order of 510 minutes. When it finishes, binaries such as `llama-server` appear under `build/bin/`.
## Step 4. Download Gemma 4 31B IT GGUF (supported model example)
llama.cpp loads models in **GGUF** format. **gemma-4-31B-it** is available in GGUF from Hugging Face; this playbook uses a F16 variant that balances quality and memory on GB10-class hardware.
```bash
hf download ggml-org/gemma-4-31B-it-GGUF \
gemma-4-31B-it-f16.gguf \
--local-dir ~/models/gemma-4-31B-it-GGUF
```
The F16 file is large (**~62GB**). The download can be resumed if interrupted.
## Step 5. Start llama-server with Gemma 4 31B IT
From your `llama.cpp/build` directory, launch the OpenAI-compatible server with GPU offload:
```bash
./bin/llama-server \
--model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-f16.gguf \
--host 0.0.0.0 \
--port 30000 \
--n-gpu-layers 99 \
--ctx-size 8192 \
--threads 8
```
**Parameters (short):**
- `--host` / `--port`: bind address and port for the HTTP API
- `--n-gpu-layers 99`: offload layers to the GPU (adjust if you use a different model)
- `--ctx-size`: context length (can be increased up to model/server limits; uses more memory)
- `--threads`: CPU threads for non-GPU work
You should see log lines similar to:
```
llama_new_context_with_model: n_ctx = 8192
...
main: server is listening on 0.0.0.0:30000
```
**Keep this terminal open** while testing. Large GGUFs can take several minutes to load; until you see `server is listening`, nothing accepts connections on port 30000 (see Troubleshooting if `curl` reports connection refused).
## Step 6. Test the API
Use a **second terminal on the same machine** that runs `llama-server` (for example another SSH session into DGX Spark). If you run `curl` on your laptop while the server runs only on Spark, use the Spark hostname or IP instead of `localhost`.
```bash
curl -X POST http://127.0.0.1:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"messages": [{"role": "user", "content": "New York is a great city because..."}],
"max_tokens": 100
}'
```
If you see `curl: (7) Failed to connect`, the server is still loading, the process exited (check the server log for OOM or path errors), or you are not curling the host that runs `llama-server`.
Example shape of the response (fields vary by llama.cpp version; `message` may include extra keys):
```json
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"role": "assistant",
"content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, neversleeping metropolis. Here are just a few reasons that many people ("
}
}
],
"created": 1765916539,
"model": "gemma-4-31B-it-f16.gguf",
"object": "chat.completion",
"usage": {
"completion_tokens": 100,
"prompt_tokens": 25,
"total_tokens": 125
},
"id": "chatcmpl-...",
"timings": {
...
}
}
```
## Step 7. Longer completion (with example model)
Try a slightly longer prompt to confirm stable generation with **Gemma 4 31B IT**:
```bash
curl -X POST http://127.0.0.1:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
"max_tokens": 500
}'
```
## Step 8. Cleanup
Stop the server with `Ctrl+C` in the terminal where it is running.
To remove this tutorials artifacts:
```bash
rm -rf ~/llama.cpp
rm -rf ~/models/gemma-4-31B-it-GGUF
```
Deactivate the Python venv if you no longer need `hf`:
```bash
deactivate
```
## Step 9. Next steps
1. **Context length:** Increase `--ctx-size` for longer chats (watch memory; 1M-token class contexts are possible only when the build, model, and hardware allow).
2. **Other models:** Point `--model` at any compatible GGUF; the llama.cpp server API stays the same.
3. **Integrations:** Point Open WebUI, Continue.dev, or custom clients at `http://<spark-host>:30000/v1` using the OpenAI client pattern.
The server implements the usual OpenAI-style chat features your llama.cpp build enables (including streaming and tool-related flows where supported).
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| `cmake` fails with "CUDA not found" | CUDA toolkit not in PATH | Run `export PATH=/usr/local/cuda/bin:$PATH` and re-run CMake from a clean build directory |
| Build errors mentioning wrong GPU arch | CMake `CMAKE_CUDA_ARCHITECTURES` does not match GB10 | Use `-DCMAKE_CUDA_ARCHITECTURES="121"` for DGX Spark GB10 as in the instructions |
| GGUF download fails or stalls | Network or Hugging Face availability | Re-run `hf download`; it resumes partial files |
| "CUDA out of memory" when starting `llama-server` | Model too large for current context or VRAM | Lower `--ctx-size` (e.g. 4096) or use a smaller quantization from the same repo |
| Server runs but latency is high | Layers not on GPU | Confirm `--n-gpu-layers` is high enough for your model; check `nvidia-smi` during a request |
| `curl: (7) Failed to connect` on port 30000 | No listener yet, wrong host, or crash | Wait for `server is listening`; run `curl` on the same host as `llama-server` (or Sparks IP); run `ss -tln` and confirm `:30000`; read server stderr for OOM or bad `--model` path |
| Chat API errors or empty replies | Wrong `--model` path or incompatible GGUF | Verify the path to the `.gguf` file; update llama.cpp if the GGUF requires a newer format |
> [!NOTE]
> DGX Spark uses Unified Memory Architecture (UMA), which allows flexible sharing between GPU and CPU memory. Some software is still catching up to UMA behavior. If you hit memory pressure unexpectedly, you can try flushing the page cache (use with care on shared systems):
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```
For the latest platform issues, see the [DGX Spark known issues](https://docs.nvidia.com/dgx/dgx-spark/known-issues.html) documentation.

View File

@ -688,13 +688,13 @@
"node_modules/@next/env": {
"version": "15.6.0-canary.60",
"resolved": "https://registry.npmjs.org/@next/env/-/env-15.6.0-canary.60.tgz",
"integrity": "sha512-d9jnRrkuOH7Mhi+LHav2XW91HOgTAWHxjMPkXMGBc9B2b7614P7kjt8tAplRvJpbSt4nbO1lugcT/kAaWzjlLQ==",
"integrity": "sha512-VVoy8NpkI2ngEr3RsD8jzum8SbXisObkvcCMBJeiWRSMOtACIm6kbnTYjgLWeGEwHHKsVBdeJpnf558rF6ktbw==",
"license": "MIT"
},
"node_modules/@next/eslint-plugin-next": {
"version": "15.6.0-canary.60",
"resolved": "https://registry.npmjs.org/@next/eslint-plugin-next/-/eslint-plugin-next-15.6.0-canary.60.tgz",
"integrity": "sha512-kRP7RjSxfTO13NE317ek3mSGzoZlI33nc/i5hs1KaWpK+egs85xg0DJ4p32QEiHnR0mVjuUfhRIun7awqfL7pQ==",
"integrity": "sha512-LjRcK6eL6jNseAiKJgGbYSddLN65SMmNoj90/ShVZx/+F7AkoFPKZWA5EVrnXAYIPBOkJ0jyB/NLHUQv5YQWyw==",
"dev": true,
"license": "MIT",
"dependencies": {
@ -2438,7 +2438,7 @@
"node_modules/eslint-config-next": {
"version": "15.6.0-canary.60",
"resolved": "https://registry.npmjs.org/eslint-config-next/-/eslint-config-next-15.6.0-canary.60.tgz",
"integrity": "sha512-zXoMnYUIy3XHaAoOhrcYkT9UQWvXqWju2K7NNsmb5wd/7XESDwof61eUdW4QhERr3eJ9Ko/vnXqIrj8kk/drYw==",
"integrity": "sha512-Z911jMoc4mD14qvhrTAN5MUh0DYl/vMBdn0eghuJDurOi4qbIkAFObg9ju9A77XLGPYfAaivjs7Sgtxh7uTzkw==",
"dev": true,
"license": "MIT",
"dependencies": {
@ -5231,7 +5231,7 @@
"node_modules/next": {
"version": "15.6.0-canary.60",
"resolved": "https://registry.npmjs.org/next/-/next-15.6.0-canary.60.tgz",
"integrity": "sha512-GNeINPGS9c6OZKCvKypbL8GTsT5GhWPp4DM0fzkXJuXMilOO2EeFxuAY6JZbtk6XIl6Ws10ag3xRINDjSO5+wg==",
"integrity": "sha512-E5gKHda+vdACO+/Bv53V3qcMrhmhH8UuctFSbNeCvnIfPoWPEUMHCB+hok7qxRlKwclXoQ5SUgUdgi6Ayzu5uA==",
"license": "MIT",
"dependencies": {
"@next/env": "15.6.0-canary.60",

View File

@ -268,6 +268,27 @@ def ip_for_2node_link(link_index: int, node_id: int, local_index_in_pair: int) -
host = 1 + (0 if node_id == 1 else 2) + local_index_in_pair
return f"192.168.{link_index}.{host}/24"
def ip_for_3node_ring_link(link_index: int, node_id: int, local_index_in_pair: int) -> str:
"""
/24 scheme for 3-node ring topology.
For each node_id:
network = 192.168.third_octet.node_id/24
third_octet = link_index * 2 + local_index_in_pair
Node 1:
192.168.[0, 1].1/24 -> Node 2
192.168.[2, 3].1/24 -> Node 3
Node 2:
192.168.[4, 5].1/24 -> Node 3
192.168.[0, 1].2/24 -> Node 1
Node 3:
192.168.[2, 3].2/24 -> Node 1
192.168.[4, 5].2/24 -> Node 2
"""
return f"192.168.{link_index * 2 + local_index_in_pair}.{node_id}/24"
def ip_for_switch_link(link_index: int, node_index: int, local_index_in_pair: int) -> str:
"""
@ -602,7 +623,7 @@ def main() -> bool:
node_id_link = 1 if local_machine_id < neighbor_machine else 2
for local_idx, cfg_iface in enumerate(config_ifaces):
ip_cidr = ip_for_2node_link(link_index, node_id_link, local_idx)
ip_cidr = ip_for_3node_ring_link(link_index, node_id_link, local_idx)
iface_to_ip[cfg_iface] = ip_cidr
print(

View File

@ -47,8 +47,8 @@ All necessary files for the playbook can be found [here on GitHub](https://githu
* **Duration:** 45-90 minutes for complete setup and initial model fine-tuning
* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
* **Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations.
* **Last Updated:** 01/15/2026
* Fix qLoRA fine-tuning workflow
* **Last Updated:** 03/04/2026
* Recommend running Nemo finetune workflow via Docker
## Instructions

View File

@ -1,6 +1,7 @@
# NemoClaw with Nemotron-3-Super on DGX Spark
# NemoClaw with Nemotron-3-Super and Telegram on DGX Spark
> Install NemoClaw on DGX Spark with local Ollama inference and Telegram bot integration
> Run NemoClaw on DGX Spark with Nemotron-3-Super
## Table of Contents
@ -12,10 +13,22 @@
- [Isolation layers (OpenShell)](#isolation-layers-openshell)
- [What to know before starting](#what-to-know-before-starting)
- [Prerequisites](#prerequisites)
- [Have ready before you begin](#have-ready-before-you-begin)
- [Ancillary files](#ancillary-files)
- [Time and risk](#time-and-risk)
- [Instructions](#instructions)
- [Restarting the gateway (if needed)](#restarting-the-gateway-if-needed)
- [Step 1. Configure Docker and the NVIDIA container runtime](#step-1-configure-docker-and-the-nvidia-container-runtime)
- [Step 2. Install Ollama](#step-2-install-ollama)
- [Step 3. Pull the Nemotron 3 Super model](#step-3-pull-the-nemotron-3-super-model)
- [Step 4. Install NemoClaw](#step-4-install-nemoclaw)
- [Step 5. Connect to the sandbox and verify inference](#step-5-connect-to-the-sandbox-and-verify-inference)
- [Step 6. Talk to the agent (CLI)](#step-6-talk-to-the-agent-cli)
- [Step 7. Interactive TUI](#step-7-interactive-tui)
- [Step 8. Exit the sandbox and access the Web UI](#step-8-exit-the-sandbox-and-access-the-web-ui)
- [Step 9. Prepare credentials](#step-9-prepare-credentials)
- [Step 10. Configure and start the Telegram bridge](#step-10-configure-and-start-the-telegram-bridge)
- [Step 11. Stop services](#step-11-stop-services)
- [Step 12. Uninstall NemoClaw](#step-12-uninstall-nemoclaw)
- [Troubleshooting](#troubleshooting)
---
@ -26,16 +39,18 @@
### Basic idea
**NVIDIA OpenShell** is an open-source runtime for running autonomous AI agents in sandboxed environments with kernel-level isolation. **NVIDIA NemoClaw** is an OpenClaw plugin that packages OpenShell with an AI agent: it includes the `nemoclaw onboard` wizard to automate setup so you can get a browser-based chat interface running locally on your DGX Spark using Ollama (e.g. NVIDIA Nemotron 3 Super).
**NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime -- an environment designed for executing agents with additional security -- and open-source models like NVIDIA Nemotron. A single installer command handles Node.js, OpenShell, and the NemoClaw CLI, then walks you through an onboard wizard to create a sandboxed agent on your DGX Spark using Ollama with Nemotron 3 Super.
By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, accessible via a dashboard URL, with inference routed to a local model on your Spark—all without exposing your host filesystem or network to the agent.
By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, accessible via a web dashboard and a Telegram bot, with inference routed to a local Nemotron 3 Super 120B model on your Spark -- all without exposing your host filesystem or network to the agent.
### What you'll accomplish
- Install and configure Docker for OpenShell (including cgroup fix for DGX Spark)
- Install Node.js, Ollama, the OpenShell CLI, and the NemoClaw plugin
- Run the NemoClaw onboard wizard to create a sandbox and configure inference
- Start the OpenClaw web UI inside the sandbox and chat with Nemotron 3 Super (or another Ollama model) locally
- Configure Docker and the NVIDIA container runtime for OpenShell on DGX Spark
- Install Ollama, pull Nemotron 3 Super 120B, and configure it for sandbox access
- Install NemoClaw with a single command (handles Node.js, OpenShell, and the CLI)
- Run the onboard wizard to create a sandbox and configure local inference
- Chat with the agent via the CLI, TUI, and web UI
- Set up a Telegram bot that forwards messages to your sandboxed agent
### Notice and disclaimers
@ -49,14 +64,14 @@ By installing this demo, you accept responsibility for all third-party component
#### What you're getting
This experience is provided "AS IS" for demonstration purposes onlyno warranties, no guarantees. This is a demo, not a production-ready solution. You will need to implement appropriate security controls for your environment and use case.
This experience is provided "AS IS" for demonstration purposes only -- no warranties, no guarantees. This is a demo, not a production-ready solution. You will need to implement appropriate security controls for your environment and use case.
#### Key risks with AI agents
- **Data leakage** Any materials the agent accesses could be exposed, leaked, or stolen.
- **Malicious code execution** The agent or its connected tools could expose your system to malicious code or cyber-attacks.
- **Unintended actions** The agent might modify or delete files, send messages, or access services without explicit approval.
- **Prompt injection and manipulation** External inputs or connected content could hijack the agent's behavior in unexpected ways.
- **Data leakage** -- Any materials the agent accesses could be exposed, leaked, or stolen.
- **Malicious code execution** -- The agent or its connected tools could expose your system to malicious code or cyber-attacks.
- **Unintended actions** -- The agent might modify or delete files, send messages, or access services without explicit approval.
- **Prompt injection and manipulation** -- External inputs or connected content could hijack the agent's behavior in unexpected ways.
#### Participant acknowledgement
@ -82,8 +97,8 @@ By participating in this demo, you acknowledge that you are solely responsible f
**Hardware and access:**
- A DGX Spark (GB10) with keyboard and monitor, or SSH access
- An **NVIDIA API key** from [build.nvidia.com](https://build.nvidia.com) (free; only required if using NVIDIA Cloud inference — not needed for local Ollama)
- A GitHub account with access to the NVIDIA organization (for installing the OpenShell CLI from GitHub releases)
- An **NVIDIA API key** from [build.nvidia.com](https://build.nvidia.com/settings/api-keys) (needed for the Telegram bridge)
- A **Telegram bot token** from [@BotFather](https://t.me/BotFather) (create one with `/newbot`)
**Software:**
@ -95,51 +110,47 @@ Verify your system before starting:
head -n 2 /etc/os-release
nvidia-smi
docker info --format '{{.ServerVersion}}'
python3 --version
```
Expected: Ubuntu 24.04, NVIDIA GB10 GPU, Docker server version, Python 3.12+.
Expected: Ubuntu 24.04, NVIDIA GB10 GPU, Docker 28.x+.
### Have ready before you begin
| Item | Where to get it |
|------|----------------|
| NVIDIA API key | [build.nvidia.com/settings/api-keys](https://build.nvidia.com/settings/api-keys) |
| Telegram bot token | [@BotFather](https://t.me/BotFather) on Telegram -- create with `/newbot` |
### Ancillary files
All required assets are in the [NemoClaw repository](https://github.com/NVIDIA/NemoClaw). You will clone it during the instructions to install NemoClaw.
All required assets are handled by the NemoClaw installer. No manual cloning is needed.
### Time and risk
- **Estimated time:** 4590 minutes (including first-time gateway and sandbox build, and Nemotron 3 Super download of ~87GB).
- **Risk level:** Medium — you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
- **Rollback:** Remove the sandbox with `openshell sandbox delete <name>`, destroy the gateway with `openshell gateway destroy -g nemoclaw`, and uninstall NemoClaw with `sudo npm uninstall -g nemoclaw` and `rm -rf ~/.nemoclaw` (see Cleanup in Instructions).
- **Last Updated:** 03/17/2026
- Updated wizard step descriptions to match actual onboard behavior
- Simplified Step 8 (gateway already runs during sandbox creation)
- Fixed repository references (NemoClaw)
- Added troubleshooting entries for port conflicts and provider setup
- **Estimated time:** 20--30 minutes (with Ollama and model already downloaded). First-time model download adds ~15--30 minutes depending on network speed.
- **Risk level:** Medium -- you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
- **Last Updated:** 03/31/2026
* First Publication
## Instructions
## Step 1. Docker configuration
## Phase 1: Prerequisites
Verify Docker permissions and configure the NVIDIA runtime. OpenShell's gateway runs k3s inside Docker and on DGX Spark requires a cgroup setting so the gateway can start correctly.
These steps prepare a fresh DGX Spark for NemoClaw. If Docker, the NVIDIA runtime, and Ollama are already configured, skip to Phase 2.
Verify Docker:
### Step 1. Configure Docker and the NVIDIA container runtime
```bash
docker ps
```
OpenShell's gateway runs k3s inside Docker. On DGX Spark (Ubuntu 24.04, cgroup v2), Docker must be configured with the NVIDIA runtime and host cgroup namespace mode.
If you get a permission denied error, add your user to the Docker group:
```bash
sudo usermod -aG docker $USER
```
Log out and back in for the group to take effect.
Configure Docker for the NVIDIA runtime and set cgroup namespace mode for OpenShell on DGX Spark:
Configure the NVIDIA container runtime for Docker:
```bash
sudo nvidia-ctk runtime configure --runtime=docker
```
Set the cgroup namespace mode required by OpenShell on DGX Spark:
```bash
sudo python3 -c "
import json, os
path = '/etc/docker/daemon.json'
@ -147,31 +158,33 @@ d = json.load(open(path)) if os.path.exists(path) else {}
d['default-cgroupns-mode'] = 'host'
json.dump(d, open(path, 'w'), indent=2)
"
```
Restart Docker:
```bash
sudo systemctl restart docker
```
Verify:
Verify the NVIDIA runtime works:
```bash
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
```
If you get a permission denied error on `docker`, add your user to the Docker group and activate the new group in your current session:
```bash
sudo usermod -aG docker $USER
newgrp docker
```
This applies the group change immediately. Alternatively, you can log out and back in instead of running `newgrp docker`.
> [!NOTE]
> DGX Spark uses cgroup v2. OpenShell's gateway embeds k3s inside Docker and needs host cgroup namespace access. Without `default-cgroupns-mode: host`, the gateway can fail with "Failed to start ContainerManager" errors.
## Step 2. Install Node.js
NemoClaw is installed via npm and requires Node.js.
```bash
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt-get install -y nodejs
```
Verify: `node --version` should show v22.x.x.
## Step 3. Install Ollama and download a model
### Step 2. Install Ollama
Install Ollama:
@ -187,306 +200,309 @@ curl http://localhost:11434
Expected: `Ollama is running`. If not, start it: `ollama serve &`
Download Nemotron 3 Super 120B (~87GB; may take several minutes):
```bash
ollama pull nemotron-3-super:120b
```
Run it briefly to pre-load weights (type `/bye` to exit):
```bash
ollama run nemotron-3-super:120b
```
Configure Ollama to listen on all interfaces so the sandbox container can reach it:
```bash
sudo mkdir -p /etc/systemd/system/ollama.service.d
printf '[Service]\nEnvironment="OLLAMA_HOST=0.0.0.0"\n' | sudo tee /etc/systemd/system/ollama.service.d/override.conf
sudo systemctl daemon-reload
sudo systemctl restart ollama
```
## Step 4. Install the OpenShell CLI
### Step 3. Pull the Nemotron 3 Super model
Install OpenShell using the install script:
Download Nemotron 3 Super 120B (~87 GB; may take 15--30 minutes depending on network speed):
```bash
curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh
ollama pull nemotron-3-super:120b
```
Verify: `openshell --version`
## Step 5. Install NemoClaw
Clone the NemoClaw repository and install the CLI globally:
Run it briefly to pre-load weights into memory (type `/bye` to exit):
```bash
cd ~
git clone https://github.com/NVIDIA/NemoClaw
cd NemoClaw
sudo npm install -g .
ollama run nemotron-3-super:120b
```
Verify: `nemoclaw --help`
> [!NOTE]
> OpenClaw (the AI agent) is installed **automatically inside the sandbox** during onboarding — it is built into the sandbox Docker image. You do not install it on the host.
## Step 6. Run the NemoClaw onboard wizard
Ensure Ollama is running (`curl http://localhost:11434` should return "Ollama is running"). From the directory where you cloned the repository in Step 5, run:
Verify the model is available:
```bash
cd ~/NemoClaw
nemoclaw onboard
ollama list
```
The wizard walks you through seven steps:
You should see `nemotron-3-super:120b` in the output.
1. **Preflight** — Checks Docker and OpenShell CLI. Detects GPU. "No GPU detected" or the VRAM count is normal on DGX Spark (GB10 reports unified memory differently).
2. **Starting OpenShell Gateway** — Destroys any old `nemoclaw` gateway and starts a new one (3060 seconds on first run). If port 8080 is already in use by another container, see [Troubleshooting](troubleshooting.md).
3. **Creating Sandbox** — Enter a name or press Enter for the default (`my-assistant`). The wizard builds a Docker image from the NemoClaw Dockerfile (which includes OpenClaw, the NemoClaw plugin, and the `nemoclaw-start` entrypoint script), then creates a sandbox from that image. On creation, `nemoclaw-start` runs inside the sandbox to configure and launch the OpenClaw gateway. The wizard also sets up port forwarding from port 18789 on the host to the sandbox. First build takes 25 minutes.
4. **Configuring Inference (NIM)** — Auto-detects local inference engine options.
- **Inference options**: If Ollama is running, the wizard will suggest that you select option 2 to use `localhost:11434`. No API key is needed for local Ollama. If no local engine is found, you will be prompted to choose the NVIDIA Endpoint API option (cloud API requires an NVIDIA API key).
- **Choose model**: If you downloaded Nemotron 3 120B in Step 3, the onboarding wizard will default to using that model for the inference route. Otherwise, the onboarding wizard will default to `nemotron-3-nano:30b`.
5. **Inference provider** — Creates the `ollama-local` provider on the gateway and sets the inference route.
6. **OpenClaw** — Already configured inside the sandbox during step 3.
7. **Policies** — Press Enter or Y to accept suggested presets (pypi, npm).
---
When complete you will see something like:
## Phase 2: Install and Run NemoClaw
### Step 4. Install NemoClaw
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones NemoClaw at the pinned stable release (`v0.0.1`), builds the CLI, and runs the onboard wizard to create a sandbox.
```bash
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.4 bash
```
The onboard wizard walks you through setup:
1. **Sandbox name** -- Pick a name (e.g. `my-assistant`). Names must be lowercase alphanumeric with hyphens only.
2. **Inference provider** -- Select **Local Ollama** (option 7).
3. **Model** -- Select **nemotron-3-super:120b** (option 1).
4. **Policy presets** -- Accept the suggested presets when prompted (hit **Y**).
When complete you will see output like:
```text
──────────────────────────────────────────────────
Dashboard http://localhost:18789/
Sandbox my-assistant (Landlock + seccomp + netns)
Model nemotron-3-super:120b (Local Ollama)
NIM not running
──────────────────────────────────────────────────
Run: nemoclaw my-assistant connect
Status: nemoclaw my-assistant status
Logs: nemoclaw my-assistant logs --follow
──────────────────────────────────────────────────
```
## Step 7. Configure inference for Nemotron 3 Super
If you did not download Nemotron 3 Super 120B in Step 3, then the onboarding wizard will default to `nemotron-3-nano:30b`.
If the wizard did not create the `ollama-local` provider (you will see `provider 'ollama-local' not found` when running the next command), create it manually first:
```bash
openshell provider create \
--name ollama-local \
--type openai \
--credential "OPENAI_API_KEY=ollama" \
--config "OPENAI_BASE_URL=http://host.openshell.internal:11434/v1"
```
Then set the inference route:
```bash
openshell inference set --provider ollama-local --model nemotron-3-super:120b --no-verify
```
The `--no-verify` flag is needed because `host.openshell.internal` only resolves from inside the sandbox, not from the host.
Verify:
```bash
openshell inference get
```
Expected: `provider: ollama-local` and `model: nemotron-3-super:120b`.
## Step 8. Get the dashboard URL
The onboard wizard in Step 6 already launched the OpenClaw gateway inside the sandbox and set up port forwarding on port 18789. Verify the port forward is active:
```bash
openshell forward list
```
You should see `my-assistant` with port `18789` and status `running`. If the forward is not active or shows `dead`, restart it:
```bash
openshell forward start --background 18789 my-assistant
```
Now get the dashboard URL (which includes an authentication token). Connect to the sandbox and run `openclaw dashboard`:
```bash
openshell sandbox connect my-assistant
```
Inside the sandbox:
```bash
openclaw dashboard
```
This prints something like:
```text
Dashboard URL: http://127.0.0.1:18789/#token=YOUR_UNIQUE_TOKEN
```
**Save this URL.** Type `exit` to leave the sandbox (the gateway keeps running inside the sandbox).
### Restarting the gateway (if needed)
If the OpenClaw gateway inside the sandbox stopped (e.g. after a sandbox restart), connect and re-launch it:
```bash
openshell sandbox connect my-assistant
```
Inside the sandbox:
```bash
export NVIDIA_API_KEY=local-ollama
export ANTHROPIC_API_KEY=local-ollama
nemoclaw-start
```
The `nemoclaw-start` script configures OpenClaw and launches the gateway. After you see the `[gateway]` log lines, type `exit` to leave the sandbox.
## Step 9. Open the chat interface
Open the dashboard URL from Step 8 in your Spark's web browser:
```text
http://127.0.0.1:18789/#token=YOUR_UNIQUE_TOKEN
──────────────────────────────────────────────────
Dashboard http://localhost:18789/
Sandbox my-assistant (Landlock + seccomp + netns)
Model nemotron-3-super:120b (Local Ollama)
──────────────────────────────────────────────────
Run: nemoclaw my-assistant connect
Status: nemoclaw my-assistant status
Logs: nemoclaw my-assistant logs --follow
──────────────────────────────────────────────────
```
> [!IMPORTANT]
> The token is in the URL as a hash fragment (`#token=...`), not a query parameter (`?token=`). Paste the full URL including `#token=...` into the address bar.
You should see the OpenClaw dashboard with **Version** and **Health: OK**. Click **Chat** in the left sidebar and send a message to your agent.
Try: *"Hello! What can you help me with?"* or *"How many rs are there in the word strawberry?"*
> Save the tokenized Web UI URL printed at the end -- you will need it in Step 8. It looks like:
> `http://127.0.0.1:18789/#token=<long-token-here>`
> [!NOTE]
> Nemotron 3 Super 120B responses may take 3090 seconds. This is normal for a 120B parameter model running locally.
> If `nemoclaw` is not found after install, run `source ~/.bashrc` to reload your shell path.
## Step 10. Using the agent from the command line
### Step 5. Connect to the sandbox and verify inference
Connect to the sandbox:
```bash
openshell sandbox connect my-assistant
nemoclaw my-assistant connect
```
Run a prompt:
You will see `sandbox@my-assistant:~$` -- you are now inside the sandboxed environment.
Verify that the inference route is working:
```bash
export NVIDIA_API_KEY=local-ollama
export ANTHROPIC_API_KEY=local-ollama
openclaw agent --agent main --local -m "How many rs are there in strawberry?" --session-id s1
curl -sf https://inference.local/v1/models
```
To test the sandbox isolation, try the following:
Expected: JSON listing `nemotron-3-super:120b`.
### Step 6. Talk to the agent (CLI)
Still inside the sandbox, send a test message:
```bash
curl -sI https://httpbin.org/get
openclaw agent --agent main --local -m "hello" --session-id test
```
The expected output should be as follows, since this is blocked by the network policy:
The agent will respond using Nemotron 3 Super. First responses may take 30--90 seconds for a 120B parameter model running locally.
### Step 7. Interactive TUI
Launch the terminal UI for an interactive chat session:
```bash
HTTP/1.1 403 Forbidden
openclaw tui
```
Type `exit` to leave the sandbox.
Press **Ctrl+C** to exit the TUI.
## Step 11. Monitoring with the OpenShell TUI
### Step 8. Exit the sandbox and access the Web UI
In a separate terminal on the host:
Exit the sandbox to return to the host:
```bash
openshell term
exit
```
First, press any key to proceed. Press `f` to follow live output, `s` to filter by source, `q` to quit.
**If accessing the Web UI directly on the Spark** (keyboard and monitor attached), open a browser and navigate to the tokenized URL from Step 4:
## Step 12. Cleanup
```text
http://127.0.0.1:18789/#token=<long-token-here>
```
Remove the sandbox and destroy the NemoClaw gateway:
**If accessing the Web UI from a remote machine**, you need to set up port forwarding.
First, find your Spark's IP address. On the Spark, run:
```bash
openshell sandbox delete my-assistant
openshell provider delete ollama-local
openshell gateway destroy -g nemoclaw
hostname -I | awk '{print $1}'
```
To fully uninstall NemoClaw:
This prints the primary IP address (e.g. `192.168.1.42`). You can also find it in **Settings > Wi-Fi** or **Settings > Network** on the Spark's desktop, or check your router's connected-devices list.
Start the port forward on the Spark host:
```bash
sudo npm uninstall -g nemoclaw
rm -rf ~/.nemoclaw
openshell forward start 18789 my-assistant --background
```
Verify:
Then from your remote machine, create an SSH tunnel to the Spark (replace `<your-spark-ip>` with the IP address from above):
```bash
which nemoclaw # Should report "not found"
openshell status # Should report "No gateway configured"
ssh -L 18789:127.0.0.1:18789 <your-user>@<your-spark-ip>
```
Then, you can choose to restart from Step 5 (Install NemoClaw).
Now open the tokenized URL in your remote machine's browser:
To also remove the Ollama model:
```text
http://127.0.0.1:18789/#token=<long-token-here>
```
> [!IMPORTANT]
> Use `127.0.0.1`, not `localhost` -- the gateway origin check requires an exact match.
---
## Phase 3: Telegram Bot
### Step 9. Prepare credentials
You need two items:
| Item | Where to get it |
|------|----------------|
| Telegram bot token | Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the token it gives you. |
| NVIDIA API key | Go to [build.nvidia.com/settings/api-keys](https://build.nvidia.com/settings/api-keys) and create or copy a key (starts with `nvapi-`). |
### Step 10. Configure and start the Telegram bridge
Make sure you are on the **host** (not inside the sandbox). If you are inside the sandbox, run `exit` first.
Set the required environment variables. Replace the placeholders with your actual values. `SANDBOX_NAME` must match the sandbox name you chose during the onboard wizard:
```bash
ollama rm nemotron-3-super:120b
export TELEGRAM_BOT_TOKEN=<your-bot-token>
export SANDBOX_NAME=my-assistant
```
## Step 13. Optional: Remote access via SSH
If you access the Spark remotely, forward port 18789 to your machine.
**SSH tunnel** (from your local machine, not the Spark):
Add the Telegram network policy to the sandbox:
```bash
ssh -L 18789:127.0.0.1:18789 your-user@your-spark-ip
nemoclaw my-assistant policy-add
```
Then open the dashboard URL in your local browser.
When prompted, type `telegram` and hit **Y** to confirm.
**Cursor / VS Code:** Open the **Ports** tab in the bottom panel, click **Forward a Port**, enter **18789**, then open the dashboard URL in your browser.
Start the Telegram bridge. On first run it will ask for your NVIDIA API key:
```bash
nemoclaw start
```
Paste your `nvapi-` key when prompted.
You should see:
```text
[services] telegram-bridge started
Telegram: bridge running
```
Open Telegram, find your bot, and send it a message. The bot forwards it to the agent and replies.
> [!NOTE]
> The first response may include a debug log line like "gateway Running as non-root..." -- this is cosmetic and can be ignored.
> [!NOTE]
> If you need to restart the bridge, `nemoclaw stop` may not cleanly stop the process. If that happens, find and kill the bridge process via its PID file:
> ```bash
> kill -9 "$(cat /tmp/nemoclaw-services-${SANDBOX_NAME}/telegram-bridge.pid)"
> ```
> Then run `nemoclaw start` again.
---
## Phase 4: Cleanup and Uninstall
### Step 11. Stop services
Stop any running auxiliary services (Telegram bridge, cloudflared):
```bash
nemoclaw stop
```
Stop the port forward:
```bash
openshell forward list # find active forwards
openshell forward stop 18789 # stop the dashboard forward
```
### Step 12. Uninstall NemoClaw
Run the uninstaller from the cloned source directory. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and Ollama are preserved.
```bash
cd ~/.nemoclaw/source
./uninstall.sh
```
**Uninstaller flags:**
| Flag | Effect |
|------|--------|
| `--yes` | Skip the confirmation prompt |
| `--keep-openshell` | Leave the `openshell` binary in place |
| `--delete-models` | Also remove the Ollama models pulled by NemoClaw |
To remove everything including the Ollama model:
```bash
./uninstall.sh --yes --delete-models
```
The uninstaller runs 6 steps:
1. Stop NemoClaw helper services and port-forward processes
2. Delete all OpenShell sandboxes, the NemoClaw gateway, and providers
3. Remove the global `nemoclaw` npm package
4. Remove NemoClaw/OpenShell Docker containers, images, and volumes
5. Remove Ollama models (only with `--delete-models`)
6. Remove state directories (`~/.nemoclaw`, `~/.config/openshell`, `~/.config/nemoclaw`) and the OpenShell binary
> [!NOTE]
> The source clone at `~/.nemoclaw/source` is removed as part of state cleanup in step 6. If you want to keep a local copy, move or back it up before running the uninstaller.
## Useful commands
| Command | Description |
|---------|-------------|
| `openshell status` | Check gateway health |
| `openshell sandbox list` | List all running sandboxes |
| `openshell sandbox connect my-assistant` | Shell into the sandbox |
| `openshell term` | Open the monitoring TUI |
| `openshell inference get` | Show current inference routing |
| `nemoclaw my-assistant connect` | Shell into the sandbox |
| `nemoclaw my-assistant status` | Show sandbox status and inference config |
| `nemoclaw my-assistant logs --follow` | Stream sandbox logs in real time |
| `nemoclaw list` | List all registered sandboxes |
| `nemoclaw start` | Start auxiliary services (Telegram bridge) |
| `nemoclaw stop` | Stop auxiliary services |
| `openshell term` | Open the monitoring TUI on the host |
| `openshell forward list` | List active port forwards |
| `nemoclaw my-assistant connect` | Connect to sandbox (alternate) |
| `nemoclaw my-assistant status` | Show sandbox status |
| `openshell forward start 18789 my-assistant --background` | Restart port forwarding for Web UI |
| `cd ~/.nemoclaw/source && ./uninstall.sh` | Remove NemoClaw (preserves Docker, Node.js, Ollama) |
| `cd ~/.nemoclaw/source && ./uninstall.sh --delete-models` | Remove NemoClaw and Ollama models |
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| Gateway fails with "cannot start gateway: port 8080 is held by container..." | Another OpenShell gateway or container is already using port 8080 | Stop the conflicting container: `openshell gateway destroy -g <old-gateway-name>` or `docker stop <container-name> && docker rm <container-name>`, then retry `nemoclaw onboard` |
| Gateway fails with cgroup / "Failed to start ContainerManager" errors | Docker not configured for host cgroup namespace on DGX Spark | Run the cgroup fix: `sudo python3 -c "import json, os; path='/etc/docker/daemon.json'; d=json.load(open(path)) if os.path.exists(path) else {}; d['default-cgroupns-mode']='host'; json.dump(d, open(path,'w'), indent=2)"` then `sudo systemctl restart docker` |
| "No GPU detected" during onboard | DGX Spark GB10 reports unified memory differently | Expected on DGX Spark. The wizard still works and will use Ollama for inference. |
| "provider 'ollama-local' not found" when running `openshell inference set` | The onboard wizard did not complete the inference provider setup | Create the provider manually: `openshell provider create --name ollama-local --type openai --credential "OPENAI_API_KEY=ollama" --config "OPENAI_BASE_URL=http://host.openshell.internal:11434/v1"` then retry the inference set command |
| Sandbox created with a random name instead of the one you wanted | Name passed as a positional argument instead of using `--name` flag | Use `--name` flag: `openshell sandbox create --name my-assistant`. Delete the random sandbox with `openshell sandbox delete <random-name>` |
| "unauthorized: gateway token missing" | Dashboard URL used without token or wrong format | Paste the **full URL** including `#token=...` (hash fragment, not `?token=`). Run `openclaw dashboard` inside the sandbox to get the URL again. |
| "No API key found for provider anthropic" | API key env vars not set when starting gateway in sandbox | Inside the sandbox, set both before running the gateway: `export NVIDIA_API_KEY=local-ollama` and `export ANTHROPIC_API_KEY=local-ollama` |
| Agent gives no response | Model not loaded or Nemotron 3 Super is slow | Nemotron 3 Super can take 3090 seconds per response. Verify Ollama: `curl http://localhost:11434`. Ensure inference is set: `openshell inference get` |
| Port forward dies or dashboard unreachable | Forward not active or wrong port | List forwards: `openshell forward list`. Restart: `openshell forward stop 18789 my-assistant` then `openshell forward start --background 18789 my-assistant` |
| `nemoclaw: command not found` after install | Shell PATH not updated | Run `source ~/.bashrc` (or `source ~/.zshrc` for zsh), or open a new terminal window. |
| Installer fails with Node.js version error | Node.js version below 20 | Install Node.js 20+: `curl -fsSL https://deb.nodesource.com/setup_22.x \| sudo -E bash - && sudo apt-get install -y nodejs` then re-run the installer. |
| npm install fails with `EACCES` permission error | npm global directory not writable | `mkdir -p ~/.npm-global && npm config set prefix ~/.npm-global && export PATH=~/.npm-global/bin:$PATH` then re-run the installer. Add the `export` line to `~/.bashrc` to make it permanent. |
| Docker permission denied | User not in docker group | `sudo usermod -aG docker $USER`, then log out and back in. |
| Ollama not reachable from sandbox (503 / timeout) | Ollama bound to localhost only or firewall blocking 11434 | Ensure Ollama listens on all interfaces: add `Environment="OLLAMA_HOST=0.0.0.0"` in `sudo systemctl edit ollama.service`, then `sudo systemctl daemon-reload` and `sudo systemctl restart ollama`. If using UFW: `sudo ufw allow 11434/tcp comment 'Ollama for NemoClaw'` and `sudo ufw reload` |
| OpenClaw UI shows error message `origin not allowed` | OpenClaw gateway inside the sandbox rejects remote access connections | Run `openshell sandbox connect my-assistant` to enter sandbox. Inside the sandbox, run `sed -i 's/"allowedOrigins": \[\]/"allowedOrigins": ["*"]/' /root/.openclaw/gateway.json 2>/dev/null` to allow the origin. Restart OpenClaw gateway inside sandbox by running `export NVIDIA_API_KEY=local-ollama; export ANTHROPIC_API_KEY=local-ollama; nemoclaw-start` |
| Gateway fails with cgroup / "Failed to start ContainerManager" errors | Docker not configured for host cgroup namespace on DGX Spark | Run the cgroup fix: `sudo python3 -c "import json, os; path='/etc/docker/daemon.json'; d=json.load(open(path)) if os.path.exists(path) else {}; d['default-cgroupns-mode']='host'; json.dump(d, open(path,'w'), indent=2)"` then `sudo systemctl restart docker`. Alternatively, run `sudo nemoclaw setup-spark` which applies this fix automatically. |
| Gateway fails with "port 8080 is held by container..." | Another OpenShell gateway or container is using port 8080 | Stop the conflicting container: `openshell gateway destroy -g <old-gateway-name>` or `docker stop <container-name> && docker rm <container-name>`, then retry `nemoclaw onboard`. |
| Sandbox creation fails | Stale gateway state or DNS not propagated | Run `openshell gateway destroy && openshell gateway start`, then re-run the installer or `nemoclaw onboard`. |
| CoreDNS crash loop | Known issue on some DGX Spark configurations | Run `sudo ./scripts/fix-coredns.sh` from the NemoClaw repo directory. |
| "No GPU detected" during onboard | DGX Spark GB10 reports unified memory differently | Expected on DGX Spark. The wizard still works and uses Ollama for inference. |
| Inference timeout or hangs | Ollama not running or not reachable | Check Ollama: `curl http://localhost:11434`. If not running: `ollama serve &`. If running but unreachable from sandbox, ensure Ollama is configured to listen on `0.0.0.0` (see Step 2 in Instructions). |
| Agent gives no response or is very slow | Normal for 120B model running locally | Nemotron 3 Super 120B can take 30--90 seconds per response. Verify inference route: `nemoclaw my-assistant status`. |
| Port 18789 already in use | Another process is bound to the port | `lsof -i :18789` then `kill <PID>`. If needed, `kill -9 <PID>` to force-terminate. |
| Web UI port forward dies or dashboard unreachable | Port forward not active | `openshell forward stop 18789 my-assistant` then `openshell forward start 18789 my-assistant --background`. |
| Web UI shows `origin not allowed` | Accessing via `localhost` instead of `127.0.0.1` | Use `http://127.0.0.1:18789/#token=...` in the browser. The gateway origin check requires `127.0.0.1` exactly. |
| Telegram bridge does not start | Missing environment variables | Ensure `TELEGRAM_BOT_TOKEN` and `SANDBOX_NAME` are set on the host. `SANDBOX_NAME` must match the sandbox name from onboarding. |
| Telegram bridge needs restart but `nemoclaw stop` does not work | Known bug in `nemoclaw stop` | Find the PID from the `nemoclaw start` output, force-kill with `kill -9 <PID>`, then run `nemoclaw start` again. |
| Telegram bot receives messages but does not reply | Telegram policy not added to sandbox | Run `nemoclaw my-assistant policy-add`, type `telegram`, hit Y. Then restart the bridge with `nemoclaw start`. |
> [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

View File

@ -31,12 +31,14 @@ Spark & Reachy Photo Booth is an interactive and event-driven photo booth demo t
- **User position tracking** built with `facebookresearch/detectron2` and `FoundationVision/ByteTrack`
- **MinIO** for storing captured/generated images as well as sharing them via QR-code
The demo is based on a several services that communicate through a message bus.
The demo is based on several services that communicate through a message bus.
![Architecture diagram](assets/architecture-diagram.png)
See also the walk-through video for this playbook: [Video](https://www.youtube.com/watch?v=6f1x8ReGLjc)
> [!NOTE]
> This playbook applies to both the Reachy Mini and Reachy Mini Lite robots. For simplicity, well refer to the robot as Reachy throughout this playbook.
> This playbook applies to Reachy Mini Lite. Reachy Mini (with on-board Raspberry Pi) might require minor adaptations. For simplicity, well refer to the robot as Reachy throughout this playbook.
## What you'll accomplish
@ -57,7 +59,7 @@ You'll deploy a complete photo booth system on DGX Spark running multiple infere
> [!TIP]
> Make sure your Reachy robot firmware is up to date. You can find instructions to update it [here](https://huggingface.co/spaces/pollen-robotics/Reachy_Mini).
**Software Requirements:**
- The official DGX Spark OS image including all required utilities such as Git, Docker, NVIDIA drivers, and the NVIDIA Container Toolkit
- The official [DGX Spark OS](https://docs.nvidia.com/dgx/dgx-spark/dgx-os.html) image including all required utilities such as Git, Docker, NVIDIA drivers, and the NVIDIA Container Toolkit
- An internet connection for the DGX Spark
- NVIDIA NGC Personal API Key (**`NVIDIA_API_KEY`**). [Create a key](https://org.ngc.nvidia.com/setup/api-keys) if necessary. Make sure to enable the `NGC Catalog` scope when creating the key.
- Hugging Face access token (**`HF_TOKEN`**). [Create a token](https://huggingface.co/settings/tokens) if necessary. Make sure to create a token with _Read access to contents of all public gated repos you can access_ permission.
@ -77,8 +79,9 @@ All required assets can be found in the [Spark & Reachy Photo Booth repository](
* **Estimated time:** 2 hours including hardware setup, container building, and model downloads
* **Risk level:** Medium
* **Rollback:** Docker containers can be stopped and removed to free resources. Downloaded models can be deleted from cache directories. Robot and peripheral connections can be safely disconnected. Network configurations can be reverted by removing custom settings.
* **Last Updated:** 01/27/2026
* 1.0.0 First Publication
* **Last Updated:** 04/01/2026
* 1.0.0 First publication
* 1.0.1 Documentation improvements
## Governing terms
Your use of the Spark Playbook scripts is governed by [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) and enables use of separate open source and proprietary software governed by their respective licenses: [Flux.1-Kontext NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/black-forest-labs/containers/flux.1-kontext-dev?version=1.1), [Parakeet 1.1b CTC en-US ASR NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/parakeet-1-1b-ctc-en-us?version=1.4), [TensorRT-LLM](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release?version=1.3.0rc1), [minio/minio](https://hub.docker.com/r/minio/minio), [arizephoenix/phoenix](https://hub.docker.com/r/arizephoenix/phoenix), [grafana/otel-lgtm](https://hub.docker.com/r/grafana/otel-lgtm), [Python](https://hub.docker.com/_/python), [Node.js](https://hub.docker.com/_/node), [nginx](https://hub.docker.com/_/nginx), [busybox](https://hub.docker.com/_/busybox), [UV Python Packager](https://docs.astral.sh/uv/), [Redpanda](https://www.redpanda.com/), [Redpanda Console](https://www.redpanda.com/), [gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b), [FLUX.1-Kontext-dev](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev), [FLUX.1-Kontext-dev-onnx](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev-onnx).
@ -277,7 +280,7 @@ uv sync --all-packages
Every folder suffixed by `-service` is a standalone Python program that runs in its own container. You must always start the services by interacting with the `docker-compose.yaml` at the root of the repository. You can enable code hot reloading for all the Python services by running:
```bash
docker compose up -d --build --watch
docker compose up --build --watch
```
Whenever you change some Python code in the repository the associated container will be updated and automatically restarted.
@ -315,6 +318,7 @@ The [Writing Your First Service](https://github.com/NVIDIA/spark-reachy-photo-bo
|---------|-------|-----|
| No audio from robot (low volume) | Reachy speaker volume set too low by default | Increase Reachy speaker volume to maximum |
| No audio from robot (device conflict) | Another application capturing Reachy speaker | Check `animation-compositor` logs for "Error querying device (-1)", verify Reachy speaker is not set as system default in Ubuntu sound settings, ensure no other apps are capturing the speaker, then restart the demo |
| Image-generation fails on first start | Transient initialization issue | Rerun `docker compose up --build -d` to resolve the issue |
If you have any issues with Reachy that are not covered by this guide, please read [Hugging Face's official troubleshooting guide](https://huggingface.co/docs/reachy_mini/troubleshooting).

View File

@ -442,7 +442,7 @@ Replace the IP addresses with your actual node IPs.
On **each node** (primary and worker), run the following command to start the TRT-LLM container:
```bash
docker run -d --rm \
docker run -d --rm \
--name trtllm-multinode \
--gpus '"device=all"' \
--network host \
@ -456,9 +456,11 @@ docker run -d --rm \
-e OMPI_MCA_rmaps_ppr_n_pernode="1" \
-e OMPI_ALLOW_RUN_AS_ROOT="1" \
-e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM="1" \
-e CPATH=/usr/local/cuda/include \
-e TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas \
-v ~/.cache/huggingface/:/root/.cache/huggingface/ \
-v ~/.ssh:/tmp/.ssh:ro \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5 \
sh -c "curl https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/trt-llm/assets/trtllm-mn-entrypoint.sh | sh"
```
@ -477,7 +479,7 @@ You should see output similar to:
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
abc123def456 nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 "sh -c 'curl https:…" 10 seconds ago Up 8 seconds trtllm-multinode
abc123def456 nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5 "sh -c 'curl https:…" 10 seconds ago Up 8 seconds trtllm-multinode
```
### Step 6. Copy hostfile to primary container

View File

@ -27,8 +27,8 @@ services:
# Ollama configuration
- OLLAMA_BASE_URL=http://ollama:11434/v1
- OLLAMA_MODEL=llama3.1:8b
# Disable vLLM
- VLLM_BASE_URL=http://localhost:8001/v1
# vLLM disabled in default Ollama mode
# - VLLM_BASE_URL=http://localhost:8001/v1
- VLLM_MODEL=disabled
# Vector DB configuration
- QDRANT_URL=http://qdrant:6333

View File

@ -108,7 +108,7 @@ export class TextProcessor {
// Determine which LLM provider to use based on configuration
// Priority: vLLM > NVIDIA > Ollama
if (process.env.VLLM_BASE_URL) {
if (process.env.VLLM_BASE_URL && process.env.VLLM_MODEL && process.env.VLLM_MODEL !== 'disabled') {
this.selectedLLMProvider = 'vllm';
} else if (process.env.NVIDIA_API_KEY) {
this.selectedLLMProvider = 'nvidia';

View File

@ -8,6 +8,7 @@
- [Instructions](#instructions)
- [Run on two Sparks](#run-on-two-sparks)
- [Step 11. (Optional) Launch 405B inference server](#step-11-optional-launch-405b-inference-server)
- [Run on multiple Sparks through a switch](#run-on-multiple-sparks-through-a-switch)
- [Troubleshooting](#troubleshooting)
---
@ -53,6 +54,11 @@ The following models are supported with vLLM on Spark. All listed models are ava
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Gemma 4 31B IT** | Base | ✅ | [`google/gemma-4-31B-it`](https://huggingface.co/google/gemma-4-31B-it) |
| **Gemma 4 31B IT** | NVFP4 | ✅ | [`nvidia/Gemma-4-31B-IT-NVFP4`](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) |
| **Gemma 4 26B A4B IT** | Base | ✅ | [`google/gemma-4-26B-A4B-it`](https://huggingface.co/google/gemma-4-26B-A4B-it) |
| **Gemma 4 E4B IT** | Base | ✅ | [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it) |
| **Gemma 4 E2B IT** | Base | ✅ | [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it) |
| **Nemotron-3-Super-120B** | NVFP4 | ✅ | [`nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4`](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) |
| **GPT-OSS-20B** | MXFP4 | ✅ | [`openai/gpt-oss-20b`](https://huggingface.co/openai/gpt-oss-20b) |
| **GPT-OSS-120B** | MXFP4 | ✅ | [`openai/gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) |
@ -88,9 +94,8 @@ Reminder: not all model architectures are supported for NVFP4 quantization.
* **Duration:** 30 minutes for Docker approach
* **Risks:** Container registry access requires internal credentials
* **Rollback:** Container approach is non-destructive.
* **Last Updated:** 03/12/2026
* Added support for Nemotron-3-Super-120B model
* Updated container to Feb 2026 release (26.02-py3)
* **Last Updated:** 04/02/2026
* Add support for Gemma 4 model family
## Instructions
@ -116,13 +121,21 @@ Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/
```bash
export LATEST_VLLM_VERSION=<latest_container_version>
## example
## export LATEST_VLLM_VERSION=26.02-py3
export HF_MODEL_HANDLE=<HF_HANDLE>
## example
## export HF_MODEL_HANDLE=openai/gpt-oss-20b
docker pull nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION}
```
For Gemma 4 model family, use vLLM custom containers:
```bash
docker pull vllm/vllm-openai:gemma4-cu130
```
## Step 3. Test vLLM in container
Launch the container and start vLLM server with a test model to verify basic functionality.
@ -130,7 +143,13 @@ Launch the container and start vLLM server with a test model to verify basic fun
```bash
docker run -it --gpus all -p 8000:8000 \
nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} \
vllm serve "Qwen/Qwen2.5-Math-1.5B-Instruct"
vllm serve ${HF_MODEL_HANDLE}
```
To run models from Gemma 4 model family, (e.g. `google/gemma-4-31B-it`):
```bash
docker run -it --gpus all -p 8000:8000 \
vllm/vllm-openai:gemma4-cu130 ${HF_MODEL_HANDLE}
```
Expected output should include:
@ -144,7 +163,7 @@ In another terminal, test the server:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Math-1.5B-Instruct",
"model": "'"${HF_MODEL_HANDLE}"'",
"messages": [{"role": "user", "content": "12*17"}],
"max_tokens": 500
}'
@ -394,6 +413,194 @@ http://<head-node-ip>:8265
## - Alternative quantization methods (FP8, INT4)
```
## Run on multiple Sparks through a switch
## Step 1. Configure network connectivity
Follow the network setup instructions from the [Multi Sparks through switch](https://build.nvidia.com/spark/multi-sparks-through-switch) playbook to establish connectivity between your DGX Spark nodes.
This includes:
- Physical QSFP cable connections between Sparks and Switch
- Network interface configuration (automatic or manual IP assignment)
- Passwordless SSH setup
- Network connectivity verification
- NCCL Bandwidth test
## Step 2. Download cluster deployment script
Download the vLLM cluster deployment script on all nodes. This script orchestrates the Ray cluster setup required for distributed inference.
```bash
## Download on all nodes
wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/online_serving/run_cluster.sh
chmod +x run_cluster.sh
```
## Step 3. Pull the NVIDIA vLLM Image from NGC
Do this step on all nodes.
First, you will need to configure docker to pull from NGC
If this is your first time using docker run:
```bash
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
```
After this, you should be able to run docker commands without using `sudo`.
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm
```bash
docker pull nvcr.io/nvidia/vllm:26.02-py3
export VLLM_IMAGE=nvcr.io/nvidia/vllm:26.02-py3
```
## Step 4. Start Ray head node
Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.
```bash
## On Node 1, start head node
## Get the IP address of the high-speed interface
## Use the interface that shows "(Up)" from ibdev2netdev (enp1s0f0np0 or enp1s0f1np1)
export MN_IF_NAME=enp1s0f1np1
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"
bash run_cluster.sh $VLLM_IMAGE $VLLM_HOST_IP --head ~/.cache/huggingface \
-e VLLM_HOST_IP=$VLLM_HOST_IP \
-e UCX_NET_DEVICES=$MN_IF_NAME \
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
-e TP_SOCKET_IFNAME=$MN_IF_NAME \
-e RAY_memory_monitor_refresh_ms=0 \
-e MASTER_ADDR=$VLLM_HOST_IP
```
## Step 5. Start Ray worker nodes
Connect rest of the nodes to the Ray cluster as a worker nodes. This provides additional GPU resources for tensor parallelism.
```bash
## On other Nodes, join as workers
## Set the interface name (same as Node 1)
export MN_IF_NAME=enp1s0f1np1
## Get Node's own IP address
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
## IMPORTANT: Set HEAD_NODE_IP to Node 1's IP address
## You must get this value from Node 1 (run: echo $VLLM_HOST_IP on Node 1)
export HEAD_NODE_IP=<NODE_1_IP_ADDRESS>
echo "Worker IP: $VLLM_HOST_IP, connecting to head node at: $HEAD_NODE_IP"
bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
-e VLLM_HOST_IP=$VLLM_HOST_IP \
-e UCX_NET_DEVICES=$MN_IF_NAME \
-e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
-e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
-e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
-e TP_SOCKET_IFNAME=$MN_IF_NAME \
-e RAY_memory_monitor_refresh_ms=0 \
-e MASTER_ADDR=$HEAD_NODE_IP
```
> **Note:** Replace `<NODE_1_IP_ADDRESS>` with the actual IP address from Node 1, specifically the QSFP interface enp1s0f1np1 configured in the [Multi Sparks through switch](https://build.nvidia.com/spark/multi-sparks-through-switch) playbook.
## Step 6. Verify cluster status
Confirm all nodes are recognized and available in the Ray cluster.
```bash
## On Node 1 (head node)
## Find the vLLM container name (it will be node-<random_number>)
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
echo "Found container: $VLLM_CONTAINER"
docker exec $VLLM_CONTAINER ray status
```
Expected output shows all nodes with available GPU resources.
## Step 7. Download MiniMax M2.5 model
If you are running with four or more sparks, you can comfortably run this model with tensor parallelism. Authenticate with Hugging Face and download the model.
```bash
## On all nodes, from within the docker containers created in previous steps, run the following
hf auth login
hf download MiniMaxAI/MiniMax-M2.5
```
## Step 8. Launch inference server for MiniMax M2.5
Start the vLLM inference server with tensor parallelism across all nodes.
```bash
## On Node 1, enter container and start server
## Assuming that you run on a 4 node cluster, set --tensor-parallel-size as 4
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec -it $VLLM_CONTAINER /bin/bash -c '
vllm serve MiniMaxAI/MiniMax-M2.5 \
--tensor-parallel-size 4 --max-model-len 129000 --max-num-seqs 4 --trust-remote-code'
```
## Step 9. Test MiniMax M2.5 model inference
Verify the deployment with a sample inference request.
```bash
## Test from Node 1 or external client.
## If testing with external client change localhost to the Node 1 Mgmt IP address.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M2.5",
"prompt": "Write a haiku about a GPU",
"max_tokens": 32,
"temperature": 0.7
}'
```
## Step 10. Validate deployment
Perform comprehensive validation of the distributed inference system.
```bash
## Check Ray cluster health
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec $VLLM_CONTAINER ray status
## Verify server health endpoint on Node 1
curl http://localhost:8000/health
## Monitor GPU utilization on all nodes
nvidia-smi
export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec $VLLM_CONTAINER nvidia-smi --query-gpu=memory.used,memory.total --format=csv
```
## Step 11. Next steps
Access the Ray dashboard for cluster monitoring and explore additional features:
```bash
## Ray dashboard available at:
http://<head-node-ip>:8265
## Consider implementing for production:
## - Health checks and automatic restarts
## - Log rotation for long-running services
## - Persistent model caching across restarts
## - Other models which can fit on the cluster with different quantization methods (FP8, NVFP4)
```
## Troubleshooting
## Common issues for running on a single Spark

View File

@ -204,27 +204,34 @@ In this hybrid deployment, we would use NIMs from [build.nvidia.com](https://bui
## Start Standard VSS (Base)
export NGC_CLI_API_KEY='your_ngc_api_key'
export LLM_ENDPOINT_URL=https://your-llm-endpoint.com
scripts/dev-profile.sh up -p base -H DGX-SPARK --use-remote-llm
scripts/dev-profile.sh up -p base -H DGX-SPARK --use-remote-llm --llm <REMOTE LLM MODEL NAME>
## Start Standard VSS (Alert Verification)
export NGC_CLI_API_KEY='your_ngc_api_key'
export LLM_ENDPOINT_URL=https://your-llm-endpoint.com
scripts/dev-profile.sh up -p alerts -m verification -H DGX-SPARK --use-remote-llm
scripts/dev-profile.sh up -p alerts -m verification -H DGX-SPARK --use-remote-llm --llm <REMOTE LLM MODEL NAME>
## Start Standard VSS (Real-Time Alerts)
export NGC_CLI_API_KEY='your_ngc_api_key'
export LLM_ENDPOINT_URL=https://your-llm-endpoint.com
scripts/dev-profile.sh up -p alerts -m real-time -H DGX-SPARK --use-remote-llm
scripts/dev-profile.sh up -p alerts -m real-time -H DGX-SPARK --use-remote-llm --llm <REMOTE LLM MODEL NAME>
```
> [!NOTE]
> This step will take several minutes as containers are pulled and services initialize. The VSS backend requires additional startup time.
> The following the environment variable needs to be set first before any deployment:
> • NGC_CLI_API_KEY — (required) for vss deployment
> • LLM_ENDPOINT_URL — (required) when --use-remote-llm is passed, used as LLM base URL
> • NVIDIA_API_KEY — (optional) used for accessing remote LLM/VLM endpoints
> • OPENAI_API_KEY — (optional) used for accessing remote LLM/VLM endpoints
> • VLM_CUSTOM_WEIGHTS — (optional) absolute path to custom weights dir
> Set the following environment variables before deployment:
> • **NGC_CLI_API_KEY** — (required) NGC API key for pulling images and deployment
> • **LLM_ENDPOINT_URL** — (required when using `--use-remote-llm`) Base URL for the remote LLM
> • **NVIDIA_API_KEY** — (optional) For remote LLM/VLM endpoints that require it
> • **OPENAI_API_KEY** — (optional) For remote LLM/VLM endpoints that require it
> • **VLM_CUSTOM_WEIGHTS** — (optional) Absolute path to a custom weights directory
>
> Pass these additional flags to **`scripts/dev-profile.sh`** for remote LLM mode:
> • **`--use-remote-llm`** — (required) Use a remote LLM, the base URL is read from **`LLM_ENDPOINT_URL`** in the environment
> • **`--llm`** — (required) Remote LLM model name (for example: `nvidia/nvidia-nemotron-nano-9b-v2`). **Strongly recommended** for alert workflows (verification and real-time): use `nvidia/nvidia-nemotron-nano-9b-v2`. Omitting `--llm` may cause the script to use whatever model is returned by the remote endpoint.
>
> Run **`scripts/dev-profile.sh -h`** for a full list of supported arguments.
**7.3 Validate Standard VSS deployment**