From b8cc262bede8d74a3c27fe5c07221b4a60b3540d Mon Sep 17 00:00:00 2001 From: GitLab CI Date: Sat, 13 Jun 2026 02:56:21 +0000 Subject: [PATCH] chore: Regenerate all playbooks --- nvidia/hermes-agent/README.md | 95 ++++++++++------------- nvidia/nemoclaw/README.md | 31 ++++---- nvidia/openclaw/README.md | 137 ++++++++-------------------------- nvidia/openshell/README.md | 119 +++++++++++------------------ nvidia/vllm/README.md | 95 ++++++++++++++++++++++- 5 files changed, 223 insertions(+), 254 deletions(-) diff --git a/nvidia/hermes-agent/README.md b/nvidia/hermes-agent/README.md index 0c200c5..4df230b 100644 --- a/nvidia/hermes-agent/README.md +++ b/nvidia/hermes-agent/README.md @@ -21,10 +21,10 @@ Running Hermes and its LLM **fully on your DGX Spark** keeps your conversations ## What you'll accomplish -You will have Hermes installed on your DGX Spark and connected to a local LLM served by Ollama. You can chat with the agent from the DGX Spark terminal and from Telegram on your phone or laptop. The gateway runs as a system service, so the agent stays reachable across reboots without anyone logging in. +You will have Hermes installed on your DGX Spark and connected to a local LLM served by **vLLM** (the agent-ready `nvidia/Qwen3.6-35B-A3B-NVFP4` recipe). You can chat with the agent from the DGX Spark terminal and from Telegram on your phone or laptop. The gateway runs as a system service, so the agent stays reachable across reboots without anyone logging in. -- Install Ollama and pull a local model -- Install Hermes and configure it against the local Ollama endpoint +- Serve a local model with vLLM +- Install Hermes and configure it against the local vLLM endpoint - Set up a Telegram bot so you can message Hermes from any Telegram client - Resume past sessions, switch models, update, and uninstall using the `hermes` CLI @@ -38,7 +38,7 @@ You will have Hermes installed on your DGX Spark and connected to a local LLM se ## What to know before starting - Basic use of the Linux terminal and a text editor -- Familiarity with Ollama or willingness to follow the [Ollama on Spark playbook](https://build.nvidia.com/spark/ollama) first +- Familiarity with Docker and vLLM, or willingness to follow the [vLLM for Inference playbook](https://build.nvidia.com/spark/vllm) first - A Telegram account if you want to use the messaging gateway - Awareness of the security considerations below @@ -54,7 +54,7 @@ Main risks: You cannot eliminate all risk; proceed at your own risk. **Recommended security measures:** - **Restrict the Telegram bot** by entering one or more numeric Telegram user IDs at the *"Allowed user IDs"* prompt during install. Leaving this blank allows anyone who finds the bot to use it. -- Keep the Ollama endpoint bound to **`localhost` only**; do not expose `http://:11434` to your LAN or the public internet without strong authentication. +- Keep the vLLM endpoint bound to the Spark; do not forward `http://:8000` to your LAN or the public internet without strong authentication. - Run Hermes on a Spark dedicated to this purpose where possible, and only place files on it that the agent is allowed to access. - **Monitor activity**: Periodically review the gateway service logs (`sudo journalctl -u -e`) and the Hermes session history. @@ -64,15 +64,16 @@ You cannot eliminate all risk; proceed at your own risk. **Recommended security - Terminal (SSH or local) access to the Spark - `curl` and `git` installed (verified in Step 1 of the instructions) - Interactive terminal access for the setup wizard and any `sudo` password prompts. Non-interactive SSH is supported with the config-command fallback in the Instructions tab. -- Enough disk and GPU memory for the Ollama model you plan to serve (the playbook uses `qwen3.6:27b` as the example; pick a smaller model if you want a faster first install) +- Docker with the NVIDIA Container Toolkit, plus a HuggingFace token to download the model (the playbook serves `nvidia/Qwen3.6-35B-A3B-NVFP4` with vLLM) - A Telegram account and the ability to create a bot via [@BotFather](https://t.me/BotFather) if you plan to use the messaging gateway ## Time and risk - **Duration**: About 30 minutes for install and first-time setup; model download time depends on size and network speed. - **Risk level**: **Medium** — the agent can execute commands, persist skills, and is reachable from Telegram. Risk increases if you skip the allowed-user-IDs restriction or expose the local model endpoint beyond `localhost`. Always follow the security measures above. -- **Rollback**: Run `hermes uninstall` (with `sudo` if you installed the gateway as a system service) to remove Hermes, the gateway service, and the shell-profile entry. The data directory `~/.hermes` may still be present afterward; remove it manually if you want a full reset (see the Cleanup and Troubleshooting tabs). Uninstall Ollama separately if desired. -- **Last Updated**: 2026-05-08 +- **Rollback**: Run `hermes uninstall` (with `sudo` if you installed the gateway as a system service) to remove Hermes, the gateway service, and the shell-profile entry. The data directory `~/.hermes` may still be present afterward; remove it manually if you want a full reset (see the Cleanup and Troubleshooting tabs). Stop the vLLM container separately (`docker rm`/`docker rmi`) if desired. +- **Last Updated**: 2026-06-12 + - Switch local inference backend to vLLM (agent-ready Qwen3.6 35B recipe) - First Publication ## Instructions @@ -99,36 +100,22 @@ curl -sS --connect-timeout 10 -o /dev/null -w "HTTP %{http_code}\n" https://api. You should see an **HTTP status line** such as **`HTTP 404`**, **`HTTP 200`**, or **`HTTP 302`** (Telegram’s edge often answers bare `GET` requests with a short JSON or redirect). The important part is that the request **completes over TLS** without hanging. **Timeouts**, **“Could not resolve host”**, or **connection refused** mean the gateway will not reach Telegram from this network—try a path that allows that traffic (for example a personal hotspot) or ask your network administrator to allow **HTTPS to `api.telegram.org`**. -## Step 2. Install Ollama and pull a model +## Step 2. Serve a model with vLLM -Hermes will be configured against a local Ollama endpoint, so Ollama must be installed and serving at least one model before you run the Hermes installer. If you have already completed the [Ollama on Spark playbook](https://build.nvidia.com/spark/ollama), you can skip this step. +Hermes will be configured against a local, OpenAI-compatible endpoint, so a model server must be running before you launch the Hermes installer. This playbook uses **vLLM** with the agent-ready `nvidia/Qwen3.6-35B-A3B-NVFP4` recipe — the same one documented in the vLLM playbook's [Run Agent Ready Qwen3.6 35B Model with vLLM](https://build.nvidia.com/spark/vllm/agent-ready-qwen35b) tab. -Install Ollama: +Follow that tab to launch the server in a **separate terminal** on the Spark so it can run alongside Hermes. It serves `nvidia/Qwen3.6-35B-A3B-NVFP4` on an OpenAI-compatible API at `http://localhost:8000/v1`. + +Once the server reports `Application startup complete`, verify the API on **8000** in another terminal. A healthy server returns **JSON** with a top-level **`"data"`** array listing the served model: ```bash -curl -fsSL https://ollama.com/install.sh | sh +curl -sS http://localhost:8000/v1/models ``` +You should see `nvidia/Qwen3.6-35B-A3B-NVFP4` in the returned list. + > [!NOTE] -> During `install.sh` you might see a message that **systemd is not running** or that a service could not be enabled. On a normal DGX Spark appliance with systemd this is uncommon. If you are on a minimal container, chroot, or unusual environment, Ollama may still run via the `ollama` CLI once the binary is installed; on a standard Spark, prefer fixing the service (`systemctl status ollama`) if the installer warns. If Ollama otherwise starts and answers on port **11434**, you can treat a one-off installer warning as informational. - -Verify the Ollama daemon is running and the HTTP API on **11434** responds. The command below asks Ollama for the **list of pulled models** (`GET /api/tags`). A healthy daemon returns **JSON** with a top-level **`"models"`** array (it may be empty until you pull a model): - -```bash -curl -sS http://localhost:11434/api/tags -``` - -Optional: confirm the daemon build string: - -```bash -curl -sS http://localhost:11434/api/version -``` - -Pull the model you intend to use with Hermes (this playbook uses `qwen3.6:27b` as the example): - -```bash -ollama pull qwen3.6:27b -``` +> Keep the vLLM endpoint bound to the Spark only. The container publishes port `8000`; do not forward `http://:8000` to your LAN or the public internet without strong authentication. ## Step 3. Install Hermes @@ -149,15 +136,15 @@ The installer will walk you through an interactive setup. Respond to each prompt 3. **"Select Provider"** — Choose **Custom endpoint (enter URL manually)** so Hermes can be pointed at the model endpoint running on your DGX Spark. -4. **"API base URL [e.g. https://api.example.com/v1]:"** — *If this prompt appears*, enter the URL of your local model server. For a local Ollama endpoint, use `http://localhost:11434/v1`. (Depending on installer version or prior config, this question is sometimes skipped when the endpoint is already inferred—continue with the prompts you do see.) +4. **"API base URL [e.g. https://api.example.com/v1]:"** — *If this prompt appears*, enter the URL of your local model server. For the local vLLM endpoint from Step 2, use `http://localhost:8000/v1`. (Depending on installer version or prior config, this question is sometimes skipped when the endpoint is already inferred—continue with the prompts you do see.) -5. **"API key [optional]"** — Leave blank and press **Enter**; no key is required for a local model. +5. **"API key [optional]"** — Leave blank and press **Enter**; vLLM does not require a key for a local model. -6. **Model selection** — The installer lists the models available from your local Ollama instance. Select one to use with Hermes (for example, `qwen3.6:27b`). +6. **Model selection** — The installer lists the models served by your local endpoint (vLLM reports these via `/v1/models`). Select `nvidia/Qwen3.6-35B-A3B-NVFP4`. -7. **"Context length in tokens [leave blank for auto-detect]:"** — Press **Enter** to let Hermes auto-detect the context length from the selected model. +7. **"Context length in tokens [leave blank for auto-detect]:"** — Press **Enter** to let Hermes auto-detect the context length from the served model (the recipe serves `--max-model-len 262144`). -8. **"Display name [Local (localhost:11434)]"** — Press **Enter** to accept the suggested label, or type a custom name to identify this endpoint in the Hermes UI. +8. **"Display name [Local (localhost:8000)]"** — Press **Enter** to accept the suggested label, or type a custom name to identify this endpoint in the Hermes UI. 9. **"Connect a messaging platform? (Telegram, Discord, etc.)"** — Choose **Set up messaging now (recommended)** to configure a gateway during installation. @@ -190,17 +177,17 @@ The installer will walk you through an interactive setup. Respond to each prompt #### Non-interactive SSH fallback -If the installer prints **"Setup wizard skipped (no terminal available)"**, or if you are validating the playbook through non-interactive SSH, configure the local Ollama endpoint with Hermes' config command: +If the installer prints **"Setup wizard skipped (no terminal available)"**, or if you are validating the playbook through non-interactive SSH, configure the local vLLM endpoint with Hermes' config command: ```bash export PATH="$HOME/.local/bin:$PATH" hermes config set model.provider custom -hermes config set model.base_url http://localhost:11434/v1 -hermes config set model.default qwen3.6:27b +hermes config set model.base_url http://localhost:8000/v1 +hermes config set model.default nvidia/Qwen3.6-35B-A3B-NVFP4 hermes -z "Reply exactly HERMES_OK" ``` -The last command should return `HERMES_OK`, confirming that Hermes can call the local Ollama model without opening the TUI. +The last command should return `HERMES_OK`, confirming that Hermes can call the local vLLM model without opening the TUI. #### Sudo and `hermes` PATH @@ -231,15 +218,11 @@ sudo journalctl -u -e --no-pager -n 50 If `systemctl status` or `systemctl --user status` shows **active (running)** and logs are not repeating connection errors to Telegram, the service side is in good shape. If logs show TLS timeouts or “connection refused” to Telegram hosts, re-run the **outbound HTTPS** check at the top of this page. -## Step 4. Switch to a different Ollama model (optional) +## Step 4. Switch to a different model (optional) -You configured an initial model during the Hermes install. To switch to a different one later, pull the new model with Ollama and then re-point Hermes at the same local endpoint. +You configured an initial model during the Hermes install. To switch to a different one later, restart vLLM serving the new model handle, then re-point Hermes at the same local endpoint. -1. Pull the new model with Ollama (replace `` with the model you want): - - ```bash - ollama pull - ``` +1. Stop the current vLLM container (Ctrl+C in its terminal) and relaunch it with the new model handle in place of `nvidia/Qwen3.6-35B-A3B-NVFP4`. Use the same `docker run` invocation from the vLLM playbook's [Run Agent Ready Qwen3.6 35B Model with vLLM](https://build.nvidia.com/spark/vllm/agent-ready-qwen35b) tab, swapping the model handle (and any flags appropriate for the new model). 2. Launch the Hermes model picker: @@ -249,20 +232,20 @@ You configured an initial model during the Hermes install. To switch to a differ 3. At the **"Select Provider"** prompt, choose **Custom endpoint (enter URL manually)**. -4. **If you see the “API base URL” prompt**, enter the same local Ollama endpoint as before: +4. **If you see the “API base URL” prompt**, enter the same local vLLM endpoint as before: ``` - http://localhost:11434/v1 + http://localhost:8000/v1 ``` -5. When the installer lists the models served by Ollama, choose the one you just pulled. Hermes will use it for subsequent sessions. +5. When Hermes lists the models served by the endpoint, choose the one you just started serving. Hermes will use it for subsequent sessions. If you are in a non-interactive SSH session, switch models with config commands instead: ```bash hermes config set model.provider custom -hermes config set model.base_url http://localhost:11434/v1 -hermes config set model.default +hermes config set model.base_url http://localhost:8000/v1 +hermes config set model.default hermes -z "Reply exactly MODEL_OK" ``` @@ -347,7 +330,7 @@ rm -rf ~/.hermes | `hermes: command not found` after install | Shell profile not reloaded in the current session | Run `source ~/.bashrc` (or `source ~/.zshrc`) and retry. Open a new terminal if the issue persists. | | `source ~/.bashrc` works in an interactive terminal, but `hermes` is still missing from a scripted SSH command | Many Ubuntu `.bashrc` files return early for non-interactive shells before the installer-added PATH lines run | In automation, run `export PATH="$HOME/.local/bin:$PATH"` before `hermes`, or call `~/.local/bin/hermes` directly. | | `sudo: hermes: command not found` during gateway install, uninstall, or printed `sudo hermes …` steps | `sudo` resets `PATH` and does not see the user-level `hermes` shim | Run `which hermes` as your normal user, then invoke that path with sudo, e.g. `sudo "$(which hermes)" uninstall` or `sudo /full/path/from/which/hermes gateway …`. | -| Installer prints **"Setup wizard skipped (no terminal available)"** | The installer was launched from a non-interactive shell, CI job, or SSH command without a usable TTY | Either re-run `hermes setup` in an interactive terminal, or configure Ollama directly: `hermes config set model.provider custom`, `hermes config set model.base_url http://localhost:11434/v1`, and `hermes config set model.default qwen3.6:27b`. | +| Installer prints **"Setup wizard skipped (no terminal available)"** | The installer was launched from a non-interactive shell, CI job, or SSH command without a usable TTY | Either re-run `hermes setup` in an interactive terminal, or configure the endpoint directly: `hermes config set model.provider custom`, `hermes config set model.base_url http://localhost:8000/v1`, and `hermes config set model.default nvidia/Qwen3.6-35B-A3B-NVFP4`. | | Installer cannot install `ripgrep` / `ffmpeg`, or prints `Non-interactive mode and no terminal available` | Optional helper install needs `sudo`, but the current shell cannot prompt for a password | Install manually in an interactive terminal with `sudo apt install -y ripgrep ffmpeg`. Hermes still runs without them, but file search is slower and TTS voice-message support is limited. | | Browser tools show `system dependency not met`, or Playwright Chromium install fails | Playwright needs Linux shared libraries installed through `sudo`, and the installer could not obtain sudo access | Core chat and Telegram can still work. To enable browser tools, run `cd ~/.hermes/hermes-agent && npx playwright install --with-deps chromium` in an interactive terminal and enter your sudo password. | | You want the gateway to start at boot, but `hermes gateway install` creates a user service | Current Hermes installs a user service by default unless `--system` is supplied | Use `sudo "$(which hermes)" gateway install --system --run-as-user "$USER"` (or replace `$(which hermes)` with `~/.local/bin/hermes` if needed). | @@ -357,11 +340,11 @@ rm -rf ~/.hermes | Choosing **Telegram** during install immediately shows “setup complete” without token / user ID prompts | Stale or partial Hermes gateway config; installer short-circuit | After `source ~/.bashrc`, run **`hermes gateway setup`**, select Telegram, and complete token and allowed-user steps. Install or restart the systemd service using the printed commands (with `sudo "$(which hermes)"` if needed). | | `/start` shows “Unknown command” (or similar) in Telegram | Bot does not define a custom `/start` handler | Send a normal text message such as **`hello`** after `/start`. Hermes responds to conversational text, not necessarily slash commands. | | `~/.hermes` still exists after `uninstall` | Uninstaller preserves data unless you explicitly remove it | This is expected in some flows. Remove manually only if you want a full wipe: `rm -rf ~/.hermes` (see **Start over from scratch**). | -| Hermes installer can't list any models at the model-selection prompt | Ollama is not running or has no models pulled | Sanity-check Ollama in another terminal: list installed models with `ollama list`, hit the API with `curl http://localhost:11434/api/tags`, and confirm a model can actually serve requests by running `ollama run ` (e.g. `ollama run qwen3.6:27b`) and sending a test prompt. If the list is empty or the API is unreachable, start Ollama and pull a model with `ollama pull `, then re-run the Hermes installer. | -| `Connection refused` to `http://localhost:11434/v1` from Hermes | Ollama service not running on the default port | Start the Ollama service and confirm it is listening on `11434`. On systemd hosts: `systemctl status ollama` and `systemctl start ollama`. | +| Hermes installer can't list any models at the model-selection prompt | vLLM is not running yet or is still loading the checkpoint | Sanity-check the endpoint in another terminal: `curl http://localhost:8000/v1/models` should return a `"data"` array containing `nvidia/Qwen3.6-35B-A3B-NVFP4`. If it is empty or unreachable, confirm the vLLM container is up and has finished loading (watch its terminal for `Application startup complete`), then re-run the Hermes installer. | +| `Connection refused` to `http://localhost:8000/v1` from Hermes | vLLM server not running, still loading, or wrong port | Confirm the vLLM container is up and listening on `8000` (`docker ps`, then `curl http://localhost:8000/v1/models`). If it exited, relaunch it (see Instructions — Step 2). | | Pasting the Telegram bot token shows nothing on the screen | Expected — the installer hides token characters as a security measure | Paste the token, then press **Enter**. The installer should respond with `Telegram token saved`. | | Telegram bot does not reply when you send `hello` | Gateway service not running, your account is not in the allowed user IDs list, **or outbound HTTPS to Telegram is blocked** | (1) Confirm Telegram HTTPS from the Spark (Instructions — network check). (2) List Hermes units with `systemctl list-units --type=service --all`, locate the gateway unit by name, then `sudo systemctl status ` and `sudo journalctl -u -e --no-pager -n 80`. (3) If logs show reachability to Telegram but messages are ignored, verify your numeric user ID is in the allowed list via `hermes gateway setup` or the [Hermes messaging gateway docs](https://hermes-agent.nousresearch.com/docs/user-guide/messaging). | -| Out-of-memory or very slow inference | Selected Ollama model is too large for available GPU memory, or other GPU workloads are competing | Check usage with `nvidia-smi`, free GPU memory by closing other workloads, or pull a smaller model with `ollama pull ` and switch to it via `hermes model`. | +| Out-of-memory or very slow inference | Served model is too large for available GPU memory, or other GPU workloads are competing | Check usage with `nvidia-smi`, free GPU memory by closing other workloads, or relaunch vLLM with a lower `--gpu-memory-utilization` / `--max-model-len` (or a smaller model handle) and re-point Hermes via `hermes model`. | | `hermes update` fails or the gateway does not restart | Gateway service still bound to the previous version, or insufficient permissions on a system-service install | Re-run `sudo "$(which hermes)" update` if the gateway was installed as a **System service** and plain `hermes update` cannot restart it. If the service is stuck, restart it manually: `sudo systemctl restart `. | | Cannot resume a previous session | The `` value is missing or wrong | Use `hermes --resume ` with the exact ID Hermes printed when you `/exit` that chat. If the ID is lost, start a new session with `hermes` (omit `--resume`). | diff --git a/nvidia/nemoclaw/README.md b/nvidia/nemoclaw/README.md index aa3177e..10ed64d 100644 --- a/nvidia/nemoclaw/README.md +++ b/nvidia/nemoclaw/README.md @@ -1,6 +1,6 @@ # Run NemoClaw with a Local LLM -> Build your first local AI assistant on DGX Spark using NemoClaw and Ollama in a secure sandbox, with optional Telegram. +> Build your first local AI assistant on DGX Spark using NemoClaw and vLLM in a secure sandbox, with optional Telegram. ## Table of Contents @@ -31,9 +31,9 @@ ## Basic idea -**NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime — an environment designed for executing agents with additional security — and connects them to **local Ollama** inference on your DGX Spark. A single installer command (`nemoclaw.sh`) handles Node.js, OpenShell, and the NemoClaw CLI; the **onboard** wizard then creates a sandboxed agent, optional **Brave Search**, optional **messaging channels** (Telegram, Discord, or Slack), and a **policy tier** with network presets. +**NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime — an environment designed for executing agents with additional security — and connects them to **local vLLM** inference on your DGX Spark. A single installer command (`nemoclaw.sh`) handles Node.js, OpenShell, and the NemoClaw CLI; the **onboard** wizard then creates a sandboxed agent, optional **Brave Search**, optional **messaging channels** (Telegram, Discord, or Slack), and a **policy tier** with network presets. -By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, reachable through the **Web UI** or **terminal TUI**, with inference routed to **local Ollama** on the Spark. You can optionally add **Telegram** (with **cloudflared** for a public webhook URL) and optional **web search** — all without exposing your host filesystem or network beyond what you explicitly allow in policy. +By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, reachable through the **Web UI** or **terminal TUI**, with inference routed to **local vLLM** on the Spark. You can optionally add **Telegram** (with **cloudflared** for a public webhook URL) and optional **web search** — all without exposing your host filesystem or network beyond what you explicitly allow in policy. ### What you'll accomplish @@ -118,7 +118,8 @@ All required assets are handled by the NemoClaw installer. No manual cloning is - **Estimated time:** About 30–60 minutes for a first full pass (install, onboard, model download depending on choice and network). Optional Brave, Telegram, and cloudflared steps add time if you do them in a second session. - **Risk level:** Medium — you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts. -- **Last Updated:** 06/01/2026 +- **Last Updated:** 06/12/2026 + - Switch local inference backend to vLLM (agent-ready Qwen3.6 35B recipe) - Pin nemoclaw installer to v0.0.55, the latest stable version ## Instructions @@ -147,8 +148,8 @@ The installer requires **Node.js 22.16+** (installed automatically if missing). During custom setup, the onboard wizard walks you through: -1. **Configuring inference** -- Choose to set up local inference on your Spark by selecting **`7) Local Ollama`**. -2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will download **`qwen3.6:35b`** automatically. +1. **Configuring inference** -- Choose to set up local inference on your Spark by selecting **`Local vLLM`** (the default). +2. **vLLM models** -- Choose desired inference model. If no model is present locally, the installer will download **`nvidia/Qwen3.6-35B-A3B-NVFP4`** automatically. 3. **Sandbox name** -- Pick a name (e.g. my-assistant). Each sandbox requires a unique name. 4. **Apply this configuration** -- Enter `Y` to confirm setting up local inference. 5. **Enable Brave Web Search** -- Optional. If you enable it, paste a [Brave Search API](https://brave.com/search/api/) key when prompted. @@ -160,7 +161,7 @@ When complete you will see output like: ```text ────────────────────────────────────────────────── Sandbox my-assistant (Landlock + seccomp + netns) -Model (Local Ollama) +Model (Local vLLM) ────────────────────────────────────────────────── Run: nemoclaw my-assistant connect Status: nemoclaw my-assistant status @@ -377,13 +378,13 @@ openshell forward stop # stop the dashboard forward (use the port shown ### Step 8. Uninstall NemoClaw -The NemoClaw CLI includes a built-in uninstaller. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and Ollama are preserved. +The NemoClaw CLI includes a built-in uninstaller. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and the vLLM container image are preserved. ```bash nemoclaw uninstall --yes ``` -To remove everything including the Ollama model: +To remove everything including the downloaded model weights: ```bash nemoclaw uninstall --yes --delete-models @@ -395,7 +396,7 @@ nemoclaw uninstall --yes --delete-models |------|--------| | `--yes` | Skip the confirmation prompt | | `--keep-openshell` | Leave the `openshell` binary in place | -| `--delete-models` | Also remove the Ollama models pulled by NemoClaw | +| `--delete-models` | Also remove the model weights pulled by NemoClaw | > [!NOTE] > If the `nemoclaw` CLI is not available (e.g. install failed partway), use the remote uninstaller as a fallback: @@ -408,7 +409,7 @@ The uninstaller runs 6 steps: 2. Delete all OpenShell sandboxes, the NemoClaw gateway, and providers 3. Remove the global `nemoclaw` npm package 4. Remove NemoClaw/OpenShell Docker containers, images, and volumes -5. Remove Ollama models (only with `--delete-models`) +5. Remove downloaded model weights (only with `--delete-models`) 6. Remove state directories (`~/.nemoclaw`, `~/.config/openshell`, `~/.config/nemoclaw`) and the OpenShell binary > [!NOTE] @@ -427,8 +428,8 @@ The uninstaller runs 6 steps: | `nemoclaw my-assistant dashboard-url --quiet` | Print the full tokenized Web UI URL (includes auto-assigned port) | | `openshell term` | Open the monitoring TUI on the host | | `openshell forward list` | List active port forwards | -| `nemoclaw uninstall --yes` | Remove NemoClaw (preserves Docker, Node.js, Ollama) | -| `nemoclaw uninstall --yes --delete-models` | Remove NemoClaw and Ollama models | +| `nemoclaw uninstall --yes` | Remove NemoClaw (preserves Docker, Node.js, vLLM image) | +| `nemoclaw uninstall --yes --delete-models` | Remove NemoClaw and downloaded model weights | ## Troubleshooting @@ -442,8 +443,8 @@ The uninstaller runs 6 steps: | Gateway fails with "port 8080 is held by container..." | Another OpenShell gateway or container is using port 8080 | Stop the conflicting container: `openshell gateway destroy -g ` or `docker stop && docker rm `, then retry `nemoclaw onboard`. | | Sandbox creation fails | Stale gateway state or DNS not propagated | Run `openshell gateway destroy && openshell gateway start`, then re-run the installer or `nemoclaw onboard`. | | CoreDNS crash loop | Known issue on some DGX Spark configurations | Re-run the NemoClaw installer (`curl -fsSL https://www.nvidia.com/nemoclaw.sh \| bash`) which includes the CoreDNS fix. If the issue persists, see [NemoClaw troubleshooting](https://docs.nvidia.com/nemoclaw/latest/reference/troubleshooting.html). | -| "No GPU detected" during onboard | DGX Spark GB10 reports unified memory differently | Expected on DGX Spark. The wizard still works and uses Ollama for inference. | -| Inference timeout or hangs | Ollama not running or not reachable | Check Ollama: `curl http://127.0.0.1:11434`. If not running: `sudo systemctl restart ollama`. Verify the NemoClaw auth proxy is healthy: `curl http://127.0.0.1:11435/api/tags`. If both respond, check `nemoclaw my-assistant status` for the Inference health line. | +| "No GPU detected" during onboard | DGX Spark GB10 reports unified memory differently | Expected on DGX Spark. The wizard still works and uses vLLM for inference. | +| Inference timeout or hangs | vLLM not running or not reachable | Check the vLLM server: `curl http://127.0.0.1:8000/v1/models` should list `nvidia/Qwen3.6-35B-A3B-NVFP4`. If it hangs, the model may still be loading — wait for `Application startup complete`. Then check `nemoclaw my-assistant status` for the Inference health line. | | Agent gives no response or is very slow | First response can be slow, especially with larger models | Response time depends on model size (30B: a few seconds, 120B: 30–90 seconds). Verify inference route: `nemoclaw my-assistant status`. | | Port 18789 already in use | Another process is bound to the port | `lsof -i :18789` then `kill `. If needed, `kill -9 ` to force-terminate. | | Web UI port forward dies or dashboard unreachable | Port forward not active | `openshell forward stop 18789 my-assistant` then `openshell forward start 18789 my-assistant --background`. | diff --git a/nvidia/openclaw/README.md b/nvidia/openclaw/README.md index cdd7fc8..1823b68 100644 --- a/nvidia/openclaw/README.md +++ b/nvidia/openclaw/README.md @@ -1,6 +1,6 @@ # OpenClaw 🦞 -> Run OpenClaw locally on DGX Spark with LM Studio or Ollama +> Run OpenClaw locally on DGX Spark with a vLLM-served local model ## Table of Contents @@ -20,7 +20,7 @@ Running OpenClaw and its LLMs **fully on your DGX Spark** keeps your data privat ## What you'll accomplish -You will have OpenClaw installed on your DGX Spark and connected to a local LLM (via LM Studio or Ollama). You can use the OpenClaw web UI to chat with your agent, and optionally connect communication channels and skills. The agent and models run entirely on your Spark—no data leaves your machine unless you add cloud or external integrations. +You will have OpenClaw installed on your DGX Spark and connected to a local LLM served by **vLLM** (the agent-ready `nvidia/Qwen3.6-35B-A3B-NVFP4` recipe). You can use the OpenClaw web UI to chat with your agent, and optionally connect communication channels and skills. The agent and models run entirely on your Spark—no data leaves your machine unless you add cloud or external integrations. ## Popular use cases @@ -32,7 +32,7 @@ You will have OpenClaw installed on your DGX Spark and connected to a local LLM ## What to know before starting - Basic use of the Linux terminal and a text editor -- Optional: familiarity with Ollama or LM Studio if you plan to use a local model +- Optional: familiarity with Docker and vLLM if you plan to use a local model - Awareness of the security considerations below ## Important: security and risks @@ -61,10 +61,11 @@ You cannot eliminate all risk; proceed at your own risk. **Critical security mea ## Time and risk -- **Duration**: About 30 minutes for install and first-time model setup; model download time depends on size and network (gpt-oss-120b is ~65GB and may take longer on slower connections). +- **Duration**: About 30 minutes for install and first-time model setup; model download time depends on size and network (the NVFP4 checkpoint is downloaded once and cached for later launches). - **Risk level**: **Medium to High**—the agent has access to whatever files, tools, and channels you configure. Risk increases significantly if you enable terminal/command execution skills or connect external accounts. Without proper isolation, this setup could expose sensitive data or allow code execution. **Always follow the security measures above.** -- **Rollback**: You can stop the OpenClaw gateway and uninstall via the same install script or by removing its directory; uninstall Ollama or LM Studio separately if desired. -- **Last Updated**: 03/11/2026 +- **Rollback**: You can stop the OpenClaw gateway and uninstall via the same install script or by removing its directory; stop the vLLM container separately (`docker rm`/`docker rmi`) if desired. +- **Last Updated**: 06/12/2026 + - Switch local inference backend to vLLM (agent-ready Qwen3.6 35B recipe) - First Publication ## Instructions @@ -106,85 +107,21 @@ Work through the prompts as follows. You can now open the OpenClaw dashboard in a browser using the URL and token from the installer. -## Step 3. Choose and install a local LLM backend +## Step 3. Serve the model with vLLM on your DGX Spark -OpenClaw can use a local LLM via **LM Studio** (best raw performance, uses Llama.cpp) or **Ollama** (simpler and good for deployment). Use a **separate terminal** on your DGX Spark for the backend so the gateway and the model server can run side by side. +OpenClaw will connect to a local, OpenAI-compatible endpoint served by **vLLM**. This playbook uses the agent-ready `nvidia/Qwen3.6-35B-A3B-NVFP4` recipe — the same one documented in the vLLM playbook's [Run Agent Ready Qwen3.6 35B Model with vLLM](https://build.nvidia.com/spark/vllm/agent-ready-qwen35b) tab. The NVFP4 quantization and speculative decoding give strong tool-calling and reasoning quality while leaving headroom on DGX Spark's 128GB unified memory. -**Install one of the following:** +In a **separate terminal** on your DGX Spark, follow the vLLM playbook's [Run Agent Ready Qwen3.6 35B Model with vLLM](https://build.nvidia.com/spark/vllm/agent-ready-qwen35b) tab to launch the server. Run it on its own terminal so the gateway and the model server can run side by side. That tab serves `nvidia/Qwen3.6-35B-A3B-NVFP4` on an OpenAI-compatible API at `http://localhost:8000/v1`. -**Option A – LM Studio** +Once the server reports `Application startup complete`, verify it from another terminal before continuing: ```bash -curl -fsSL https://lmstudio.ai/install.sh | bash +curl http://localhost:8000/v1/models ``` -**Option B – Ollama** +You should see `nvidia/Qwen3.6-35B-A3B-NVFP4` in the returned list. -```bash -curl -fsSL https://ollama.com/install.sh | sh -``` - -## Step 4. Select and download a model - -Model quality and capability scale with size. Free as much GPU memory as possible (avoid other GPU workloads, enable only the skills you need). DGX Spark has **128GB unified memory**, so you can run large models with room to spare. - -**Suggested models by GPU memory:** - -| GPU memory | Suggested model | Model size | Notes | -|-------------|-------------------------------------|-----------|-------| -| 8–12 GB | qwen3-4B-Thinking-2507 | ~5GB | — | -| 16 GB | gpt-oss-20b | ~12GB | Lower latency, good for interactive use | -| 24–48 GB | Nemotron-3-Nano-30B-A3B | ~20GB | — | -| 128 GB | gpt-oss-120b | ~65GB | **Best quality on DGX Spark** (quantized); leaves ~63GB for context window and other processes; use 20B/30B if you prefer faster responses | - -**Quality vs. latency:** The 120B model gives the best accuracy and capability but has higher per-token latency. If you prefer snappier replies, use **gpt-oss-20b** (or a 30B model) instead; both run comfortably on DGX Spark with plenty of memory headroom. - -**Download the model:** - -**LM Studio** - -```bash -lms get openai/gpt-oss-120b -``` - -**Ollama** - -```bash -ollama pull gpt-oss:120b -``` - -(Use the model name that matches your choice from the table; adjust the `lms get` or `ollama pull` command accordingly.) - -## Step 5. Run the model with a large context window - -OpenClaw works best with a context window of **32K tokens or more**. - -**LM Studio** - -```bash -lms load openai/gpt-oss-120b --context-length 32768 -``` - -**Ollama** - -```bash -ollama run gpt-oss:120b -``` - -Once the interactive prompt appears, set the context window (type the following at the Ollama prompt; do not include any `>>>` prefix): - -``` -/set parameter num_ctx 32768 -``` - -Keep this terminal (or process) running so the model stays loaded. You can now chat with the model or press Ctrl+D to exit the interactive mode while keeping the model server running. - -> [!TIP] -> **If you see out-of-memory (OOM) errors:** Try a smaller context (e.g. `16384`) or switch to a smaller model (e.g. gpt-oss-20b). Monitor memory with `nvidia-smi` while the model is loaded. - -## Step 6. Configure OpenClaw to use your local model - -**If you use LM Studio:** +## Step 4. Configure OpenClaw to use the vLLM server 1. Open the OpenClaw config file in your preferred editor (e.g. `nano`, `vim`, or a graphical editor). The config path is: ```bash @@ -195,21 +132,21 @@ Keep this terminal (or process) running so the model stays loaded. You can now c nano ~/.openclaw/openclaw.json ``` -2. Add or update the `models` section so it includes the LM Studio provider. Example for **gpt-oss-120b** (DGX Spark): +2. Add or update the `models` section so it includes the vLLM provider pointing at the endpoint from Step 3. vLLM does not require an API key, so any non-empty placeholder works: ```json "models": { "mode": "merge", "providers": { - "lmstudio": { - "baseUrl": "http://localhost:1234/v1", - "apiKey": "lmstudio", + "vllm": { + "baseUrl": "http://localhost:8000/v1", + "apiKey": "vllm", "api": "openai-responses", "models": [ { - "id": "openai/gpt-oss-120b", - "name": "openai/gpt-oss-120b", - "reasoning": false, + "id": "nvidia/Qwen3.6-35B-A3B-NVFP4", + "name": "nvidia/Qwen3.6-35B-A3B-NVFP4", + "reasoning": true, "input": ["text"], "cost": { "input": 0, @@ -217,8 +154,8 @@ Keep this terminal (or process) running so the model stays loaded. You can now c "cacheRead": 0, "cacheWrite": 0 }, - "contextWindow": 32768, - "maxTokens": 4096 + "contextWindow": 262144, + "maxTokens": 8192 } ] } @@ -226,30 +163,22 @@ Keep this terminal (or process) running so the model stays loaded. You can now c } ``` -For **gpt-oss-20b** or another model, use the same structure but set `id` and `name` to match the model you loaded (e.g. `openai/gpt-oss-20b`). Adjust `contextWindow` and `maxTokens` if needed. - -**If you use Ollama:** +The `id` and `name` must match the model handle served by vLLM (`nvidia/Qwen3.6-35B-A3B-NVFP4`). `contextWindow` matches the `--max-model-len` from Step 3. > [!NOTE] -> `ollama launch openclaw` requires **Ollama v0.15 or later**. If you see an "unknown command" error, upgrade Ollama (`ollama --version`) and retry. +> If OpenClaw reports an unsupported-endpoint error against the Responses API, change `"api": "openai-responses"` to the OpenAI chat-completions variant for your OpenClaw version — vLLM always exposes `/v1/chat/completions`. -Run: +3. If the OpenClaw gateway is already running, restart it so it reloads `~/.openclaw/openclaw.json` and picks up the new provider. -```bash -ollama launch openclaw -``` - -If the OpenClaw gateway is already running, it should pick up the new configuration automatically. You can add `--config` to configure without launching the gateway yet. - -## Step 7. Verify the setup +## Step 5. Verify the setup 1. In a browser, open the **OpenClaw dashboard URL** (and use the access token if required). 2. Start a **new** conversation and send a short message. 3. If you get a reply from the agent, the setup is working. -You can also ask OpenClaw which model it’s using. In the gateway chat UI you can switch models by typing: **`/model MODEL_NAME`**. +You can also ask OpenClaw which model it’s using. In the gateway chat UI you can switch models by typing: **`/model MODEL_NAME`** (e.g. `/model nvidia/Qwen3.6-35B-A3B-NVFP4`). -## Step 8. Optional: add skills and learn more +## Step 6. Optional: add skills and learn more - **Skills** add capabilities but also risk; only enable skills you trust (e.g., community-vetted ones). To add a skill: - Ask OpenClaw to configure a skill, or @@ -262,9 +191,9 @@ You can also ask OpenClaw which model it’s using. In the gateway chat UI you c | Symptom | Cause | Fix | |---------|--------|-----| -| OpenClaw dashboard URL not loading | Gateway not running or wrong host/port | **Restart the OpenClaw gateway:** For Ollama, run `ollama launch openclaw` to restart an already-configured gateway. For LM Studio, restart the OpenClaw gateway via the LM Studio UI or restart the OpenClaw service/container. **Verify:** Check that the gateway process is running with `pgrep -f openclaw` or `ps aux \| grep openclaw`. **Find URL/token:** Check the original installer output (scroll up in your terminal) or look in gateway logs (typically `~/.openclaw/logs/`) for the dashboard URL and access token | -| "Connection refused" to model (e.g. localhost:1234 or Ollama port) | LM Studio or Ollama not running, or wrong port | Start the model in a separate terminal (`lms load ...` or `ollama run ...`) and ensure the port in `openclaw.json` matches (1234 for LM Studio, 11434 for Ollama) | -| OpenClaw says no model available | Model provider not configured or model not loaded | Add the `models` section to `~/.openclaw/openclaw.json` for LM Studio, or run `ollama launch openclaw` for Ollama; ensure the model is loaded/running | -| Out-of-memory or very slow inference on DGX Spark | Model too large for available GPU memory or other GPU workloads | Free GPU memory (close other apps), choose a smaller model, or check usage with `nvidia-smi` | +| OpenClaw dashboard URL not loading | Gateway not running or wrong host/port | **Restart the OpenClaw gateway** so it reloads `~/.openclaw/openclaw.json`. **Verify:** Check that the gateway process is running with `pgrep -f openclaw` or `ps aux \| grep openclaw`. **Find URL/token:** Check the original installer output (scroll up in your terminal) or look in gateway logs (typically `~/.openclaw/logs/`) for the dashboard URL and access token | +| "Connection refused" to model (e.g. localhost:8000) | vLLM server not running, still loading, or wrong port | Confirm the vLLM container is up and finished loading (`curl http://localhost:8000/v1/models` lists the model) and that `baseUrl` in `openclaw.json` is `http://localhost:8000/v1` | +| OpenClaw says no model available | Provider not configured or model handle mismatch | Add the `vllm` provider to `~/.openclaw/openclaw.json` and ensure `id`/`name` exactly match the served handle (`nvidia/Qwen3.6-35B-A3B-NVFP4`) | +| Out-of-memory or very slow inference on DGX Spark | Model too large for available GPU memory or other GPU workloads | Lower `--gpu-memory-utilization` or `--max-model-len` when launching vLLM, free GPU memory (close other apps), or check usage with `nvidia-smi` | | Install script fails or dependencies missing | Missing system packages on Linux | Install curl and any required build tools; see [OpenClaw documentation](https://docs.openclaw.ai) for current requirements | | Config changes not applied | Gateway not reloaded | Restart the OpenClaw gateway so it reloads `~/.openclaw/openclaw.json` | diff --git a/nvidia/openshell/README.md b/nvidia/openshell/README.md index f5634d0..bc8d51f 100644 --- a/nvidia/openshell/README.md +++ b/nvidia/openshell/README.md @@ -83,22 +83,22 @@ You will install the OpenShell CLI (`openshell`), deploy a gateway on your DGX S - Comfort with the Linux terminal and SSH - Basic understanding of Docker (OpenShell runs a k3s cluster inside Docker) -- Familiarity with Ollama for local model serving +- Familiarity with Docker and vLLM for local model serving - Awareness of the security model: OpenShell reduces risk through isolation but cannot eliminate all risk. Review the [OpenShell documentation](https://pypi.org/project/openshell/) and [OpenClaw security guidance](https://docs.openclaw.ai/gateway/security). ## Prerequisites **Hardware Requirements:** - NVIDIA DGX Spark with 128GB unified memory -- At least 70GB available memory for a large local model (e.g., gpt-oss:120b at ~65GB plus overhead), or 25GB+ for a smaller model (e.g., gpt-oss-20b) +- Enough unified memory for the served model plus KV cache (the playbook serves `nvidia/Qwen3.6-35B-A3B-NVFP4` with vLLM at `--gpu-memory-utilization 0.4`) **Software Requirements:** - NVIDIA DGX OS (Ubuntu 24.04 base) - Docker Desktop or Docker Engine running: `docker info` - Python 3.12 or later: `python3 --version` - `uv` package manager: `uv --version` (install with `curl -LsSf https://astral.sh/uv/install.sh | sh`) -- Ollama 0.17.0 or newer (latest recommended for gpt-oss MXFP4 support): `ollama --version` -- Network access to download Python packages from PyPI and model weights from Ollama +- NVIDIA Container Toolkit configured for Docker, plus a HuggingFace token to download the model +- Network access to download Python packages from PyPI and model weights from HuggingFace - Have [NVIDIA Sync](https://build.nvidia.com/spark/connect-to-your-spark) installed and configured for your DGX Spark ## Time & risk @@ -109,8 +109,9 @@ You will install the OpenShell CLI (`openshell`), deploy a gateway on your DGX S * OpenShell sandboxes enforce kernel-level isolation, significantly reducing the risk compared to running OpenClaw directly on the host. * The sandbox default policy denies all outbound traffic not explicitly allowed. Misconfigured policies may block legitimate agent traffic; use `openshell logs` to diagnose. * Large model downloads may fail on unstable networks. -* **Rollback:** Delete the sandbox with `openshell sandbox delete `, stop the gateway with `openshell gateway stop`, and optionally destroy it with `openshell gateway destroy`. Ollama models can be removed with `ollama rm `. -* **Last Updated:** 03/13/2026 +* **Rollback:** Delete the sandbox with `openshell sandbox delete `, stop the gateway with `openshell gateway stop`, and optionally destroy it with `openshell gateway destroy`. The vLLM container can be removed with `docker rm`/`docker rmi`. +* **Last Updated:** 06/12/2026 + * Switch local inference backend to vLLM (agent-ready Qwen3.6 35B recipe) ## Instructions @@ -191,63 +192,26 @@ openshell status > [!TIP] > If you want to manage the Spark gateway from a separate workstation, run `openshell gateway start --remote @.local` from that workstation instead. All subsequent commands will route through the SSH tunnel. -## Step 5. Install Ollama and pull a model +## Step 5. Serve a model with vLLM -Install Ollama (if not already present) and download a model for local inference. +Serve a model with **vLLM** for local inference. This playbook uses the agent-ready `nvidia/Qwen3.6-35B-A3B-NVFP4` recipe — the same one documented in the vLLM playbook's [Run Agent Ready Qwen3.6 35B Model with vLLM](https://build.nvidia.com/spark/vllm/agent-ready-qwen35b) tab. + +Follow that tab to launch the server in a **separate terminal**. It serves `nvidia/Qwen3.6-35B-A3B-NVFP4` on an OpenAI-compatible API at port `8000`. + +> [!IMPORTANT] +> The recipe binds `--host 0.0.0.0`, which is required here: the OpenShell gateway runs inside Docker and reaches the server over the Spark's IP address, not `localhost`. Keep the `--host 0.0.0.0` flag when you launch it. + +Once the server reports `Application startup complete`, verify it is reachable on all interfaces: ```bash -curl -fsSL https://ollama.com/install.sh | sh -ollama --version +curl http://0.0.0.0:8000/v1/models ``` -DGX Spark's 128GB memory can run large models: - -| GPU memory available | Suggested model | Model size | Notes | -|---------------------|---------------------------|-----------|-------| -| 25–48 GB | nemotron-3-nano | ~24GB | Lower latency, good for interactive use | -| 48–80 GB | gpt-oss:120b | ~65GB | Good balance of quality and speed | -| 128 GB | nemotron-3-super:120b | ~86GB | Best quality on DGX Spark | - -Verify Ollama is running (it auto-starts as a service after installation). If not, start it manually: - -```bash -ollama serve & -``` - -Configure Ollama to listen on all interfaces so the OpenShell gateway container can reach it: - -```bash -sudo mkdir -p /etc/systemd/system/ollama.service.d -printf '[Service]\nEnvironment="OLLAMA_HOST=0.0.0.0"\n' | sudo tee /etc/systemd/system/ollama.service.d/override.conf -sudo systemctl daemon-reload -sudo systemctl restart ollama -``` - -Verify Ollama is running and reachable on all interfaces: - -```bash -curl http://0.0.0.0:11434 -``` - -Expected: `Ollama is running`. If not, start it with `sudo systemctl start ollama`. - -Next, run a model from Ollama (adjust the model name to match your choice from [the Ollama model library](https://ollama.com/library)). The `ollama run` command will pull the model automatically if it is not already present. Running the model here ensures it is loaded and ready when you use it with OpenClaw, reducing the chance of timeouts later. Example for nemotron-3-super: - -```bash -ollama run nemotron-3-super:120b -``` - -Type `/bye` to exit. - -Verify the model is available: - -```bash -ollama list -``` +Expected: a JSON `"data"` array listing `nvidia/Qwen3.6-35B-A3B-NVFP4`. If the request hangs, the model is likely still loading — wait for the startup line and retry. ## Step 6. Create an inference provider -We are going to create an OpenShell provider that points to your local Ollama server. This lets OpenShell route inference requests to your Spark-hosted model. +We are going to create an OpenShell provider that points to your local vLLM server. This lets OpenShell route inference requests to your Spark-hosted model. First, find the IP address of your DGX Spark: @@ -255,14 +219,14 @@ First, find the IP address of your DGX Spark: hostname -I | awk '{print $1}' ``` -Then create the provider, replacing `{Machine_IP}` with the IP address from the command above (e.g. `10.110.106.169`): +Then create the provider, replacing `{Machine_IP}` with the IP address from the command above (e.g. `10.110.106.169`). vLLM does not require an API key, so any non-empty placeholder works: ```bash openshell provider create \ - --name local-ollama \ + --name local-vllm \ --type openai \ --credential OPENAI_API_KEY=not-needed \ - --config OPENAI_BASE_URL=http://{Machine_IP}:11434/v1 + --config OPENAI_BASE_URL=http://{Machine_IP}:8000/v1 ``` > [!IMPORTANT] @@ -276,18 +240,18 @@ openshell provider list ## Step 7. Configure inference routing -Point the `inference.local` endpoint (available inside every sandbox) at your Ollama model. Replace the model name with your choice from Step 5: +Point the `inference.local` endpoint (available inside every sandbox) at your vLLM model. The model name must match the handle served in Step 5: ```bash openshell inference set \ - --provider local-ollama \ - --model nemotron-3-super:120b + --provider local-vllm \ + --model nvidia/Qwen3.6-35B-A3B-NVFP4 ``` -The output should confirm the route and show a validated endpoint URL, for example: `http://10.110.106.169:11434/v1/chat/completions (openai_chat_completions)`. +The output should confirm the route and show a validated endpoint URL, for example: `http://10.110.106.169:8000/v1/chat/completions (openai_chat_completions)`. > [!NOTE] -> If you see `failed to verify inference endpoint` or `failed to connect` (for example because the gateway cannot reach the host IP from inside its container), add `--no-verify` to skip endpoint verification: `openshell inference set --provider local-ollama --model nemotron-3-super:120b --no-verify`. Ensure Ollama is running and listening on all interfaces (see Step 5). +> If you see `failed to verify inference endpoint` or `failed to connect` (for example because the gateway cannot reach the host IP from inside its container), add `--no-verify` to skip endpoint verification: `openshell inference set --provider local-vllm --model nvidia/Qwen3.6-35B-A3B-NVFP4 --no-verify`. Ensure the vLLM server is running and reachable on the Spark's IP (see Step 5). Verify the configuration: @@ -295,7 +259,7 @@ Verify the configuration: openshell inference get ``` -Expected output should show `provider: local-ollama` and `model: nemotron-3-super:120b` (or whichever model you chose). +Expected output should show `provider: local-vllm` and `model: nvidia/Qwen3.6-35B-A3B-NVFP4`. ## Step 8. Deploy OpenShell Sandbox @@ -339,10 +303,10 @@ Use the arrow keys and Enter key to interact with the installation. - Model/auth Provider: Select **Custom Provider**, the second-to-last option. - API Base URL: update to https://inference.local/v1 - How do you want to provide this API key?: Paste API key for now. -- API key: please enter "ollama". +- API key: please enter "vllm" (vLLM does not validate the key; any non-empty value works). - Endpoint compatibility: select **OpenAI-compatible** and press Enter. -- Model ID: enter the model name you chose in Step 5 (e.g. `nemotron-3-super:120b`). - - This may take 1-2 minutes as the Ollama model is spun up in the background. +- Model ID: enter the model handle you served in Step 5: `nvidia/Qwen3.6-35B-A3B-NVFP4`. + - The first request may take a moment while vLLM warms up. - Endpoint ID: leave the default value. - Alias: enter the same model name (this is optional). - Channel: Select **Skip for now**. @@ -454,13 +418,13 @@ Now that OpenClaw has been configured within the OpenShell protected runtime, yo openshell sandbox connect $SANDBOX_NAME ``` -Once loaded into the sandbox terminal, you can test connectivity to the Ollama model with this command: +Once loaded into the sandbox terminal, you can test connectivity to the vLLM model with this command: ``` bash -curl https://inference.local/v1/responses \ +curl https://inference.local/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "instructions": "You are a helpful assistant.", - "input": "Hello!" + "model": "nvidia/Qwen3.6-35B-A3B-NVFP4", + "messages": [{"role": "user", "content": "Hello!"}] }' ``` @@ -511,7 +475,7 @@ openshell sandbox delete $SANDBOX_NAME Remove the inference provider you created in Step 6: ```bash -openshell provider delete local-ollama +openshell provider delete local-vllm ``` Stop the gateway (preserves state for later): @@ -527,10 +491,11 @@ openshell gateway stop openshell gateway destroy ``` -To also remove the Ollama model: +To also stop and remove the vLLM container and image: ```bash -ollama rm nemotron-3-super:120b +docker rm $(docker ps -aq --filter ancestor=vllm/vllm-openai:nightly-aarch64) +docker rmi vllm/vllm-openai:nightly-aarch64 ``` ## Step 14. Next steps @@ -548,11 +513,11 @@ ollama rm nemotron-3-super:120b | `openshell status` shows gateway as unhealthy | Gateway container crashed or failed to initialize | Run `openshell gateway destroy` and then `openshell gateway start` to recreate it. Check Docker logs with `docker ps -a` and `docker logs ` for details | | `openshell sandbox create --from openclaw` fails to build | Network issue pulling the community sandbox or Dockerfile build failure | Check internet connectivity. Retry the command. If the build fails on a specific package, check if the base image is compatible with your Docker version | | Sandbox is in `Error` phase after creation | Policy validation failed or container startup crashed | Run `openshell logs ` to see error details. Common causes: invalid policy YAML, missing provider credentials, or port conflicts | -| Agent cannot reach `inference.local` inside the sandbox | Inference routing not configured or provider unreachable | Run `openshell inference get` to verify the provider and model are set. Test Ollama is accessible from the host: `curl http://localhost:11434/api/tags`. Ensure the provider URL uses `host.docker.internal` instead of `localhost` | -| 503 verification failed or timeout when gateway/sandbox accesses Ollama on the host | Ollama bound only to localhost, or host firewall blocking port 11434 | Make Ollama listen on all interfaces so the gateway container (e.g. on Docker network 172.17.x.x) can reach it: `OLLAMA_HOST=0.0.0.0 ollama serve &`. Allow port 11434 through the host firewall: `sudo ufw allow 11434/tcp comment 'Ollama for OpenShell Gateway'` (then `sudo ufw reload` if needed). | +| Agent cannot reach `inference.local` inside the sandbox | Inference routing not configured or provider unreachable | Run `openshell inference get` to verify the provider and model are set. Test the vLLM server from the host: `curl http://localhost:8000/v1/models`. Ensure the provider `OPENAI_BASE_URL` uses the Spark's IP address (not `localhost`), since the gateway runs inside Docker | +| 503 verification failed or timeout when gateway/sandbox accesses vLLM on the host | Provider URL points at `localhost`, or host firewall blocking port 8000 | The recipe already binds vLLM to all interfaces (`--host 0.0.0.0`). Confirm the provider `OPENAI_BASE_URL` uses the Spark's IP (from `hostname -I`) so the gateway container (e.g. on Docker network 172.17.x.x) can reach it. Allow port 8000 through the host firewall: `sudo ufw allow 8000/tcp comment 'vLLM for OpenShell Gateway'` (then `sudo ufw reload` if needed). | | Agent's outbound connections are all denied | Default policy does not include the required endpoints | Monitor denials with `openshell logs --tail --source sandbox`. Pull the current policy with `openshell policy get --full`, add the needed host/port under `network_policies`, and push with `openshell policy set --policy --wait` | | "Permission denied" or Landlock errors inside the sandbox | Agent trying to access a path not in `read_only` or `read_write` filesystem policy | Pull the current policy and add the path to `read_write` (or `read_only` if read access is sufficient). Push the updated policy. Note: filesystem policy is static and requires sandbox recreation | -| Ollama OOM or very slow inference | Model too large for available memory or GPU contention | Free GPU memory (close other GPU workloads), try a smaller model (e.g., `gpt-oss:20b`), or reduce context length. Monitor with `nvidia-smi` | +| vLLM OOM or very slow inference | Model too large for available memory or GPU contention | Free GPU memory (close other GPU workloads), or relaunch vLLM with a lower `--gpu-memory-utilization` / `--max-model-len` (or a smaller model handle). Monitor with `nvidia-smi` | | `openshell sandbox connect` hangs or times out | Sandbox not in `Ready` phase | Run `openshell sandbox get ` to check the phase. If stuck in `Provisioning`, wait or check logs. If in `Error`, delete and recreate the sandbox | | Policy push returns exit code 1 (validation failed) | Malformed YAML or invalid policy fields | Check the YAML syntax. Common issues: paths not starting with `/`, `..` traversal in paths, `root` as `run_as_user`, or endpoints missing required `host`/`port` fields. Fix and re-push | | `openshell gateway start` fails with "K8s namespace not ready" / timed out waiting for namespace | The k3s cluster inside the Docker container takes longer to bootstrap than the CLI timeout allows. The internal components (TLS secrets, Helm chart, namespace creation) may need extra time, especially on first run when images are pulled inside the container. | First, check whether the container is still running and progressing: `docker ps --filter name=openshell` (look for `health: starting`). Inspect k3s state inside the container: `docker exec sh -c "KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl get ns"` and `kubectl get pods -A`. If pods are in `ContainerCreating` and TLS secrets are missing (`navigator-server-tls`, `openshell-server-tls`), the cluster is still bootstrapping — wait a few minutes and run `openshell status` again. If it does not recover, destroy with `openshell gateway destroy` (and `docker rm -f ` if needed) and retry `openshell gateway start`. Ensure Docker has enough resources (memory and disk) for the k3s cluster. | diff --git a/nvidia/vllm/README.md b/nvidia/vllm/README.md index 308696d..5d44665 100644 --- a/nvidia/vllm/README.md +++ b/nvidia/vllm/README.md @@ -9,6 +9,7 @@ - [Run on two Sparks](#run-on-two-sparks) - [Step 11. (Optional) Launch 405B inference server](#step-11-optional-launch-405b-inference-server) - [Run on multiple Sparks through a switch](#run-on-multiple-sparks-through-a-switch) +- [Run Agent Ready Qwen3.6 35B Model with vLLM](#run-agent-ready-qwen36-35b-model-with-vllm) - [Troubleshooting](#troubleshooting) --- @@ -99,8 +100,8 @@ Reminder: not all model architectures are supported for NVFP4 quantization. * **Duration:** 30 minutes for Docker approach * **Risks:** Container registry access requires internal credentials * **Rollback:** Container approach is non-destructive. -* **Last Updated:** 06/10/2026 - * Add models +* **Last Updated:** 06/12/2026 + * Add Agent ready model recipe for Qwen3.6 35B ## Instructions @@ -658,6 +659,96 @@ http://:8265 ## - Other models which can fit on the cluster with different quantization methods (FP8, NVFP4) ``` +## Run Agent Ready Qwen3.6 35B Model with vLLM + +## Step 1. Configure Docker permissions + +To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo. + +Open a new terminal and test Docker access. In the terminal, run: +```bash +docker ps +``` + +If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo . + +```bash +sudo usermod -aG docker $USER +newgrp docker +``` + +## Step 2. Pull vLLM container image + +```bash +docker pull vllm/vllm-openai:nightly-aarch64 +``` + +## Step 3. Launch the Agent Ready Qwen3.6 35B server + +Launch the container and start the vLLM server with the agent-ready +`nvidia/Qwen3.6-35B-A3B-NVFP4` recipe. The `vllm/vllm-openai` image entrypoint is +`vllm serve`, so the model handle and flags are passed directly as container arguments. + +```bash +## HuggingFace token (required to download the model) +## Get a token from https://huggingface.co/settings/tokens +export HF_TOKEN="your_huggingface_token" + +docker run -it --gpus all -p 8000:8000 \ + -e HF_TOKEN="$HF_TOKEN" \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + vllm/vllm-openai:nightly-aarch64 \ + nvidia/Qwen3.6-35B-A3B-NVFP4 \ + --host 0.0.0.0 \ + --port 8000 \ + --tensor-parallel-size 1 \ + --trust-remote-code \ + --kv-cache-dtype fp8 \ + --attention-backend flashinfer \ + --moe-backend marlin \ + --gpu-memory-utilization 0.4 \ + --max-model-len 262144 \ + --max-num-seqs 4 \ + --max-num-batched-tokens 8192 \ + --enable-chunked-prefill \ + --async-scheduling \ + --enable-prefix-caching \ + --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \ + --load-format fastsafetensors \ + --reasoning-parser qwen3 \ + --tool-call-parser qwen3_xml \ + --enable-auto-tool-choice +``` + +Expected output should include: +- Model loading confirmation +- Server startup on port 8000 +- GPU memory allocation details + +In another terminal, test the server: + +```bash +curl http://localhost:8000/v1/chat/completions \ +-H "Content-Type: application/json" \ +-d '{ + "model": "nvidia/Qwen3.6-35B-A3B-NVFP4", + "messages": [{"role": "user", "content": "12*17"}], + "max_tokens": 500 +}' +``` + +Expected response should contain `"content": "204"` or similar mathematical calculation. + + +## Step 4. Cleanup and rollback + +For container approach (non-destructive): + +```bash +docker rm $(docker ps -aq --filter ancestor=vllm/vllm-openai:nightly-aarch64) +docker rmi vllm/vllm-openai:nightly-aarch64 +``` + ## Troubleshooting ## Common issues for running on a single Spark