mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-21 21:59:30 +00:00
Compare commits
5 Commits
6461873c40
...
1c7bc103b1
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
1c7bc103b1 | ||
|
|
b8cc262bed | ||
|
|
97ae853a23 | ||
|
|
bc6bf2251e | ||
|
|
48fc5eb30e |
@ -21,10 +21,10 @@ Running Hermes and its LLM **fully on your DGX Spark** keeps your conversations
|
||||
|
||||
## What you'll accomplish
|
||||
|
||||
You will have Hermes installed on your DGX Spark and connected to a local LLM served by Ollama. You can chat with the agent from the DGX Spark terminal and from Telegram on your phone or laptop. The gateway runs as a system service, so the agent stays reachable across reboots without anyone logging in.
|
||||
You will have Hermes installed on your DGX Spark and connected to a local LLM served by **vLLM** (the agent-ready `nvidia/Qwen3.6-35B-A3B-NVFP4` recipe). You can chat with the agent from the DGX Spark terminal and from Telegram on your phone or laptop. The gateway runs as a system service, so the agent stays reachable across reboots without anyone logging in.
|
||||
|
||||
- Install Ollama and pull a local model
|
||||
- Install Hermes and configure it against the local Ollama endpoint
|
||||
- Serve a local model with vLLM
|
||||
- Install Hermes and configure it against the local vLLM endpoint
|
||||
- Set up a Telegram bot so you can message Hermes from any Telegram client
|
||||
- Resume past sessions, switch models, update, and uninstall using the `hermes` CLI
|
||||
|
||||
@ -38,7 +38,7 @@ You will have Hermes installed on your DGX Spark and connected to a local LLM se
|
||||
## What to know before starting
|
||||
|
||||
- Basic use of the Linux terminal and a text editor
|
||||
- Familiarity with Ollama or willingness to follow the [Ollama on Spark playbook](https://build.nvidia.com/spark/ollama) first
|
||||
- Familiarity with Docker and vLLM, or willingness to follow the [vLLM for Inference playbook](https://build.nvidia.com/spark/vllm) first
|
||||
- A Telegram account if you want to use the messaging gateway
|
||||
- Awareness of the security considerations below
|
||||
|
||||
@ -54,7 +54,7 @@ Main risks:
|
||||
You cannot eliminate all risk; proceed at your own risk. **Recommended security measures:**
|
||||
|
||||
- **Restrict the Telegram bot** by entering one or more numeric Telegram user IDs at the *"Allowed user IDs"* prompt during install. Leaving this blank allows anyone who finds the bot to use it.
|
||||
- Keep the Ollama endpoint bound to **`localhost` only**; do not expose `http://<spark-ip>:11434` to your LAN or the public internet without strong authentication.
|
||||
- Keep the vLLM endpoint bound to the Spark; do not forward `http://<spark-ip>:8000` to your LAN or the public internet without strong authentication.
|
||||
- Run Hermes on a Spark dedicated to this purpose where possible, and only place files on it that the agent is allowed to access.
|
||||
- **Monitor activity**: Periodically review the gateway service logs (`sudo journalctl -u <hermes-gateway-unit> -e`) and the Hermes session history.
|
||||
|
||||
@ -64,15 +64,16 @@ You cannot eliminate all risk; proceed at your own risk. **Recommended security
|
||||
- Terminal (SSH or local) access to the Spark
|
||||
- `curl` and `git` installed (verified in Step 1 of the instructions)
|
||||
- Interactive terminal access for the setup wizard and any `sudo` password prompts. Non-interactive SSH is supported with the config-command fallback in the Instructions tab.
|
||||
- Enough disk and GPU memory for the Ollama model you plan to serve (the playbook uses `qwen3.6:27b` as the example; pick a smaller model if you want a faster first install)
|
||||
- Docker with the NVIDIA Container Toolkit, plus a HuggingFace token to download the model (the playbook serves `nvidia/Qwen3.6-35B-A3B-NVFP4` with vLLM)
|
||||
- A Telegram account and the ability to create a bot via [@BotFather](https://t.me/BotFather) if you plan to use the messaging gateway
|
||||
|
||||
## Time and risk
|
||||
|
||||
- **Duration**: About 30 minutes for install and first-time setup; model download time depends on size and network speed.
|
||||
- **Risk level**: **Medium** — the agent can execute commands, persist skills, and is reachable from Telegram. Risk increases if you skip the allowed-user-IDs restriction or expose the local model endpoint beyond `localhost`. Always follow the security measures above.
|
||||
- **Rollback**: Run `hermes uninstall` (with `sudo` if you installed the gateway as a system service) to remove Hermes, the gateway service, and the shell-profile entry. The data directory `~/.hermes` may still be present afterward; remove it manually if you want a full reset (see the Cleanup and Troubleshooting tabs). Uninstall Ollama separately if desired.
|
||||
- **Last Updated**: 2026-05-08
|
||||
- **Rollback**: Run `hermes uninstall` (with `sudo` if you installed the gateway as a system service) to remove Hermes, the gateway service, and the shell-profile entry. The data directory `~/.hermes` may still be present afterward; remove it manually if you want a full reset (see the Cleanup and Troubleshooting tabs). Stop the vLLM container separately (`docker rm`/`docker rmi`) if desired.
|
||||
- **Last Updated**: 2026-06-12
|
||||
- Switch local inference backend to vLLM (agent-ready Qwen3.6 35B recipe)
|
||||
- First Publication
|
||||
|
||||
## Instructions
|
||||
@ -99,36 +100,22 @@ curl -sS --connect-timeout 10 -o /dev/null -w "HTTP %{http_code}\n" https://api.
|
||||
|
||||
You should see an **HTTP status line** such as **`HTTP 404`**, **`HTTP 200`**, or **`HTTP 302`** (Telegram’s edge often answers bare `GET` requests with a short JSON or redirect). The important part is that the request **completes over TLS** without hanging. **Timeouts**, **“Could not resolve host”**, or **connection refused** mean the gateway will not reach Telegram from this network—try a path that allows that traffic (for example a personal hotspot) or ask your network administrator to allow **HTTPS to `api.telegram.org`**.
|
||||
|
||||
## Step 2. Install Ollama and pull a model
|
||||
## Step 2. Serve a model with vLLM
|
||||
|
||||
Hermes will be configured against a local Ollama endpoint, so Ollama must be installed and serving at least one model before you run the Hermes installer. If you have already completed the [Ollama on Spark playbook](https://build.nvidia.com/spark/ollama), you can skip this step.
|
||||
Hermes will be configured against a local, OpenAI-compatible endpoint, so a model server must be running before you launch the Hermes installer. This playbook uses **vLLM** with the agent-ready `nvidia/Qwen3.6-35B-A3B-NVFP4` recipe — the same one documented in the vLLM playbook's [Run Agent Ready Qwen3.6 35B Model with vLLM](https://build.nvidia.com/spark/vllm/agent-ready-qwen35b) tab.
|
||||
|
||||
Install Ollama:
|
||||
Follow that tab to launch the server in a **separate terminal** on the Spark so it can run alongside Hermes. It serves `nvidia/Qwen3.6-35B-A3B-NVFP4` on an OpenAI-compatible API at `http://localhost:8000/v1`.
|
||||
|
||||
Once the server reports `Application startup complete`, verify the API on **8000** in another terminal. A healthy server returns **JSON** with a top-level **`"data"`** array listing the served model:
|
||||
|
||||
```bash
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
curl -sS http://localhost:8000/v1/models
|
||||
```
|
||||
|
||||
You should see `nvidia/Qwen3.6-35B-A3B-NVFP4` in the returned list.
|
||||
|
||||
> [!NOTE]
|
||||
> During `install.sh` you might see a message that **systemd is not running** or that a service could not be enabled. On a normal DGX Spark appliance with systemd this is uncommon. If you are on a minimal container, chroot, or unusual environment, Ollama may still run via the `ollama` CLI once the binary is installed; on a standard Spark, prefer fixing the service (`systemctl status ollama`) if the installer warns. If Ollama otherwise starts and answers on port **11434**, you can treat a one-off installer warning as informational.
|
||||
|
||||
Verify the Ollama daemon is running and the HTTP API on **11434** responds. The command below asks Ollama for the **list of pulled models** (`GET /api/tags`). A healthy daemon returns **JSON** with a top-level **`"models"`** array (it may be empty until you pull a model):
|
||||
|
||||
```bash
|
||||
curl -sS http://localhost:11434/api/tags
|
||||
```
|
||||
|
||||
Optional: confirm the daemon build string:
|
||||
|
||||
```bash
|
||||
curl -sS http://localhost:11434/api/version
|
||||
```
|
||||
|
||||
Pull the model you intend to use with Hermes (this playbook uses `qwen3.6:27b` as the example):
|
||||
|
||||
```bash
|
||||
ollama pull qwen3.6:27b
|
||||
```
|
||||
> Keep the vLLM endpoint bound to the Spark only. The container publishes port `8000`; do not forward `http://<spark-ip>:8000` to your LAN or the public internet without strong authentication.
|
||||
|
||||
## Step 3. Install Hermes
|
||||
|
||||
@ -149,15 +136,15 @@ The installer will walk you through an interactive setup. Respond to each prompt
|
||||
|
||||
3. **"Select Provider"** — Choose **Custom endpoint (enter URL manually)** so Hermes can be pointed at the model endpoint running on your DGX Spark.
|
||||
|
||||
4. **"API base URL [e.g. https://api.example.com/v1]:"** — *If this prompt appears*, enter the URL of your local model server. For a local Ollama endpoint, use `http://localhost:11434/v1`. (Depending on installer version or prior config, this question is sometimes skipped when the endpoint is already inferred—continue with the prompts you do see.)
|
||||
4. **"API base URL [e.g. https://api.example.com/v1]:"** — *If this prompt appears*, enter the URL of your local model server. For the local vLLM endpoint from Step 2, use `http://localhost:8000/v1`. (Depending on installer version or prior config, this question is sometimes skipped when the endpoint is already inferred—continue with the prompts you do see.)
|
||||
|
||||
5. **"API key [optional]"** — Leave blank and press **Enter**; no key is required for a local model.
|
||||
5. **"API key [optional]"** — Leave blank and press **Enter**; vLLM does not require a key for a local model.
|
||||
|
||||
6. **Model selection** — The installer lists the models available from your local Ollama instance. Select one to use with Hermes (for example, `qwen3.6:27b`).
|
||||
6. **Model selection** — The installer lists the models served by your local endpoint (vLLM reports these via `/v1/models`). Select `nvidia/Qwen3.6-35B-A3B-NVFP4`.
|
||||
|
||||
7. **"Context length in tokens [leave blank for auto-detect]:"** — Press **Enter** to let Hermes auto-detect the context length from the selected model.
|
||||
7. **"Context length in tokens [leave blank for auto-detect]:"** — Press **Enter** to let Hermes auto-detect the context length from the served model (the recipe serves `--max-model-len 262144`).
|
||||
|
||||
8. **"Display name [Local (localhost:11434)]"** — Press **Enter** to accept the suggested label, or type a custom name to identify this endpoint in the Hermes UI.
|
||||
8. **"Display name [Local (localhost:8000)]"** — Press **Enter** to accept the suggested label, or type a custom name to identify this endpoint in the Hermes UI.
|
||||
|
||||
9. **"Connect a messaging platform? (Telegram, Discord, etc.)"** — Choose **Set up messaging now (recommended)** to configure a gateway during installation.
|
||||
|
||||
@ -190,17 +177,17 @@ The installer will walk you through an interactive setup. Respond to each prompt
|
||||
|
||||
#### Non-interactive SSH fallback
|
||||
|
||||
If the installer prints **"Setup wizard skipped (no terminal available)"**, or if you are validating the playbook through non-interactive SSH, configure the local Ollama endpoint with Hermes' config command:
|
||||
If the installer prints **"Setup wizard skipped (no terminal available)"**, or if you are validating the playbook through non-interactive SSH, configure the local vLLM endpoint with Hermes' config command:
|
||||
|
||||
```bash
|
||||
export PATH="$HOME/.local/bin:$PATH"
|
||||
hermes config set model.provider custom
|
||||
hermes config set model.base_url http://localhost:11434/v1
|
||||
hermes config set model.default qwen3.6:27b
|
||||
hermes config set model.base_url http://localhost:8000/v1
|
||||
hermes config set model.default nvidia/Qwen3.6-35B-A3B-NVFP4
|
||||
hermes -z "Reply exactly HERMES_OK"
|
||||
```
|
||||
|
||||
The last command should return `HERMES_OK`, confirming that Hermes can call the local Ollama model without opening the TUI.
|
||||
The last command should return `HERMES_OK`, confirming that Hermes can call the local vLLM model without opening the TUI.
|
||||
|
||||
#### Sudo and `hermes` PATH
|
||||
|
||||
@ -231,15 +218,11 @@ sudo journalctl -u <hermes-gateway-unit> -e --no-pager -n 50
|
||||
|
||||
If `systemctl status` or `systemctl --user status` shows **active (running)** and logs are not repeating connection errors to Telegram, the service side is in good shape. If logs show TLS timeouts or “connection refused” to Telegram hosts, re-run the **outbound HTTPS** check at the top of this page.
|
||||
|
||||
## Step 4. Switch to a different Ollama model (optional)
|
||||
## Step 4. Switch to a different model (optional)
|
||||
|
||||
You configured an initial model during the Hermes install. To switch to a different one later, pull the new model with Ollama and then re-point Hermes at the same local endpoint.
|
||||
You configured an initial model during the Hermes install. To switch to a different one later, restart vLLM serving the new model handle, then re-point Hermes at the same local endpoint.
|
||||
|
||||
1. Pull the new model with Ollama (replace `<model-name>` with the model you want):
|
||||
|
||||
```bash
|
||||
ollama pull <model-name>
|
||||
```
|
||||
1. Stop the current vLLM container (Ctrl+C in its terminal) and relaunch it with the new model handle in place of `nvidia/Qwen3.6-35B-A3B-NVFP4`. Use the same `docker run` invocation from the vLLM playbook's [Run Agent Ready Qwen3.6 35B Model with vLLM](https://build.nvidia.com/spark/vllm/agent-ready-qwen35b) tab, swapping the model handle (and any flags appropriate for the new model).
|
||||
|
||||
2. Launch the Hermes model picker:
|
||||
|
||||
@ -249,20 +232,20 @@ You configured an initial model during the Hermes install. To switch to a differ
|
||||
|
||||
3. At the **"Select Provider"** prompt, choose **Custom endpoint (enter URL manually)**.
|
||||
|
||||
4. **If you see the “API base URL” prompt**, enter the same local Ollama endpoint as before:
|
||||
4. **If you see the “API base URL” prompt**, enter the same local vLLM endpoint as before:
|
||||
|
||||
```
|
||||
http://localhost:11434/v1
|
||||
http://localhost:8000/v1
|
||||
```
|
||||
|
||||
5. When the installer lists the models served by Ollama, choose the one you just pulled. Hermes will use it for subsequent sessions.
|
||||
5. When Hermes lists the models served by the endpoint, choose the one you just started serving. Hermes will use it for subsequent sessions.
|
||||
|
||||
If you are in a non-interactive SSH session, switch models with config commands instead:
|
||||
|
||||
```bash
|
||||
hermes config set model.provider custom
|
||||
hermes config set model.base_url http://localhost:11434/v1
|
||||
hermes config set model.default <model-name>
|
||||
hermes config set model.base_url http://localhost:8000/v1
|
||||
hermes config set model.default <new-model-handle>
|
||||
hermes -z "Reply exactly MODEL_OK"
|
||||
```
|
||||
|
||||
@ -347,7 +330,7 @@ rm -rf ~/.hermes
|
||||
| `hermes: command not found` after install | Shell profile not reloaded in the current session | Run `source ~/.bashrc` (or `source ~/.zshrc`) and retry. Open a new terminal if the issue persists. |
|
||||
| `source ~/.bashrc` works in an interactive terminal, but `hermes` is still missing from a scripted SSH command | Many Ubuntu `.bashrc` files return early for non-interactive shells before the installer-added PATH lines run | In automation, run `export PATH="$HOME/.local/bin:$PATH"` before `hermes`, or call `~/.local/bin/hermes` directly. |
|
||||
| `sudo: hermes: command not found` during gateway install, uninstall, or printed `sudo hermes …` steps | `sudo` resets `PATH` and does not see the user-level `hermes` shim | Run `which hermes` as your normal user, then invoke that path with sudo, e.g. `sudo "$(which hermes)" uninstall` or `sudo /full/path/from/which/hermes gateway …`. |
|
||||
| Installer prints **"Setup wizard skipped (no terminal available)"** | The installer was launched from a non-interactive shell, CI job, or SSH command without a usable TTY | Either re-run `hermes setup` in an interactive terminal, or configure Ollama directly: `hermes config set model.provider custom`, `hermes config set model.base_url http://localhost:11434/v1`, and `hermes config set model.default qwen3.6:27b`. |
|
||||
| Installer prints **"Setup wizard skipped (no terminal available)"** | The installer was launched from a non-interactive shell, CI job, or SSH command without a usable TTY | Either re-run `hermes setup` in an interactive terminal, or configure the endpoint directly: `hermes config set model.provider custom`, `hermes config set model.base_url http://localhost:8000/v1`, and `hermes config set model.default nvidia/Qwen3.6-35B-A3B-NVFP4`. |
|
||||
| Installer cannot install `ripgrep` / `ffmpeg`, or prints `Non-interactive mode and no terminal available` | Optional helper install needs `sudo`, but the current shell cannot prompt for a password | Install manually in an interactive terminal with `sudo apt install -y ripgrep ffmpeg`. Hermes still runs without them, but file search is slower and TTS voice-message support is limited. |
|
||||
| Browser tools show `system dependency not met`, or Playwright Chromium install fails | Playwright needs Linux shared libraries installed through `sudo`, and the installer could not obtain sudo access | Core chat and Telegram can still work. To enable browser tools, run `cd ~/.hermes/hermes-agent && npx playwright install --with-deps chromium` in an interactive terminal and enter your sudo password. |
|
||||
| You want the gateway to start at boot, but `hermes gateway install` creates a user service | Current Hermes installs a user service by default unless `--system` is supplied | Use `sudo "$(which hermes)" gateway install --system --run-as-user "$USER"` (or replace `$(which hermes)` with `~/.local/bin/hermes` if needed). |
|
||||
@ -357,11 +340,11 @@ rm -rf ~/.hermes
|
||||
| Choosing **Telegram** during install immediately shows “setup complete” without token / user ID prompts | Stale or partial Hermes gateway config; installer short-circuit | After `source ~/.bashrc`, run **`hermes gateway setup`**, select Telegram, and complete token and allowed-user steps. Install or restart the systemd service using the printed commands (with `sudo "$(which hermes)"` if needed). |
|
||||
| `/start` shows “Unknown command” (or similar) in Telegram | Bot does not define a custom `/start` handler | Send a normal text message such as **`hello`** after `/start`. Hermes responds to conversational text, not necessarily slash commands. |
|
||||
| `~/.hermes` still exists after `uninstall` | Uninstaller preserves data unless you explicitly remove it | This is expected in some flows. Remove manually only if you want a full wipe: `rm -rf ~/.hermes` (see **Start over from scratch**). |
|
||||
| Hermes installer can't list any models at the model-selection prompt | Ollama is not running or has no models pulled | Sanity-check Ollama in another terminal: list installed models with `ollama list`, hit the API with `curl http://localhost:11434/api/tags`, and confirm a model can actually serve requests by running `ollama run <model-name>` (e.g. `ollama run qwen3.6:27b`) and sending a test prompt. If the list is empty or the API is unreachable, start Ollama and pull a model with `ollama pull <model-name>`, then re-run the Hermes installer. |
|
||||
| `Connection refused` to `http://localhost:11434/v1` from Hermes | Ollama service not running on the default port | Start the Ollama service and confirm it is listening on `11434`. On systemd hosts: `systemctl status ollama` and `systemctl start ollama`. |
|
||||
| Hermes installer can't list any models at the model-selection prompt | vLLM is not running yet or is still loading the checkpoint | Sanity-check the endpoint in another terminal: `curl http://localhost:8000/v1/models` should return a `"data"` array containing `nvidia/Qwen3.6-35B-A3B-NVFP4`. If it is empty or unreachable, confirm the vLLM container is up and has finished loading (watch its terminal for `Application startup complete`), then re-run the Hermes installer. |
|
||||
| `Connection refused` to `http://localhost:8000/v1` from Hermes | vLLM server not running, still loading, or wrong port | Confirm the vLLM container is up and listening on `8000` (`docker ps`, then `curl http://localhost:8000/v1/models`). If it exited, relaunch it (see Instructions — Step 2). |
|
||||
| Pasting the Telegram bot token shows nothing on the screen | Expected — the installer hides token characters as a security measure | Paste the token, then press **Enter**. The installer should respond with `Telegram token saved`. |
|
||||
| Telegram bot does not reply when you send `hello` | Gateway service not running, your account is not in the allowed user IDs list, **or outbound HTTPS to Telegram is blocked** | (1) Confirm Telegram HTTPS from the Spark (Instructions — network check). (2) List Hermes units with `systemctl list-units --type=service --all`, locate the gateway unit by name, then `sudo systemctl status <hermes-gateway-unit>` and `sudo journalctl -u <hermes-gateway-unit> -e --no-pager -n 80`. (3) If logs show reachability to Telegram but messages are ignored, verify your numeric user ID is in the allowed list via `hermes gateway setup` or the [Hermes messaging gateway docs](https://hermes-agent.nousresearch.com/docs/user-guide/messaging). |
|
||||
| Out-of-memory or very slow inference | Selected Ollama model is too large for available GPU memory, or other GPU workloads are competing | Check usage with `nvidia-smi`, free GPU memory by closing other workloads, or pull a smaller model with `ollama pull <smaller-model>` and switch to it via `hermes model`. |
|
||||
| Out-of-memory or very slow inference | Served model is too large for available GPU memory, or other GPU workloads are competing | Check usage with `nvidia-smi`, free GPU memory by closing other workloads, or relaunch vLLM with a lower `--gpu-memory-utilization` / `--max-model-len` (or a smaller model handle) and re-point Hermes via `hermes model`. |
|
||||
| `hermes update` fails or the gateway does not restart | Gateway service still bound to the previous version, or insufficient permissions on a system-service install | Re-run `sudo "$(which hermes)" update` if the gateway was installed as a **System service** and plain `hermes update` cannot restart it. If the service is stuck, restart it manually: `sudo systemctl restart <hermes-gateway-unit>`. |
|
||||
| Cannot resume a previous session | The `<sessionId>` value is missing or wrong | Use `hermes --resume <sessionId>` with the exact ID Hermes printed when you `/exit` that chat. If the ID is lost, start a new session with `hermes` (omit `--resume`). |
|
||||
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
# Run NemoClaw with a Local LLM
|
||||
|
||||
> Build your first local AI assistant on DGX Spark using NemoClaw and Ollama in a secure sandbox, with optional Telegram.
|
||||
> Build your first local AI assistant on DGX Spark using NemoClaw and vLLM in a secure sandbox, with optional Telegram.
|
||||
|
||||
|
||||
## Table of Contents
|
||||
@ -31,9 +31,9 @@
|
||||
|
||||
## Basic idea
|
||||
|
||||
**NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime — an environment designed for executing agents with additional security — and connects them to **local Ollama** inference on your DGX Spark. A single installer command (`nemoclaw.sh`) handles Node.js, OpenShell, and the NemoClaw CLI; the **onboard** wizard then creates a sandboxed agent, optional **Brave Search**, optional **messaging channels** (Telegram, Discord, or Slack), and a **policy tier** with network presets.
|
||||
**NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime — an environment designed for executing agents with additional security — and connects them to **local vLLM** inference on your DGX Spark. A single installer command (`nemoclaw.sh`) handles Node.js, OpenShell, and the NemoClaw CLI; the **onboard** wizard then creates a sandboxed agent, optional **Brave Search**, optional **messaging channels** (Telegram, Discord, or Slack), and a **policy tier** with network presets.
|
||||
|
||||
By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, reachable through the **Web UI** or **terminal TUI**, with inference routed to **local Ollama** on the Spark. You can optionally add **Telegram** (with **cloudflared** for a public webhook URL) and optional **web search** — all without exposing your host filesystem or network beyond what you explicitly allow in policy.
|
||||
By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, reachable through the **Web UI** or **terminal TUI**, with inference routed to **local vLLM** on the Spark. You can optionally add **Telegram** (with **cloudflared** for a public webhook URL) and optional **web search** — all without exposing your host filesystem or network beyond what you explicitly allow in policy.
|
||||
|
||||
### What you'll accomplish
|
||||
|
||||
@ -118,7 +118,8 @@ All required assets are handled by the NemoClaw installer. No manual cloning is
|
||||
|
||||
- **Estimated time:** About 30–60 minutes for a first full pass (install, onboard, model download depending on choice and network). Optional Brave, Telegram, and cloudflared steps add time if you do them in a second session.
|
||||
- **Risk level:** Medium — you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
|
||||
- **Last Updated:** 06/01/2026
|
||||
- **Last Updated:** 06/12/2026
|
||||
- Switch local inference backend to vLLM (agent-ready Qwen3.6 35B recipe)
|
||||
- Pin nemoclaw installer to v0.0.55, the latest stable version
|
||||
|
||||
## Instructions
|
||||
@ -147,8 +148,8 @@ The installer requires **Node.js 22.16+** (installed automatically if missing).
|
||||
|
||||
During custom setup, the onboard wizard walks you through:
|
||||
|
||||
1. **Configuring inference** -- Choose to set up local inference on your Spark by selecting **`7) Local Ollama`**.
|
||||
2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will download **`qwen3.6:35b`** automatically.
|
||||
1. **Configuring inference** -- Choose to set up local inference on your Spark by selecting **`Local vLLM`** (the default).
|
||||
2. **vLLM models** -- Choose desired inference model. If no model is present locally, the installer will download **`nvidia/Qwen3.6-35B-A3B-NVFP4`** automatically.
|
||||
3. **Sandbox name** -- Pick a name (e.g. my-assistant). Each sandbox requires a unique name.
|
||||
4. **Apply this configuration** -- Enter `Y` to confirm setting up local inference.
|
||||
5. **Enable Brave Web Search** -- Optional. If you enable it, paste a [Brave Search API](https://brave.com/search/api/) key when prompted.
|
||||
@ -160,7 +161,7 @@ When complete you will see output like:
|
||||
```text
|
||||
──────────────────────────────────────────────────
|
||||
Sandbox my-assistant (Landlock + seccomp + netns)
|
||||
Model <your-selected-model> (Local Ollama)
|
||||
Model <your-selected-model> (Local vLLM)
|
||||
──────────────────────────────────────────────────
|
||||
Run: nemoclaw my-assistant connect
|
||||
Status: nemoclaw my-assistant status
|
||||
@ -377,13 +378,13 @@ openshell forward stop <port> # stop the dashboard forward (use the port shown
|
||||
|
||||
### Step 8. Uninstall NemoClaw
|
||||
|
||||
The NemoClaw CLI includes a built-in uninstaller. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and Ollama are preserved.
|
||||
The NemoClaw CLI includes a built-in uninstaller. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and the vLLM container image are preserved.
|
||||
|
||||
```bash
|
||||
nemoclaw uninstall --yes
|
||||
```
|
||||
|
||||
To remove everything including the Ollama model:
|
||||
To remove everything including the downloaded model weights:
|
||||
|
||||
```bash
|
||||
nemoclaw uninstall --yes --delete-models
|
||||
@ -395,7 +396,7 @@ nemoclaw uninstall --yes --delete-models
|
||||
|------|--------|
|
||||
| `--yes` | Skip the confirmation prompt |
|
||||
| `--keep-openshell` | Leave the `openshell` binary in place |
|
||||
| `--delete-models` | Also remove the Ollama models pulled by NemoClaw |
|
||||
| `--delete-models` | Also remove the model weights pulled by NemoClaw |
|
||||
|
||||
> [!NOTE]
|
||||
> If the `nemoclaw` CLI is not available (e.g. install failed partway), use the remote uninstaller as a fallback:
|
||||
@ -408,7 +409,7 @@ The uninstaller runs 6 steps:
|
||||
2. Delete all OpenShell sandboxes, the NemoClaw gateway, and providers
|
||||
3. Remove the global `nemoclaw` npm package
|
||||
4. Remove NemoClaw/OpenShell Docker containers, images, and volumes
|
||||
5. Remove Ollama models (only with `--delete-models`)
|
||||
5. Remove downloaded model weights (only with `--delete-models`)
|
||||
6. Remove state directories (`~/.nemoclaw`, `~/.config/openshell`, `~/.config/nemoclaw`) and the OpenShell binary
|
||||
|
||||
> [!NOTE]
|
||||
@ -427,8 +428,8 @@ The uninstaller runs 6 steps:
|
||||
| `nemoclaw my-assistant dashboard-url --quiet` | Print the full tokenized Web UI URL (includes auto-assigned port) |
|
||||
| `openshell term` | Open the monitoring TUI on the host |
|
||||
| `openshell forward list` | List active port forwards |
|
||||
| `nemoclaw uninstall --yes` | Remove NemoClaw (preserves Docker, Node.js, Ollama) |
|
||||
| `nemoclaw uninstall --yes --delete-models` | Remove NemoClaw and Ollama models |
|
||||
| `nemoclaw uninstall --yes` | Remove NemoClaw (preserves Docker, Node.js, vLLM image) |
|
||||
| `nemoclaw uninstall --yes --delete-models` | Remove NemoClaw and downloaded model weights |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
@ -442,8 +443,8 @@ The uninstaller runs 6 steps:
|
||||
| Gateway fails with "port 8080 is held by container..." | Another OpenShell gateway or container is using port 8080 | Stop the conflicting container: `openshell gateway destroy -g <old-gateway-name>` or `docker stop <container-name> && docker rm <container-name>`, then retry `nemoclaw onboard`. |
|
||||
| Sandbox creation fails | Stale gateway state or DNS not propagated | Run `openshell gateway destroy && openshell gateway start`, then re-run the installer or `nemoclaw onboard`. |
|
||||
| CoreDNS crash loop | Known issue on some DGX Spark configurations | Re-run the NemoClaw installer (`curl -fsSL https://www.nvidia.com/nemoclaw.sh \| bash`) which includes the CoreDNS fix. If the issue persists, see [NemoClaw troubleshooting](https://docs.nvidia.com/nemoclaw/latest/reference/troubleshooting.html). |
|
||||
| "No GPU detected" during onboard | DGX Spark GB10 reports unified memory differently | Expected on DGX Spark. The wizard still works and uses Ollama for inference. |
|
||||
| Inference timeout or hangs | Ollama not running or not reachable | Check Ollama: `curl http://127.0.0.1:11434`. If not running: `sudo systemctl restart ollama`. Verify the NemoClaw auth proxy is healthy: `curl http://127.0.0.1:11435/api/tags`. If both respond, check `nemoclaw my-assistant status` for the Inference health line. |
|
||||
| "No GPU detected" during onboard | DGX Spark GB10 reports unified memory differently | Expected on DGX Spark. The wizard still works and uses vLLM for inference. |
|
||||
| Inference timeout or hangs | vLLM not running or not reachable | Check the vLLM server: `curl http://127.0.0.1:8000/v1/models` should list `nvidia/Qwen3.6-35B-A3B-NVFP4`. If it hangs, the model may still be loading — wait for `Application startup complete`. Then check `nemoclaw my-assistant status` for the Inference health line. |
|
||||
| Agent gives no response or is very slow | First response can be slow, especially with larger models | Response time depends on model size (30B: a few seconds, 120B: 30–90 seconds). Verify inference route: `nemoclaw my-assistant status`. |
|
||||
| Port 18789 already in use | Another process is bound to the port | `lsof -i :18789` then `kill <PID>`. If needed, `kill -9 <PID>` to force-terminate. |
|
||||
| Web UI port forward dies or dashboard unreachable | Port forward not active | `openshell forward stop 18789 my-assistant` then `openshell forward start 18789 my-assistant --background`. |
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
# OpenClaw 🦞
|
||||
|
||||
> Run OpenClaw locally on DGX Spark with LM Studio or Ollama
|
||||
> Run OpenClaw locally on DGX Spark with a vLLM-served local model
|
||||
|
||||
## Table of Contents
|
||||
|
||||
@ -20,7 +20,7 @@ Running OpenClaw and its LLMs **fully on your DGX Spark** keeps your data privat
|
||||
|
||||
## What you'll accomplish
|
||||
|
||||
You will have OpenClaw installed on your DGX Spark and connected to a local LLM (via LM Studio or Ollama). You can use the OpenClaw web UI to chat with your agent, and optionally connect communication channels and skills. The agent and models run entirely on your Spark—no data leaves your machine unless you add cloud or external integrations.
|
||||
You will have OpenClaw installed on your DGX Spark and connected to a local LLM served by **vLLM** (the agent-ready `nvidia/Qwen3.6-35B-A3B-NVFP4` recipe). You can use the OpenClaw web UI to chat with your agent, and optionally connect communication channels and skills. The agent and models run entirely on your Spark—no data leaves your machine unless you add cloud or external integrations.
|
||||
|
||||
## Popular use cases
|
||||
|
||||
@ -32,7 +32,7 @@ You will have OpenClaw installed on your DGX Spark and connected to a local LLM
|
||||
## What to know before starting
|
||||
|
||||
- Basic use of the Linux terminal and a text editor
|
||||
- Optional: familiarity with Ollama or LM Studio if you plan to use a local model
|
||||
- Optional: familiarity with Docker and vLLM if you plan to use a local model
|
||||
- Awareness of the security considerations below
|
||||
|
||||
## Important: security and risks
|
||||
@ -61,10 +61,11 @@ You cannot eliminate all risk; proceed at your own risk. **Critical security mea
|
||||
|
||||
## Time and risk
|
||||
|
||||
- **Duration**: About 30 minutes for install and first-time model setup; model download time depends on size and network (gpt-oss-120b is ~65GB and may take longer on slower connections).
|
||||
- **Duration**: About 30 minutes for install and first-time model setup; model download time depends on size and network (the NVFP4 checkpoint is downloaded once and cached for later launches).
|
||||
- **Risk level**: **Medium to High**—the agent has access to whatever files, tools, and channels you configure. Risk increases significantly if you enable terminal/command execution skills or connect external accounts. Without proper isolation, this setup could expose sensitive data or allow code execution. **Always follow the security measures above.**
|
||||
- **Rollback**: You can stop the OpenClaw gateway and uninstall via the same install script or by removing its directory; uninstall Ollama or LM Studio separately if desired.
|
||||
- **Last Updated**: 03/11/2026
|
||||
- **Rollback**: You can stop the OpenClaw gateway and uninstall via the same install script or by removing its directory; stop the vLLM container separately (`docker rm`/`docker rmi`) if desired.
|
||||
- **Last Updated**: 06/12/2026
|
||||
- Switch local inference backend to vLLM (agent-ready Qwen3.6 35B recipe)
|
||||
- First Publication
|
||||
|
||||
## Instructions
|
||||
@ -106,85 +107,21 @@ Work through the prompts as follows.
|
||||
|
||||
You can now open the OpenClaw dashboard in a browser using the URL and token from the installer.
|
||||
|
||||
## Step 3. Choose and install a local LLM backend
|
||||
## Step 3. Serve the model with vLLM on your DGX Spark
|
||||
|
||||
OpenClaw can use a local LLM via **LM Studio** (best raw performance, uses Llama.cpp) or **Ollama** (simpler and good for deployment). Use a **separate terminal** on your DGX Spark for the backend so the gateway and the model server can run side by side.
|
||||
OpenClaw will connect to a local, OpenAI-compatible endpoint served by **vLLM**. This playbook uses the agent-ready `nvidia/Qwen3.6-35B-A3B-NVFP4` recipe — the same one documented in the vLLM playbook's [Run Agent Ready Qwen3.6 35B Model with vLLM](https://build.nvidia.com/spark/vllm/agent-ready-qwen35b) tab. The NVFP4 quantization and speculative decoding give strong tool-calling and reasoning quality while leaving headroom on DGX Spark's 128GB unified memory.
|
||||
|
||||
**Install one of the following:**
|
||||
In a **separate terminal** on your DGX Spark, follow the vLLM playbook's [Run Agent Ready Qwen3.6 35B Model with vLLM](https://build.nvidia.com/spark/vllm/agent-ready-qwen35b) tab to launch the server. Run it on its own terminal so the gateway and the model server can run side by side. That tab serves `nvidia/Qwen3.6-35B-A3B-NVFP4` on an OpenAI-compatible API at `http://localhost:8000/v1`.
|
||||
|
||||
**Option A – LM Studio**
|
||||
Once the server reports `Application startup complete`, verify it from another terminal before continuing:
|
||||
|
||||
```bash
|
||||
curl -fsSL https://lmstudio.ai/install.sh | bash
|
||||
curl http://localhost:8000/v1/models
|
||||
```
|
||||
|
||||
**Option B – Ollama**
|
||||
You should see `nvidia/Qwen3.6-35B-A3B-NVFP4` in the returned list.
|
||||
|
||||
```bash
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
```
|
||||
|
||||
## Step 4. Select and download a model
|
||||
|
||||
Model quality and capability scale with size. Free as much GPU memory as possible (avoid other GPU workloads, enable only the skills you need). DGX Spark has **128GB unified memory**, so you can run large models with room to spare.
|
||||
|
||||
**Suggested models by GPU memory:**
|
||||
|
||||
| GPU memory | Suggested model | Model size | Notes |
|
||||
|-------------|-------------------------------------|-----------|-------|
|
||||
| 8–12 GB | qwen3-4B-Thinking-2507 | ~5GB | — |
|
||||
| 16 GB | gpt-oss-20b | ~12GB | Lower latency, good for interactive use |
|
||||
| 24–48 GB | Nemotron-3-Nano-30B-A3B | ~20GB | — |
|
||||
| 128 GB | gpt-oss-120b | ~65GB | **Best quality on DGX Spark** (quantized); leaves ~63GB for context window and other processes; use 20B/30B if you prefer faster responses |
|
||||
|
||||
**Quality vs. latency:** The 120B model gives the best accuracy and capability but has higher per-token latency. If you prefer snappier replies, use **gpt-oss-20b** (or a 30B model) instead; both run comfortably on DGX Spark with plenty of memory headroom.
|
||||
|
||||
**Download the model:**
|
||||
|
||||
**LM Studio**
|
||||
|
||||
```bash
|
||||
lms get openai/gpt-oss-120b
|
||||
```
|
||||
|
||||
**Ollama**
|
||||
|
||||
```bash
|
||||
ollama pull gpt-oss:120b
|
||||
```
|
||||
|
||||
(Use the model name that matches your choice from the table; adjust the `lms get` or `ollama pull` command accordingly.)
|
||||
|
||||
## Step 5. Run the model with a large context window
|
||||
|
||||
OpenClaw works best with a context window of **32K tokens or more**.
|
||||
|
||||
**LM Studio**
|
||||
|
||||
```bash
|
||||
lms load openai/gpt-oss-120b --context-length 32768
|
||||
```
|
||||
|
||||
**Ollama**
|
||||
|
||||
```bash
|
||||
ollama run gpt-oss:120b
|
||||
```
|
||||
|
||||
Once the interactive prompt appears, set the context window (type the following at the Ollama prompt; do not include any `>>>` prefix):
|
||||
|
||||
```
|
||||
/set parameter num_ctx 32768
|
||||
```
|
||||
|
||||
Keep this terminal (or process) running so the model stays loaded. You can now chat with the model or press Ctrl+D to exit the interactive mode while keeping the model server running.
|
||||
|
||||
> [!TIP]
|
||||
> **If you see out-of-memory (OOM) errors:** Try a smaller context (e.g. `16384`) or switch to a smaller model (e.g. gpt-oss-20b). Monitor memory with `nvidia-smi` while the model is loaded.
|
||||
|
||||
## Step 6. Configure OpenClaw to use your local model
|
||||
|
||||
**If you use LM Studio:**
|
||||
## Step 4. Configure OpenClaw to use the vLLM server
|
||||
|
||||
1. Open the OpenClaw config file in your preferred editor (e.g. `nano`, `vim`, or a graphical editor). The config path is:
|
||||
```bash
|
||||
@ -195,21 +132,21 @@ Keep this terminal (or process) running so the model stays loaded. You can now c
|
||||
nano ~/.openclaw/openclaw.json
|
||||
```
|
||||
|
||||
2. Add or update the `models` section so it includes the LM Studio provider. Example for **gpt-oss-120b** (DGX Spark):
|
||||
2. Add or update the `models` section so it includes the vLLM provider pointing at the endpoint from Step 3. vLLM does not require an API key, so any non-empty placeholder works:
|
||||
|
||||
```json
|
||||
"models": {
|
||||
"mode": "merge",
|
||||
"providers": {
|
||||
"lmstudio": {
|
||||
"baseUrl": "http://localhost:1234/v1",
|
||||
"apiKey": "lmstudio",
|
||||
"vllm": {
|
||||
"baseUrl": "http://localhost:8000/v1",
|
||||
"apiKey": "vllm",
|
||||
"api": "openai-responses",
|
||||
"models": [
|
||||
{
|
||||
"id": "openai/gpt-oss-120b",
|
||||
"name": "openai/gpt-oss-120b",
|
||||
"reasoning": false,
|
||||
"id": "nvidia/Qwen3.6-35B-A3B-NVFP4",
|
||||
"name": "nvidia/Qwen3.6-35B-A3B-NVFP4",
|
||||
"reasoning": true,
|
||||
"input": ["text"],
|
||||
"cost": {
|
||||
"input": 0,
|
||||
@ -217,8 +154,8 @@ Keep this terminal (or process) running so the model stays loaded. You can now c
|
||||
"cacheRead": 0,
|
||||
"cacheWrite": 0
|
||||
},
|
||||
"contextWindow": 32768,
|
||||
"maxTokens": 4096
|
||||
"contextWindow": 262144,
|
||||
"maxTokens": 8192
|
||||
}
|
||||
]
|
||||
}
|
||||
@ -226,30 +163,22 @@ Keep this terminal (or process) running so the model stays loaded. You can now c
|
||||
}
|
||||
```
|
||||
|
||||
For **gpt-oss-20b** or another model, use the same structure but set `id` and `name` to match the model you loaded (e.g. `openai/gpt-oss-20b`). Adjust `contextWindow` and `maxTokens` if needed.
|
||||
|
||||
**If you use Ollama:**
|
||||
The `id` and `name` must match the model handle served by vLLM (`nvidia/Qwen3.6-35B-A3B-NVFP4`). `contextWindow` matches the `--max-model-len` from Step 3.
|
||||
|
||||
> [!NOTE]
|
||||
> `ollama launch openclaw` requires **Ollama v0.15 or later**. If you see an "unknown command" error, upgrade Ollama (`ollama --version`) and retry.
|
||||
> If OpenClaw reports an unsupported-endpoint error against the Responses API, change `"api": "openai-responses"` to the OpenAI chat-completions variant for your OpenClaw version — vLLM always exposes `/v1/chat/completions`.
|
||||
|
||||
Run:
|
||||
3. If the OpenClaw gateway is already running, restart it so it reloads `~/.openclaw/openclaw.json` and picks up the new provider.
|
||||
|
||||
```bash
|
||||
ollama launch openclaw
|
||||
```
|
||||
|
||||
If the OpenClaw gateway is already running, it should pick up the new configuration automatically. You can add `--config` to configure without launching the gateway yet.
|
||||
|
||||
## Step 7. Verify the setup
|
||||
## Step 5. Verify the setup
|
||||
|
||||
1. In a browser, open the **OpenClaw dashboard URL** (and use the access token if required).
|
||||
2. Start a **new** conversation and send a short message.
|
||||
3. If you get a reply from the agent, the setup is working.
|
||||
|
||||
You can also ask OpenClaw which model it’s using. In the gateway chat UI you can switch models by typing: **`/model MODEL_NAME`**.
|
||||
You can also ask OpenClaw which model it’s using. In the gateway chat UI you can switch models by typing: **`/model MODEL_NAME`** (e.g. `/model nvidia/Qwen3.6-35B-A3B-NVFP4`).
|
||||
|
||||
## Step 8. Optional: add skills and learn more
|
||||
## Step 6. Optional: add skills and learn more
|
||||
|
||||
- **Skills** add capabilities but also risk; only enable skills you trust (e.g., community-vetted ones). To add a skill:
|
||||
- Ask OpenClaw to configure a skill, or
|
||||
@ -262,9 +191,9 @@ You can also ask OpenClaw which model it’s using. In the gateway chat UI you c
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|--------|-----|
|
||||
| OpenClaw dashboard URL not loading | Gateway not running or wrong host/port | **Restart the OpenClaw gateway:** For Ollama, run `ollama launch openclaw` to restart an already-configured gateway. For LM Studio, restart the OpenClaw gateway via the LM Studio UI or restart the OpenClaw service/container. **Verify:** Check that the gateway process is running with `pgrep -f openclaw` or `ps aux \| grep openclaw`. **Find URL/token:** Check the original installer output (scroll up in your terminal) or look in gateway logs (typically `~/.openclaw/logs/`) for the dashboard URL and access token |
|
||||
| "Connection refused" to model (e.g. localhost:1234 or Ollama port) | LM Studio or Ollama not running, or wrong port | Start the model in a separate terminal (`lms load ...` or `ollama run ...`) and ensure the port in `openclaw.json` matches (1234 for LM Studio, 11434 for Ollama) |
|
||||
| OpenClaw says no model available | Model provider not configured or model not loaded | Add the `models` section to `~/.openclaw/openclaw.json` for LM Studio, or run `ollama launch openclaw` for Ollama; ensure the model is loaded/running |
|
||||
| Out-of-memory or very slow inference on DGX Spark | Model too large for available GPU memory or other GPU workloads | Free GPU memory (close other apps), choose a smaller model, or check usage with `nvidia-smi` |
|
||||
| OpenClaw dashboard URL not loading | Gateway not running or wrong host/port | **Restart the OpenClaw gateway** so it reloads `~/.openclaw/openclaw.json`. **Verify:** Check that the gateway process is running with `pgrep -f openclaw` or `ps aux \| grep openclaw`. **Find URL/token:** Check the original installer output (scroll up in your terminal) or look in gateway logs (typically `~/.openclaw/logs/`) for the dashboard URL and access token |
|
||||
| "Connection refused" to model (e.g. localhost:8000) | vLLM server not running, still loading, or wrong port | Confirm the vLLM container is up and finished loading (`curl http://localhost:8000/v1/models` lists the model) and that `baseUrl` in `openclaw.json` is `http://localhost:8000/v1` |
|
||||
| OpenClaw says no model available | Provider not configured or model handle mismatch | Add the `vllm` provider to `~/.openclaw/openclaw.json` and ensure `id`/`name` exactly match the served handle (`nvidia/Qwen3.6-35B-A3B-NVFP4`) |
|
||||
| Out-of-memory or very slow inference on DGX Spark | Model too large for available GPU memory or other GPU workloads | Lower `--gpu-memory-utilization` or `--max-model-len` when launching vLLM, free GPU memory (close other apps), or check usage with `nvidia-smi` |
|
||||
| Install script fails or dependencies missing | Missing system packages on Linux | Install curl and any required build tools; see [OpenClaw documentation](https://docs.openclaw.ai) for current requirements |
|
||||
| Config changes not applied | Gateway not reloaded | Restart the OpenClaw gateway so it reloads `~/.openclaw/openclaw.json` |
|
||||
|
||||
@ -83,22 +83,22 @@ You will install the OpenShell CLI (`openshell`), deploy a gateway on your DGX S
|
||||
|
||||
- Comfort with the Linux terminal and SSH
|
||||
- Basic understanding of Docker (OpenShell runs a k3s cluster inside Docker)
|
||||
- Familiarity with Ollama for local model serving
|
||||
- Familiarity with Docker and vLLM for local model serving
|
||||
- Awareness of the security model: OpenShell reduces risk through isolation but cannot eliminate all risk. Review the [OpenShell documentation](https://pypi.org/project/openshell/) and [OpenClaw security guidance](https://docs.openclaw.ai/gateway/security).
|
||||
|
||||
## Prerequisites
|
||||
|
||||
**Hardware Requirements:**
|
||||
- NVIDIA DGX Spark with 128GB unified memory
|
||||
- At least 70GB available memory for a large local model (e.g., gpt-oss:120b at ~65GB plus overhead), or 25GB+ for a smaller model (e.g., gpt-oss-20b)
|
||||
- Enough unified memory for the served model plus KV cache (the playbook serves `nvidia/Qwen3.6-35B-A3B-NVFP4` with vLLM at `--gpu-memory-utilization 0.4`)
|
||||
|
||||
**Software Requirements:**
|
||||
- NVIDIA DGX OS (Ubuntu 24.04 base)
|
||||
- Docker Desktop or Docker Engine running: `docker info`
|
||||
- Python 3.12 or later: `python3 --version`
|
||||
- `uv` package manager: `uv --version` (install with `curl -LsSf https://astral.sh/uv/install.sh | sh`)
|
||||
- Ollama 0.17.0 or newer (latest recommended for gpt-oss MXFP4 support): `ollama --version`
|
||||
- Network access to download Python packages from PyPI and model weights from Ollama
|
||||
- NVIDIA Container Toolkit configured for Docker, plus a HuggingFace token to download the model
|
||||
- Network access to download Python packages from PyPI and model weights from HuggingFace
|
||||
- Have [NVIDIA Sync](https://build.nvidia.com/spark/connect-to-your-spark) installed and configured for your DGX Spark
|
||||
|
||||
## Time & risk
|
||||
@ -109,8 +109,9 @@ You will install the OpenShell CLI (`openshell`), deploy a gateway on your DGX S
|
||||
* OpenShell sandboxes enforce kernel-level isolation, significantly reducing the risk compared to running OpenClaw directly on the host.
|
||||
* The sandbox default policy denies all outbound traffic not explicitly allowed. Misconfigured policies may block legitimate agent traffic; use `openshell logs` to diagnose.
|
||||
* Large model downloads may fail on unstable networks.
|
||||
* **Rollback:** Delete the sandbox with `openshell sandbox delete <sandbox-name>`, stop the gateway with `openshell gateway stop`, and optionally destroy it with `openshell gateway destroy`. Ollama models can be removed with `ollama rm <model>`.
|
||||
* **Last Updated:** 03/13/2026
|
||||
* **Rollback:** Delete the sandbox with `openshell sandbox delete <sandbox-name>`, stop the gateway with `openshell gateway stop`, and optionally destroy it with `openshell gateway destroy`. The vLLM container can be removed with `docker rm`/`docker rmi`.
|
||||
* **Last Updated:** 06/12/2026
|
||||
* Switch local inference backend to vLLM (agent-ready Qwen3.6 35B recipe)
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -191,63 +192,26 @@ openshell status
|
||||
> [!TIP]
|
||||
> If you want to manage the Spark gateway from a separate workstation, run `openshell gateway start --remote <username>@<spark-ssid>.local` from that workstation instead. All subsequent commands will route through the SSH tunnel.
|
||||
|
||||
## Step 5. Install Ollama and pull a model
|
||||
## Step 5. Serve a model with vLLM
|
||||
|
||||
Install Ollama (if not already present) and download a model for local inference.
|
||||
Serve a model with **vLLM** for local inference. This playbook uses the agent-ready `nvidia/Qwen3.6-35B-A3B-NVFP4` recipe — the same one documented in the vLLM playbook's [Run Agent Ready Qwen3.6 35B Model with vLLM](https://build.nvidia.com/spark/vllm/agent-ready-qwen35b) tab.
|
||||
|
||||
Follow that tab to launch the server in a **separate terminal**. It serves `nvidia/Qwen3.6-35B-A3B-NVFP4` on an OpenAI-compatible API at port `8000`.
|
||||
|
||||
> [!IMPORTANT]
|
||||
> The recipe binds `--host 0.0.0.0`, which is required here: the OpenShell gateway runs inside Docker and reaches the server over the Spark's IP address, not `localhost`. Keep the `--host 0.0.0.0` flag when you launch it.
|
||||
|
||||
Once the server reports `Application startup complete`, verify it is reachable on all interfaces:
|
||||
|
||||
```bash
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
ollama --version
|
||||
curl http://0.0.0.0:8000/v1/models
|
||||
```
|
||||
|
||||
DGX Spark's 128GB memory can run large models:
|
||||
|
||||
| GPU memory available | Suggested model | Model size | Notes |
|
||||
|---------------------|---------------------------|-----------|-------|
|
||||
| 25–48 GB | nemotron-3-nano | ~24GB | Lower latency, good for interactive use |
|
||||
| 48–80 GB | gpt-oss:120b | ~65GB | Good balance of quality and speed |
|
||||
| 128 GB | nemotron-3-super:120b | ~86GB | Best quality on DGX Spark |
|
||||
|
||||
Verify Ollama is running (it auto-starts as a service after installation). If not, start it manually:
|
||||
|
||||
```bash
|
||||
ollama serve &
|
||||
```
|
||||
|
||||
Configure Ollama to listen on all interfaces so the OpenShell gateway container can reach it:
|
||||
|
||||
```bash
|
||||
sudo mkdir -p /etc/systemd/system/ollama.service.d
|
||||
printf '[Service]\nEnvironment="OLLAMA_HOST=0.0.0.0"\n' | sudo tee /etc/systemd/system/ollama.service.d/override.conf
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl restart ollama
|
||||
```
|
||||
|
||||
Verify Ollama is running and reachable on all interfaces:
|
||||
|
||||
```bash
|
||||
curl http://0.0.0.0:11434
|
||||
```
|
||||
|
||||
Expected: `Ollama is running`. If not, start it with `sudo systemctl start ollama`.
|
||||
|
||||
Next, run a model from Ollama (adjust the model name to match your choice from [the Ollama model library](https://ollama.com/library)). The `ollama run` command will pull the model automatically if it is not already present. Running the model here ensures it is loaded and ready when you use it with OpenClaw, reducing the chance of timeouts later. Example for nemotron-3-super:
|
||||
|
||||
```bash
|
||||
ollama run nemotron-3-super:120b
|
||||
```
|
||||
|
||||
Type `/bye` to exit.
|
||||
|
||||
Verify the model is available:
|
||||
|
||||
```bash
|
||||
ollama list
|
||||
```
|
||||
Expected: a JSON `"data"` array listing `nvidia/Qwen3.6-35B-A3B-NVFP4`. If the request hangs, the model is likely still loading — wait for the startup line and retry.
|
||||
|
||||
## Step 6. Create an inference provider
|
||||
|
||||
We are going to create an OpenShell provider that points to your local Ollama server. This lets OpenShell route inference requests to your Spark-hosted model.
|
||||
We are going to create an OpenShell provider that points to your local vLLM server. This lets OpenShell route inference requests to your Spark-hosted model.
|
||||
|
||||
First, find the IP address of your DGX Spark:
|
||||
|
||||
@ -255,14 +219,14 @@ First, find the IP address of your DGX Spark:
|
||||
hostname -I | awk '{print $1}'
|
||||
```
|
||||
|
||||
Then create the provider, replacing `{Machine_IP}` with the IP address from the command above (e.g. `10.110.106.169`):
|
||||
Then create the provider, replacing `{Machine_IP}` with the IP address from the command above (e.g. `10.110.106.169`). vLLM does not require an API key, so any non-empty placeholder works:
|
||||
|
||||
```bash
|
||||
openshell provider create \
|
||||
--name local-ollama \
|
||||
--name local-vllm \
|
||||
--type openai \
|
||||
--credential OPENAI_API_KEY=not-needed \
|
||||
--config OPENAI_BASE_URL=http://{Machine_IP}:11434/v1
|
||||
--config OPENAI_BASE_URL=http://{Machine_IP}:8000/v1
|
||||
```
|
||||
|
||||
> [!IMPORTANT]
|
||||
@ -276,18 +240,18 @@ openshell provider list
|
||||
|
||||
## Step 7. Configure inference routing
|
||||
|
||||
Point the `inference.local` endpoint (available inside every sandbox) at your Ollama model. Replace the model name with your choice from Step 5:
|
||||
Point the `inference.local` endpoint (available inside every sandbox) at your vLLM model. The model name must match the handle served in Step 5:
|
||||
|
||||
```bash
|
||||
openshell inference set \
|
||||
--provider local-ollama \
|
||||
--model nemotron-3-super:120b
|
||||
--provider local-vllm \
|
||||
--model nvidia/Qwen3.6-35B-A3B-NVFP4
|
||||
```
|
||||
|
||||
The output should confirm the route and show a validated endpoint URL, for example: `http://10.110.106.169:11434/v1/chat/completions (openai_chat_completions)`.
|
||||
The output should confirm the route and show a validated endpoint URL, for example: `http://10.110.106.169:8000/v1/chat/completions (openai_chat_completions)`.
|
||||
|
||||
> [!NOTE]
|
||||
> If you see `failed to verify inference endpoint` or `failed to connect` (for example because the gateway cannot reach the host IP from inside its container), add `--no-verify` to skip endpoint verification: `openshell inference set --provider local-ollama --model nemotron-3-super:120b --no-verify`. Ensure Ollama is running and listening on all interfaces (see Step 5).
|
||||
> If you see `failed to verify inference endpoint` or `failed to connect` (for example because the gateway cannot reach the host IP from inside its container), add `--no-verify` to skip endpoint verification: `openshell inference set --provider local-vllm --model nvidia/Qwen3.6-35B-A3B-NVFP4 --no-verify`. Ensure the vLLM server is running and reachable on the Spark's IP (see Step 5).
|
||||
|
||||
Verify the configuration:
|
||||
|
||||
@ -295,7 +259,7 @@ Verify the configuration:
|
||||
openshell inference get
|
||||
```
|
||||
|
||||
Expected output should show `provider: local-ollama` and `model: nemotron-3-super:120b` (or whichever model you chose).
|
||||
Expected output should show `provider: local-vllm` and `model: nvidia/Qwen3.6-35B-A3B-NVFP4`.
|
||||
|
||||
## Step 8. Deploy OpenShell Sandbox
|
||||
|
||||
@ -339,10 +303,10 @@ Use the arrow keys and Enter key to interact with the installation.
|
||||
- Model/auth Provider: Select **Custom Provider**, the second-to-last option.
|
||||
- API Base URL: update to https://inference.local/v1
|
||||
- How do you want to provide this API key?: Paste API key for now.
|
||||
- API key: please enter "ollama".
|
||||
- API key: please enter "vllm" (vLLM does not validate the key; any non-empty value works).
|
||||
- Endpoint compatibility: select **OpenAI-compatible** and press Enter.
|
||||
- Model ID: enter the model name you chose in Step 5 (e.g. `nemotron-3-super:120b`).
|
||||
- This may take 1-2 minutes as the Ollama model is spun up in the background.
|
||||
- Model ID: enter the model handle you served in Step 5: `nvidia/Qwen3.6-35B-A3B-NVFP4`.
|
||||
- The first request may take a moment while vLLM warms up.
|
||||
- Endpoint ID: leave the default value.
|
||||
- Alias: enter the same model name (this is optional).
|
||||
- Channel: Select **Skip for now**.
|
||||
@ -454,13 +418,13 @@ Now that OpenClaw has been configured within the OpenShell protected runtime, yo
|
||||
openshell sandbox connect $SANDBOX_NAME
|
||||
```
|
||||
|
||||
Once loaded into the sandbox terminal, you can test connectivity to the Ollama model with this command:
|
||||
Once loaded into the sandbox terminal, you can test connectivity to the vLLM model with this command:
|
||||
``` bash
|
||||
curl https://inference.local/v1/responses \
|
||||
curl https://inference.local/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"instructions": "You are a helpful assistant.",
|
||||
"input": "Hello!"
|
||||
"model": "nvidia/Qwen3.6-35B-A3B-NVFP4",
|
||||
"messages": [{"role": "user", "content": "Hello!"}]
|
||||
}'
|
||||
```
|
||||
|
||||
@ -511,7 +475,7 @@ openshell sandbox delete $SANDBOX_NAME
|
||||
Remove the inference provider you created in Step 6:
|
||||
|
||||
```bash
|
||||
openshell provider delete local-ollama
|
||||
openshell provider delete local-vllm
|
||||
```
|
||||
|
||||
Stop the gateway (preserves state for later):
|
||||
@ -527,10 +491,11 @@ openshell gateway stop
|
||||
openshell gateway destroy
|
||||
```
|
||||
|
||||
To also remove the Ollama model:
|
||||
To also stop and remove the vLLM container and image:
|
||||
|
||||
```bash
|
||||
ollama rm nemotron-3-super:120b
|
||||
docker rm $(docker ps -aq --filter ancestor=vllm/vllm-openai:nightly-aarch64)
|
||||
docker rmi vllm/vllm-openai:nightly-aarch64
|
||||
```
|
||||
|
||||
## Step 14. Next steps
|
||||
@ -548,11 +513,11 @@ ollama rm nemotron-3-super:120b
|
||||
| `openshell status` shows gateway as unhealthy | Gateway container crashed or failed to initialize | Run `openshell gateway destroy` and then `openshell gateway start` to recreate it. Check Docker logs with `docker ps -a` and `docker logs <container-id>` for details |
|
||||
| `openshell sandbox create --from openclaw` fails to build | Network issue pulling the community sandbox or Dockerfile build failure | Check internet connectivity. Retry the command. If the build fails on a specific package, check if the base image is compatible with your Docker version |
|
||||
| Sandbox is in `Error` phase after creation | Policy validation failed or container startup crashed | Run `openshell logs <sandbox-name>` to see error details. Common causes: invalid policy YAML, missing provider credentials, or port conflicts |
|
||||
| Agent cannot reach `inference.local` inside the sandbox | Inference routing not configured or provider unreachable | Run `openshell inference get` to verify the provider and model are set. Test Ollama is accessible from the host: `curl http://localhost:11434/api/tags`. Ensure the provider URL uses `host.docker.internal` instead of `localhost` |
|
||||
| 503 verification failed or timeout when gateway/sandbox accesses Ollama on the host | Ollama bound only to localhost, or host firewall blocking port 11434 | Make Ollama listen on all interfaces so the gateway container (e.g. on Docker network 172.17.x.x) can reach it: `OLLAMA_HOST=0.0.0.0 ollama serve &`. Allow port 11434 through the host firewall: `sudo ufw allow 11434/tcp comment 'Ollama for OpenShell Gateway'` (then `sudo ufw reload` if needed). |
|
||||
| Agent cannot reach `inference.local` inside the sandbox | Inference routing not configured or provider unreachable | Run `openshell inference get` to verify the provider and model are set. Test the vLLM server from the host: `curl http://localhost:8000/v1/models`. Ensure the provider `OPENAI_BASE_URL` uses the Spark's IP address (not `localhost`), since the gateway runs inside Docker |
|
||||
| 503 verification failed or timeout when gateway/sandbox accesses vLLM on the host | Provider URL points at `localhost`, or host firewall blocking port 8000 | The recipe already binds vLLM to all interfaces (`--host 0.0.0.0`). Confirm the provider `OPENAI_BASE_URL` uses the Spark's IP (from `hostname -I`) so the gateway container (e.g. on Docker network 172.17.x.x) can reach it. Allow port 8000 through the host firewall: `sudo ufw allow 8000/tcp comment 'vLLM for OpenShell Gateway'` (then `sudo ufw reload` if needed). |
|
||||
| Agent's outbound connections are all denied | Default policy does not include the required endpoints | Monitor denials with `openshell logs <sandbox-name> --tail --source sandbox`. Pull the current policy with `openshell policy get <sandbox-name> --full`, add the needed host/port under `network_policies`, and push with `openshell policy set <sandbox-name> --policy <file> --wait` |
|
||||
| "Permission denied" or Landlock errors inside the sandbox | Agent trying to access a path not in `read_only` or `read_write` filesystem policy | Pull the current policy and add the path to `read_write` (or `read_only` if read access is sufficient). Push the updated policy. Note: filesystem policy is static and requires sandbox recreation |
|
||||
| Ollama OOM or very slow inference | Model too large for available memory or GPU contention | Free GPU memory (close other GPU workloads), try a smaller model (e.g., `gpt-oss:20b`), or reduce context length. Monitor with `nvidia-smi` |
|
||||
| vLLM OOM or very slow inference | Model too large for available memory or GPU contention | Free GPU memory (close other GPU workloads), or relaunch vLLM with a lower `--gpu-memory-utilization` / `--max-model-len` (or a smaller model handle). Monitor with `nvidia-smi` |
|
||||
| `openshell sandbox connect` hangs or times out | Sandbox not in `Ready` phase | Run `openshell sandbox get <sandbox-name>` to check the phase. If stuck in `Provisioning`, wait or check logs. If in `Error`, delete and recreate the sandbox |
|
||||
| Policy push returns exit code 1 (validation failed) | Malformed YAML or invalid policy fields | Check the YAML syntax. Common issues: paths not starting with `/`, `..` traversal in paths, `root` as `run_as_user`, or endpoints missing required `host`/`port` fields. Fix and re-push |
|
||||
| `openshell gateway start` fails with "K8s namespace not ready" / timed out waiting for namespace | The k3s cluster inside the Docker container takes longer to bootstrap than the CLI timeout allows. The internal components (TLS secrets, Helm chart, namespace creation) may need extra time, especially on first run when images are pulled inside the container. | First, check whether the container is still running and progressing: `docker ps --filter name=openshell` (look for `health: starting`). Inspect k3s state inside the container: `docker exec <container> sh -c "KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl get ns"` and `kubectl get pods -A`. If pods are in `ContainerCreating` and TLS secrets are missing (`navigator-server-tls`, `openshell-server-tls`), the cluster is still bootstrapping — wait a few minutes and run `openshell status` again. If it does not recover, destroy with `openshell gateway destroy` (and `docker rm -f <container>` if needed) and retry `openshell gateway start`. Ensure Docker has enough resources (memory and disk) for the k3s cluster. |
|
||||
|
||||
@ -82,21 +82,18 @@ spec:
|
||||
content: |
|
||||
# Step 1. Log in to Brev
|
||||
|
||||
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm you’re in the correct org (by clicking the org button on the top right-hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
|
||||
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm you’re in the correct org (by clicking the org button on the top right hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
|
||||
|
||||
Click the “Register Compute” button and follow the instructions in the pop-up window.
|
||||
|
||||
# Step 2. Complete Pop-up Instructions
|
||||
# Step 2. Complete Popup Instructions
|
||||
|
||||
* Install the Brev CLI
|
||||
* Configure your compute
|
||||
* Add a name for compute
|
||||
* To configure SSH, ensure the “Enable SSH access” toggle is on
|
||||
* To configure ssh, ensure the “Enable SSH access” toggle is on
|
||||
* Run the registration command
|
||||
|
||||
> [!IMPORTANT]
|
||||
> Run the Brev CLI install command **without `sudo`**. Prefixing the installer with `sudo` writes the `brev` binary into root's home directory, which is not on your user shell's `PATH` — the next command will fail with `brev: command not found`. Copy the install command from the pop-up and run it as your normal user.
|
||||
|
||||
# Step 3. Follow Registration Flow
|
||||
|
||||
In the CLI, you’ll be walked through registration. Go through the flow until registration is complete.
|
||||
@ -113,14 +110,10 @@ spec:
|
||||
|
||||
Now that your hardware is connected, you can:
|
||||
|
||||
* **Access your machine from anywhere:** Open the [Brev UI](https://brev.nvidia.com) and launch a session from [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
|
||||
* **Share access with others:** Invite teammates to your DGX Station from the Brev UI:
|
||||
* Go to the [Brev UI](https://brev.nvidia.com) and open [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
|
||||
* Find your DGX Station in the list and open the row's three-dot (⋯) menu.
|
||||
* Select **Share Access**.
|
||||
* Enter the email address of the person you want to share with.
|
||||
* Choose their role / permission level.
|
||||
* Confirm to send the invitation.
|
||||
* **Share Access Anywhere:** Access your machine from anywhere and share access with others through the Brev UI by:
|
||||
* Adding the user to your [Team](https://brev.nvidia.com/org/team)
|
||||
* Navigating to your instance in the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section
|
||||
* In **SSH Access** section of the instance, search for the user you wish to add and click **Modify Access** to enable access
|
||||
|
||||
# Step 6. Cleanup
|
||||
|
||||
@ -135,7 +128,7 @@ spec:
|
||||
In the UI:
|
||||
* Go to the [Brev UI](https://brev.nvidia.com)
|
||||
* Navigate to the section listing “GPU Environments” and look under “Registered Compute”
|
||||
* Click the “Remove” menu item on the device you wish to delete from Brev.
|
||||
* Click the “Remove” menu item on the DGX Station you wish to delete from Brev.
|
||||
* Confirm your selection.
|
||||
|
||||
|
||||
|
||||
@ -174,7 +174,7 @@ openshell --version
|
||||
df -h /
|
||||
```
|
||||
|
||||
Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x** (the DGX Station ships with v18 — see below), OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
|
||||
Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x**, OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
|
||||
|
||||
> [!WARNING]
|
||||
> If `openshell --version` says `command not found`, the binary is at `~/.local/bin/openshell` but isn't on PATH. Run the `export PATH=...` line above and re-source `~/.bashrc`. Without this, every `openshell` and `make` command in later steps fails.
|
||||
@ -182,10 +182,14 @@ Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x** (the DGX Stat
|
||||
> [!TIP]
|
||||
> `make prereq` (run from `~/clinical-intelligence` after Step 2) bundles all of the checks below — Docker, Node version, OpenShell, disk space, GPU, port 11434, and NGC auth — into one command.
|
||||
|
||||
**If `node --version` reports v18.x or older**, install Node.js v22 before continuing:
|
||||
**If `node --version` reports v18.x, older, or `command not found`**, install Node.js v22 before continuing:
|
||||
|
||||
```bash
|
||||
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
|
||||
## Download the NodeSource setup script first, then run it with sudo.
|
||||
## Running it inline with `| sudo bash` does not work — the sudo context
|
||||
## needs to own the entire script execution.
|
||||
curl -fsSL https://deb.nodesource.com/setup_22.x -o /tmp/nodesource_setup.sh
|
||||
sudo bash /tmp/nodesource_setup.sh
|
||||
sudo apt-get install -y nodejs
|
||||
node --version # should now show v22.x
|
||||
```
|
||||
@ -204,10 +208,20 @@ ss -tlnp 2>/dev/null | grep 11434 || echo 'port 11434 free'
|
||||
|
||||
Expected: `port 11434 free`. If the line still shows a listener, something else (an old `ollama serve`, another container, etc.) owns the port — stop it, or change `OLLAMA_PORT` in `.env` (Step 2) to a free port such as `11435`. `make setup` sources `.env` and configures the sandbox provider against the override.
|
||||
|
||||
**Stale OpenShell gateway?** If you previously ran the NemoClaw playbook, an existing gateway will be silently reused under the new name. To start clean:
|
||||
**Stale OpenShell gateway?** If you previously ran a playbook that started `openshell-gateway`, kill the process and remove the registration:
|
||||
|
||||
```bash
|
||||
openshell gateway destroy 2>/dev/null || true
|
||||
pkill -f openshell-gateway 2>/dev/null || true
|
||||
openshell gateway remove openshell 2>/dev/null || true
|
||||
```
|
||||
|
||||
**Previously ran the NemoClaw playbook?** NemoClaw installs `openclaw-gateway.service` as a systemd user service that binds port 18789. If it is still running, `make setup` fails with "Port 18789 is already in use". Stop and disable it before proceeding — `make setup` will also do this automatically, but stopping it here avoids a confusing error:
|
||||
|
||||
```bash
|
||||
systemctl --user stop openclaw-gateway.service 2>/dev/null || true
|
||||
systemctl --user disable openclaw-gateway.service 2>/dev/null || true
|
||||
## Verify the port is free
|
||||
ss -tlnp | grep 18789 || echo 'port 18789 free'
|
||||
```
|
||||
|
||||
## Step 2. Copy the assets and configure
|
||||
@ -290,8 +304,8 @@ make status
|
||||
Expected:
|
||||
|
||||
```
|
||||
Ollama: ✓ healthy
|
||||
OpenFold3: ✓ healthy
|
||||
Ollama (port 11434): ✓ healthy
|
||||
OpenFold3 (port 8000): ✓ healthy
|
||||
```
|
||||
|
||||
OpenFold3 takes ~3 minutes to load model weights on startup. If it shows "down (may still be loading)", wait and check again.
|
||||
@ -301,26 +315,30 @@ OpenFold3 takes ~3 minutes to load model weights on startup. If it shows "down (
|
||||
|
||||
## Step 4. Start the OpenShell gateway
|
||||
|
||||
The OpenShell gateway runs a lightweight k3s Kubernetes cluster inside Docker to manage sandboxes. On DGX Station, the kernel uses cgroup v2 with the systemd driver, but k3s defaults to cgroupfs. The flag below tells k3s to match the host:
|
||||
OpenShell >= 0.0.40 ships `openshell-gateway`, a standalone server binary installed alongside the CLI. Start it with the Docker driver (no Kubernetes required), then register it with the CLI:
|
||||
|
||||
```bash
|
||||
OPENSHELL_K3S_ARGS='--kubelet-arg=cgroup-driver=systemd' openshell gateway start
|
||||
## Start the gateway server in the background using the Docker compute driver.
|
||||
## --disable-tls is safe for local-only use (loopback-bound).
|
||||
nohup openshell-gateway \
|
||||
--disable-tls \
|
||||
--drivers docker \
|
||||
--bind-address 127.0.0.1 \
|
||||
--port 17670 \
|
||||
> /tmp/openshell-gateway.log 2>&1 &
|
||||
echo "Gateway PID: $!"
|
||||
|
||||
## Register the gateway with the CLI and set it as active.
|
||||
openshell gateway add http://127.0.0.1:17670 --name openshell
|
||||
```
|
||||
|
||||
Wait for the gateway's embedded k3s cluster to finish initializing (10–15 seconds after `gateway start` returns), then verify:
|
||||
Verify the gateway is connected:
|
||||
|
||||
```bash
|
||||
## Wait until the gateway accepts connections, fail after 60s
|
||||
for i in $(seq 1 30); do
|
||||
if openshell status 2>/dev/null | grep -q "Connected"; then
|
||||
echo "Gateway: Connected"; break
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
openshell status
|
||||
```
|
||||
|
||||
Expected: `Status: Connected`. If the first `openshell status` immediately after `gateway start` reports `Connection reset by peer`, that is normal — k3s is still warming up. The loop above polls until it is ready.
|
||||
Expected: `Status: Connected`. If not connected, check `/tmp/openshell-gateway.log` for errors. The gateway typically starts in under 1 second.
|
||||
|
||||
> [!NOTE]
|
||||
> Step 4 configures OpenShell infrastructure (gateway). Step 5 deploys the healthcare agent into this infrastructure.
|
||||
@ -457,7 +475,8 @@ Skill files are Markdown. Edit a threshold or drug classification — it takes e
|
||||
```bash
|
||||
openshell sandbox delete clinical-sandbox
|
||||
make down
|
||||
openshell gateway destroy
|
||||
pkill -f openshell-gateway 2>/dev/null || true
|
||||
openshell gateway remove openshell 2>/dev/null || true
|
||||
```
|
||||
|
||||
To also remove downloaded models and volumes:
|
||||
|
||||
@ -130,7 +130,8 @@ test-docker: ## Run tests inside a container
|
||||
teardown: ## Tear down sandbox, services, and gateway
|
||||
openshell sandbox delete $${SANDBOX_NAME:-clinical-sandbox} 2>/dev/null || true
|
||||
$(COMPOSE) down
|
||||
openshell gateway destroy 2>/dev/null || true
|
||||
pkill -f openshell-gateway 2>/dev/null || true
|
||||
openshell gateway remove openshell 2>/dev/null || true
|
||||
@echo "Teardown complete."
|
||||
|
||||
clean: ## Remove test results, PDB caches, and dangling images
|
||||
|
||||
@ -19,9 +19,7 @@
|
||||
# --local Bind gateway to 0.0.0.0 for local browser access (no SSH tunnel needed)
|
||||
# Default: loopback only (requires SSH tunnel from remote machine)
|
||||
#
|
||||
# Machine differences:
|
||||
# GB300: Docker bridge 172.18.0.1, no sg docker prefix
|
||||
# New Station: Docker bridge 172.17.0.1, needs sg docker prefix
|
||||
# The Docker bridge IP is auto-detected via 'ip -4 addr show docker0' below.
|
||||
set -euo pipefail
|
||||
|
||||
BIND_MODE="loopback"
|
||||
@ -129,6 +127,20 @@ if openshell sandbox list 2>/dev/null | grep -q "$SANDBOX_NAME"; then
|
||||
sleep 3
|
||||
fi
|
||||
|
||||
# Stop any host-level service that owns $PORT (e.g. openclaw-gateway.service
|
||||
# installed by the NemoClaw playbook as a systemd --user service). systemd
|
||||
# will respawn the process if only the PID is killed, so stop the unit first.
|
||||
if ss -tlnp 2>/dev/null | grep -qE "[: ]${PORT}[^0-9]"; then
|
||||
echo "Detected listener on host :$PORT — stopping before forwarding..."
|
||||
systemctl --user stop openclaw-gateway.service 2>/dev/null || true
|
||||
systemctl --user disable openclaw-gateway.service 2>/dev/null || true
|
||||
# Kill any remaining listener not managed by systemd (e.g. stale PID)
|
||||
if ss -tlnp 2>/dev/null | grep -qE "[: ]${PORT}[^0-9]"; then
|
||||
fuser -k "${PORT}/tcp" 2>/dev/null || true
|
||||
sleep 1
|
||||
fi
|
||||
fi
|
||||
|
||||
# Stop any stale port forwards on $PORT from prior (possibly deleted) sandboxes.
|
||||
# Stale forwards block re-creation with a cryptic error like
|
||||
# "× Port 18789 is already forwarded to sandbox 'dgx-demo'."
|
||||
@ -202,13 +214,37 @@ done
|
||||
echo ""
|
||||
|
||||
# --- Step 4: Upload repo into sandbox ---
|
||||
# Note: openshell sandbox upload (>= 0.0.44) copies the source *directory itself*
|
||||
# (like `cp -r src/ dest/` creates dest/src/), not just its contents. We therefore
|
||||
# upload to /sandbox/ so that the source directory `clinical-intelligence` lands at
|
||||
# /sandbox/clinical-intelligence/ rather than /sandbox/clinical-intelligence/clinical-intelligence/.
|
||||
echo "--- Step 4: Upload repo ---"
|
||||
openshell sandbox upload "$SANDBOX_NAME" "$REPO_DIR" /sandbox/clinical-intelligence
|
||||
openshell sandbox upload "$SANDBOX_NAME" "$REPO_DIR" /sandbox/
|
||||
|
||||
# Fix nested directories caused by upload (analysis-methods/analysis-methods/)
|
||||
|
||||
# Resolve the active gateway name for the ssh-proxy ProxyCommand.
|
||||
# Precedence: OPENSHELL_GATEWAY env var (set by the CLI for all subcommands) →
|
||||
# active gateway from `openshell status` → fallback to 'openshell'.
|
||||
# This prevents a failure when the user previously ran the NemoClaw playbook
|
||||
# (which registers its gateway as 'nemoclaw' instead of 'openshell').
|
||||
_gw_name() {
|
||||
if [ -n "${OPENSHELL_GATEWAY:-}" ]; then
|
||||
printf '%s' "$OPENSHELL_GATEWAY"
|
||||
return
|
||||
fi
|
||||
local name
|
||||
name=$(openshell status 2>/dev/null \
|
||||
| grep -oE 'Gateway:[[:space:]]+[A-Za-z0-9_-]+' \
|
||||
| awk '{print $NF}' | head -1)
|
||||
printf '%s' "${name:-openshell}"
|
||||
}
|
||||
GW_NAME="$(_gw_name)"
|
||||
|
||||
_sandbox() {
|
||||
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR \
|
||||
-o "ProxyCommand=openshell ssh-proxy --gateway-name openshell --name $SANDBOX_NAME" \
|
||||
-o ConnectTimeout=10 \
|
||||
-o "ProxyCommand=openshell ssh-proxy --gateway-name $GW_NAME --name $SANDBOX_NAME" \
|
||||
"sandbox@openshell-$SANDBOX_NAME" "$@"
|
||||
}
|
||||
|
||||
|
||||
@ -209,7 +209,7 @@ spec:
|
||||
df -h /
|
||||
```
|
||||
|
||||
Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x** (the DGX Station ships with v18 — see below), OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
|
||||
Expected: Blackwell Ultra GPU, Docker >= 23.0.1, **Node.js v22.x**, OpenShell >= 0.0.33, and **at least 200 GB free** on `/` (86 GB model + Docker images + working space).
|
||||
|
||||
> [!WARNING]
|
||||
> If `openshell --version` says `command not found`, the binary is at `~/.local/bin/openshell` but isn't on PATH. Run the `export PATH=...` line above and re-source `~/.bashrc`. Without this, every `openshell` and `make` command in later steps fails.
|
||||
@ -217,10 +217,14 @@ spec:
|
||||
> [!TIP]
|
||||
> `make prereq` (run from `~/clinical-intelligence` after Step 2) bundles all of the checks below — Docker, Node version, OpenShell, disk space, GPU, port 11434, and NGC auth — into one command.
|
||||
|
||||
**If `node --version` reports v18.x or older**, install Node.js v22 before continuing:
|
||||
**If `node --version` reports v18.x, older, or `command not found`**, install Node.js v22 before continuing:
|
||||
|
||||
```bash
|
||||
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
|
||||
# Download the NodeSource setup script first, then run it with sudo.
|
||||
# Running it inline with `| sudo bash` does not work — the sudo context
|
||||
# needs to own the entire script execution.
|
||||
curl -fsSL https://deb.nodesource.com/setup_22.x -o /tmp/nodesource_setup.sh
|
||||
sudo bash /tmp/nodesource_setup.sh
|
||||
sudo apt-get install -y nodejs
|
||||
node --version # should now show v22.x
|
||||
```
|
||||
@ -239,10 +243,20 @@ spec:
|
||||
|
||||
Expected: `port 11434 free`. If the line still shows a listener, something else (an old `ollama serve`, another container, etc.) owns the port — stop it, or change `OLLAMA_PORT` in `.env` (Step 2) to a free port such as `11435`. `make setup` sources `.env` and configures the sandbox provider against the override.
|
||||
|
||||
**Stale OpenShell gateway?** If you previously ran the NemoClaw playbook, an existing gateway will be silently reused under the new name. To start clean:
|
||||
**Stale OpenShell gateway?** If you previously ran a playbook that started `openshell-gateway`, kill the process and remove the registration:
|
||||
|
||||
```bash
|
||||
openshell gateway destroy 2>/dev/null || true
|
||||
pkill -f openshell-gateway 2>/dev/null || true
|
||||
openshell gateway remove openshell 2>/dev/null || true
|
||||
```
|
||||
|
||||
**Previously ran the NemoClaw playbook?** NemoClaw installs `openclaw-gateway.service` as a systemd user service that binds port 18789. If it is still running, `make setup` fails with "Port 18789 is already in use". Stop and disable it before proceeding — `make setup` will also do this automatically, but stopping it here avoids a confusing error:
|
||||
|
||||
```bash
|
||||
systemctl --user stop openclaw-gateway.service 2>/dev/null || true
|
||||
systemctl --user disable openclaw-gateway.service 2>/dev/null || true
|
||||
# Verify the port is free
|
||||
ss -tlnp | grep 18789 || echo 'port 18789 free'
|
||||
```
|
||||
|
||||
# Step 2. Copy the assets and configure
|
||||
@ -325,8 +339,8 @@ spec:
|
||||
Expected:
|
||||
|
||||
```
|
||||
Ollama: ✓ healthy
|
||||
OpenFold3: ✓ healthy
|
||||
Ollama (port 11434): ✓ healthy
|
||||
OpenFold3 (port 8000): ✓ healthy
|
||||
```
|
||||
|
||||
OpenFold3 takes ~3 minutes to load model weights on startup. If it shows "down (may still be loading)", wait and check again.
|
||||
@ -336,26 +350,30 @@ spec:
|
||||
|
||||
# Step 4. Start the OpenShell gateway
|
||||
|
||||
The OpenShell gateway runs a lightweight k3s Kubernetes cluster inside Docker to manage sandboxes. On DGX Station, the kernel uses cgroup v2 with the systemd driver, but k3s defaults to cgroupfs. The flag below tells k3s to match the host:
|
||||
OpenShell >= 0.0.40 ships `openshell-gateway`, a standalone server binary installed alongside the CLI. Start it with the Docker driver (no Kubernetes required), then register it with the CLI:
|
||||
|
||||
```bash
|
||||
OPENSHELL_K3S_ARGS='--kubelet-arg=cgroup-driver=systemd' openshell gateway start
|
||||
# Start the gateway server in the background using the Docker compute driver.
|
||||
# --disable-tls is safe for local-only use (loopback-bound).
|
||||
nohup openshell-gateway \
|
||||
--disable-tls \
|
||||
--drivers docker \
|
||||
--bind-address 127.0.0.1 \
|
||||
--port 17670 \
|
||||
> /tmp/openshell-gateway.log 2>&1 &
|
||||
echo "Gateway PID: $!"
|
||||
|
||||
# Register the gateway with the CLI and set it as active.
|
||||
openshell gateway add http://127.0.0.1:17670 --name openshell
|
||||
```
|
||||
|
||||
Wait for the gateway's embedded k3s cluster to finish initializing (10–15 seconds after `gateway start` returns), then verify:
|
||||
Verify the gateway is connected:
|
||||
|
||||
```bash
|
||||
# Wait until the gateway accepts connections, fail after 60s
|
||||
for i in $(seq 1 30); do
|
||||
if openshell status 2>/dev/null | grep -q "Connected"; then
|
||||
echo "Gateway: Connected"; break
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
openshell status
|
||||
```
|
||||
|
||||
Expected: `Status: Connected`. If the first `openshell status` immediately after `gateway start` reports `Connection reset by peer`, that is normal — k3s is still warming up. The loop above polls until it is ready.
|
||||
Expected: `Status: Connected`. If not connected, check `/tmp/openshell-gateway.log` for errors. The gateway typically starts in under 1 second.
|
||||
|
||||
> [!NOTE]
|
||||
> Step 4 configures OpenShell infrastructure (gateway). Step 5 deploys the healthcare agent into this infrastructure.
|
||||
@ -492,7 +510,8 @@ spec:
|
||||
```bash
|
||||
openshell sandbox delete clinical-sandbox
|
||||
make down
|
||||
openshell gateway destroy
|
||||
pkill -f openshell-gateway 2>/dev/null || true
|
||||
openshell gateway remove openshell 2>/dev/null || true
|
||||
```
|
||||
|
||||
To also remove downloaded models and volumes:
|
||||
|
||||
@ -2,7 +2,7 @@ kind: Playbook
|
||||
metadata:
|
||||
name: station-local-coding-agent
|
||||
displayName: Local Coding Agent
|
||||
shortDescription: Run local CLI coding agents with Ollama on DGX Station (NVIDIA GB300) using glm-4.7-flash (fast) or unsloth/GLM-4.7-GGUF:Q8_0 (best quality)
|
||||
shortDescription: Run local CLI coding agents with Ollama on DGX Station (GB300 Ultra) using GLM-4.7 and GLM-4.7-Flash
|
||||
|
||||
publisher: nvidia
|
||||
description: |
|
||||
@ -17,6 +17,8 @@ metadata:
|
||||
- LLM
|
||||
- Ollama
|
||||
- Claude Code
|
||||
- OpenCode
|
||||
- Codex
|
||||
|
||||
attributes:
|
||||
- key: DURATION
|
||||
@ -39,18 +41,24 @@ spec:
|
||||
content: |
|
||||
# Basic idea
|
||||
|
||||
Use Ollama on **DGX Station (NVIDIA GB300)** to run local coding models and connect a CLI coding agent. This
|
||||
playbook uses **Claude Code** to talk to Ollama for local inference, so you can work without external cloud APIs.
|
||||
Use Ollama on **DGX Station with GB300 Ultra** to run local coding models and connect a CLI coding agent. This
|
||||
playbook supports three options: **Claude Code**, **OpenCode**, and **Codex CLI**. Each
|
||||
agent talks to Ollama for local inference, so you can work without external cloud APIs.
|
||||
|
||||
The DGX Station GPU (reported as **NVIDIA GB300** in `nvidia-smi`) provides ample memory to run **glm-4.7-flash** (fast loading and testing) and larger models such as **unsloth/GLM-4.7-GGUF:Q8_0** (best quality), both supported on Ollama.
|
||||
The GB300 Ultra’s massive GPU memory lets you run **GLM-4.7** and **GLM-4.7-Flash** in high-quality variants (e.g. bf16, q8_0) for the best coding-assistant quality directly on the Station.
|
||||
|
||||
# CLI agent
|
||||
# Choose your CLI agent
|
||||
|
||||
This playbook uses **Claude Code** as the CLI agent, connected to a local Ollama model for inference.
|
||||
Pick the tab that matches the CLI agent you want to use:
|
||||
|
||||
- **Claude Code**: Fastest path to a working CLI agent with a local Ollama model.
|
||||
- **OpenCode**: Open-source CLI with provider configuration; this guide targets Ollama.
|
||||
- **Codex CLI**: OpenAI Codex CLI configured to run against Ollama locally.
|
||||
|
||||
# What you'll accomplish
|
||||
|
||||
You will run a local coding model on your **DGX Station (NVIDIA GB300)** with Ollama, connect Claude Code to it, and complete a small coding task end-to-end. Use **glm-4.7-flash** (including high-quality variants) or **unsloth/GLM-4.7-GGUF:Q8_0** for best quality.
|
||||
You will run a local coding model on your **DGX Station (GB300 Ultra)** with Ollama, connect it to your
|
||||
chosen CLI agent, and complete a small coding task end-to-end. You can use **GLM-4.7** or **GLM-4.7-Flash** (including high-quality variants) to take full advantage of the Station’s memory.
|
||||
|
||||
# What to know before starting
|
||||
|
||||
@ -60,14 +68,13 @@ spec:
|
||||
|
||||
# Prerequisites
|
||||
|
||||
- **DGX Station** with **NVIDIA GB300** (Grace Blackwell) and NVIDIA driver; `nvidia-smi` typically shows "NVIDIA GB300"
|
||||
- **DGX Station** with **GB300 Ultra** (Grace Blackwell) and NVIDIA driver
|
||||
- Internet access to download model weights
|
||||
- **Ollama 0.15.0 or newer** (required for GLM-4.7-Flash; do not pin to 0.14.3)
|
||||
- **GPU memory** on GB300 supports both recommended models:
|
||||
- **glm-4.7-flash**: ~19 GB (`latest`) to ~60 GB (bf16) — **recommended for fast loading and testing**
|
||||
- **unsloth/GLM-4.7-GGUF:Q8_0** (Hugging Face on Ollama): larger model — **recommended for best quality**
|
||||
- Other variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit on GB300
|
||||
- **Disk space** for model downloads: plan for ~19 GB for `glm-4.7-flash:latest`, plus additional space for the Q8_0 or bf16 variants if you use them
|
||||
- Ollama 0.14.3 or newer
|
||||
- **GPU memory** on GB300 Ultra supports GLM-4.7 and high-quality variants:
|
||||
- **GLM-4.7-Flash** (30B): ~19GB (latest) to ~60GB (bf16) — recommended default for coding
|
||||
- **GLM-4.7** (full): use `ollama pull glm-4.7` for higher quality when available
|
||||
- High-quality variants (e.g. `glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) fit comfortably on GB300 Ultra
|
||||
|
||||
# Time & risk
|
||||
|
||||
@ -76,8 +83,8 @@ spec:
|
||||
* Large model downloads can fail if network connectivity is unstable
|
||||
* Older Ollama versions will not load newer models
|
||||
* **Rollback**: Stop Ollama and delete the downloaded model from `~/.ollama/models`
|
||||
* **Last Updated:** 03/06/2026
|
||||
* Model set to glm-4.7-flash; Ollama 0.15.0+; cleanup order and docs refresh
|
||||
* **Last Updated:** February 2025
|
||||
* Tailored for DGX Station with GB300 Ultra; added large-model recommendations
|
||||
|
||||
|
||||
|
||||
@ -94,91 +101,51 @@ spec:
|
||||
nvidia-smi
|
||||
```
|
||||
|
||||
**Expected output** (example): A table showing driver version and GPU(s). On DGX Station, the GPU name may appear as **NVIDIA GB300** (without "Ultra"):
|
||||
|
||||
```text
|
||||
+-----------------------------------------------------------------------------+
|
||||
| NVIDIA-SMI 5xx.xx Driver Version: 5xx.xx CUDA Version: 12.x |
|
||||
|-------------------------------+----------------------+----------------------+
|
||||
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||||
| 0 NVIDIA GB300 On | 00000000:06:00.0 Off | 0 |
|
||||
...
|
||||
```
|
||||
Expected output should show a detected GPU (e.g. GB300 Ultra).
|
||||
|
||||
# Step 2. Install or update Ollama
|
||||
|
||||
**Description**: Install Ollama or ensure it is recent enough for modern coding models.
|
||||
|
||||
```bash
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3 sh
|
||||
ollama --version
|
||||
```
|
||||
|
||||
To install a specific version (e.g. 0.15.0 or newer, required for GLM-4.7-Flash):
|
||||
|
||||
```bash
|
||||
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.15.0 sh
|
||||
```
|
||||
|
||||
If Ollama is already present and the version is 0.15.0 or newer, simply run:
|
||||
If the ollama is already present and the version is 0.14.3 or newer, simply run:
|
||||
|
||||
```bash
|
||||
ollama --version
|
||||
```
|
||||
|
||||
**Expected output** (example):
|
||||
|
||||
```text
|
||||
ollama version is 0.15.0
|
||||
```
|
||||
Expected output should show `ollama --version` as 0.14.3 or newer.
|
||||
|
||||
# Step 3. Pull a coding model
|
||||
|
||||
**Description**: Download the model weights to your DGX Station. This playbook supports two model options on Ollama; choose one (or both) depending on whether you want **fast loading and testing** or **best quality**.
|
||||
**Description**: Download the model weights to your DGX Station. This playbook uses **GLM-4.7** where available.
|
||||
|
||||
**For fast loading and testing** — **glm-4.7-flash** (~19 GB for `latest`; loads quickly; ensure Ollama 0.15.0+):
|
||||
**Recommended: GLM-4.7**:
|
||||
|
||||
```bash
|
||||
ollama pull glm-4.7-flash
|
||||
ollama pull glm-4.7
|
||||
```
|
||||
|
||||
**For best quality** — **unsloth/GLM-4.7-GGUF:Q8_0** from Hugging Face (larger, higher quality; supported on Ollama):
|
||||
|
||||
```bash
|
||||
ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0
|
||||
```
|
||||
|
||||
**Other glm-4.7-flash variants** on GB300 (more GPU memory; bf16 is ~60 GB):
|
||||
**High-quality variants** on GB300 Ultra (use more GPU memory for better quality):
|
||||
|
||||
```bash
|
||||
ollama pull glm-4.7-flash:q8_0
|
||||
ollama pull glm-4.7-flash:bf16
|
||||
```
|
||||
|
||||
**Expected output** (example): Progress lines followed by "success" and the model in `ollama list`:
|
||||
|
||||
```bash
|
||||
ollama list
|
||||
```
|
||||
|
||||
```text
|
||||
NAME ID SIZE MODIFIED
|
||||
glm-4.7-flash:latest abc123... 19 GB 1 minute ago
|
||||
unsloth/GLM-4.7-GGUF:Q8_0 def456... ... ...
|
||||
```
|
||||
Expected output should show your model in `ollama list`.
|
||||
|
||||
# Step 4. Test local inference
|
||||
|
||||
**Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7-flash` for fast testing, or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0` for best quality).
|
||||
**Description**: Run a quick prompt to confirm the model loads. Use the same model name you pulled (e.g. `glm-4.7`).
|
||||
|
||||
```bash
|
||||
ollama run glm-4.7-flash
|
||||
```
|
||||
|
||||
Or, if you pulled the larger model:
|
||||
|
||||
```bash
|
||||
ollama run hf.co/unsloth/GLM-4.7-GGUF:Q8_0
|
||||
ollama run glm-4.7
|
||||
```
|
||||
|
||||
Try a prompt like:
|
||||
@ -187,9 +154,7 @@ spec:
|
||||
Write a short README checklist for a Python project.
|
||||
```
|
||||
|
||||
**Expected output**: GLM-4.7-Flash may show **Thinking...** and reasoning text before the final answer, then the model's response. This is normal; wait for the reply to complete.
|
||||
|
||||
**Exit the Ollama REPL** when done: type `/bye` or press **Ctrl+D**.
|
||||
Expected output should show the model responding in the terminal.
|
||||
|
||||
# Step 5. Install Claude Code
|
||||
|
||||
@ -199,24 +164,16 @@ spec:
|
||||
curl -fsSL https://claude.ai/install.sh | sh
|
||||
```
|
||||
|
||||
**Verify the installation**:
|
||||
|
||||
```bash
|
||||
claude --version
|
||||
```
|
||||
|
||||
**Expected output** (example): A version string such as `claude 0.x.x` or similar. If you see `claude: command not found`, ensure the install script added the CLI to your PATH (e.g. restart the terminal or source your shell profile); see [Troubleshooting](troubleshooting.md).
|
||||
|
||||
# Step 6. Increase context length (optional)
|
||||
|
||||
**Description**: Ollama defaults to a 4096 token context length. For coding agents and
|
||||
larger codebases, set it to 64K tokens. This increases memory usage.
|
||||
For more details on configuring context length and other parameters, see the Ollama documentation (context window and runtime options).
|
||||
For more details on configuring context length, see the [Ollama documentation](https://ollama.com/docs/faq#how-can-i-increase-the-context-length).
|
||||
|
||||
Set the context length per session in the Ollama REPL (use the same model name you pulled, e.g. `glm-4.7-flash` or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0`):
|
||||
Set the context length per session in the Ollama REPL:
|
||||
|
||||
```bash
|
||||
ollama run glm-4.7-flash
|
||||
ollama run glm-4.7
|
||||
```
|
||||
|
||||
Then, in the Ollama prompt:
|
||||
@ -226,8 +183,6 @@ spec:
|
||||
|
||||
```
|
||||
|
||||
**Exit when done**: type `/bye` or press **Ctrl+D**.
|
||||
|
||||
Optional method (set globally when serving Ollama):
|
||||
|
||||
```bash
|
||||
@ -239,35 +194,16 @@ spec:
|
||||
|
||||
# Step 7. Connect Claude Code to Ollama
|
||||
|
||||
**Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled: `glm-4.7-flash` (fast) or `hf.co/unsloth/GLM-4.7-GGUF:Q8_0` (best quality).
|
||||
**Description**: Point Claude Code to the local Ollama server and launch it. Use the model you pulled (e.g. GLM-4.7 or GLM-4.7-Flash).
|
||||
|
||||
```bash
|
||||
export ANTHROPIC_AUTH_TOKEN=ollama
|
||||
export ANTHROPIC_BASE_URL=http://localhost:11434
|
||||
|
||||
claude --model glm-4.7-flash
|
||||
claude --model glm-4.7
|
||||
```
|
||||
|
||||
If you are using the larger model:
|
||||
|
||||
```bash
|
||||
claude --model hf.co/unsloth/GLM-4.7-GGUF:Q8_0
|
||||
```
|
||||
|
||||
- **`ANTHROPIC_AUTH_TOKEN=ollama`**: Claude Code treats the literal value `ollama` as a special token that means "use the local Ollama backend" instead of Anthropic's cloud API. No real API key is needed when using Ollama.
|
||||
- **`ANTHROPIC_BASE_URL`**: Tells Claude Code to send requests to your local Ollama server at port 11434.
|
||||
|
||||
**Persist these variables** (optional) so you don't have to re-export every terminal session. Add to `~/.bashrc` or your shell profile (e.g. `~/.zshrc`):
|
||||
|
||||
```bash
|
||||
echo 'export ANTHROPIC_AUTH_TOKEN=ollama' >> ~/.bashrc
|
||||
echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.bashrc
|
||||
source ~/.bashrc
|
||||
```
|
||||
|
||||
**Expected output**: Claude Code starts and uses the local model.
|
||||
|
||||
**Exit Claude Code** when done: type `/exit` or press **Ctrl+C**.
|
||||
Expected output should show Claude Code starting and using the local model.
|
||||
|
||||
# Step 8. Complete a small coding task
|
||||
|
||||
@ -287,13 +223,13 @@ spec:
|
||||
python -m pip install -U pytest
|
||||
```
|
||||
|
||||
In Claude Code, enter:
|
||||
In Claude Code:
|
||||
|
||||
```text
|
||||
Please implement add() in math_utils.py and make sure the test passes.
|
||||
```
|
||||
|
||||
**Exit Claude Code** when finished: type `/exit` or press **Ctrl+C**, then run the test:
|
||||
Run the test:
|
||||
|
||||
```bash
|
||||
python -m pytest -q
|
||||
@ -303,37 +239,26 @@ spec:
|
||||
|
||||
# Step 9. Cleanup and rollback
|
||||
|
||||
**Description**: Remove the model and stop the Ollama service if you no longer need them. **Remove the model first** (while the Ollama server is running), then stop the service.
|
||||
**Description**: Remove the model and stop services if you no longer need them.
|
||||
|
||||
> [!WARNING]
|
||||
> The following removes the downloaded model files from disk.
|
||||
|
||||
**1. Remove the model** (Ollama must be running). Use the same name you pulled:
|
||||
|
||||
```bash
|
||||
ollama rm glm-4.7-flash
|
||||
```
|
||||
|
||||
Or, for the Hugging Face model:
|
||||
|
||||
```bash
|
||||
ollama rm hf.co/unsloth/GLM-4.7-GGUF:Q8_0
|
||||
```
|
||||
|
||||
Use the exact tag you pulled (e.g. `glm-4.7-flash:bf16` if you used that variant).
|
||||
|
||||
**2. Stop the Ollama service**:
|
||||
To stop the service:
|
||||
|
||||
```bash
|
||||
sudo systemctl stop ollama
|
||||
```
|
||||
|
||||
> [!WARNING]
|
||||
> This will delete the downloaded model files.
|
||||
|
||||
```bash
|
||||
ollama rm glm-4.7
|
||||
```
|
||||
|
||||
# Step 10. Next steps
|
||||
|
||||
- **Fast loading and testing:** use **glm-4.7-flash** for quick iteration and smaller downloads.
|
||||
- **Best quality:** use **unsloth/GLM-4.7-GGUF:Q8_0** (Hugging Face on Ollama) or **glm-4.7-flash** high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on DGX Station (NVIDIA GB300).
|
||||
- Use larger context (e.g. 64K–198K) for big codebases.
|
||||
- Use Claude Code on multi-file refactors or test-generation tasks.
|
||||
- Use **GLM-4.7** or high-quality variants (`glm-4.7-flash:bf16`, `glm-4.7-flash:q8_0`) on GB300 Ultra for best quality
|
||||
- Use larger context (e.g. 64K–198K) for big codebases
|
||||
- Use Claude Code on multi-file refactors or test-generation tasks
|
||||
|
||||
|
||||
|
||||
@ -345,15 +270,18 @@ spec:
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| `ollama: command not found` | Ollama not installed or PATH not updated | Rerun `curl -fsSL https://ollama.com/install.sh | sh` and open a new shell |
|
||||
| Model load fails with version error | Ollama is older than 0.15.0 | Update Ollama to 0.15.0 or newer (required for GLM-4.7-Flash). Do not pin to 0.14.3. |
|
||||
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull hf.co/unsloth/GLM-4.7-GGUF:Q8_0` and retry. Use the same model name in `claude --model ...`. |
|
||||
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `sudo systemctl start ollama` |
|
||||
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station (NVIDIA GB300), ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. |
|
||||
| `claude: command not found` after install | CLI not on PATH or install script did not complete | Restart the terminal or run `source ~/.bashrc` (or your shell profile). Check the install script output for the install path and add it to PATH. |
|
||||
| Claude Code install fails (Node.js / network) | Node.js missing or install script cannot download | Ensure Node.js is installed (`node --version`). If the install script fails with a network error, retry from a stable connection or download the Claude Code CLI from the official site. See [Claude Code documentation](https://claude.ai/docs) for alternatives. |
|
||||
| Model load fails with version error | Ollama is older than 0.14.3 | Update Ollama to 0.14.3 or newer |
|
||||
| `model not found` in Claude Code | Model was not pulled | Run `ollama pull glm-4.7-flash` or `ollama pull glm-4.7` and retry |
|
||||
| `opencode: command not found` | OpenCode not installed or PATH not updated | Install OpenCode and open a new shell |
|
||||
| OpenCode cannot reach Ollama | `baseURL` misconfigured or Ollama not running | Set `baseURL` to `http://localhost:11434/v1` and start Ollama |
|
||||
| `codex: command not found` | Codex CLI not installed or PATH not updated | Install Codex CLI and open a new shell |
|
||||
| Codex CLI uses the wrong model/provider | `~/.codex/config.toml` not pointing to Ollama | Set `model_provider = "ollama"` and `base_url = "http://localhost:11434/v1"` |
|
||||
| `connection refused` to localhost:11434 | Ollama service not running | Start with `ollama serve` or `systemctl start ollama` |
|
||||
| Slow responses or OOM | Insufficient GPU memory or fragmentation | On DGX Station GB300 Ultra, ensure no other heavy GPU workloads. If OOM persists, use a smaller variant (e.g. `glm-4.7-flash:q8_0` or `glm-4.7-flash:q4_K_M`) or `OLLAMA_MAX_LOADED_MODELS=1`. |
|
||||
|
||||
> [!NOTE]
|
||||
> DGX Station with **NVIDIA GB300** provides ample GPU memory for **glm-4.7-flash** (fast testing) and **unsloth/GLM-4.7-GGUF:Q8_0** (best quality), plus variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
|
||||
> DGX Station with GB300 Ultra provides ample GPU memory for **GLM-4.7** and **GLM-4.7-Flash** in high-quality
|
||||
> variants (e.g. `glm-4.7-flash:bf16`). Use `OLLAMA_MAX_LOADED_MODELS=1` if you hit memory limits with multiple models.
|
||||
|
||||
|
||||
|
||||
@ -363,15 +291,31 @@ spec:
|
||||
url: https://ollama.com/docs
|
||||
|
||||
|
||||
- name: GLM-4.7-Flash
|
||||
- name: GLM-4.7-Flash (Ollama)
|
||||
url: https://ollama.com/library/glm-4.7-flash
|
||||
|
||||
|
||||
- name: Unsloth GLM-4.7-GGUF
|
||||
url: https://huggingface.co/unsloth/GLM-4.7-GGUF
|
||||
- name: GLM-4.7 (Ollama)
|
||||
url: https://ollama.com/library/glm-4.7
|
||||
|
||||
|
||||
- name: Claude Code + Ollama Guide
|
||||
url: https://ollama.com/blog/claude
|
||||
|
||||
|
||||
- name: OpenCode Ollama Provider
|
||||
url: https://opencode.ai/docs/providers/#ollama
|
||||
|
||||
|
||||
- name: Codex + Ollama Guide
|
||||
url: https://ollama.com/blog/codex
|
||||
|
||||
|
||||
- name: DGX Station Documentation
|
||||
url: https://docs.nvidia.com/dgx/dgx-station
|
||||
|
||||
|
||||
- name: DGX Station Forum
|
||||
url: https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station
|
||||
|
||||
|
||||
|
||||
@ -54,7 +54,7 @@ spec:
|
||||
|
||||
- **Enable MIG** on all B300 GPUs or on a per-GPU basis.
|
||||
- **Create a MIG layout** using B300 profile IDs (with a known-good example for multiple GPUs).
|
||||
- **Verify** the layout with `nvidia-smi -L` and `sudo nvidia-smi mig -lgi` / `-lci`.
|
||||
- **Verify** the layout with `nvidia-smi -L` and `nvidia-smi mig -lgi` / `-lci`.
|
||||
- **Run workloads** by setting `CUDA_VISIBLE_DEVICES` to a MIG UUID or by using the container/Kubernetes flows from the MIG User Guide.
|
||||
- **Disable MIG** when you need full-GPU mode and NVLink again.
|
||||
|
||||
@ -73,7 +73,7 @@ spec:
|
||||
|
||||
**Software:**
|
||||
|
||||
- NVIDIA driver and `nvidia-smi` installed and working: `nvidia-smi`. Use a driver version that supports MIG on B300 (see [Troubleshooting](troubleshooting.md) for version guidance; if `nvidia-smi -mig 1` reports "MIG mode not supported" or similar, the driver may be too old).
|
||||
- NVIDIA driver and `nvidia-smi` installed and working: `nvidia-smi`
|
||||
- Root or sudo access to run `nvidia-smi -mig 1`, `-mig 0`, and `nvidia-smi mig -cgi ... -C`
|
||||
- For containers/K8s: nvidia-container-toolkit and MIG support as described in the [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/)
|
||||
|
||||
@ -81,15 +81,14 @@ spec:
|
||||
|
||||
This playbook does not use repository assets; all steps use `nvidia-smi` and MIG commands on the DGX Station. For container and Kubernetes setup, use the official [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) (Getting Started with MIG and Kubernetes sections).
|
||||
|
||||
|
||||
# Time & risk
|
||||
|
||||
- **Estimated time:** About 15 minutes to enable MIG, create a layout, and verify. Layout design (which profiles per GPU) may take longer if you customize.
|
||||
- **Risk level:** Low to Medium
|
||||
- Enabling or disabling MIG requires sudo and affects all workloads on that GPU.
|
||||
- Disabling MIG removes all MIG instances; ensure Fabric Manager is running on DGX/HGX B200/B300 so NVLink/NVSwitch re-initialize correctly.
|
||||
- **Rollback:** Destroy all MIG instances with `sudo nvidia-smi mig -dci -i N` and `sudo nvidia-smi mig -dgi -i N` for each GPU index N, then run `sudo nvidia-smi -mig 0` to disable MIG and return to a single full-GPU instance per GB300. Ensure **Fabric Manager** is running after disabling MIG: `sudo systemctl status nvidia-fabricmanager` (start if needed: `sudo systemctl start nvidia-fabricmanager`).
|
||||
- **Last Updated:** 03/02/2026
|
||||
- **Rollback:** Run `sudo nvidia-smi -mig 0` to disable MIG and return to a single full-GPU instance per B300.
|
||||
- **Last Updated:** February 2025
|
||||
- First publication.
|
||||
|
||||
|
||||
@ -101,26 +100,18 @@ spec:
|
||||
content: |
|
||||
# Step 1. Prerequisites and verify B300 GPUs
|
||||
|
||||
Ensure your DGX Station has B300 GPUs (GB300 Ultra), a supported NVIDIA driver (see [Troubleshooting](troubleshooting.md) for driver requirements), and that `nvidia-smi` is available. You need root or sudo to enable MIG and create instances.
|
||||
|
||||
**Before enabling MIG:** All GPU processes must be stopped. Desktop environments (e.g. GNOME, Xwayland), NVIDIA services (e.g. nvsm_core, nvidia-pe, nv-hostengine), or workloads like vLLM can hold the GPU and cause "In use by another client" when you run MIG commands. Check what is using the GPUs:
|
||||
|
||||
```bash
|
||||
sudo fuser -v /dev/nvidia*
|
||||
```
|
||||
|
||||
Stop or suspend any processes that are using the GPUs before proceeding to Step 2.
|
||||
Ensure your DGX Station is running with B300 GPUs (GB300 Ultra) and that the NVIDIA driver and `nvidia-smi` are available. You need root or sudo to enable MIG and create instances.
|
||||
|
||||
```bash
|
||||
nvidia-smi
|
||||
nvidia-smi -L
|
||||
```
|
||||
|
||||
Expected output should list one or more **NVIDIA GB300** devices. If you see GB300 GPUs, you can proceed to enable MIG.
|
||||
Expected output should list one or more **NVIDIA B300** devices (e.g. `NVIDIA B300 SXM6 AC`). If you see B300 GPUs, you can proceed to enable MIG.
|
||||
|
||||
# Step 2. Enable MIG mode on the B300 GPUs
|
||||
|
||||
Ensure no GPU processes are running (see Step 1). Enable MIG for all GPUs or for a specific GPU. This must be done with elevated privileges.
|
||||
Enable MIG for all GPUs in the system or for a specific GPU. This must be done with elevated privileges.
|
||||
|
||||
**Enable MIG on all GPUs:**
|
||||
|
||||
@ -134,10 +125,6 @@ spec:
|
||||
sudo nvidia-smi -i 0 -mig 1
|
||||
```
|
||||
|
||||
**Expected output:** Success typically shows no error message; the command returns to the prompt. If you see "In use by another client", stop all GPU processes (e.g. desktop, services, containers) and run `sudo fuser -v /dev/nvidia*` to confirm nothing is using the GPUs, then retry.
|
||||
|
||||
If MIG mode shows **Pending** after enablement (e.g. in `nvidia-smi -q | grep -i mig`), wait a short time and run the command again, or reboot the system to allow the driver to apply the MIG state.
|
||||
|
||||
Enabling MIG partitions each B300 into multiple GPU Instances; you will create and assign profiles in the next steps.
|
||||
|
||||
# Step 3. Verify MIG mode and inspect B300 profiles
|
||||
@ -158,15 +145,15 @@ spec:
|
||||
nvidia-smi mig -lgip -i 0
|
||||
```
|
||||
|
||||
On GB300 you should see profiles such as (exact memory sizes may match your driver; IDs are used in commands):
|
||||
On B300 you should see profiles such as:
|
||||
|
||||
- MIG 1g.35gb (ID 19)
|
||||
- MIG 1g.35gb+me (ID 20)
|
||||
- MIG 1g.70gb (ID 15)
|
||||
- MIG 2g.70gb (ID 14)
|
||||
- MIG 3g.139gb (ID 9)
|
||||
- MIG 4g.139gb (ID 5)
|
||||
- MIG 7g.278gb (ID 0)
|
||||
- MIG 1g.34gb (ID 19)
|
||||
- MIG 1g.34gb+me (ID 20)
|
||||
- MIG 1g.67gb (ID 15)
|
||||
- MIG 2g.67gb (ID 14)
|
||||
- MIG 3g.135gb (ID 9)
|
||||
- MIG 4g.135gb (ID 5)
|
||||
- MIG 7g.269gb (ID 0)
|
||||
|
||||
Note the **IDs**; you will pass them to `-cgi` when creating the layout.
|
||||
|
||||
@ -178,29 +165,29 @@ spec:
|
||||
sudo nvidia-smi mig -cgi <profile_id,profile_id,...> -C -i <gpu_index>
|
||||
```
|
||||
|
||||
This example assumes a **6-GPU** DGX Station. If you have fewer GPUs (e.g. 1 or 2), run only the `-cgi` lines for the GPU indices that exist on your system (e.g. `-i 0` and `-i 1` only). Each GPU can have any combination of profiles that fits within its capacity:
|
||||
Example layout for a 6-GPU DGX Station (adjust GPU indices and counts to match your system). Each GPU can have any combination of profiles that fits within its capacity:
|
||||
|
||||
```bash
|
||||
# GPU 0: 7 × 1g.35gb
|
||||
# GPU 0: 7 × 1g.34gb
|
||||
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C -i 0
|
||||
|
||||
# GPU 1: 4 × 1g.70gb
|
||||
# GPU 1: 4 × 1g.67gb
|
||||
sudo nvidia-smi mig -cgi 15,15,15,15 -C -i 1
|
||||
|
||||
# GPU 2: 3 × 2g.70gb
|
||||
# GPU 2: 3 × 2g.67gb
|
||||
sudo nvidia-smi mig -cgi 14,14,14 -C -i 2
|
||||
|
||||
# GPU 3: 2 × 3g.139gb
|
||||
# GPU 3: 2 × 3g.135gb
|
||||
sudo nvidia-smi mig -cgi 9,9 -C -i 3
|
||||
|
||||
# GPU 4: 1 × 4g.139gb
|
||||
# GPU 4: 1 × 4g.135gb
|
||||
sudo nvidia-smi mig -cgi 5 -C -i 4
|
||||
|
||||
# GPU 5: 1 × 7g.278gb (full GPU as a single MIG instance)
|
||||
# GPU 5: 1 × 7g.269gb (full GPU as a single MIG instance)
|
||||
sudo nvidia-smi mig -cgi 0 -C -i 5
|
||||
```
|
||||
|
||||
You can choose any valid combination of profile IDs per GPU that fits within the GB300’s capacity; the above is a known-good example.
|
||||
You can choose any valid combination of profile IDs per GPU that fits within the B300’s capacity; the above is a known-good example. If your DGX Station has fewer than 6 GPUs, run only the `-i <N>` commands for GPUs that exist (e.g. 0 and 1 only).
|
||||
|
||||
# Step 5. Verify MIG instances
|
||||
|
||||
@ -210,23 +197,23 @@ spec:
|
||||
nvidia-smi -L
|
||||
```
|
||||
|
||||
You should see each physical GPU (e.g. **NVIDIA GB300**) followed by its MIG devices, for example:
|
||||
You should see each physical **NVIDIA B300 SXM6 AC** followed by its MIG devices, for example:
|
||||
|
||||
```
|
||||
GPU 0: NVIDIA GB300 (UUID: GPU-...)
|
||||
MIG 1g.35gb Device 0: (UUID: MIG-...)
|
||||
MIG 1g.35gb Device 1: (UUID: MIG-...)
|
||||
GPU 0: NVIDIA B300 SXM6 AC (UUID: GPU-...)
|
||||
MIG 1g.34gb Device 0: (UUID: MIG-...)
|
||||
MIG 1g.34gb Device 1: (UUID: MIG-...)
|
||||
...
|
||||
GPU 1: NVIDIA GB300 (UUID: GPU-...)
|
||||
MIG 1g.70gb Device 0: (UUID: MIG-...)
|
||||
GPU 1: NVIDIA B300 SXM6 AC (UUID: GPU-...)
|
||||
MIG 1g.67gb Device 0: (UUID: MIG-...)
|
||||
...
|
||||
```
|
||||
|
||||
To list GPU instances and compute instances (requires sudo):
|
||||
To list GPU instances and compute instances:
|
||||
|
||||
```bash
|
||||
sudo nvidia-smi mig -lgi # list GPU instances
|
||||
sudo nvidia-smi mig -lci # list compute instances
|
||||
nvidia-smi mig -lgi # list GPU instances
|
||||
nvidia-smi mig -lci # list compute instances
|
||||
```
|
||||
|
||||
# Step 6. Using the MIG devices
|
||||
@ -238,60 +225,20 @@ spec:
|
||||
./your_app
|
||||
```
|
||||
|
||||
**Verify a MIG instance is visible:** From the same shell where you set `CUDA_VISIBLE_DEVICES`, run `nvidia-smi`. You should see only the single MIG device (e.g. one "MIG 1g.35gb" device). Example:
|
||||
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES=MIG-<uuid-from-nvidia-smi-L>
|
||||
nvidia-smi
|
||||
```
|
||||
|
||||
**Containers (Docker):** Use the MIG device UUID in the `--gpus` option. Example:
|
||||
|
||||
```bash
|
||||
docker run --gpus '"device=MIG-<uuid>"' nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi
|
||||
```
|
||||
|
||||
Replace `<uuid>` with a full MIG UUID from `nvidia-smi -L`. For Kubernetes and nvidia-container-toolkit workflows, see the [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) (Getting Started with MIG and Kubernetes sections).
|
||||
|
||||
**Containers and Kubernetes:** use the NVIDIA MIG User Guide “Getting Started with MIG” and the Kubernetes sections. They cover the nvidia-container-toolkit, device plugin, and nvidia-mig-manager workflows for exposing MIG instances to containers.
|
||||
|
||||
# Step 7. Disabling MIG and restoring full GPU
|
||||
|
||||
When you need full NVLink P2P and a single full-GPU instance again, you must **destroy all MIG instances first**, then disable MIG. If you run `sudo nvidia-smi -mig 0` without destroying instances, it will fail with "In use by another client."
|
||||
|
||||
**1. Destroy compute instances and GPU instances on each GPU.** For each GPU index that has MIG instances, run (replace `N` with the GPU index, e.g. 0, 1, … 5 for a 6-GPU system):
|
||||
|
||||
```bash
|
||||
# Destroy all compute instances on GPU N (required before destroying GPU instances)
|
||||
sudo nvidia-smi mig -dci -i N
|
||||
|
||||
# Destroy all GPU instances on GPU N
|
||||
sudo nvidia-smi mig -dgi -i N
|
||||
```
|
||||
|
||||
Repeat for every GPU that has MIG instances. Example for a 6-GPU system:
|
||||
|
||||
```bash
|
||||
for i in 0 1 2 3 4 5; do sudo nvidia-smi mig -dci -i $i; sudo nvidia-smi mig -dgi -i $i; done
|
||||
```
|
||||
|
||||
**2. Disable MIG mode on all GPUs:**
|
||||
When you need full NVLink P2P and a single full-GPU instance again, disable MIG on all GPUs:
|
||||
|
||||
> [!WARNING]
|
||||
> This returns each GB300 to a single full-GPU instance. Any workloads using MIG UUIDs must be stopped first and will need to be reconfigured or restarted.
|
||||
> This removes all MIG instances and returns each B300 to a single full-GPU instance. Any workloads using MIG UUIDs will need to be reconfigured or restarted.
|
||||
|
||||
```bash
|
||||
sudo nvidia-smi -mig 0
|
||||
```
|
||||
|
||||
**3. Verify MIG is fully disabled:**
|
||||
|
||||
```bash
|
||||
nvidia-smi -q | grep -A2 "MIG Mode"
|
||||
```
|
||||
|
||||
Expected output should show `Current: Disabled` for each GPU.
|
||||
|
||||
On DGX/HGX B200/B300, ensure **Fabric Manager** is running after disabling MIG so NVLinks and NVSwitch fabric are re-initialized (see [Troubleshooting](troubleshooting.md)).
|
||||
This resets the GPUs. On DGX/HGX B200/B300, ensure **Fabric Manager** is running so that NVLinks and NVSwitch fabric routing are re-initialized after MIG is disabled.
|
||||
|
||||
|
||||
|
||||
@ -302,50 +249,12 @@ spec:
|
||||
content: |
|
||||
| Symptom | Cause | Fix |
|
||||
|--------|--------|-----|
|
||||
| `nvidia-smi -mig 1` fails or "MIG mode not supported" | Driver too old or GPU not MIG-capable | Use a driver version that supports MIG on GB300 (see [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) for supported versions). Check `nvidia-smi -q` for driver and GPU model. Update the driver if it is too old. |
|
||||
| "In use by another client" when running `-mig 1`, `-cgi`, or `-mig 0` | GPU is held by another process or MIG instances still exist | **For enable/create:** Stop all GPU processes (desktop, VLLM, nvsm_core, nvidia-pe, nv-hostengine, etc.). Run `sudo fuser -v /dev/nvidia*` to see what is using the GPUs; stop those processes and retry. **For disable:** You must destroy all MIG instances first: run `sudo nvidia-smi mig -dci -i N` then `sudo nvidia-smi mig -dgi -i N` for each GPU index N that has instances, then run `sudo nvidia-smi -mig 0`. |
|
||||
| `nvidia-smi mig -cgi ... -C -i N` fails (e.g. "Invalid combination") | Profile combination exceeds GPU capacity or invalid IDs | Run `nvidia-smi mig -lgip -i N` and use only listed profile IDs. Ensure the sum of instance sizes does not exceed the GB300's capacity for that GPU. |
|
||||
| MIG instances not visible after creation | Instances not created or wrong GPU index | Run `nvidia-smi -L` and `sudo nvidia-smi mig -lgi` to confirm. Re-run the `-cgi` commands for the correct `-i <gpu_index>`. |
|
||||
| App doesn't see MIG device when using CUDA_VISIBLE_DEVICES=MIG-<uuid> | Wrong UUID or app not using CUDA_VISIBLE_DEVICES | Get UUIDs from `nvidia-smi -L`. Export `CUDA_VISIBLE_DEVICES=MIG-<uuid>` in the same shell before launching the app. |
|
||||
| "Insufficient Permissions" when running `nvidia-smi mig -lgi` or `-lci` | Listing instances requires root | Use `sudo nvidia-smi mig -lgi` and `sudo nvidia-smi mig -lci`. |
|
||||
| After `nvidia-smi -mig 0`, NVLink or fabric issues on DGX/HGX | Fabric Manager not re-initializing | Ensure Fabric Manager is running after disabling MIG: `sudo systemctl status nvidia-fabricmanager`; start if needed with `sudo systemctl start nvidia-fabricmanager`. See [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) for details. |
|
||||
| Permission denied when running nvidia-smi -mig or mig -cgi | Need root for MIG operations | Use `sudo` for `nvidia-smi -mig 1/0`, `nvidia-smi mig -cgi ... -C`, `-dci`, and `-dgi`. |
|
||||
|
||||
## MIG reconfiguration (day-2 operations)
|
||||
|
||||
To change the MIG layout (e.g. add or remove instances, or switch profiles), destroy existing instances on the affected GPU(s), then create the new layout:
|
||||
|
||||
1. **Destroy compute instances and GPU instances** on each GPU you want to reconfigure (replace `N` with the GPU index):
|
||||
```bash
|
||||
sudo nvidia-smi mig -dci -i N
|
||||
sudo nvidia-smi mig -dgi -i N
|
||||
```
|
||||
2. **Create the new layout** with `sudo nvidia-smi mig -cgi <profile_ids> -C -i N` as in the Instructions (Step 4).
|
||||
|
||||
Workloads using the old MIG UUIDs must be stopped before destroying instances; they will need to be restarted with the new UUIDs from `nvidia-smi -L` after recreation.
|
||||
|
||||
## Profile selection guidance
|
||||
|
||||
| Profile (typical name) | Use case |
|
||||
|------------------------|----------|
|
||||
| 1g.35gb (ID 19) | Small inference, dev/test, many concurrent small jobs |
|
||||
| 1g.70gb (ID 15) | Slightly larger inference or light training |
|
||||
| 2g.70gb (ID 14) | Medium inference or small training |
|
||||
| 3g.139gb (ID 9) | Larger inference or medium training |
|
||||
| 4g.139gb (ID 5) | Heavy inference or moderate training |
|
||||
| 7g.278gb (ID 0) | Full-GPU as single MIG instance; max memory per partition |
|
||||
|
||||
Exact profile names may vary by driver (e.g. 1g.34gb vs 1g.35gb); use the **profile IDs** from `nvidia-smi mig -lgip -i 0` in your `-cgi` commands.
|
||||
|
||||
## Post-disable verification
|
||||
|
||||
After running `sudo nvidia-smi -mig 0`, confirm MIG is fully disabled:
|
||||
|
||||
```bash
|
||||
nvidia-smi -q | grep -A2 "MIG Mode"
|
||||
```
|
||||
|
||||
Expected output should show `Current: Disabled` for each GPU. If you still see MIG devices in `nvidia-smi -L`, destroy any remaining instances with `-dci`/`-dgi` per GPU, then run `-mig 0` again
|
||||
| `nvidia-smi -mig 1` fails or "MIG mode not supported" | Driver too old or GPU not MIG-capable | Ensure you have a B300 (or other MIG-capable GPU) and a driver version that supports MIG on B300. Check `nvidia-smi -q` and [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) for supported hardware/driver. |
|
||||
| `nvidia-smi mig -cgi ... -C -i N` fails (e.g. "Invalid combination") | Profile combination exceeds GPU capacity or invalid IDs | Run `nvidia-smi mig -lgip -i N` and use only listed profile IDs. Ensure the sum of instance sizes does not exceed the B300’s capacity for that GPU. |
|
||||
| MIG instances not visible after creation | Instances not created or wrong GPU index | Run `nvidia-smi -L` and `nvidia-smi mig -lgi` to confirm. Re-run the `-cgi` commands for the correct `-i <gpu_index>`. |
|
||||
| App doesn’t see MIG device when using CUDA_VISIBLE_DEVICES=MIG-<uuid> | Wrong UUID or app not using CUDA_VISIBLE_DEVICES | Get UUIDs from `nvidia-smi -L`. Export `CUDA_VISIBLE_DEVICES=MIG-<uuid>` in the same shell before launching the app. |
|
||||
| After `nvidia-smi -mig 0`, NVLink or fabric issues on DGX/HGX | Fabric Manager not re-initializing | On DGX/HGX B200/B300, ensure Fabric Manager is running after disabling MIG so NVLinks and NVSwitch fabric are re-initialized. |
|
||||
| Permission denied when running nvidia-smi -mig or mig -cgi | Need root for MIG operations | Use `sudo` for `nvidia-smi -mig 1/0` and `nvidia-smi mig -cgi ... -C`. |
|
||||
|
||||
|
||||
|
||||
|
||||
@ -82,7 +82,6 @@ spec:
|
||||
df -h .
|
||||
```
|
||||
|
||||
|
||||
# Time & risk
|
||||
|
||||
* **Estimated duration**: 45-90 minutes depending on network speed and model size
|
||||
@ -91,8 +90,6 @@ spec:
|
||||
* Quantization process is memory-intensive and may fail on systems with insufficient GPU memory
|
||||
* Output files are large (several GB) and require adequate storage space
|
||||
* **Rollback**: Remove the output directory and any pulled Docker images to restore original state.
|
||||
* **Last Updated:** 03/02/2026
|
||||
* First Publication
|
||||
|
||||
|
||||
|
||||
@ -164,7 +161,7 @@ spec:
|
||||
In this example, the GB300 is device **1**. Note this number for use in Docker commands.
|
||||
|
||||
> [!NOTE]
|
||||
> The examples below assume the GB300 is device 1. If your GPU has a different ID, adjust the `--gpus "device=X"` parameter in the Docker commands accordingly.
|
||||
> The examples below assume the GB300 is device 1. If your GPU has a different ID, adjust the `--gpus '"device=X"'` parameter in the Docker commands accordingly.
|
||||
|
||||
# Step 5. Run the quantization process using TensorRT Model Optimizer
|
||||
|
||||
@ -194,7 +191,7 @@ spec:
|
||||
|
||||
This command:
|
||||
|
||||
- Runs the container with access to the specified GPU (device 1) and optimized shared memory settings
|
||||
- Runs the container with full GPU access and optimized shared memory settings
|
||||
- Mounts your output directory to persist quantized model files
|
||||
- Mounts your Hugging Face cache to avoid re-downloading the model
|
||||
- Clones and installs the TensorRT Model Optimizer from source
|
||||
@ -232,7 +229,7 @@ spec:
|
||||
-e HF_TOKEN=$HF_TOKEN \
|
||||
-v "$MODEL_PATH:/workspace/model" \
|
||||
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
|
||||
--gpus "device=1" --ipc=host --network host \
|
||||
--gpus '"device=1"' --ipc=host --network host \
|
||||
nvcr.io/nvidia/vllm:25.12.post1-py3 \
|
||||
vllm serve /workspace/model \
|
||||
--max-model-len 4096 \
|
||||
@ -255,7 +252,7 @@ spec:
|
||||
-e HF_TOKEN=$HF_TOKEN \
|
||||
-v "$MODEL_PATH:/workspace/model" \
|
||||
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
|
||||
--gpus "device=1" --ipc=host --network host \
|
||||
--gpus '"device=1"' --ipc=host --network host \
|
||||
nvcr.io/nvidia/vllm:25.12.post1-py3 \
|
||||
vllm serve /workspace/model \
|
||||
--backend pytorch \
|
||||
@ -264,13 +261,13 @@ spec:
|
||||
--port 8000
|
||||
```
|
||||
|
||||
When serving from a local path, vLLM exposes the model name as the path's last component (here, `model`). Run the following to test the server (use the same model name vLLM reports, e.g. from `curl http://localhost:8000/v1/models`):
|
||||
Run the following to test the server with a client CURL request:
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "model",
|
||||
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
|
||||
"messages": [{"role": "user", "content": "What is artificial intelligence?"}],
|
||||
"max_tokens": 100,
|
||||
"temperature": 0.7,
|
||||
|
||||
@ -32,7 +32,7 @@ spec:
|
||||
|
||||
cta:
|
||||
text: View on GitHub
|
||||
url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-topic-modeling/
|
||||
url: https://github.com/NVIDIA/dgx-station-playbooks/blob/main/nvidia/station-topic-modeling/
|
||||
|
||||
|
||||
tabs:
|
||||
@ -80,11 +80,10 @@ spec:
|
||||
|
||||
# Ancillary files
|
||||
|
||||
All required assets are in the playbook directory `nvidia/station-topic-modeling/assets` (see [Instructions](https://build.nvidia.com/station/topic-modeling/instructions), Step 7). Key file:
|
||||
All required assets are in the playbook directory `nvidia/station-topic-modeling/assets` (see Step 7). Key file:
|
||||
|
||||
- `video_notebook_for_GPU_Accelerated_Machine_Learning_BERTopic_RTX6000_40M.ipynb` - Complete Jupyter notebook with GPU-accelerated topic modeling pipeline (filename reflects original demo hardware; the notebook runs on GB300 and other NVIDIA GPUs)
|
||||
|
||||
|
||||
# Time & risk
|
||||
|
||||
* **Estimated time:** 45 minutes (includes environment setup, dataset download, and embedding generation)
|
||||
@ -92,7 +91,7 @@ spec:
|
||||
* Large dataset download (~14GB) may take time depending on network speed
|
||||
* Embedding generation requires significant GPU memory
|
||||
* **Rollback:** Delete the downloaded dataset and any generated embedding files to restore state
|
||||
* **Last Updated:** 03/02/2026
|
||||
* **Last Updated:** 02/05/2026
|
||||
* First Publication
|
||||
|
||||
|
||||
@ -135,7 +134,6 @@ spec:
|
||||
# Step 4. Install machine learning packages
|
||||
|
||||
Install UMAP, HDBSCAN, BERTopic, and supporting libraries for topic modeling.
|
||||
Note: `datamapplot` will upgrade dask/distributed — the next step pins them back.
|
||||
|
||||
```bash
|
||||
pip install \
|
||||
@ -144,15 +142,7 @@ spec:
|
||||
scikit-learn==1.4.2 datamapplot
|
||||
```
|
||||
|
||||
Pin dask/distributed back to RAPIDS-compatible versions:
|
||||
|
||||
```bash
|
||||
pip install "dask==2025.9.1" "distributed==2025.9.1"
|
||||
```
|
||||
|
||||
These packages provide:
|
||||
- **dask**: Parallel computing library
|
||||
- **distributed**: Distributed task scheduler for dask
|
||||
- **sentence-transformers**: Generate text embeddings
|
||||
- **umap-learn / hdbscan**: Dimensionality reduction and clustering (GPU-accelerated via cuML)
|
||||
- **bertopic**: Topic modeling framework
|
||||
@ -196,8 +186,8 @@ spec:
|
||||
Clone the playbook repository and download the Amazon Electronics Reviews dataset.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/NVIDIA/dgx-spark-playbooks
|
||||
cd dgx-spark-playbooks/nvidia/station-topic-modeling/assets
|
||||
git clone https://github.com/NVIDIA/dgx-station-playbooks
|
||||
cd dgx-station-playbooks/nvidia/station-topic-modeling/assets
|
||||
```
|
||||
|
||||
Download the dataset (~14GB compressed):
|
||||
@ -206,17 +196,7 @@ spec:
|
||||
wget https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/review_categories/Electronics.jsonl.gz
|
||||
```
|
||||
|
||||
# Step 8. Pull Git LFS files (notebooks)
|
||||
|
||||
The notebook files are stored in Git LFS — without this step, JupyterLab will throw a `NotJSONError` when trying to open them.
|
||||
|
||||
```bash
|
||||
conda install -c conda-forge git-lfs
|
||||
git lfs install
|
||||
git lfs pull
|
||||
```
|
||||
|
||||
# Step 9. Launch JupyterLab
|
||||
# Step 8. Launch JupyterLab
|
||||
|
||||
Start JupyterLab from the assets directory:
|
||||
|
||||
@ -224,13 +204,13 @@ spec:
|
||||
jupyter lab
|
||||
```
|
||||
|
||||
# Step 10. Select the rapids-25.10 kernel
|
||||
# Step 9. Select the rapids-25.10 kernel
|
||||
|
||||
In JupyterLab, open the notebook `video_notebook_for_GPU_Accelerated_Machine_Learning_BERTopic_1M.ipynb`.
|
||||
|
||||
Select the **rapids-25.10** kernel from the kernel selector in the top right corner of the notebook interface.
|
||||
|
||||
# Step 11. Execute all cells
|
||||
# Step 10. Execute all cells
|
||||
|
||||
Run all cells in the notebook sequentially. The notebook will:
|
||||
|
||||
@ -241,7 +221,7 @@ spec:
|
||||
5. **Run BERTopic**: Cluster documents into topics using GPU-accelerated UMAP and HDBSCAN
|
||||
6. **Visualize results**: Generate interactive topic visualizations
|
||||
|
||||
# Step 12. Explore the results
|
||||
# Step 11. Explore the results
|
||||
|
||||
After the notebook completes, you'll have:
|
||||
|
||||
@ -251,7 +231,7 @@ spec:
|
||||
- **Heatmap**: Topic similarity matrix
|
||||
- **Document datamap**: Visual clustering of documents by topic
|
||||
|
||||
# Step 13. Cleanup (optional)
|
||||
# Step 12. Cleanup (optional)
|
||||
|
||||
Remove the conda environment when finished:
|
||||
|
||||
@ -266,16 +246,6 @@ spec:
|
||||
rm Electronics.jsonl.gz
|
||||
```
|
||||
|
||||
Remove generated embedding files and the cloned playbook directory if you no longer need them:
|
||||
|
||||
```bash
|
||||
# Optional: remove Hugging Face cache (embedding cache from the notebook)
|
||||
rm -rf ~/.cache/huggingface
|
||||
|
||||
# From the parent of dgx-spark-playbooks/, remove the cloned repo
|
||||
rm -rf dgx-spark-playbooks/
|
||||
```
|
||||
|
||||
# Next steps
|
||||
|
||||
Apply this workflow to your own datasets:
|
||||
@ -288,6 +258,31 @@ spec:
|
||||
|
||||
|
||||
|
||||
-
|
||||
id: troubleshooting
|
||||
|
||||
label: Troubleshooting
|
||||
content: |
|
||||
# Common issues
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| "Permission denied" on `~/.cache/huggingface` or Hugging Face download fails | Cache dir owned by root or wrong permissions | Run `sudo chown -R $USER:$USER $HOME/.cache/huggingface` and `sudo chmod -R u+rwX $HOME/.cache/huggingface` (use your username if different). |
|
||||
| `PackagesNotFoundError` for `jupyterlab-widgets` with conda | Package not available for platform/channel | Install with pip: `pip install jupyterlab-widgets`. |
|
||||
| Pip reports dependency conflicts (dask, distributed, cuml, rapids-dask-dependency) after installing BERTopic stack | Pip downgrades dask/distributed; RAPIDS expects newer versions | BERTopic and the notebook typically still work. To avoid conflicts, install BERTopic/umap/hdbscan in a separate env, or accept the conflict if you do not need cuML + dask together. |
|
||||
| `CUDA out of memory` error during embedding generation | Insufficient GPU memory for batch size | Reduce batch size in `model.encode()` or process fewer documents by lowering `nrows` |
|
||||
| `ModuleNotFoundError: No module named 'cuml'` | cuML not installed or wrong environment | Verify `conda activate rapids-25.10` and run `%load_ext cuml.accel` before imports |
|
||||
| Notebook kernel dies during UMAP | Out of memory during dimensionality reduction | Reduce dataset size or use `low_memory=True` in UMAP parameters |
|
||||
| `wget` download fails or hangs | Network issues or firewall blocking | Check internet connection, try with `--retry-connrefused --waitretry=1 --read-timeout=20` |
|
||||
| Kernel not found in JupyterLab | rapids-25.10 kernel not registered | Run `python -m ipykernel install --user --name rapids-25.10` |
|
||||
| `cudf.pandas` not accelerating operations | Extension not loaded before pandas import | Restart kernel and ensure `%load_ext cudf.pandas` runs before `import pandas` |
|
||||
| Topic model produces too many/few topics | HDBSCAN parameters need tuning | Adjust `min_cluster_size` (larger = fewer topics) and `min_samples` |
|
||||
| Plotly visualizations not rendering | Renderer not configured for JupyterLab | Add `pio.renderers.default = "notebook"` after importing plotly |
|
||||
| `ResolvePackageNotFound` during conda install | Package version conflict or missing channel | Ensure `-c rapidsai -c conda-forge` channels are specified |
|
||||
| PyTorch not using GPU | Wrong PyTorch version or CUDA mismatch | Reinstall with correct CUDA version: `pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu130` |
|
||||
|
||||
|
||||
|
||||
|
||||
resources:
|
||||
- name: BERTopic Documentation
|
||||
|
||||
@ -35,7 +35,7 @@ spec:
|
||||
|
||||
cta:
|
||||
text: View on GitHub
|
||||
url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-txt2kg/
|
||||
url: https://github.com/NVIDIA/dgx-station-playbooks/blob/main/nvidia/station-txt2kg/
|
||||
|
||||
|
||||
tabs:
|
||||
@ -81,13 +81,12 @@ spec:
|
||||
|
||||
# Ancillary files
|
||||
|
||||
All required assets are in the playbook directory `nvidia/station-txt2kg/assets` (see Instructions, Step 1). Key files:
|
||||
All required assets are in the playbook directory `nvidia/station-txt2kg/assets` (see Step 1). Key files:
|
||||
|
||||
- `start.sh` - Launch script for all services
|
||||
- `stop.sh` - Stop script to shut down services
|
||||
- `deploy/compose/` - Docker Compose configurations
|
||||
|
||||
|
||||
# Time & risk
|
||||
|
||||
- **Duration**:
|
||||
@ -100,8 +99,8 @@ spec:
|
||||
- Document processing time scales with document size and complexity
|
||||
|
||||
- **Rollback**: Stop and remove Docker containers, delete downloaded models if needed
|
||||
* **Last Updated:** 03/02/2026
|
||||
* First Publication
|
||||
- **Last Updated**: 02/06/2026
|
||||
- First Publication
|
||||
|
||||
|
||||
|
||||
@ -115,61 +114,63 @@ spec:
|
||||
This playbook is for **DGX Station**. In a terminal, clone the repository and navigate to the project directory.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/NVIDIA/dgx-spark-playbooks
|
||||
cd dgx-spark-playbooks/nvidia/station-txt2kg/assets
|
||||
git clone https://github.com/NVIDIA/dgx-station-playbooks
|
||||
cd dgx-station-playbooks/nvidia/station-txt2kg/assets
|
||||
```
|
||||
|
||||
# Step 2. Start the txt2kg services
|
||||
|
||||
The default backend is **vLLM** (supported on DGX Station). The script starts services and waits for the vLLM backend to be ready (model load can take 30+ minutes; progress is shown in the terminal). To use Ollama instead, run `./start.sh --ollama`.
|
||||
Use the provided start script to launch all required services. On DGX Station, if the default backend (Ollama) does not work, use the vLLM backend: `./start.sh --vllm`.
|
||||
|
||||
```bash
|
||||
./start.sh
|
||||
# Optional: ./start.sh --ollama # Use ArangoDB + Ollama instead of vLLM
|
||||
# Optional: ./start.sh --no-wait # Skip waiting for vLLM readiness
|
||||
# If the default backend fails: ./start.sh --vllm
|
||||
```
|
||||
|
||||
The script will:
|
||||
The script will automatically:
|
||||
- Check for GPU availability
|
||||
- Start Docker Compose services (Neo4j + vLLM by default)
|
||||
- Wait for vLLM to be ready and show elapsed time
|
||||
- Print the Web UI URL when ready
|
||||
- Start Docker Compose services
|
||||
- Set up ArangoDB database
|
||||
- Launch the web interface
|
||||
|
||||
# Step 3. Pull the model (Ollama only)
|
||||
# Step 3. Pull the Llama 3.1 405B model
|
||||
|
||||
If you started with **Ollama** (`./start.sh --ollama`), pull the Llama model:
|
||||
The default configuration uses Llama 3.1 405B, which leverages the GB300 Ultra's large GPU memory for maximum accuracy in knowledge extraction:
|
||||
|
||||
```bash
|
||||
docker exec ollama-compose ollama pull llama3.1:405b
|
||||
```
|
||||
|
||||
Browse available models at [https://ollama.com/search](https://ollama.com/search). With the default **vLLM** stack, the model is loaded automatically by the vLLM container.
|
||||
Browse available models at [https://ollama.com/search](https://ollama.com/search)
|
||||
|
||||
> [!NOTE]
|
||||
> The first model download may take 20-30 minutes depending on network speed. For faster initial testing, you can use `llama3.1:70b` or `llama3.1:8b` as alternatives.
|
||||
|
||||
|
||||
# Step 4. Access the web interface
|
||||
|
||||
> [!NOTE]
|
||||
> If you started with **vLLM** (`./start.sh --vllm`), the vLLM backend can take **30 minutes or more** to load the model and initialize. There may be no progress indicator in the CLI or web UI during this time; check container logs with `docker logs` to confirm the server is still loading.
|
||||
|
||||
Open your browser and navigate to:
|
||||
|
||||
```
|
||||
http://localhost:3001
|
||||
```
|
||||
|
||||
You can also access:
|
||||
- **Neo4j Browser** (vLLM default): http://localhost:7474
|
||||
- **vLLM API**: http://localhost:8001
|
||||
- **ArangoDB** (Ollama only): http://localhost:8529
|
||||
- **Ollama API** (Ollama only): http://localhost:11434
|
||||
You can also access individual services:
|
||||
- **ArangoDB Web Interface**: http://localhost:8529
|
||||
- **Ollama API**: http://localhost:11434
|
||||
|
||||
# Step 5. Upload documents and build knowledge graphs
|
||||
|
||||
The web UI defaults to **local** (vLLM or Ollama). If the backend is still loading, a banner and the model selector will show “Initializing…” until the backend is ready.
|
||||
|
||||
### 5.1. Document Upload
|
||||
- Use the web interface to upload text documents (markdown, text, CSV supported)
|
||||
- Documents are automatically chunked and processed for triple extraction
|
||||
|
||||
### 5.2. Knowledge Graph Generation
|
||||
- The system extracts subject-predicate-object triples using the selected LLM (vLLM or Ollama)
|
||||
- Triples are stored in Neo4j (vLLM) or ArangoDB (Ollama) for relationship querying
|
||||
- The system extracts subject-predicate-object triples using Ollama
|
||||
- Triples are stored in ArangoDB for relationship querying
|
||||
|
||||
### 5.3. Interactive Visualization
|
||||
- View your knowledge graph in 2D or 3D with GPU-accelerated rendering
|
||||
@ -184,28 +185,26 @@ spec:
|
||||
|
||||
# Step 6. Cleanup and rollback
|
||||
|
||||
Stop all services (use the same flags as when you started):
|
||||
Remove downloaded models while the container is still running, then stop services:
|
||||
|
||||
```bash
|
||||
# Stop services (default: vLLM stack)
|
||||
./stop.sh
|
||||
# If you started with Ollama: ./stop.sh --ollama
|
||||
# Remove downloaded models (optional; run before stopping containers)
|
||||
docker exec ollama-compose ollama rm llama3.1:405b
|
||||
|
||||
# Stop services
|
||||
docker compose down
|
||||
|
||||
# Remove containers and volumes (optional)
|
||||
# From assets dir: docker compose -f deploy/compose/docker-compose.vllm.yml down -v
|
||||
# Or with Ollama: docker compose -f deploy/compose/docker-compose.yml down -v
|
||||
|
||||
# Remove downloaded Ollama models (Ollama only)
|
||||
# docker exec ollama-compose ollama rm llama3.1:405b
|
||||
docker compose down -v
|
||||
```
|
||||
|
||||
# Step 7. Next steps
|
||||
|
||||
- Default is vLLM on DGX Station; use `./start.sh --ollama` for ArangoDB + Ollama.
|
||||
- The UI shows a readiness banner and “vLLM (Local) – Initializing…” until the backend is ready.
|
||||
- Experiment with different models for extraction quality and speed tradeoffs.
|
||||
- Customize triple extraction prompts for domain-specific knowledge.
|
||||
- Explore advanced graph querying and visualization features.
|
||||
- On DGX Station, use `./start.sh --vllm` if the default Ollama backend does not work; allow 30+ minutes for vLLM to initialize.
|
||||
- Experiment with different Ollama models for varied extraction quality and speed tradeoffs
|
||||
- The 405B model provides the highest accuracy; use 70B or 8B for faster processing
|
||||
- Customize triple extraction prompts for domain-specific knowledge
|
||||
- Explore advanced graph querying and visualization features
|
||||
|
||||
|
||||
|
||||
@ -224,8 +223,8 @@ spec:
|
||||
| ArangoDB connection refused | Service not fully started | Wait 30s after start.sh, verify with `docker ps` |
|
||||
| Container fails to start with GPU error | NVIDIA Container Toolkit not configured | Run `nvidia-ctk runtime configure --runtime=docker` and restart Docker |
|
||||
| Port already in use | Previous instance still running | Run `./stop.sh` first or use `docker compose down` |
|
||||
| Default is vLLM; need Ollama instead | Prefer ArangoDB + Ollama | Start with `./start.sh --ollama`. |
|
||||
| vLLM takes long to become ready | Model load can take 30+ minutes | The start script waits and shows elapsed time. The UI shows a banner and "vLLM (Local) – Initializing…" until ready. Check progress: `docker logs vllm-service -f`. |
|
||||
| Default backend (Ollama) doesn't work on DGX Station | Backend or model not available | Start with vLLM: `./start.sh --vllm`. Allow 30+ minutes for vLLM to load the model; there may be no progress message in the UI. |
|
||||
| No feedback while vLLM is starting | vLLM model load takes a long time | vLLM can take >30 minutes to initialize. Check `docker logs` for the vLLM container to confirm it is still loading. |
|
||||
|
||||
> [!NOTE]
|
||||
> DGX Station with GB300 Ultra provides massive GPU memory capacity, enabling you to run larger models (70B+)
|
||||
|
||||
@ -45,6 +45,8 @@ The following models are supported with vLLM on DGX Station. All listed models a
|
||||
|
||||
| Model | Quantization | Support Status | HF Handle |
|
||||
|-------|-------------|----------------|-----------|
|
||||
| **DiffusionGemma 26B A4B IT** | BF16 | ✅ | [`google/diffusiongemma-26B-A4B-it`](https://huggingface.co/google/diffusiongemma-26B-A4B-it) |
|
||||
| **DiffusionGemma 26B A4B IT** | NVFP4 | ✅ | [`nvidia/diffusiongemma-26B-A4B-it-NVFP4`](https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4) |
|
||||
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
|
||||
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
|
||||
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
|
||||
@ -54,7 +56,7 @@ The following models are supported with vLLM on DGX Station. All listed models a
|
||||
* **Duration:** 30 minutes (longer on first run due to model download)
|
||||
* **Risks:** Model download requires HuggingFace authentication
|
||||
* **Rollback:** Stop and remove the container to restore state
|
||||
* **Last Updated:** 05/28/2026
|
||||
* **Last Updated:** 06/10/2026
|
||||
* Update models
|
||||
|
||||
## Instructions
|
||||
@ -92,6 +94,12 @@ Pull the vLLM container from NGC. Use the **26.01** image on DGX Station; the 25
|
||||
docker pull nvcr.io/nvidia/vllm:26.01-py3
|
||||
```
|
||||
|
||||
For DiffusionGemma, use the vLLM custom container:
|
||||
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:gemma
|
||||
```
|
||||
|
||||
For Step-3.7-Flash models, pull the custom VLLM container
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:stepfun37
|
||||
@ -119,6 +127,34 @@ docker run -d \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
For DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`), run with custom VLLM container.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name vllm-server \
|
||||
-p 8000:8000 \
|
||||
--gpus all \
|
||||
--shm-size=16g \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-e VLLM_USE_V2_MODEL_RUNNER=1 \
|
||||
-e HF_TOKEN="$HF_TOKEN" \
|
||||
vllm/vllm-openai:gemma ${MODEL_HANDLE} \
|
||||
--gpu-memory-utilization 0.85 \
|
||||
--attention-backend TRITON_ATTN \
|
||||
--max-num-seqs 16 \
|
||||
--diffusion-config '{"canvas_length":256}' \
|
||||
--override-generation-config '{"max_new_tokens": null}' \
|
||||
--load-format fastsafetensors \
|
||||
--enable-prefix-caching \
|
||||
--reasoning-parser gemma4 \
|
||||
--default-chat-template-kwargs '{"enable_thinking": true}' \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser gemma4
|
||||
|
||||
## For BF16 checkpoint add "--moe-backend triton" for better performance
|
||||
```
|
||||
|
||||
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
|
||||
|
||||
```bash
|
||||
|
||||
@ -65,6 +65,8 @@ spec:
|
||||
|
||||
| Model | Quantization | Support Status | HF Handle |
|
||||
|-------|-------------|----------------|-----------|
|
||||
| **DiffusionGemma 26B A4B IT** | BF16 | ✅ | [`google/diffusiongemma-26B-A4B-it`](https://huggingface.co/google/diffusiongemma-26B-A4B-it) |
|
||||
| **DiffusionGemma 26B A4B IT** | NVFP4 | ✅ | [`nvidia/diffusiongemma-26B-A4B-it-NVFP4`](https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4) |
|
||||
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
|
||||
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
|
||||
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
|
||||
@ -74,7 +76,7 @@ spec:
|
||||
* **Duration:** 30 minutes (longer on first run due to model download)
|
||||
* **Risks:** Model download requires HuggingFace authentication
|
||||
* **Rollback:** Stop and remove the container to restore state
|
||||
* **Last Updated:** 05/28/2026
|
||||
* **Last Updated:** 06/10/2026
|
||||
* Update models
|
||||
|
||||
|
||||
@ -117,6 +119,12 @@ spec:
|
||||
docker pull nvcr.io/nvidia/vllm:26.01-py3
|
||||
```
|
||||
|
||||
For DiffusionGemma, use the vLLM custom container:
|
||||
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:gemma
|
||||
```
|
||||
|
||||
For Step-3.7-Flash models, pull the custom VLLM container
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:stepfun37
|
||||
@ -144,6 +152,34 @@ spec:
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
For DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`), run with custom VLLM container.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name vllm-server \
|
||||
-p 8000:8000 \
|
||||
--gpus all \
|
||||
--shm-size=16g \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-e VLLM_USE_V2_MODEL_RUNNER=1 \
|
||||
-e HF_TOKEN="$HF_TOKEN" \
|
||||
vllm/vllm-openai:gemma ${MODEL_HANDLE} \
|
||||
--gpu-memory-utilization 0.85 \
|
||||
--attention-backend TRITON_ATTN \
|
||||
--max-num-seqs 16 \
|
||||
--diffusion-config '{"canvas_length":256}' \
|
||||
--override-generation-config '{"max_new_tokens": null}' \
|
||||
--load-format fastsafetensors \
|
||||
--enable-prefix-caching \
|
||||
--reasoning-parser gemma4 \
|
||||
--default-chat-template-kwargs '{"enable_thinking": true}' \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser gemma4
|
||||
|
||||
# For BF16 checkpoint add "--moe-backend triton" for better performance
|
||||
```
|
||||
|
||||
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
|
||||
|
||||
```bash
|
||||
|
||||
@ -171,10 +171,12 @@ Add additional model entries for any other Ollama models you wish to host remote
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
|Ollama not starting|GPU drivers may not be installed correctly|Run `nvidia-smi` in the terminal. If the command fails check DGX Dashboard for updates to your DGX Spark.|
|
||||
|Continue can't connect over the network|Port 11434 may not be open or accessible|Run command `ss -tuln \| grep 11434`. If the output does not reflect ` tcp LISTEN 0 4096 *:11434 *:* `, go back to step 2 and run the ufw command.|
|
||||
|Continue can't detect a locally running Ollama model|Configuration not properly set or detected|Check `OLLAMA_HOST` and `OLLAMA_ORIGINS` in `/etc/systemd/system/ollama.service.d/override.conf` file. If `OLLAMA_HOST` and `OLLAMA_ORIGINS` are set correctly, add these lines to your `~/.bashrc` file.|
|
||||
|High memory usage|Model size too big|Confirm no other large models or containers are running with `nvidia-smi`. Use smaller models such as `gpt-oss:20b` for lightweight usage.|
|
||||
| **WiFi connection drops or becomes unreachable** (especially in headless mode) | Aggressive WiFi power-saving settings in NetworkManager | Edit `/etc/NetworkManager/conf.d/default-wifi-powersave-on.conf`, set `wifi.powersave = 2`, and run `sudo systemctl restart NetworkManager`. |
|
||||
| **Random reboots and "00" error code on the display** | Watchdog timer module (`sbsa_gwdt`) not loaded | Add `sbsa_gwdt` to `/etc/modules-load.d/watchdog.conf` and reboot to ensure the hardware watchdog is correctly managed by the kernel. |
|
||||
| Ollama not starting | GPU drivers may not be installed correctly | Run `nvidia-smi` in the terminal. If the command fails check DGX Dashboard for updates to your DGX Spark. |
|
||||
| Continue can't connect over the network | Port 11434 may not be open or accessible | Run command `ss -tuln \| grep 11434`. If the output does not reflect `tcp LISTEN 0 4096 *:11434 *:*`, go back to step 2 and run the ufw command. |
|
||||
| Continue can't detect a locally running Ollama model | Configuration not properly set or detected | Check `OLLAMA_HOST` and `OLLAMA_ORIGINS` in `/etc/systemd/system/ollama.service.d/override.conf` file. If `OLLAMA_HOST` and `OLLAMA_ORIGINS` are set correctly, add these lines to your `~/.bashrc` file. |
|
||||
| High memory usage | Model size too big | Confirm no other large models or containers are running with `nvidia-smi`. Use smaller models such as `gpt-oss:20b` for lightweight usage. |
|
||||
|
||||
> [!NOTE]
|
||||
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||
|
||||
@ -9,6 +9,7 @@
|
||||
- [Run on two Sparks](#run-on-two-sparks)
|
||||
- [Step 11. (Optional) Launch 405B inference server](#step-11-optional-launch-405b-inference-server)
|
||||
- [Run on multiple Sparks through a switch](#run-on-multiple-sparks-through-a-switch)
|
||||
- [Run Agent Ready Qwen3.6 35B Model with vLLM](#run-agent-ready-qwen36-35b-model-with-vllm)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
@ -54,6 +55,8 @@ The following models are supported with vLLM on Spark. All listed models are ava
|
||||
|
||||
| Model | Quantization | Support Status | HF Handle |
|
||||
|-------|-------------|----------------|-----------|
|
||||
| **DiffusionGemma 26B A4B IT** | BF16 | ✅ | [`google/diffusiongemma-26B-A4B-it`](https://huggingface.co/google/diffusiongemma-26B-A4B-it) |
|
||||
| **DiffusionGemma 26B A4B IT** | NVFP4 | ✅ | [`nvidia/diffusiongemma-26B-A4B-it-NVFP4`](https://huggingface.co/nvidia/diffusiongemma-26B-A4B-it-NVFP4) |
|
||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | BF16 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16) |
|
||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | FP8 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8) |
|
||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | NVFP4 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4) |
|
||||
@ -97,8 +100,8 @@ Reminder: not all model architectures are supported for NVFP4 quantization.
|
||||
* **Duration:** 30 minutes for Docker approach
|
||||
* **Risks:** Container registry access requires internal credentials
|
||||
* **Rollback:** Container approach is non-destructive.
|
||||
* **Last Updated:** 04/28/2026
|
||||
* Add support for Nemotron-3-Nano-Omni reasoning BF16, FP8, NVFP4
|
||||
* **Last Updated:** 06/12/2026
|
||||
* Add Agent ready model recipe for Qwen3.6 35B
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -133,9 +136,13 @@ newgrp docker
|
||||
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm
|
||||
|
||||
```bash
|
||||
## HuggingFace token (required)
|
||||
## Get a token from https://huggingface.co/settings/tokens
|
||||
export HF_TOKEN="your_huggingface_token"
|
||||
|
||||
export LATEST_VLLM_VERSION=<latest_container_version>
|
||||
## example
|
||||
## export LATEST_VLLM_VERSION=26.02-py3
|
||||
## export LATEST_VLLM_VERSION=26.05.post1-py3
|
||||
|
||||
export HF_MODEL_HANDLE=<HF_HANDLE>
|
||||
## example
|
||||
@ -144,7 +151,12 @@ export HF_MODEL_HANDLE=<HF_HANDLE>
|
||||
docker pull nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION}
|
||||
```
|
||||
|
||||
For Gemma 4 model family, use vLLM custom containers:
|
||||
For DiffusionGemma models, use vLLM custom container:
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:gemma
|
||||
```
|
||||
|
||||
For Gemma 4 model family, use vLLM custom container:
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:gemma4-cu130
|
||||
```
|
||||
@ -159,6 +171,31 @@ nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION} \
|
||||
vllm serve ${HF_MODEL_HANDLE}
|
||||
```
|
||||
|
||||
To run DiffusionGemma models (e.g. `google/diffusiongemma-26B-A4B-it`):
|
||||
```bash
|
||||
docker run -it \
|
||||
-p 8000:8000 \
|
||||
--gpus all \
|
||||
--shm-size=16g \
|
||||
-e HF_TOKEN="$HF_TOKEN" \
|
||||
-e VLLM_USE_V2_MODEL_RUNNER=1 \
|
||||
vllm/vllm-openai:gemma ${HF_MODEL_HANDLE} \
|
||||
--gpu-memory-utilization 0.8 \
|
||||
--max-model-len 262144 \
|
||||
--attention-backend TRITON_ATTN \
|
||||
--max-num-seqs 10 \
|
||||
--diffusion-config '{"canvas_length":256}' \
|
||||
--override-generation-config '{"max_new_tokens": null}' \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser gemma4 \
|
||||
--reasoning-parser gemma4 \
|
||||
--enable-prefix-caching \
|
||||
--default-chat-template-kwargs '{"enable_thinking": true}' \
|
||||
--load-format fastsafetensors
|
||||
|
||||
## For BF16 checkpoint add "--moe-backend triton" for better performance
|
||||
```
|
||||
|
||||
To run models from Gemma 4 model family, (e.g. `google/gemma-4-31B-it`):
|
||||
```bash
|
||||
docker run -it --gpus all -p 8000:8000 \
|
||||
@ -188,11 +225,19 @@ Expected response should contain `"content": "204"` or similar mathematical calc
|
||||
|
||||
For container approach (non-destructive):
|
||||
|
||||
NGC Container:
|
||||
```bash
|
||||
docker rm $(docker ps -aq --filter ancestor=nvcr.io/nvidia/vllm:${LATEST_VLLM_VERSION})
|
||||
docker rmi nvcr.io/nvidia/vllm
|
||||
```
|
||||
|
||||
Upstream Container:
|
||||
```bash
|
||||
docker stop "<container name>"
|
||||
docker rm "<container name>"
|
||||
docker rmi "<container image name>"
|
||||
```
|
||||
|
||||
## Step 6. Next steps
|
||||
|
||||
- **Production deployment:** Configure vLLM with your specific model requirements
|
||||
@ -614,6 +659,96 @@ http://<head-node-ip>:8265
|
||||
## - Other models which can fit on the cluster with different quantization methods (FP8, NVFP4)
|
||||
```
|
||||
|
||||
## Run Agent Ready Qwen3.6 35B Model with vLLM
|
||||
|
||||
## Step 1. Configure Docker permissions
|
||||
|
||||
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
|
||||
|
||||
Open a new terminal and test Docker access. In the terminal, run:
|
||||
```bash
|
||||
docker ps
|
||||
```
|
||||
|
||||
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
## Step 2. Pull vLLM container image
|
||||
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:nightly-aarch64
|
||||
```
|
||||
|
||||
## Step 3. Launch the Agent Ready Qwen3.6 35B server
|
||||
|
||||
Launch the container and start the vLLM server with the agent-ready
|
||||
`nvidia/Qwen3.6-35B-A3B-NVFP4` recipe. The `vllm/vllm-openai` image entrypoint is
|
||||
`vllm serve`, so the model handle and flags are passed directly as container arguments.
|
||||
|
||||
```bash
|
||||
## HuggingFace token (required to download the model)
|
||||
## Get a token from https://huggingface.co/settings/tokens
|
||||
export HF_TOKEN="your_huggingface_token"
|
||||
|
||||
docker run -it --gpus all -p 8000:8000 \
|
||||
-e HF_TOKEN="$HF_TOKEN" \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
vllm/vllm-openai:nightly-aarch64 \
|
||||
nvidia/Qwen3.6-35B-A3B-NVFP4 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--tensor-parallel-size 1 \
|
||||
--trust-remote-code \
|
||||
--kv-cache-dtype fp8 \
|
||||
--attention-backend flashinfer \
|
||||
--moe-backend marlin \
|
||||
--gpu-memory-utilization 0.4 \
|
||||
--max-model-len 262144 \
|
||||
--max-num-seqs 4 \
|
||||
--max-num-batched-tokens 8192 \
|
||||
--enable-chunked-prefill \
|
||||
--async-scheduling \
|
||||
--enable-prefix-caching \
|
||||
--speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
|
||||
--load-format fastsafetensors \
|
||||
--reasoning-parser qwen3 \
|
||||
--tool-call-parser qwen3_xml \
|
||||
--enable-auto-tool-choice
|
||||
```
|
||||
|
||||
Expected output should include:
|
||||
- Model loading confirmation
|
||||
- Server startup on port 8000
|
||||
- GPU memory allocation details
|
||||
|
||||
In another terminal, test the server:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "nvidia/Qwen3.6-35B-A3B-NVFP4",
|
||||
"messages": [{"role": "user", "content": "12*17"}],
|
||||
"max_tokens": 500
|
||||
}'
|
||||
```
|
||||
|
||||
Expected response should contain `"content": "204"` or similar mathematical calculation.
|
||||
|
||||
|
||||
## Step 4. Cleanup and rollback
|
||||
|
||||
For container approach (non-destructive):
|
||||
|
||||
```bash
|
||||
docker rm $(docker ps -aq --filter ancestor=vllm/vllm-openai:nightly-aarch64)
|
||||
docker rmi vllm/vllm-openai:nightly-aarch64
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
## Common issues for running on a single Spark
|
||||
@ -623,6 +758,7 @@ http://<head-node-ip>:8265
|
||||
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
|
||||
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
|
||||
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
|
||||
| CUDA out of memory | Insufficient GPU memory | Reduce --max-model-len and --max-num-seqs parameters |
|
||||
|
||||
## Common Issues for running on two Sparks
|
||||
| Symptom | Cause | Fix |
|
||||
@ -631,7 +767,7 @@ http://<head-node-ip>:8265
|
||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
||||
| Model download fails | Authentication or network issue | Re-run `huggingface-cli login`, check internet access |
|
||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
|
||||
| CUDA out of memory with 405B | Insufficient GPU memory | Use 70B model or reduce max_model_len parameter |
|
||||
| CUDA out of memory | Insufficient GPU memory | Reduce --max-model-len and --max-num-seqs parameters |
|
||||
| Container startup fails | Missing ARM64 image | Rebuild vLLM image following ARM64 instructions |
|
||||
|
||||
> [!NOTE]
|
||||
|
||||
Loading…
Reference in New Issue
Block a user