diff --git a/nvidia/station-ai-skills/endpoint-production.yaml b/nvidia/station-ai-skills/endpoint-production.yaml new file mode 100644 index 0000000..6ef089a --- /dev/null +++ b/nvidia/station-ai-skills/endpoint-production.yaml @@ -0,0 +1,413 @@ +kind: Playbook +metadata: + name: station-ai-skills + displayName: DGX Station AI Skills for Coding Agents + shortDescription: Give your coding agent (Claude Code, Codex, Gemini CLI, Cursor) DGX Station expertise via an AGENTS.md and on-demand Agent Skills + + publisher: nvidia + description: | + # REPLACE THIS WITH YOUR MODEL CARD + https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads + + labelsV2: + - gpuType:playbook:gpu_type_station + - DGX Station + - GB300 + - Blackwell + - AI Agents + - Agent Skills + - AGENTS.md + - Claude Code + - Codex + - Gemini CLI + - Cursor + - vLLM + - SGLang + - MIG + - Mixed Coherency + + attributes: + - key: DURATION + value: 15 MIN + +spec: + artifactName: station-ai-skills + nvcfFunctionId: None + attributes: + + showUnavailableBanner: false + apiDocsUrl: None + termsOfUse: | + + cta: + text: View on GitHub + url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-ai-skills/ + + + tabs: + - + id: overview + + label: Overview + content: | + # Basic idea + + Modern coding agents — Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor — all support two extension mechanisms: a project-level **context file** that's loaded into every conversation, and **on-demand procedural workflows** (called skills, prompts, commands, or rules depending on the harness). This playbook ships both for DGX Station: + + - An **`AGENTS.md`** with the critical DGX Station constraints your agent should always know (mixed coherency, GPU targeting, common pitfalls). `AGENTS.md` is the cross-harness standard; an `install.sh` lays it down as `CLAUDE.md`, `GEMINI.md`, or `AGENTS.md` depending on the agent you use. + - **Four Agent Skills** — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` — authored once in the [Anthropic Agent Skills format](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview) and installed into the right per-harness location (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`). + + This approach keeps your agent's context lean in every conversation while giving it deep procedural knowledge on demand, regardless of which agent you use. + + ## AGENTS.md vs Agent Skill — why split? + + | | AGENTS.md | Agent Skill | + |---|---|---| + | **Loaded** | Every conversation, automatically | Only when invoked by name (or matched by description, in Claude) | + | **Best for** | Constraints, pitfalls, "never do X" rules | Step-by-step workflows, deployment procedures | + | **Context cost** | Consumed every time | Zero until invoked | + + The DGX Station mixed-coherency constraint (`--gpus all` will crash) should be in every conversation. The full vLLM deployment procedure should not. + + # What you'll accomplish + + - Install the `AGENTS.md` and four Agent Skills into your project directory for your chosen agent (Claude Code, Codex, Gemini CLI, or Cursor). + - Verify the agent loads the constraints automatically and the skills on demand. + - Invoke `vllm-setup` to deploy a vLLM inference server with validated configuration. + - Invoke `sglang-setup` to deploy an SGLang inference server. + - Invoke `mig-configure` to partition the GB300 into MIG instances. + - Invoke `dgx-diagnose` to troubleshoot common DGX Station issues. + + # What to know before starting + + - Basic familiarity with one supported coding agent (running it, giving it prompts, using slash commands or rule references) + - General understanding of DGX Station (two GPUs, Docker-based workflows) + + # Prerequisites + + - NVIDIA DGX Station with GB300 + - One of the supported coding agents installed: + - **Claude Code:** `curl -fsSL https://claude.ai/install.sh | sh` + - **OpenAI Codex CLI:** `npm i -g @openai/codex` + - **Gemini CLI:** `npm i -g @google/gemini-cli` + - **Cursor:** download from `https://cursor.com/` + - A project directory where you do DGX Station work + + # Ancillary files + + - `assets/AGENTS.md` — canonical context file with critical constraints, GPU targeting, software versions, and common pitfalls. Cross-harness standard. + - `assets/skills/vllm-setup/SKILL.md` — skill: deploy vLLM with validated configuration. + - `assets/skills/sglang-setup/SKILL.md` — skill: deploy SGLang with validated configuration. + - `assets/skills/mig-configure/SKILL.md` — skill: configure MIG partitions on the GB300. + - `assets/skills/dgx-diagnose/SKILL.md` — skill: troubleshoot common DGX Station issues. + - `assets/install.sh` — per-harness installer (`claude`, `codex`, `gemini`, `cursor`, or `all`). + + # Time & risk + + * **Duration:** 10-15 minutes + * **Risk level:** Low — this playbook copies markdown files into your project directory + * **Rollback:** Delete the context file (`AGENTS.md` / `CLAUDE.md` / `GEMINI.md`) and the harness-specific skill directory (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`) from your project directory + * **Last Updated:** 05/18/2026 + * Restructured as harness-agnostic Agent Skills (Claude Code, Codex, Gemini CLI, Cursor) + + + + - + id: instructions + + label: Instructions + content: | + # Step 1. Install your coding agent + + Pick whichever agent you prefer — the rest of this playbook works the same regardless. Install commands: + + | Agent | Install | + |-------|---------| + | Claude Code | `curl -fsSL https://claude.ai/install.sh \| sh` | + | OpenAI Codex CLI | `npm i -g @openai/codex` | + | Gemini CLI | `npm i -g @google/gemini-cli` | + | Cursor | Download from `https://cursor.com/` | + + Verify with `claude --version`, `codex --version`, `gemini --version`, or by launching Cursor. + + # Step 2. Install the skills into your project + + Navigate to the project where you want DGX Station expertise, then run the installer with the harness you use: + + ```bash + cd ~/your-project + + # Pick one: + /path/to/this/playbook/assets/install.sh claude + /path/to/this/playbook/assets/install.sh codex + /path/to/this/playbook/assets/install.sh gemini + /path/to/this/playbook/assets/install.sh cursor + + # Or install for all four at once: + /path/to/this/playbook/assets/install.sh all + ``` + + If you downloaded the playbook as a zip, the path is relative to the extracted directory: + + ```bash + station-ai-skills/assets/install.sh claude ~/your-project + ``` + + The installer is additive for skill directories (won't clobber existing skills you've written) and refuses to overwrite an existing context file (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`) unless you pass `--force`. + + **Resulting layout** (per harness): + + ```text + your-project/ + AGENTS.md or CLAUDE.md or GEMINI.md # context file (named for your agent) + .claude/skills//SKILL.md # claude + .codex/prompts/.md # codex + .gemini/commands/.md # gemini + .cursor/rules/.mdc # cursor + ``` + + Where `` is each of `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose`. + + > [!NOTE] + > Every supported agent automatically reads the context file from the working directory at startup. Skills/prompts/rules in the harness-specific directory are discovered automatically — no additional configuration needed. + + # Step 3. Verify the setup + + Start your agent in the project directory and ask a question that requires constraint knowledge: + + ```text + Can I use --gpus all to run my CUDA workload on DGX Station? + ``` + + The agent should immediately warn about the mixed-coherency constraint and recommend `--gpus '"device=N"'` targeting. If you don't get the warning, the context file isn't being loaded — see Troubleshooting. + + Then verify the skills are discoverable: + + | Agent | How to check | + |-------|--------------| + | Claude Code | Type `/` — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` should appear in the autocomplete | + | Codex CLI | Type `/prompts:` — same four names appear | + | Gemini CLI | Type `/` — same four names appear | + | Cursor | Open the Rules panel — same four rules appear | + + # Step 4. Use vllm-setup to deploy an inference server + + Invoke the skill in your agent: + + | Agent | Invocation | + |-------|-----------| + | Claude Code | `/vllm-setup` (slash command) or just describe the task ("deploy vllm with Qwen3-8B") | + | Codex CLI | `/prompts:vllm-setup` | + | Gemini CLI | `/vllm-setup` | + | Cursor | In chat: "use the vllm-setup rule to deploy a vllm server" | + + The agent will walk you through deploying a vLLM server with a validated container image, correct GPU targeting, and recommended parameters. It will check your GPU index, ask which model you want to serve, and generate the full `docker run` command. + + # Step 5. Use sglang-setup to deploy SGLang + + Same invocation pattern, but for SGLang with the `cu130` container, RadixAttention prefix caching, and structured JSON output support. + + # Step 6. Use mig-configure to partition the GB300 + + The agent will query your current MIG state, show available profiles, help you choose a layout for your workloads, and execute the partitioning commands. + + # Step 7. Use dgx-diagnose to troubleshoot issues + + If you encounter problems, invoke `dgx-diagnose`. The agent will check GPU status, driver version, running processes, MIG state, and Fabric Manager to identify the issue. + + # Step 8. Customize + + Both the `AGENTS.md` and the skills are plain markdown — extend them freely. + + **Add project-specific constraints to `AGENTS.md`** (or your harness-specific context file): + + ```markdown + ## Project-specific + + - Our production MIG layout is 3g.139gb + 2g.70gb + 2g.70gb + - Always use port 8080 for inference (nginx proxy on 443) + - Model weights are cached at /data/models, mount with -v /data/models:/root/.cache/huggingface/hub + ``` + + **Create new skills** by adding a directory and `SKILL.md` to `assets/skills/`, then re-run `install.sh`: + + ```bash + mkdir -p assets/skills/run-benchmarks + cat > assets/skills/run-benchmarks/SKILL.md << 'EOF' + --- + name: run-benchmarks + description: Run our standard inference benchmark suite against the running vLLM or SGLang server and compare against the baseline. + --- + + # Run benchmarks + + 1. Check which inference server is running (vLLM on port 8000 or SGLang on port 30000) + 2. Run the appropriate benchmark script from ./benchmarks/ + 3. Report throughput (tokens/sec), latency (TTFT, ITL), and memory utilization + 4. Compare against the baseline in ./benchmarks/baseline.json + EOF + ``` + + > [!TIP] + > Keep `AGENTS.md` focused on constraints and pitfalls (things that break). Put procedural workflows in skills (things you do step-by-step). + + + + - + id: troubleshooting + + label: Troubleshooting + content: | + # Skills don't appear in autocomplete / aren't discoverable + + Each agent discovers skills from a harness-specific directory in the current directory (or a parent). Check the right one: + + | Agent | Expected location | + |-------|-------------------| + | Claude Code | `.claude/skills//SKILL.md` | + | Codex CLI | `.codex/prompts/.md` | + | Gemini CLI | `.gemini/commands/.md` | + | Cursor | `.cursor/rules/.mdc` | + + ```bash + # Examples — check the directory for your agent + ls -la .claude/skills/ + ls -la .codex/prompts/ + ls -la .gemini/commands/ + ls -la .cursor/rules/ + ``` + + You should see entries for `vllm-setup`, `sglang-setup`, `mig-configure`, and `dgx-diagnose`. + + **Check you're in the right directory:** + + ```bash + pwd + ``` + + The agent must be started from the directory containing the harness directory, or a subdirectory of it. + + # Context file not loaded + + If the agent gives generic answers without DGX Station awareness, the context file isn't being picked up. Each agent reads a different filename — verify the one for your agent exists: + + | Agent | Expected filename | + |-------|-------------------| + | Claude Code | `CLAUDE.md` (also reads `AGENTS.md` as fallback) | + | Codex CLI | `AGENTS.md` | + | Gemini CLI | `GEMINI.md` | + | Cursor | `AGENTS.md` | + + ```bash + # Verify the file exists for your agent + cat AGENTS.md | head -5 + cat CLAUDE.md | head -5 + cat GEMINI.md | head -5 + + # Restart the agent in the correct directory + cd ~/your-project + claude # or codex, gemini, etc. + ``` + + All four agents read the context file from the working directory (and parent directories up to the project root). + + # Skill gives outdated information + + The skills contain validated container versions and parameters as of the publication date. If a newer container is available, edit the canonical source and re-install: + + ```bash + nano /path/to/playbook/assets/skills/vllm-setup/SKILL.md + /path/to/playbook/assets/install.sh all --force + ``` + + Or edit the installed copy directly: + + ```bash + # Claude Code + nano .claude/skills/vllm-setup/SKILL.md + # Codex + nano .codex/prompts/vllm-setup.md + # Gemini CLI + nano .gemini/commands/vllm-setup.md + # Cursor + nano .cursor/rules/vllm-setup.mdc + ``` + + > [!TIP] + > Skills are plain markdown — you can version them in git alongside your project code. + + # "Both GPUs cannot be used" errors + + This is the mixed-coherency constraint working as intended. If you see CUDA errors when using `--gpus all`: + + ```bash + # Find the GB300 index + nvidia-smi --query-gpu=index,name --format=csv,noheader + + # Use device-specific targeting + docker run --gpus '"device=1"' ... + ``` + + The `AGENTS.md` covers this constraint, but if you removed that section, add it back — it's the most important piece of DGX Station knowledge. + + # Skills conflict with existing project directory + + If your project already has a `.claude/`, `.codex/`, `.gemini/`, or `.cursor/` directory with its own contents, `install.sh` is **additive** for skill directories — it adds the new skill files alongside whatever you already have and warns on collision rather than overwriting. + + For context files (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`), the installer **refuses** to overwrite an existing file. Pass `--force` to override, or merge the new content manually: + + ```bash + # See what would be written + diff /path/to/playbook/assets/AGENTS.md ./AGENTS.md + + # Force overwrite + /path/to/playbook/assets/install.sh claude . --force + ``` + + # Installer reports "WROTE" for some files but "SKIP" for others + + That's the safe-by-default behavior. The installer skips any file that already exists, prints a warning, and continues with the rest. To get a clean install, either: + + 1. Delete the existing files first: `rm -rf .claude/skills/{vllm-setup,sglang-setup,mig-configure,dgx-diagnose}` + 2. Or pass `--force` (only affects context files; skill files are still skipped if present) + + + + + resources: + - name: Anthropic Agent Skills Overview + url: https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview + + + - name: AGENTS.md Standard + url: https://agents.md/ + + + - name: Claude Code Documentation + url: https://docs.anthropic.com/en/docs/claude-code + + + - name: OpenAI Codex AGENTS.md Guide + url: https://developers.openai.com/codex/guides/agents-md + + + - name: Gemini CLI Custom Commands + url: https://geminicli.com/docs/cli/custom-commands/ + + + - name: Cursor Rules Documentation + url: https://docs.cursor.com/ + + + - name: vLLM Documentation + url: https://docs.vllm.ai/en/latest/ + + + - name: SGLang Documentation + url: https://docs.sglang.io/ + + + - name: MIG User Guide + url: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ + + diff --git a/nvidia/station-brev/endpoint-production.yaml b/nvidia/station-brev/endpoint-production.yaml new file mode 100644 index 0000000..d08fa9c --- /dev/null +++ b/nvidia/station-brev/endpoint-production.yaml @@ -0,0 +1,160 @@ +kind: Playbook +metadata: + name: station-brev + displayName: Register DGX Station to Brev + shortDescription: Link your DGX Station to Brev for remote access and sharing + publisher: nvidia + description: | + # REPLACE THIS WITH YOUR MODEL CARD + https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads + + labelsV2: + - gpuType:playbook:gpu_type_station + - DGX Station + - Brev + + attributes: + - key: DURATION + value: 5 MIN + +spec: + artifactName: station-brev + nvcfFunctionId: None + attributes: + + showUnavailableBanner: false + apiDocsUrl: None + termsOfUse: | + + cta: + text: Brev Overview + url: https://docs.nvidia.com/brev/concepts/overview + + + tabs: + - + id: overview + + label: Overview + content: | + # Basic idea + + NVIDIA Brev is an AI development platform that makes GPU environments remotely accessible, shareable, and easy to standardize using preconfigured setups called Launchables. + + This walkthrough will help you connect your NVIDIA DGX Station to Brev so it shows up as a managed GPU environment in Brev. After a one-time registration, your Station becomes remotely accessible and shareable. + + # What you'll accomplish + + You’ll register your DGX Station with Brev and it will be visible as a healthy node in the Brev web UI and CLI, ready to share access and accept workloads whenever needed. + + # What to know before starting + + While Brev automates the complex configuration, understanding a few key concepts when establishing the initial connection will be useful: + + * **Terminal Basics**: + * Familiarity with command-line use to run a few simple setup commands. + + # Prerequisites + + You will also need the following: + + * NVIDIA DGX Station with GB300 GPU + * **Brev Account**: + * Have an NVIDIA Brev account. [Create an NVIDIA Brev account](https://login.brev.nvidia.com/signin) if you don’t have one. + + * **Permissions**: + * You have administrative (root or sudo) access on the DGX Station device to run the registration command. + + # Time & risk + + * **Estimated time:** 5-10 minutes + * **Risk level:** Low - Registration configures the Station for secure remote access without altering your existing workloads + * **Rollback:** The Brev configuration can be removed through the UI and CLI + * **Last Updated:** 05/29/2026 + * First Publication + + + + - + id: instructions + + label: Instructions + content: | + # Step 1. Log in to Brev + + Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm you’re in the correct org (by clicking the org button on the top right-hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation. + + Click the “Register Compute” button and follow the instructions in the pop-up window. + + # Step 2. Complete Pop-up Instructions + + * Install the Brev CLI + * Configure your compute + * Add a name for compute + * To configure SSH, ensure the “Enable SSH access” toggle is on + * Run the registration command + + > [!IMPORTANT] + > Run the Brev CLI install command **without `sudo`**. Prefixing the installer with `sudo` writes the `brev` binary into root's home directory, which is not on your user shell's `PATH` — the next command will fail with `brev: command not found`. Copy the install command from the pop-up and run it as your normal user. + + # Step 3. Follow Registration Flow + + In the CLI, you’ll be walked through registration. Go through the flow until registration is complete. + + # Step 4. Confirm DGX Station in Brev UI + + * Go to the [Brev UI](https://brev.nvidia.com) + * Navigate to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) + * Confirm that the DGX Station appears as a registered node with a **Connected** status + + # Step 5. Next Steps + + Your DGX Station is now integrated into Brev as a secure, remotely accessible GPU environment. + + Now that your hardware is connected, you can: + + * **Access your machine from anywhere:** Open the [Brev UI](https://brev.nvidia.com) and launch a session from [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute). + * **Share access with others:** Invite teammates to your DGX Station from the Brev UI: + * Go to the [Brev UI](https://brev.nvidia.com) and open [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute). + * Find your DGX Station in the list and open the row's three-dot (⋯) menu. + * Select **Share Access**. + * Enter the email address of the person you want to share with. + * Choose their role / permission level. + * Confirm to send the invitation. + + # Step 6. Cleanup + + If you ever decide to unregister your DGX Station with Brev, you can either do so through the Brev UI or the Brev CLI. + + With the CLI simply run: + + ```bash + brev deregister + ``` + + In the UI: + * Go to the [Brev UI](https://brev.nvidia.com) + * Navigate to the section listing “GPU Environments” and look under “Registered Compute” + * Click the “Remove” menu item on the device you wish to delete from Brev. + * Confirm your selection. + + + + - + id: troubleshooting + + label: Troubleshooting + content: | + | Symptom | Cause | Fix | + |---------|-------|-----| + | Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set ` and then redo the registration process. | + | Unable to `brev shell ` | Need to refresh | `brev refresh`. | + + + + + resources: + - name: Brev Documentation + url: https://docs.nvidia.com/brev/latest + + diff --git a/nvidia/station-nemoclaw/README.md b/nvidia/station-nemoclaw/README.md index 384cc45..2ab2f89 100644 --- a/nvidia/station-nemoclaw/README.md +++ b/nvidia/station-nemoclaw/README.md @@ -118,8 +118,8 @@ All required assets are handled by the NemoClaw installer. No manual cloning is - **Estimated time:** About 30–60 minutes for a first full pass (install, onboard, model download depending on choice and network). Optional Brave, Telegram, and cloudflared steps add time if you do them in a second session. - **Risk level:** Medium — you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts. -- **Last Updated:** 05/29/2026 - - Update to latest nemoclaw installer instructions +- **Last Updated:** 06/01/2026 + - Pin nemoclaw installer to v0.0.55, the latest stable version ## Instructions @@ -127,10 +127,10 @@ All required assets are handled by the NemoClaw installer. No manual cloning is ### Step 1. Install NemoClaw -This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.55** release (set via `NEMOCLAW_VERSION`; v0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox. +This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.0.55** release (set via `NEMOCLAW_INSTALL_TAG`; v0.0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox. ```bash -curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_VERSION=v0.55 bash +curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.55 bash ``` The installation wizard walks you through setup: @@ -148,7 +148,7 @@ The installer requires **Node.js 22.16+** (installed automatically if missing). During custom setup, the onboard wizard walks you through: 1. **Configuring inference** -- Choose to set up local inference on your DGX Station by selecting **`7) Local Ollama`**. -2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will provide options to download models to start. +2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will download **`qwen3.6:35b`** automatically. 3. **Sandbox name** -- Pick a name (e.g. my-assistant). Each sandbox requires a unique name. 4. **Apply this configuration** -- Enter `Y` to confirm setting up local inference. 5. **Enable Brave Web Search** -- Optional. If you enable it, paste a [Brave Search API](https://brave.com/search/api/) key when prompted. @@ -324,7 +324,7 @@ Open Telegram, find your bot, and send a message. The bot should forward traffic The cloudflared tunnel provides a **public URL for the Web UI dashboard** — it is not related to Telegram messaging. -Install cloudflared (DGX Station is arm64): +Install cloudflared (DGX Station is aarch64): ```bash curl -L --output cloudflared.deb \ @@ -354,7 +354,7 @@ You should see `● cloudflared` with a `trycloudflare.com` public URL. Set up NemoClaw Agents in general require three steps: Configure NemoClaw security policy, Run Agent Workflow Prompt, Personalize the Workflow for your own use case. -Checkout these [Example NemoClaw Agents](https://build.nvidia.com/station/nemoclaw-applications) for reference. Consider sharing your NemoClaw agent setup with the community at [DGX Station Developer Forum](https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station-gb300) +Checkout these [Example NemoClaw Agents](https://build.nvidia.com/spark/nemoclaw-applications) for reference. --- diff --git a/nvidia/station-nemoclaw/endpoint-production.yaml b/nvidia/station-nemoclaw/endpoint-production.yaml index 54569b2..01a441c 100644 --- a/nvidia/station-nemoclaw/endpoint-production.yaml +++ b/nvidia/station-nemoclaw/endpoint-production.yaml @@ -1,8 +1,8 @@ kind: Playbook metadata: name: station-nemoclaw - displayName: NemoClaw with Nemotron-3-Super and vLLM on DGX Station - shortDescription: Install NemoClaw on DGX Station with local vLLM inference and Telegram bot integration + displayName: Run NemoClaw with a Local LLM + shortDescription: Build your first local AI assistant on DGX Station using NemoClaw in a secure sandbox, with optional Telegram. publisher: nvidia description: | @@ -11,19 +11,15 @@ metadata: labelsV2: - gpuType:playbook:gpu_type_station - - DGX - DGX Station - - GB300 - - AI Agent + - Agentic Workflow - OpenShell - - vLLM - - Nemotron-3-Super - NemoClaw - Telegram attributes: - key: DURATION - value: 30 MINS + value: 30 MIN spec: artifactName: station-nemoclaw @@ -45,22 +41,19 @@ spec: label: Overview content: | - ## Overview + # Basic idea - ## Basic idea + **NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime — an environment designed for executing agents with additional security — and connects them to local inference on your DGX Station. A single installer command (`nemoclaw.sh`) handles Node.js, OpenShell, and the NemoClaw CLI; the **onboard** wizard then creates a sandboxed agent, optional **Brave Search**, optional **messaging channels** (Telegram, Discord, or Slack), and a **policy tier** with network presets. - **NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime -- an environment designed for executing agents with additional security -- and open-source models like NVIDIA Nemotron. A single installer command handles Node.js, OpenShell, and the NemoClaw CLI, then walks you through an onboard wizard to create a sandboxed agent on your DGX Station using vLLM with Nemotron 3 Super. - - By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, accessible via a web dashboard and a Telegram bot, with inference routed to a local Nemotron 3 Super 120B model served by vLLM on your DGX Station -- all without exposing your host filesystem or network to the agent. + By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, reachable through the **Web UI** or **terminal TUI**, with inference routed to local inference on the DGX Station. You can optionally add **Telegram** (with **cloudflared** for a public webhook URL) and optional **web search** — all without exposing your host filesystem or network beyond what you explicitly allow in policy. ## What you'll accomplish - - Configure Docker and the NVIDIA container runtime for OpenShell on DGX Station - - Pull Nemotron 3 Super 120B (NVFP4) from Hugging Face and serve it with vLLM - - Install NemoClaw with a single command (handles Node.js, OpenShell, and the CLI) - - Run the onboard wizard to create a sandbox and configure local vLLM inference - - Chat with the agent via the CLI, TUI, and web UI - - Set up a Telegram bot that forwards messages to your sandboxed agent + - Install **NemoClaw** with one command (`nemoclaw.sh`), which pulls Node.js, OpenShell, and the CLI as needed + - Walk through `nemoclaw onboard` wizard with recommended settings + - Open the **Web UI** to interact with agent + - Optionally enable **Brave Search** or **Telegram** after onboarding + - **Cleanup and uninstall** with the documented `uninstall.sh` flags when finished ## Notice and disclaimers @@ -74,14 +67,14 @@ spec: ### What you're getting - This experience is provided "AS IS" for demonstration purposes only -- no warranties, no guarantees. This is a demo, not a production-ready solution. You will need to implement appropriate security controls for your environment and use case. + This experience is provided "AS IS" for demonstration purposes only — no warranties, no guarantees. This is a demo, not a production-ready solution. You will need to implement appropriate security controls for your environment and use case. ### Key risks with AI agents - - **Data leakage** -- Any materials the agent accesses could be exposed, leaked, or stolen. - - **Malicious code execution** -- The agent or its connected tools could expose your system to malicious code or cyber-attacks. - - **Unintended actions** -- The agent might modify or delete files, send messages, or access services without explicit approval. - - **Prompt injection and manipulation** -- External inputs or connected content could hijack the agent's behavior in unexpected ways. + - **Data leakage** — Any materials the agent accesses could be exposed, leaked, or stolen. + - **Malicious code execution** — The agent or its connected tools could expose your system to malicious code or cyber-attacks. + - **Unintended actions** — The agent might modify or delete files, send messages, or access services without explicit approval. + - **Prompt injection and manipulation** — External inputs or connected content could hijack the agent's behavior in unexpected ways. ### Participant acknowledgement @@ -91,23 +84,22 @@ spec: | Layer | What it protects | When it applies | |------------|----------------------------------------------------|-----------------------------| - | Filesystem | Prevents reads/writes outside allowed paths. | Locked at sandbox creation. | + | Filesystem | Prevents reads/writes outside allowed paths. | Locked at sandbox creation. | | Network | Blocks unauthorized outbound connections. | Hot-reloadable at runtime. | - | Process | Blocks privilege escalation and dangerous syscalls.| Locked at sandbox creation. | + | Process | Blocks privilege escalation and dangerous syscalls.| Locked at sandbox creation. | | Inference | Reroutes model API calls to controlled backends. | Hot-reloadable at runtime. | ## What to know before starting - Basic use of the Linux terminal and SSH - - Familiarity with Docker (permissions, `docker run`) + - Familiarity with Docker (permissions, `docker run`, optional `docker` group membership) - Awareness of the security and risk sections above ## Prerequisites - **Hardware and access:** + **Hardware:** - A DGX Station (GB300) with keyboard and monitor, or SSH access - - A **Telegram bot token** from [@BotFather](https://t.me/BotFather) (create one with `/newbot`) -- optional, for Phase 3 **Software:** @@ -119,16 +111,16 @@ spec: head -n 2 /etc/os-release nvidia-smi docker info --format '{{.ServerVersion}}' - df -h / /var/lib/docker 2>/dev/null | head -20 ``` - Expected: Ubuntu 24.04, NVIDIA GB300 GPU(s), Docker 28.x+, and **enough free disk** for Docker layers, the NemoClaw sandbox image, and Hugging Face cache (treat **~40 GB free** on the Docker data filesystem as a practical minimum; very low free space can surface as cryptic onboard errors such as “K8s namespace not ready”). + Expected: Ubuntu 24.04, NVIDIA GB300 GPU, Docker 28.x+. ## Have ready before you begin - | Item | Where to get it | - |------|----------------| - | Telegram bot token (optional) | [@BotFather](https://t.me/BotFather) on Telegram -- create with `/newbot` | + | Item | When you need it | + |------|------------------| + | **Telegram bot token** (optional) | Create with [@BotFather](https://t.me/BotFather) (`/newbot`). You can paste it during **onboarding** (Step 3) **or** when you run **`nemoclaw channels add telegram`** later. | + | **Brave Search API key** (optional) | From [Brave Search API](https://brave.com/search/api/) if you enable web search during onboarding or via **`nemoclaw onboard --fresh --gpu`** (`--fresh` re-prompts every onboarding question, including features you previously skipped; without `--fresh` the wizard resumes the previous session and will not re-prompt). | ## Ancillary files @@ -136,10 +128,10 @@ spec: ## Time and risk - - **Estimated time:** 20--30 minutes (with model already downloaded). First-time model download adds ~10--20 minutes depending on network speed. - - **Risk level:** Medium -- you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts. - - **Last Updated:** 04/27/2026 - * First publication for DGX Station with vLLM + - **Estimated time:** About 30–60 minutes for a first full pass (install, onboard, model download depending on choice and network). Optional Brave, Telegram, and cloudflared steps add time if you do them in a second session. + - **Risk level:** Medium — you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts. + - **Last Updated:** 05/29/2026 + - Update to latest nemoclaw installer instructions @@ -148,355 +140,111 @@ spec: label: Instructions content: | - # Phase 1: Prerequisites + # Phase 1: Install and Run NemoClaw - These steps prepare a fresh DGX Station for NemoClaw. If Docker, the NVIDIA runtime, and vLLM are already configured, skip to Phase 2. + ## Step 1. Install NemoClaw - > [!IMPORTANT] - > **Disk space:** NemoClaw’s onboard flow pulls a multi-gigabyte sandbox image and runs Docker, k3s, and the gateway together. If root or Docker’s data disk is nearly full (for example only a few gigabytes free), onboarding can fail with generic errors such as **“K8s namespace not ready”** with no clear hint about storage. Before you start, check free space: `df -h / /var/lib/docker`. NVIDIA recommends **at least 40 GB free** on the filesystem that holds Docker layers (often `/` or `/var/lib/docker`); treat **under ~15 GB** as high risk for first-time onboard failures. - - ## Step 1. Configure Docker and the NVIDIA container runtime - - OpenShell's gateway runs k3s inside Docker. On DGX Station (Ubuntu 24.04, cgroup v2), Docker must be configured with the NVIDIA runtime and host cgroup namespace mode. - - Configure the NVIDIA container runtime for Docker: + This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.55** release (set via `NEMOCLAW_VERSION`; v0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox. ```bash - sudo nvidia-ctk runtime configure --runtime=docker + curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_VERSION=v0.55 bash ``` - Expected: + The installation wizard walks you through setup: - ```text - INFO Loading config from /etc/docker/daemon.json - INFO Wrote updated config to /etc/docker/daemon.json - INFO It is recommended that docker daemon be restarted. - ``` + 1. **Accept NemoClaw license** -- Confirm by entering `yes` + 2. **Run express install** -- Confirm by entering `Y` - Set the cgroup namespace mode required by OpenShell on DGX Station: + The installer requires **Node.js 22.16+** (installed automatically if missing). It walks you through Node.js, NemoClaw CLI and Onboarding phases. See more details of Onboarding configuration in the next step. - ```bash - sudo python3 -c " - import json, os - path = '/etc/docker/daemon.json' - d = json.load(open(path)) if os.path.exists(path) else {} - d['default-cgroupns-mode'] = 'host' - json.dump(d, open(path, 'w'), indent=2) - " - ``` - - Restart Docker: - - ```bash - sudo systemctl restart docker - ``` - - Verify the NVIDIA runtime works: - - ```bash - docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi - ``` - - Expected: - - ```text - +-----------------------------------------------------------------------------------------+ - | NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 | - +-----------------------------------------+------------------------+----------------------+ - | 0 NVIDIA GB300 On | 00000009:06:00.0 Off | 0 | - | N/A 46C P0 215W / 1300W | 18661MiB / 256703MiB | 0% Default | - +-----------------------------------------+------------------------+----------------------+ - ``` - - If you get a permission denied error on `docker`, add your user to the Docker group and activate the new group in your current session: - - ```bash - sudo usermod -aG docker $USER - newgrp docker - ``` - - This applies the group change immediately. Alternatively, you can log out and back in instead of running `newgrp docker`. + ## Step 2. NemoClaw Onboarding > [!NOTE] - > DGX Station uses cgroup v2. OpenShell's gateway embeds k3s inside Docker and needs host cgroup namespace access. Without `default-cgroupns-mode: host`, the gateway can fail with "Failed to start ContainerManager" errors. + > If you chose **express install** in Step 1, all settings are auto-configured with recommended defaults. Skip to Step 3. - ## Step 2. Pull the Nemotron-3-Super model + During custom setup, the onboard wizard walks you through: - Install pip and the Hugging Face CLI (if not already installed): + 1. **Configuring inference** -- Choose to set up local inference on your DGX Station by selecting **`7) Local Ollama`**. + 2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will provide options to download models to start. + 3. **Sandbox name** -- Pick a name (e.g. my-assistant). Each sandbox requires a unique name. + 4. **Apply this configuration** -- Enter `Y` to confirm setting up local inference. + 5. **Enable Brave Web Search** -- Optional. If you enable it, paste a [Brave Search API](https://brave.com/search/api/) key when prompted. + 6. **Messaging channels** -- Optional. If you enable it, choose your desired bot (`telegram`, `discord` or `slack`) and paste your bot token when prompted. + 7. **Policy presets** -- Choose desired Policy tier (`Balanced` recommended) and accept/edit the suggested presets when prompted (confirm with **Enter**). - ```bash - sudo apt install -y python3-pip - pip3 install --break-system-packages huggingface-hub - ``` - - Download Nemotron 3 Super 120B in NVFP4 quantization (~60 GB; may take 10--20 minutes depending on network speed): - - ```bash - hf download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 - ``` - - Expected (on a fresh download; cached downloads complete instantly): - - ```text - Fetching 36 files: 100%|██████████| 36/36 [15:42<00:00, 26.18s/it] - /home/nvidia/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/snapshots/0d6fa3ecad422a... - ``` - - Verify the download completed: - - ```bash - ls ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/ - ``` - - Expected: - - ```text - blobs refs snapshots - ``` - - > [!NOTE] - > The NVFP4 quantization is chosen because it fits entirely in **one** GB300 GPU’s 256 GB HBM3e with room for KV cache. On a **two-GPU** station you can still use NVFP4 with `--tensor-parallel-size 1` and a single visible GPU, or shard with `--tensor-parallel-size 2`. For other quantization variants, see [Troubleshooting](troubleshooting.md). - - ## Step 3. Start the vLLM inference server - - Launch vLLM using the NVIDIA-optimized container image. - - **Single GPU (default on one-GPU systems, or pin to one GPU on multi-GPU stations):** vLLM can emit **mixed device** warnings if several GPUs are visible but the model is only meant to use one. Pinning avoids accidentally placing weights on an unexpected device. - - ```bash - docker run -d --name vllm-nemotron \ - --runtime nvidia --gpus '"device=0"' \ - -e CUDA_VISIBLE_DEVICES=0 \ - -v ~/.cache/huggingface:/root/.cache/huggingface \ - -p 8000:8000 \ - --restart unless-stopped \ - nvcr.io/nvidia/vllm:26.03-py3 \ - python3 -m vllm.entrypoints.openai.api_server \ - --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ - --host 0.0.0.0 \ - --port 8000 \ - --tensor-parallel-size 1 \ - --trust-remote-code \ - --max-model-len 32768 \ - --enable-auto-tool-choice \ - --tool-call-parser qwen3_xml \ - --reasoning-parser nemotron_v3 - ``` - - **Two GPUs (tensor parallel):** If your DGX Station has two Blackwell GPUs and you want Nemotron sharded across both, use both devices and set tensor parallel size to `2` (VRAM is summed across the GPUs): - - ```bash - docker run -d --name vllm-nemotron \ - --runtime nvidia --gpus all \ - -e CUDA_VISIBLE_DEVICES=0,1 \ - -v ~/.cache/huggingface:/root/.cache/huggingface \ - -p 8000:8000 \ - --restart unless-stopped \ - nvcr.io/nvidia/vllm:26.03-py3 \ - python3 -m vllm.entrypoints.openai.api_server \ - --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ - --host 0.0.0.0 \ - --port 8000 \ - --tensor-parallel-size 2 \ - --trust-remote-code \ - --max-model-len 32768 \ - --enable-auto-tool-choice \ - --tool-call-parser qwen3_xml \ - --reasoning-parser nemotron_v3 - ``` - - **Pick a GPU index by name (optional one-liner):** To print the device index of the first GPU whose name contains `GB300` (adjust the pattern if your `nvidia-smi` name string differs), run on the host: - - ```bash - nvidia-smi --query-gpu=index,name --format=csv,noheader | awk -F', ' '/GB300/ { gsub(/^ +/,"",$1); print $1; exit }' - ``` - - Use that index in Docker as `--gpus '"device=N"'` (replace `N` with the printed index). - - > [!NOTE] - > **`--tool-call-parser qwen3_xml`:** Nemotron’s tool-call wire format is exposed through vLLM’s **Qwen3-compatible XML tool parser** — the name refers to the parser implementation, not the base model. This pairing is what vLLM expects for correct function/tool calling with this checkpoint. - - The first startup loads ~70 GB of weights into GPU memory. Watch the logs until you see the model is ready: - - ```bash - docker logs -f vllm-nemotron - ``` - - Wait until you see the following in the logs (typically 3--5 minutes): - - ```text - INFO Loading weights took 55.47 seconds - INFO Model loading took 69.39 GiB memory and 71.31 seconds - INFO: Started server process [1] - INFO: Waiting for application startup. - INFO: Application startup complete. - ``` - - Then verify the API is responding: - - ```bash - curl -s http://localhost:8000/v1/models - ``` - - Expected: - - ```json - {"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]} - ``` - - Send a test request to warm up the model before proceeding to Step 4. The first inference request compiles CUDA graphs and can take 30--90 seconds: - - ```bash - curl -s --max-time 120 http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"Say hello."}],"max_tokens":10}' - ``` - - Expected (the first request may take 30--90 seconds; subsequent requests are much faster): - - ```json - {"id":"chatcmpl-...","object":"chat.completion","model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","choices":[{"index":0,"message":{"role":"assistant","content":"..."},"finish_reason":"length"}],...} - ``` - - > [!IMPORTANT] - > Warm up the model before running the NemoClaw installer. The onboard wizard validates the vLLM endpoint with a short timeout. If the model has not served at least one request, this validation will time out and the install will fail. - - > [!IMPORTANT] - > Always start vLLM via the Docker container -- do not run `vllm serve` directly on the host. The NVIDIA container image (`nvcr.io/nvidia/vllm:26.03-py3`) includes optimized kernels for the GB300's Blackwell architecture that are not available in the pip-installed version. - - > [!NOTE] - > Key flags explained: - > - `--tensor-parallel-size` -- `1` for a single visible GPU; `2` when you expose two GPUs for tensor-parallel sharding (see Step 3). - > - `--trust-remote-code` -- required for the Mamba2-Transformer hybrid architecture - > - `--max-model-len 32768` -- maximum context length (increase up to 1M if VRAM allows) - > - `--enable-auto-tool-choice --tool-call-parser qwen3_xml` -- enables function/tool calling for the agent (see the note above on the parser name). - > - `--reasoning-parser nemotron_v3` -- separates chain-of-thought reasoning from the response so the TUI/Web UI can display them cleanly - - --- - - # Phase 2: Install and Run NemoClaw - - ## Step 4. Install NemoClaw - - The installer script installs Node.js (if needed), OpenShell, the NemoClaw CLI, and runs onboarding to create a sandbox. The vLLM provider requires the **experimental** flag and an **extended inference timeout** (the default 15-second validation timeout is too short for a 120B model). - - ### Recommended: non-interactive install (copy-paste friendly) - - This path is best for SSH sessions, automation, and documentation — no arrow-key TUI in the terminal. - - ```bash - NEMOCLAW_EXPERIMENTAL=1 \ - NEMOCLAW_NON_INTERACTIVE=1 \ - NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \ - NEMOCLAW_SANDBOX_NAME=my-assistant \ - NEMOCLAW_PROVIDER=vllm \ - NEMOCLAW_MODEL="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \ - NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \ - bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)" - ``` - - Optional: include **Telegram** in the first onboard without typing the token over SSH — export credentials on the host **before** running the installer (same variables the [NemoClaw Telegram bridge guide](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html) documents): - - ```bash - export TELEGRAM_BOT_TOKEN='' - # Optional DM allowlist (comma-separated Telegram user IDs): - # export TELEGRAM_ALLOWED_IDS='123456789,987654321' - ``` - - Use [Telegram Desktop](https://desktop.telegram.org/) or [web.telegram.org](https://web.telegram.org/) on a laptop to copy the token from [@BotFather](https://t.me/BotFather) and paste into your SSH session (or into a small env file you `source`). Typing a 46+ character token on a phone keyboard into a remote shell is error-prone. - - To **persist** `TELEGRAM_BOT_TOKEN` across reboots, keep it in a root-owned or user-only file and source it from your shell profile (example — adjust path and permissions): - - ```bash - install -m 600 /dev/null ~/.nemoclaw/telegram.env - nano ~/.nemoclaw/telegram.env # add: export TELEGRAM_BOT_TOKEN='...' - grep -q 'nemoclaw/telegram.env' ~/.bashrc || echo 'source ~/.nemoclaw/telegram.env 2>/dev/null' >> ~/.bashrc - ``` - - NemoClaw also stores messaging credentials in its credential store when you onboard or run `nemoclaw … channels add telegram`; the file above is mainly for **re-running scripts** or **non-interactive** flows that read the environment. - - ### Alternative: interactive installer - - If you prefer the wizard: - - ```bash - NEMOCLAW_EXPERIMENTAL=1 \ - NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \ - bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)" - ``` - - The wizard asks **six** high-level prompts (third-party notice, inference provider, Brave Search, messaging channels, sandbox name, policy presets). In parallel, the installer prints **eight** numbered onboard sub-phases, `[1/8]` … `[8/8]` (preflight, gateway, inference detection, inference route, messaging channels, sandbox creation, OpenClaw inside sandbox, policy presets). **Those two numberings are different on purpose** — the `[n/8]` lines are internal progress steps; the numbered list above is what you answer in the TUI. - - 1. **Third-party software notice** -- Type `yes` to accept and continue. - 2. **Inference provider** -- The wizard detects vLLM running locally. Select option **8** (`Local vLLM [experimental] — running`). - 3. **Brave Web Search** -- Optional. Type `skip` if you don't have a Brave Search API key. - 4. **Messaging channels** -- Optional. Press **Enter** to skip, or toggle Telegram/Discord/Slack if desired (this is the step that corresponds to onboard phase **[5/8]** in the log). - 5. **Sandbox name** -- Pick a name (e.g. `my-assistant`). Names must be lowercase alphanumeric with hyphens only. - 6. **Policy presets** -- Use arrow keys to toggle presets. `pypi` and `npm` are selected by default. Press **Enter** to confirm. - - The install takes approximately 3 minutes. Example milestones in the output (wording may vary slightly by release): - - ```text - [1/3] Node.js - Node.js found: v22.22.2 - - [2/3] NemoClaw CLI - Installing NemoClaw from GitHub... - Verified: nemoclaw is available at /home/nvidia/.local/bin/nemoclaw - - [3/3] Onboarding - [1/8] Preflight checks - ✓ Docker is running - ✓ NVIDIA GPU detected: 2 GPU(s), 256703 MB VRAM # example on a two-GPU system - [2/8] Starting OpenShell gateway - ✓ Gateway is healthy - [3/8] Configuring inference (NIM) - ✓ Using existing vLLM on localhost:8000 - Detected model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 - [4/8] Setting up inference provider - ✓ Inference route set: vllm-local / nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 - [5/8] Messaging channels - (example) Telegram disabled — skipped - # or: Telegram enabled; token stored in credential store - [6/8] Creating sandbox - ✓ Sandbox 'my-assistant' created - [7/8] Setting up OpenClaw inside sandbox - ✓ OpenClaw gateway launched inside sandbox - [8/8] Policy presets - Applied preset: pypi - Applied preset: npm - ``` - - When complete you will see: + When complete you will see output like: ```text ────────────────────────────────────────────────── Sandbox my-assistant (Landlock + seccomp + netns) - Model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (Local vLLM) + Model (Local Ollama) ────────────────────────────────────────────────── Run: nemoclaw my-assistant connect Status: nemoclaw my-assistant status Logs: nemoclaw my-assistant logs --follow - - OpenClaw UI (tokenized URL; treat it like a password) - http://127.0.0.1:18789/#token= ────────────────────────────────────────────────── ``` - > [!IMPORTANT] - > Save the tokenized Web UI URL printed at the end -- you will need it in Step 8. It looks like: - > `http://127.0.0.1:18789/#token=` + > [!NOTE] + > - If `nemoclaw` is not found after install, run `source ~/.bashrc` to reload your shell path. + > - Time to finish **Onboarding** can vary, depending on the model choice and internet speed. + + NemoClaw Onboarding can be run repeatedly to create multiple sandboxes for independent usecases. Use `--name ` to create an additional sandbox alongside any existing ones: + + ```bash + nemoclaw onboard --gpu --name + ``` > [!IMPORTANT] - > `NEMOCLAW_EXPERIMENTAL=1` is required for the vLLM provider. Without it, the installer will report "Requested provider 'vllm' is not available in this environment." + > Use `--name ` to create an additional sandbox without affecting existing ones. The `--fresh` flag is a destructive option reserved for starting a completely new onboard session — if a sandbox with the same name already exists, `--fresh` will **destroy and recreate it**. Only use `--fresh` when you intend to wipe and re-onboard (see Step 4 for an example where re-prompting is required). + + ## Step 3. Interact with OpenClaw + + There are two ways to interact with your OpenClaw, Web UI or terminal UI. + + ### Option 1. Web UI + + Get the full dashboard URL (includes the auto-assigned port and token): + + ```bash + nemoclaw my-assistant dashboard-url --quiet + ``` + + This prints a URL like `http://127.0.0.1:18790/#token=`. The port is auto-assigned (commonly 18789 or 18790) and may differ between installs. + + **If accessing the Web UI directly on the DGX Station** (keyboard and monitor attached), open the dashboard URL in a browser. + + **If accessing the Web UI from a remote machine**, you need to set up an SSH tunnel. + + First, note the port number from the dashboard URL above (e.g. `18790`). + + Find your DGX Station's IP address: + + ```bash + hostname -I | awk '{print $1}' + ``` + + This prints the primary IP address (e.g. `192.168.1.42`). You can also find it in **Settings > Wi-Fi** or **Settings > Network** on the DGX Station's desktop, or check your router's connected-devices list. + + From your remote machine, create an SSH tunnel using the port from above (replace `` and ``): + + ```bash + ssh -L :127.0.0.1: @ + ``` + + Now open the dashboard URL in your remote machine's browser. > [!IMPORTANT] - > `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300` extends the validation timeout from the default 15 seconds to 300 seconds. Without this, the endpoint validation will fail on a cold 120B model, even if you warmed it up in Step 3 -- the installer sends its own test prompt which may be slower. + > Use `127.0.0.1`, not `localhost` -- the gateway origin check requires an exact match. > [!NOTE] - > If `nemoclaw` is not found after install, run `source ~/.bashrc` to reload your shell path. + > If the Web UI fails to load and the port forward may be stale, get the port from `nemoclaw my-assistant dashboard-url --quiet` and reset: + > ```bash + > openshell forward stop my-assistant || true + > openshell forward start my-assistant --background + > ``` - ## Step 5. Connect to the sandbox and verify inference + ### Option 2. Terminal UI Connect to the sandbox: @@ -504,207 +252,158 @@ spec: nemoclaw my-assistant connect ``` - Expected: - - ```text - sandbox@my-assistant:~$ - ``` - - You are now inside the sandboxed environment. Verify that the inference route is working: - - ```bash - curl -sf https://inference.local/v1/models - ``` - - Expected: - - ```json - {"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]} - ``` - - ## Step 6. Talk to the agent (CLI) - - Still inside the sandbox, send a test message **through the OpenClaw gateway** (the default path). The `--local` flag is **intentionally blocked** inside the NemoClaw OpenShell sandbox — it would bypass gateway controls — so the command you may see in generic OpenClaw quickstarts will fail here. - - ```bash - openclaw agent --agent main -m "hello" --session-id test - ``` - - Expected (the agent will think, then respond -- first response may take 30--90 seconds): streaming or printed assistant text ending with a normal reply. - - If you see a response from the agent, inference is working end-to-end. - - ## Step 7. Interactive TUI - - Launch the terminal UI for an interactive chat session: + Then launch the terminal UI inside the sandbox: ```bash openclaw tui ``` - Press **Ctrl+C** to exit the TUI. + You can start chatting with OpenClaw. Press **Ctrl+C** to exit the terminal UI. - ## Step 8. Exit the sandbox and access the Web UI - - Exit the sandbox to return to the host: + To exit the sandbox: ```bash exit ``` - **If accessing the Web UI directly on the DGX Station** (keyboard and monitor attached), open a browser and navigate to the tokenized URL from Step 4. Prefer **`127.0.0.1`** in the URL bar (not `localhost`) so it matches strict gateway origin checks: - - ```text - http://127.0.0.1:18789/#token= - ``` - - **If accessing the Web UI from a remote machine**, you need to set up port forwarding. - - First, find your DGX Station's IP address. On the Station, run: - - ```bash - hostname -I | awk '{print $1}' - ``` - - Start the port forward on the DGX Station host: - - ```bash - openshell forward start 18789 my-assistant --background - ``` - - Expected: - - ```text - Forwarding 127.0.0.1:18789 -> my-assistant:18789 (background) - ``` - - If the forward was already started during onboarding, you will see: - - ```text - Error: Port 18789 is already forwarded to sandbox 'my-assistant'. - ``` - - This is fine -- the forward is already running. - - Then from your remote machine, create an SSH tunnel to the Station (replace `` with the IP address from above): - - ```bash - ssh -L 18789:127.0.0.1:18789 @ - ``` - - Now open the tokenized URL in your remote machine's browser. Either of these usually works on the **client** side because both bind to your loopback through the tunnel: - - ```text - http://127.0.0.1:18789/#token= - ``` - - > [!IMPORTANT] - > Use `127.0.0.1`, not `localhost` -- the gateway origin check requires an exact match. - --- - # Phase 3: Telegram Bot + # Phase 2: Modify NemoClaw Policy - Messaging (Telegram, Discord, Slack) is **wired during onboarding** — credentials are stored, OpenShell providers are created, and channel configuration is **baked into the sandbox image**. Runtime config under `/sandbox/.openclaw/` is not safely patchable from inside the running sandbox. + ## Step 4. Enable Brave Search in sandbox - **`nemoclaw start` does not start the Telegram bridge.** In current NemoClaw releases it starts **optional host services** such as the **cloudflared** tunnel when installed; Telegram delivery stays under OpenShell. See [NemoClaw commands](https://docs.nvidia.com/nemoclaw/latest/reference/commands.html) and [Set up Telegram bridge](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html). - - ## Step 9. Create a Telegram bot - - Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the bot token. - - **Tip:** Use [Telegram Desktop](https://desktop.telegram.org/) or [web.telegram.org](https://web.telegram.org/) so you can **copy-paste** the token into your terminal or env file instead of typing 46+ characters from your phone into SSH. - - ## Step 10. Enable Telegram (first time or after skipping it) - - ### Path A — You have not installed yet, or you can re-run onboard - - Export the token on the **host**, then run the installer / onboard again (non-interactive variables from Step 4, plus `TELEGRAM_BOT_TOKEN`). The wizard’s **Messaging channels** step (installer phase **[5/8]**) is the right time to toggle Telegram interactively. - - Re-onboarding after a sandbox exists is supported; NemoClaw can detect token changes and rebuild the sandbox — see the official [Telegram bridge](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html) page. - - ### Path B — NemoClaw is already installed (recommended host command) - - On the **host** (run `exit` if you are inside `nemoclaw … connect`): - - 1. **Allow outbound access to the Telegram API** if you have not already — add the `telegram` network preset: + To add Brave Web Search to an existing sandbox, re-run the onboard wizard with `--fresh` to start a new session that re-prompts all options (including previously skipped features): ```bash - nemoclaw my-assistant policy-add + nemoclaw onboard --fresh --gpu ``` - When prompted, select `telegram` and confirm. + > [!NOTE] + > Without `--fresh`, the onboard wizard **resumes** the previous session and will not re-prompt for features you already skipped. - 2. **Register the bot token and rebuild** the sandbox image so Telegram is included: + When you reach **Enable Brave Web Search**, choose **yes** and paste the key from the [Brave Search API](https://brave.com/search/api/) console. Confirm the same sandbox name and inference choices where prompted. The wizard will **rebuild** the sandbox so the key is applied. + + > [!NOTE] + > Alternatively, set `BRAVE_API_KEY` in your environment before running the installer and Brave Search will be enabled automatically during onboard. + + To confirm web search is enabled, relaunch your OpenClaw WebUI or terminal UI. Ask the agent for something that needs **live web search**. If requests still fail, recheck **`policy-list`** and re-read the onboard output for Brave/API errors. + + ## Step 5. Set up Messaging Channel (Telegram Bot as an example) + + These steps apply when your sandbox exists but **Telegram was never configured** (you skipped **Messaging channels** in Step 2, or the sandbox policy tier never included Telegram-related egress). Replace `` with your sandbox (for example `my-assistant`). + + ### 1. Create a Telegram bot + + In Telegram, open [@BotFather](https://t.me/BotFather), send `/newbot`, and complete the prompts. Copy the **bot token** BotFather returns and keep it ready for the next step. + + ### 2. Register Telegram with NemoClaw and rebuild the sandbox ```bash - export TELEGRAM_BOT_TOKEN='' - nemoclaw my-assistant channels add telegram + nemoclaw channels add telegram ``` - Follow the prompts to rebuild when asked (or run `nemoclaw my-assistant rebuild --yes` afterward if non-interactive mode queued a rebuild — see `NEMOCLAW_NON_INTERACTIVE=1` behavior in the [commands reference](https://docs.nvidia.com/nemoclaw/latest/reference/commands.html)). + Paste the token when prompted. NemoClaw persists credentials and **rebuilds** the sandbox so OpenClaw can use Telegram as a messaging channel. - 3. **Pause or resume** Telegram delivery without changing credentials: use the **`nemoclaw channels stop`** / **`nemoclaw channels start`** patterns for the `telegram` channel described in [Set up Telegram bridge](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html) (exact subcommand spelling may vary slightly by NemoClaw version; use `nemoclaw --help` if in doubt). + ### 3. (If needed) Allow Telegram egress in the sandbox policy - Check overall status: + If messages fail with network or policy errors after the channel is registered, inspect presets and add Telegram-related egress if your tier omitted it: + + ```bash + nemoclaw policy-list + nemoclaw policy-add telegram + ``` + + Preset names follow your selected tier; confirm against [Network policies](https://docs.nvidia.com/nemoclaw/latest/reference/network-policies.html). + + ### 4. Verify Telegram + + Telegram uses long-polling (`getUpdates`) — the sandbox actively pulls messages from Telegram servers. **No public URL or cloudflared tunnel is required for Telegram to work.** + + Open Telegram, find your bot, and send a message. The bot should forward traffic to the agent in your NemoClaw sandbox and reply. + + > [!NOTE] + > The first response may take longer depending on model size (30B models respond in a few seconds; larger models may take longer on first inference). + + > [!NOTE] + > If the bot does not respond: + > - Run `nemoclaw status` to confirm the sandbox is running and inference is healthy. + > - Run `nemoclaw logs --follow` and look for Telegram-related errors. + > - If Telegram egress is missing, run `nemoclaw policy-add` and select `telegram`. + > - If the channel was never registered, run `nemoclaw channels add telegram`. + + > [!NOTE] + > The `channels add telegram` wizard also prompts for an optional **Telegram User ID** to restrict who can DM the bot. Send `/start` to [@userinfobot](https://t.me/userinfobot) on Telegram to get your numeric user ID. If you skip this, the bot will require device pairing (a terminal-based code confirmation) before responding to messages. + + > [!NOTE] + > For details on restricting which Telegram chats can interact with the agent, see the [NemoClaw Telegram bridge documentation](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html). + + ### 5. (Optional) Install cloudflared for remote Web UI access + + The cloudflared tunnel provides a **public URL for the Web UI dashboard** — it is not related to Telegram messaging. + + Install cloudflared (DGX Station is arm64): + + ```bash + curl -L --output cloudflared.deb \ + https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-arm64.deb + sudo dpkg -i cloudflared.deb + ``` + + Start the tunnel: + + ```bash + nemoclaw tunnel start + ``` + + Verify: ```bash nemoclaw status ``` - Open Telegram, find your bot, and send it a message. + You should see `● cloudflared` with a `trycloudflare.com` public URL. - > [!NOTE] - > The first response may take 30--90 seconds for a 120B parameter model running locally. + --- - > [!NOTE] - > To **persist** `TELEGRAM_BOT_TOKEN` for shell-based flows, use a `chmod 600` env file and `source` it from `~/.bashrc` as shown in Step 4. + # Phase 3: Set Up NemoClaw Agent - > [!NOTE] - > For chat allowlists and advanced Telegram behavior, see [NemoClaw Telegram bridge documentation](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html). + ## Step 6. Set Up NemoClaw Agents + + Set up NemoClaw Agents in general require three steps: Configure NemoClaw security policy, Run Agent Workflow Prompt, Personalize the Workflow for your own use case. + + Checkout these [Example NemoClaw Agents](https://build.nvidia.com/station/nemoclaw-applications) for reference. Consider sharing your NemoClaw agent setup with the community at [DGX Station Developer Forum](https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station-gb300) --- # Phase 4: Cleanup and Uninstall - ## Step 11. Stop services + ## Step 7. Stop services - Stop any running auxiliary services (Telegram bridge, cloudflared tunnel): + Stop the cloudflared tunnel: ```bash - nemoclaw stop + nemoclaw tunnel stop ``` - Expected: - - ```text - [services] All services stopped. - ``` - - Stop the port forward (always pass **port** and **sandbox name**): + Stop the port forward: ```bash - openshell forward list - openshell forward stop 18789 my-assistant + openshell forward list # find active forwards and their ports + openshell forward stop # stop the dashboard forward (use the port shown above) ``` - Stop and **remove** the vLLM container so the name `vllm-nemotron` is free for a future run. The playbook created the container with **`--restart unless-stopped`**, so `docker stop` alone is not enough: Docker would **restart it after reboot** and the container would keep reserving GPU memory. + ## Step 8. Uninstall NemoClaw + + The NemoClaw CLI includes a built-in uninstaller. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and Ollama are preserved. ```bash - docker update --restart=no vllm-nemotron 2>/dev/null || true - docker stop vllm-nemotron - docker rm vllm-nemotron + nemoclaw uninstall --yes ``` - To remove the container in one step even if it is running: `docker rm -f vllm-nemotron`. - - ## Step 12. Uninstall NemoClaw - - Run the uninstaller from the cloned source directory. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and vLLM are preserved. + To remove everything including the Ollama model: ```bash - cd ~/.nemoclaw/source - ./uninstall.sh + nemoclaw uninstall --yes --delete-models ``` **Uninstaller flags:** @@ -713,15 +412,13 @@ spec: |------|--------| | `--yes` | Skip the confirmation prompt | | `--keep-openshell` | Leave the `openshell` binary in place | - | `--delete-models` | Removes **local inference models pulled by older NemoClaw flows** (the upstream flag name still references **Ollama**). It does **not** remove Hugging Face weights used by this playbook’s **vLLM** container — delete those separately (below). | + | `--delete-models` | Also remove the Ollama models pulled by NemoClaw | - To also remove the vLLM container and cached model weights: - - ```bash - ./uninstall.sh --yes - docker rm -f vllm-nemotron 2>/dev/null || true - rm -rf ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/ - ``` + > [!NOTE] + > If the `nemoclaw` CLI is not available (e.g. install failed partway), use the remote uninstaller as a fallback: + > ```bash + > curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh | bash -s -- --yes + > ``` The uninstaller runs 6 steps: 1. Stop NemoClaw helper services and port-forward processes @@ -732,7 +429,7 @@ spec: 6. Remove state directories (`~/.nemoclaw`, `~/.config/openshell`, `~/.config/nemoclaw`) and the OpenShell binary > [!NOTE] - > The source clone at `~/.nemoclaw/source` is removed as part of state cleanup in step 6. If you want to keep a local copy, move or back it up before running the uninstaller. + > If you have a local clone at `~/.nemoclaw/source` you want to keep, move or back it up before running the uninstaller — it is removed as part of state cleanup in step 6. # Useful commands @@ -742,18 +439,13 @@ spec: | `nemoclaw my-assistant status` | Show sandbox status and inference config | | `nemoclaw my-assistant logs --follow` | Stream sandbox logs in real time | | `nemoclaw list` | List all registered sandboxes | - | `nemoclaw tunnel start` | Start optional host services such as **cloudflared** (public dashboard URL when installed); does **not** start Telegram | - | `nemoclaw start` | Deprecated alias for tunnel/aux host services — **not** for Telegram | - | `nemoclaw stop` | Stop host auxiliary services started by `nemoclaw tunnel start` / `nemoclaw start` | - | `nemoclaw channels add telegram` | Store Telegram token and rebuild sandbox (host) | + | `nemoclaw tunnel start` | Start cloudflared tunnel (public URL for remote Web UI access) | + | `nemoclaw tunnel stop` | Stop the cloudflared tunnel | + | `nemoclaw my-assistant dashboard-url --quiet` | Print the full tokenized Web UI URL (includes auto-assigned port) | | `openshell term` | Open the monitoring TUI on the host | | `openshell forward list` | List active port forwards | - | `openshell forward start 18789 my-assistant --background` | Start port forwarding for Web UI | - | `openshell forward stop 18789 my-assistant` | Stop Web UI port forward | - | `docker logs -f vllm-nemotron` | Stream vLLM inference server logs | - | `docker restart vllm-nemotron` | Restart the vLLM inference server | - | `curl http://localhost:8000/v1/models` | Check vLLM API status | - | `cd ~/.nemoclaw/source && ./uninstall.sh` | Remove NemoClaw (preserves Docker, Node.js, vLLM image) | + | `nemoclaw uninstall --yes` | Remove NemoClaw (preserves Docker, Node.js, Ollama) | + | `nemoclaw uninstall --yes --delete-models` | Remove NemoClaw and Ollama models | @@ -765,38 +457,72 @@ spec: | Symptom | Cause | Fix | |---------|-------|-----| - | `openclaw agent --local` fails or is blocked inside the sandbox | `--local` bypasses the NemoClaw gateway and is disallowed in the OpenShell sandbox | Use gateway mode: `openclaw agent --agent main -m "hello" --session-id test` (no `--local`). | - | Onboard fails with **“K8s namespace not ready”** (or similar) with no clear reason | Often **low disk space** on `/` or Docker’s data root; image push / k3s need headroom | Run `df -h / /var/lib/docker`. Free **at least ~40 GB** (see [NemoClaw quickstart prerequisites](https://docs.nvidia.com/nemoclaw/latest/get-started/quickstart.html)); prune Docker (`docker system prune`) or expand disk, then retry onboard. | - | vLLM warns about **mixed devices** or loads on an unexpected GPU | Multiple GPUs visible; default visibility does not match intent | Pin one GPU: `--gpus '"device=0"'` and `-e CUDA_VISIBLE_DEVICES=0` with `--tensor-parallel-size 1`, or use two GPUs explicitly with `--tensor-parallel-size 2` and `-e CUDA_VISIBLE_DEVICES=0,1` (see Step 3 in instructions). | | `nemoclaw: command not found` after install | Shell PATH not updated | Run `source ~/.bashrc` (or `source ~/.zshrc` for zsh), or open a new terminal window. | - | `pip: command not found` | pip not installed on DGX Station by default | Install pip: `sudo apt install -y python3-pip`. Then use `pip3 install --break-system-packages huggingface-hub`. | - | `huggingface-cli` is deprecated | Hugging Face CLI was renamed | Use `hf download` instead of `huggingface-cli download`. | - | vLLM container won't start or crashes | GPU memory issue or wrong image | Check logs: `docker logs vllm-nemotron`. If CUDA OOM, reduce context: recreate the container with `--max-model-len 8192`. Ensure you are using the NVIDIA container image (`nvcr.io/nvidia/vllm:26.03-py3`), not the community `vllm/vllm-openai` image. | - | vLLM logs show `Application startup complete.` but `curl` times out | vLLM still compiling CUDA graphs after startup | Wait 1--2 minutes after `Application startup complete.` before sending requests. The first request compiles CUDA graphs and may take 30--90 seconds. | - | NemoClaw onboard fails with "endpoint validation failed" | vLLM model not warmed up or validation timeout too short | Warm up the model first: `curl -s --max-time 120 http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"hello"}],"max_tokens":10}'`. Then re-run with `NEMOCLAW_EXPERIMENTAL=1 NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 nemoclaw onboard`. | - | NemoClaw reports "provider 'vllm' is not available" | Missing experimental flag | Set `NEMOCLAW_EXPERIMENTAL=1` before running the installer or `nemoclaw onboard`. The vLLM provider is currently an experimental feature. | + | Installer fails with Node.js version error | Node.js version below 22.16 | Install Node.js 22.16+: `curl -fsSL https://deb.nodesource.com/setup_22.x \| sudo -E bash - && sudo apt-get install -y nodejs` then re-run the installer. | + | npm install fails with `EACCES` permission error | npm global directory not writable | `mkdir -p ~/.npm-global && npm config set prefix ~/.npm-global && export PATH=~/.npm-global/bin:$PATH` then re-run the installer. Add the `export` line to `~/.bashrc` to make it permanent. | | Docker permission denied | User not in docker group | `sudo usermod -aG docker $USER`, then log out and back in. | - | Gateway fails with cgroup / "Failed to start ContainerManager" errors | Docker not configured for host cgroup namespace on DGX Station | Run the cgroup fix: `sudo python3 -c "import json, os; path='/etc/docker/daemon.json'; d=json.load(open(path)) if os.path.exists(path) else {}; d['default-cgroupns-mode']='host'; json.dump(d, open(path,'w'), indent=2)"` then `sudo systemctl restart docker`. | + | Gateway fails with cgroup / "Failed to start ContainerManager" errors | Older OpenShell or Docker still using a **private** cgroup namespace for the gateway so kubelet cannot see cgroup v2 controllers | First **upgrade OpenShell** (re-run the Phase 1 `nemoclaw.sh` install so you get a build that sets host cgroupns on the gateway container). If it still fails, force Docker's default to host mode by running the [daemon.json cgroup fix](#daemonjson-cgroup-fix) below, then run `sudo systemctl restart docker`. | | Gateway fails with "port 8080 is held by container..." | Another OpenShell gateway or container is using port 8080 | Stop the conflicting container: `openshell gateway destroy -g ` or `docker stop && docker rm `, then retry `nemoclaw onboard`. | - | Sandbox cannot reach the inference server | Using `localhost` instead of `host.openshell.internal` in endpoint URL | Inside the sandbox, `localhost` refers to the sandbox container, not the host. The onboard wizard configures `host.openshell.internal` automatically. Verify from inside the sandbox: `curl -sf https://inference.local/v1/models`. If this fails, check that vLLM is reachable from the host: `curl -s http://localhost:8000/v1/models`. | - | Agent gives no response or is very slow | Normal for 120B model running locally | Nemotron 3 Super 120B can take 30--90 seconds per response. Verify inference route: `nemoclaw my-assistant status`. | - | vLLM API returns empty or errors on tool calls | Missing tool-call flags | Verify that `--enable-auto-tool-choice` and `--tool-call-parser qwen3_xml` are set: `docker inspect vllm-nemotron --format '{{.Config.Cmd}}'`. | + | Sandbox creation fails | Stale gateway state or DNS not propagated | Run `openshell gateway destroy && openshell gateway start`, then re-run the installer or `nemoclaw onboard`. | + | CoreDNS crash loop | Known issue on some DGX Station configurations | Re-run the NemoClaw installer (`curl -fsSL https://www.nvidia.com/nemoclaw.sh \| bash`) which includes the CoreDNS fix. If the issue persists, see [NemoClaw troubleshooting](https://docs.nvidia.com/nemoclaw/latest/reference/troubleshooting.html). | + | "No GPU detected" during onboard | DGX Station GB300 reports unified memory differently | Expected on DGX Station. The wizard still works and uses Ollama for inference. | + | Inference timeout or hangs | Ollama not running or not reachable | Check Ollama: `curl http://127.0.0.1:11434`. If not running: `sudo systemctl restart ollama`. Verify the NemoClaw auth proxy is healthy: `curl http://127.0.0.1:11435/api/tags`. If both respond, check `nemoclaw my-assistant status` for the Inference health line. | + | Agent gives no response or is very slow | First response can be slow, especially with larger models | Response time depends on model size (30B: a few seconds, 120B: 30–90 seconds). Verify inference route: `nemoclaw my-assistant status`. | | Port 18789 already in use | Another process is bound to the port | `lsof -i :18789` then `kill `. If needed, `kill -9 ` to force-terminate. | - | Web UI port forward dies or dashboard unreachable | Port forward not active | `openshell forward stop 18789 my-assistant` then `openshell forward start 18789 my-assistant --background`. Always pass **port** and **sandbox name** to `openshell forward stop`. | - | Web UI shows `origin not allowed` | Browser origin does not match what the gateway expects | On the **DGX Station local desktop**, open `http://127.0.0.1:18789/#token=...` (not `localhost`). Through an **SSH tunnel** on another machine, `localhost` vs `127.0.0.1` in the client browser usually both work because the check applies to how you reach the forwarded port locally. | - | Telegram does not work after install; `nemoclaw start` does nothing for Telegram | **`nemoclaw start` starts optional host services (e.g. cloudflared), not the Telegram bridge** | Configure Telegram during onboard, or on the host run `nemoclaw my-assistant channels add telegram` (and rebuild), after `policy-add` for the `telegram` preset. See [Set up Telegram bridge](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html). | - | Telegram bot receives messages but does not reply | Telegram policy not added to sandbox | Run `nemoclaw my-assistant policy-add`, type `telegram`, hit Y. Ensure the channel was added with `nemoclaw my-assistant channels add telegram` so the image includes Telegram. | - | `docker: Error response from daemon: Conflict. The container name "/vllm-nemotron" is already in use` | Previous cleanup used `docker stop` only | `docker rm -f vllm-nemotron` (or `docker update --restart=no` then `docker stop` and `docker rm`). The playbook uses `--restart unless-stopped`; stopping alone leaves a restart policy and reserved name. | + | Web UI port forward dies or dashboard unreachable | Port forward not active | `openshell forward stop 18789 my-assistant` then `openshell forward start 18789 my-assistant --background`. | + | Web UI shows `origin not allowed` | Accessing via `localhost` instead of `127.0.0.1` | Use `http://127.0.0.1:18789/#token=...` in the browser. The gateway origin check requires `127.0.0.1` exactly. | + | Telegram bridge does not start | Telegram channel not registered with sandbox | Run `nemoclaw channels add telegram` to register the bot token and rebuild the sandbox. Verify with `nemoclaw status`. | + | Telegram stops responding after sandbox rebuild | Telegram long-polling session stale after rebuild | Run `nemoclaw recover` to restart the gateway. If still unresponsive, run `nemoclaw channels add telegram` to re-register and rebuild. | + | Telegram bot receives messages but does not reply | Telegram network egress policy not added | Run `nemoclaw policy-add`, select `telegram`, and confirm. This is a hot-reload — no rebuild needed. | - **Model variant guidance:** + ### daemon.json cgroup fix - | Variant | Size | VRAM Required | When to Use | - |---------|------|---------------|-------------| - | `NVFP4` | ~60 GB | ~80 GB | Default for DGX Station (GB300). Fits on single GPU with room for large KV cache. | - | `FP8` | ~120 GB | ~140 GB | Higher accuracy, still fits on GB300. Add `--kv-cache-dtype fp8` to the vLLM command. | - | `BF16` | ~240 GB | ~260 GB | Highest accuracy. Fits on GB300 but leaves little room for KV cache. Reduce `--max-model-len`. | + Use this script as the fallback for the cgroup / "Failed to start ContainerManager" row above. It validates any existing `/etc/docker/daemon.json`, writes a `.bak` backup, sets `default-cgroupns-mode` to `host`, and atomically replaces the file. It exits non-zero with an error on stderr if anything fails, leaving the original `daemon.json` untouched. - For the latest known issues, see [DGX Station documentation](https://docs.nvidia.com/dgx/dgx-station-user-guide/index.html). + ```bash + sudo python3 - <<'PY' + import json, os, shutil, sys, tempfile + + path = '/etc/docker/daemon.json' + try: + if os.path.exists(path): + with open(path) as f: + data = json.load(f) + if not isinstance(data, dict): + raise ValueError(f'{path} is not a JSON object') + else: + data = {} + except (json.JSONDecodeError, ValueError, OSError) as e: + print(f'error: failed to read {path}: {e}', file=sys.stderr) + sys.exit(1) + + if os.path.exists(path): + try: + shutil.copy2(path, path + '.bak') + except OSError as e: + print(f'error: failed to back up {path}: {e}', file=sys.stderr) + sys.exit(1) + + data['default-cgroupns-mode'] = 'host' + + target_dir = os.path.dirname(path) or '/' + fd, tmp = tempfile.mkstemp(prefix='daemon.json.', dir=target_dir) + try: + with os.fdopen(fd, 'w') as f: + json.dump(data, f, indent=2) + f.write('\n') + os.chmod(tmp, 0o644) + os.replace(tmp, path) + except OSError as e: + if os.path.exists(tmp): + try: + os.unlink(tmp) + except OSError: + pass + print(f'error: failed to write {path}: {e}', file=sys.stderr) + sys.exit(1) + PY + ``` @@ -814,19 +540,3 @@ spec: url: https://docs.openclaw.ai - - name: vLLM Documentation - url: https://docs.vllm.ai - - - - name: Nemotron-3-Super on Hugging Face - url: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 - - - - name: DGX Station Documentation - url: https://docs.nvidia.com/dgx/dgx-station-user-guide/index.html - - - - name: DGX Station Forum - url: https://forums.developer.nvidia.com - - diff --git a/nvidia/station-nemoclaw/endpoint-test.yaml b/nvidia/station-nemoclaw/endpoint-test.yaml index e83eccf..a4c129c 100644 --- a/nvidia/station-nemoclaw/endpoint-test.yaml +++ b/nvidia/station-nemoclaw/endpoint-test.yaml @@ -1,6 +1,6 @@ kind: Playbook metadata: - name: nemoclaw + name: station-nemoclaw displayName: Run NemoClaw with a Local LLM shortDescription: Build your first local AI assistant on DGX Station using NemoClaw in a secure sandbox, with optional Telegram. @@ -22,8 +22,8 @@ metadata: value: 30 MIN spec: - artifactName: nemoclaw - nvcfFunctionId: 3b0ad962-7cfe-4370-9f4d-8024298a6d13 + artifactName: station-nemoclaw + nvcfFunctionId: None attributes: showUnavailableBanner: false @@ -130,8 +130,8 @@ spec: - **Estimated time:** About 30–60 minutes for a first full pass (install, onboard, model download depending on choice and network). Optional Brave, Telegram, and cloudflared steps add time if you do them in a second session. - **Risk level:** Medium — you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts. - - **Last Updated:** 05/29/2026 - - Update to latest nemoclaw installer instructions + - **Last Updated:** 06/01/2026 + - Pin nemoclaw installer to v0.0.55, the latest stable version @@ -144,10 +144,10 @@ spec: ## Step 1. Install NemoClaw - This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.55** release (set via `NEMOCLAW_VERSION`; v0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox. + This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.0.55** release (set via `NEMOCLAW_INSTALL_TAG`; v0.0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox. ```bash - curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_VERSION=v0.55 bash + curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.55 bash ``` The installation wizard walks you through setup: @@ -165,7 +165,7 @@ spec: During custom setup, the onboard wizard walks you through: 1. **Configuring inference** -- Choose to set up local inference on your DGX Station by selecting **`7) Local Ollama`**. - 2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will provide options to download models to start. + 2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will download **`qwen3.6:35b`** automatically. 3. **Sandbox name** -- Pick a name (e.g. my-assistant). Each sandbox requires a unique name. 4. **Apply this configuration** -- Enter `Y` to confirm setting up local inference. 5. **Enable Brave Web Search** -- Optional. If you enable it, paste a [Brave Search API](https://brave.com/search/api/) key when prompted. @@ -341,7 +341,7 @@ spec: The cloudflared tunnel provides a **public URL for the Web UI dashboard** — it is not related to Telegram messaging. - Install cloudflared (DGX Station is arm64): + Install cloudflared (DGX Station is aarch64): ```bash curl -L --output cloudflared.deb \ @@ -371,7 +371,7 @@ spec: Set up NemoClaw Agents in general require three steps: Configure NemoClaw security policy, Run Agent Workflow Prompt, Personalize the Workflow for your own use case. - Checkout these [Example NemoClaw Agents](https://build.nvidia.com/station/nemoclaw-applications) for reference. Consider sharing your NemoClaw agent setup with the community at [DGX Station Developer Forum](https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station-gb300) + Checkout these [Example NemoClaw Agents](https://build.nvidia.com/spark/nemoclaw-applications) for reference. --- diff --git a/nvidia/station-vllm/endpoint-test.yaml b/nvidia/station-vllm/endpoint-test.yaml index d018424..e22f1c8 100644 --- a/nvidia/station-vllm/endpoint-test.yaml +++ b/nvidia/station-vllm/endpoint-test.yaml @@ -68,17 +68,14 @@ spec: | **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) | | **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) | | **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) | - | **Kimi-K2.5 (1T)** | NVFP4 | ✅ | [`nvidia/Kimi-K2.5-NVFP4`](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) | - | **DeepSeek-V4-Flash** | NVFP4 | ✅ | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | # Time & risk * **Duration:** 30 minutes (longer on first run due to model download) * **Risks:** Model download requires HuggingFace authentication * **Rollback:** Stop and remove the container to restore state - * **Last Updated:** 05/29/2026 + * **Last Updated:** 05/28/2026 * Update models - * Add base configuration example, per-setting explanations, and DeepSeek-V4-Flash recipe @@ -125,23 +122,11 @@ spec: docker pull vllm/vllm-openai:stepfun37 ``` - For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, pull the **26.03** image, which includes the `--cpu-offload-params` support used below: - ```bash - docker pull nvcr.io/nvidia/vllm:26.03-py3 - ``` - - For DeepSeek-V4-Flash, pull the stable DeepSeek-V4 release container. Use the **cu130** build on DGX Station (Blackwell): - ```bash - docker pull vllm/vllm-openai:v0.20.0-cu130 - ``` - # Step 4. Start vLLM server Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`. - ## Base configuration (most models) - - This is the recommended starting point for any model that fits entirely in VRAM on the GB300. The Qwen3-235B-A22B-NVFP4 model, for example, runs directly with this configuration. + For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300. ```bash docker run -d \ @@ -159,12 +144,6 @@ spec: --gpu-memory-utilization 0.9 ``` - Settings used: - - `--max-model-len` — maximum context length (prompt + output) per request. Larger values reserve more GPU memory for the KV cache; size it to your workload. - - `--gpu-memory-utilization 0.9` — fraction of GPU memory vLLM may use for weights and KV cache. `0.9` leaves headroom for other processes; raise toward `0.95` to fit more KV cache if the GPU is dedicated. - - ## Step-3.7-Flash (FP8 / NVFP4) - For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300. ```bash @@ -187,94 +166,6 @@ spec: --kv-cache-dtype fp8 ``` - Settings used (in addition to the base configuration): - - `--trust-remote-code` — allows the model's custom modeling code (shipped in its repo) to load. Required for Step-3.7. - - `--reasoning-parser step3p5` — parses the model's reasoning/thinking tokens into the dedicated `reasoning_content` response field. - - `--enable-auto-tool-choice` — lets the model decide when to call a tool, enabling OpenAI-compatible function calling. - - `--tool-call-parser step3p5` — parses the model's tool-call output into structured `tool_calls`. Pairs with `--enable-auto-tool-choice`. - - `--kv-cache-dtype fp8` — stores the KV cache in FP8, roughly halving KV-cache memory versus 16-bit and allowing more concurrent/longer sequences. - - ## Kimi-K2.5 NVFP4 (1T) — CPU offloading - - For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, run with the **26.03** NGC container. This model does not fit entirely in VRAM, so the MoE expert weights are offloaded to CPU DRAM with `--cpu-offload-gb 375 --cpu-offload-params experts`. Ensure the system has enough free DRAM to hold the offloaded weights. - - ```bash - docker run -d \ - --name vllm-server \ - --gpus all \ - --ipc host \ - --ulimit memlock=-1 \ - --ulimit stack=67108864 \ - -p 8000:8000 \ - -e HF_TOKEN="$HF_TOKEN" \ - -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ - nvcr.io/nvidia/vllm:26.03-py3 \ - vllm serve nvidia/Kimi-K2.5-NVFP4 \ - --host 0.0.0.0 \ - --port 8000 \ - --dtype auto \ - --kv-cache-dtype auto \ - --gpu-memory-utilization 0.95 \ - --served-model-name nvidia/Kimi-K2.5-NVFP4 \ - --tensor-parallel-size 1 \ - --no-enable-prefix-caching \ - --trust-remote-code \ - --max-model-len 40960 \ - --max-num-seqs 1 \ - --max-num-batched-tokens 32768 \ - --cpu-offload-gb 375 \ - --cpu-offload-params experts - ``` - - Settings used (in addition to the base configuration): - - `--cpu-offload-gb 375` — amount of CPU DRAM (in GiB) vLLM may use to hold weights that don't fit in VRAM. Must be large enough for the offloaded experts; the system needs at least this much free DRAM. - - `--cpu-offload-params experts` — offloads only the MoE expert weights (the bulk of a large MoE model) to DRAM, keeping attention and other hot weights in VRAM. - - `--tensor-parallel-size 1` — single GPU; the GB300 serves the whole model. - - `--max-num-seqs 1` / `--max-num-batched-tokens 32768` — caps concurrency to one sequence and the batch token budget. With expert weights paged from DRAM, throughput is offload-bound, so a low concurrency keeps latency predictable. - - `--no-enable-prefix-caching` — disables prefix-cache reuse. Offloaded experts make the memory budget tight, so the cache is turned off here rather than spent on KV reuse. - - `--kv-cache-dtype auto` / `--dtype auto` — let vLLM pick the KV-cache and compute dtypes from the model's quantization (NVFP4). - - ## DeepSeek-V4-Flash — MTP + agentic - - For DeepSeek-V4-Flash, run with the stable **v0.20.0-cu130** container. This recipe targets agentic workloads and enables Multi-Token Prediction (MTP) speculative decoding. On a single GB300 (TP1) the MoE expert-parallel path is used; the `deep_gemm_mega_moe` backend from some internal recipes is not needed at TP1 and is omitted here. - - ```bash - docker run -d \ - --name vllm-server \ - --gpus all \ - --ipc host \ - --ulimit memlock=-1 \ - --ulimit stack=67108864 \ - -p 8000:8000 \ - -e HF_TOKEN="$HF_TOKEN" \ - -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ - vllm/vllm-openai:v0.20.0-cu130 \ - deepseek-ai/DeepSeek-V4-Flash \ - --enable-expert-parallel \ - --kv-cache-dtype fp8 \ - --trust-remote-code \ - --block-size 256 \ - --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \ - --attention_config.use_fp4_indexer_cache True \ - --tokenizer-mode deepseek_v4 \ - --tool-call-parser deepseek_v4 \ - --enable-auto-tool-choice \ - --reasoning-parser deepseek_v4 \ - --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \ - --max-model-len 32768 - ``` - - Settings used (in addition to the base configuration): - - `--enable-expert-parallel` — shards the MoE experts across the available GPU(s) using expert parallelism, the recommended MoE execution path for DeepSeek-V4. - - `--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'` — enables **MTP (Multi-Token Prediction)** speculative decoding: the model proposes 3 tokens per step that are verified in a single forward pass, cutting latency for accepted tokens. - - `--kv-cache-dtype fp8` — FP8 KV cache to fit more concurrent/longer sequences. - - `--block-size 256` — KV-cache page size in tokens. DeepSeek-V4 uses multiple KV-cache groups; `256` matches the recipe validated on Station. - - `--attention_config.use_fp4_indexer_cache True` — enables the FP4 indexer cache used by DeepSeek-V4's attention. (Drop this flag on platforms without native FP4, e.g. Hopper.) - - `--tokenizer-mode deepseek_v4` / `--tool-call-parser deepseek_v4` / `--reasoning-parser deepseek_v4` — DeepSeek-V4-specific tokenizer, tool-call, and reasoning parsers. - - `--enable-auto-tool-choice` — OpenAI-compatible function calling for agentic use. - - `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` — uses full + piecewise CUDA graph capture and enables all custom ops for lower per-step overhead. - - **Prefix caching is left enabled (the vLLM default).** For agentic workloads with large shared prefixes (e.g. a 32k system/context prefix) at low batch sizes (~BS 3–4), prefix caching gives a significant throughput boost by reusing the cached prefix across requests. Some internal recipes carry `--no-enable-prefix-caching`, but that was inherited from random-data benchmarking and is not recommended for agentic use here. - Check the server logs for startup progress: ```bash