mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-06-18 04:22:21 +00:00

History

GitLab CI 4073d2c1de chore: Regenerate all playbooks		2026-05-26 18:25:53 +00:00
..
endpoint-production.yaml	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
endpoint-test.yaml	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
overview.md	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00
README.md	chore: Regenerate all playbooks	2026-05-26 18:25:53 +00:00

README.md

NemoClaw with Nemotron-3-Super and vLLM on DGX Station

Install NemoClaw on DGX Station with local vLLM inference and Telegram bot integration

Overview
Instructions
Troubleshooting

Overview

Basic idea

NVIDIA NemoClaw is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the NVIDIA OpenShell runtime -- an environment designed for executing agents with additional security -- and open-source models like NVIDIA Nemotron. A single installer command handles Node.js, OpenShell, and the NemoClaw CLI, then walks you through an onboard wizard to create a sandboxed agent on your DGX Station using vLLM with Nemotron 3 Super.

By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, accessible via a web dashboard and a Telegram bot, with inference routed to a local Nemotron 3 Super 120B model served by vLLM on your DGX Station -- all without exposing your host filesystem or network to the agent.

What you'll accomplish

Configure Docker and the NVIDIA container runtime for OpenShell on DGX Station
Pull Nemotron 3 Super 120B (NVFP4) from Hugging Face and serve it with vLLM
Install NemoClaw with a single command (handles Node.js, OpenShell, and the CLI)
Run the onboard wizard to create a sandbox and configure local vLLM inference
Chat with the agent via the CLI, TUI, and web UI
Set up a Telegram bot that forwards messages to your sandboxed agent

Notice and disclaimers

The following sections describe safety, risks, and your responsibilities when running this demo.

Quick start safety check

Use only a clean environment. Run this demo on a fresh device or VM with no personal data, confidential information, or sensitive credentials. Keep it isolated like a sandbox.

By installing this demo, you accept responsibility for all third-party components, including reviewing their licenses, terms, and security posture. Read and accept before you install or use.

What you're getting

This experience is provided "AS IS" for demonstration purposes only -- no warranties, no guarantees. This is a demo, not a production-ready solution. You will need to implement appropriate security controls for your environment and use case.

Key risks with AI agents

Data leakage -- Any materials the agent accesses could be exposed, leaked, or stolen.
Malicious code execution -- The agent or its connected tools could expose your system to malicious code or cyber-attacks.
Unintended actions -- The agent might modify or delete files, send messages, or access services without explicit approval.
Prompt injection and manipulation -- External inputs or connected content could hijack the agent's behavior in unexpected ways.

Participant acknowledgement

By participating in this demo, you acknowledge that you are solely responsible for your configuration and for any data, accounts, and tools you connect. To the maximum extent permitted by law, NVIDIA is not responsible for any loss of data, device damage, security incidents, or other harm arising from your configuration or use of NemoClaw demo materials, including OpenClaw or any connected tools or services.

Isolation layers (OpenShell)

Layer	What it protects	When it applies
Filesystem	Prevents reads/writes outside allowed paths.	Locked at sandbox creation.
Network	Blocks unauthorized outbound connections.	Hot-reloadable at runtime.
Process	Blocks privilege escalation and dangerous syscalls.	Locked at sandbox creation.
Inference	Reroutes model API calls to controlled backends.	Hot-reloadable at runtime.

What to know before starting

Basic use of the Linux terminal and SSH
Familiarity with Docker (permissions, docker run)
Awareness of the security and risk sections above

Prerequisites

Hardware and access:

A DGX Station (GB300) with keyboard and monitor, or SSH access
A Telegram bot token from @BotFather (create one with /newbot) -- optional, for Phase 3

Software:

Fresh install of DGX OS with latest updates

Verify your system before starting:

head -n 2 /etc/os-release
nvidia-smi
docker info --format '{{.ServerVersion}}'
df -h / /var/lib/docker 2>/dev/null | head -20

Expected: Ubuntu 24.04, NVIDIA GB300 GPU(s), Docker 28.x+, and enough free disk for Docker layers, the NemoClaw sandbox image, and Hugging Face cache (treat ~40 GB free on the Docker data filesystem as a practical minimum; very low free space can surface as cryptic onboard errors such as “K8s namespace not ready”).

Have ready before you begin

Item	Where to get it
Telegram bot token (optional)	@BotFather on Telegram -- create with `/newbot`

Ancillary files

All required assets are handled by the NemoClaw installer. No manual cloning is needed.

Time and risk

Estimated time: 20--30 minutes (with model already downloaded). First-time model download adds ~10--20 minutes depending on network speed.
Risk level: Medium -- you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
Last Updated: 04/27/2026
- First publication for DGX Station with vLLM

Instructions

Phase 1: Prerequisites

These steps prepare a fresh DGX Station for NemoClaw. If Docker, the NVIDIA runtime, and vLLM are already configured, skip to Phase 2.

Important

Disk space: NemoClaw’s onboard flow pulls a multi-gigabyte sandbox image and runs Docker, k3s, and the gateway together. If root or Docker’s data disk is nearly full (for example only a few gigabytes free), onboarding can fail with generic errors such as “K8s namespace not ready” with no clear hint about storage. Before you start, check free space: df -h / /var/lib/docker. NVIDIA recommends at least 40 GB free on the filesystem that holds Docker layers (often / or /var/lib/docker); treat under ~15 GB as high risk for first-time onboard failures.

Step 1. Configure Docker and the NVIDIA container runtime

OpenShell's gateway runs k3s inside Docker. On DGX Station (Ubuntu 24.04, cgroup v2), Docker must be configured with the NVIDIA runtime and host cgroup namespace mode.

Configure the NVIDIA container runtime for Docker:

sudo nvidia-ctk runtime configure --runtime=docker

Expected:

INFO Loading config from /etc/docker/daemon.json
INFO Wrote updated config to /etc/docker/daemon.json
INFO It is recommended that docker daemon be restarted.

Set the cgroup namespace mode required by OpenShell on DGX Station:

sudo python3 -c "
import json, os
path = '/etc/docker/daemon.json'
d = json.load(open(path)) if os.path.exists(path) else {}
d['default-cgroupns-mode'] = 'host'
json.dump(d, open(path, 'w'), indent=2)
"

Restart Docker:

sudo systemctl restart docker

Verify the NVIDIA runtime works:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Expected:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
|   0  NVIDIA GB300                   On  |   00000009:06:00.0 Off |                    0 |
| N/A   46C    P0            215W / 1300W |   18661MiB / 256703MiB |      0%      Default |
+-----------------------------------------+------------------------+----------------------+

If you get a permission denied error on docker, add your user to the Docker group and activate the new group in your current session:

sudo usermod -aG docker $USER
newgrp docker

This applies the group change immediately. Alternatively, you can log out and back in instead of running newgrp docker.

Note

DGX Station uses cgroup v2. OpenShell's gateway embeds k3s inside Docker and needs host cgroup namespace access. Without default-cgroupns-mode: host, the gateway can fail with "Failed to start ContainerManager" errors.

Step 2. Pull the Nemotron-3-Super model

Install pip and the Hugging Face CLI (if not already installed):

sudo apt install -y python3-pip
pip3 install --break-system-packages huggingface-hub

Download Nemotron 3 Super 120B in NVFP4 quantization (~60 GB; may take 10--20 minutes depending on network speed):

hf download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

Expected (on a fresh download; cached downloads complete instantly):

Fetching 36 files: 100%|██████████| 36/36 [15:42<00:00, 26.18s/it]
/home/nvidia/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/snapshots/0d6fa3ecad422a...

Verify the download completed:

ls ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/

Expected:

blobs  refs  snapshots

Note

The NVFP4 quantization is chosen because it fits entirely in one GB300 GPU’s 256 GB HBM3e with room for KV cache. On a two-GPU station you can still use NVFP4 with --tensor-parallel-size 1 and a single visible GPU, or shard with --tensor-parallel-size 2. For other quantization variants, see Troubleshooting.

Step 3. Start the vLLM inference server

Launch vLLM using the NVIDIA-optimized container image.

Single GPU (default on one-GPU systems, or pin to one GPU on multi-GPU stations): vLLM can emit mixed device warnings if several GPUs are visible but the model is only meant to use one. Pinning avoids accidentally placing weights on an unexpected device.

docker run -d --name vllm-nemotron \
  --runtime nvidia --gpus '"device=0"' \
  -e CUDA_VISIBLE_DEVICES=0 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --restart unless-stopped \
  nvcr.io/nvidia/vllm:26.03-py3 \
  python3 -m vllm.entrypoints.openai.api_server \
    --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --max-model-len 32768 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --reasoning-parser nemotron_v3

Two GPUs (tensor parallel): If your DGX Station has two Blackwell GPUs and you want Nemotron sharded across both, use both devices and set tensor parallel size to 2 (VRAM is summed across the GPUs):

docker run -d --name vllm-nemotron \
  --runtime nvidia --gpus all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --restart unless-stopped \
  nvcr.io/nvidia/vllm:26.03-py3 \
  python3 -m vllm.entrypoints.openai.api_server \
    --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --trust-remote-code \
    --max-model-len 32768 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --reasoning-parser nemotron_v3

Pick a GPU index by name (optional one-liner): To print the device index of the first GPU whose name contains GB300 (adjust the pattern if your nvidia-smi name string differs), run on the host:

nvidia-smi --query-gpu=index,name --format=csv,noheader | awk -F', ' '/GB300/ { gsub(/^ +/,"",$1); print $1; exit }'

Use that index in Docker as --gpus '"device=N"' (replace N with the printed index).

Note

--tool-call-parser qwen3_xml: Nemotron’s tool-call wire format is exposed through vLLM’s Qwen3-compatible XML tool parser — the name refers to the parser implementation, not the base model. This pairing is what vLLM expects for correct function/tool calling with this checkpoint.

The first startup loads ~70 GB of weights into GPU memory. Watch the logs until you see the model is ready:

docker logs -f vllm-nemotron

Wait until you see the following in the logs (typically 3--5 minutes):

INFO Loading weights took 55.47 seconds
INFO Model loading took 69.39 GiB memory and 71.31 seconds
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Then verify the API is responding:

curl -s http://localhost:8000/v1/models

Expected:

{"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]}

Send a test request to warm up the model before proceeding to Step 4. The first inference request compiles CUDA graphs and can take 30--90 seconds:

curl -s --max-time 120 http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"Say hello."}],"max_tokens":10}'

Expected (the first request may take 30--90 seconds; subsequent requests are much faster):

{"id":"chatcmpl-...","object":"chat.completion","model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","choices":[{"index":0,"message":{"role":"assistant","content":"..."},"finish_reason":"length"}],...}

Important

Warm up the model before running the NemoClaw installer. The onboard wizard validates the vLLM endpoint with a short timeout. If the model has not served at least one request, this validation will time out and the install will fail.

Important

Always start vLLM via the Docker container -- do not run vllm serve directly on the host. The NVIDIA container image (nvcr.io/nvidia/vllm:26.03-py3) includes optimized kernels for the GB300's Blackwell architecture that are not available in the pip-installed version.

Note

Key flags explained:

--tensor-parallel-size -- 1 for a single visible GPU; 2 when you expose two GPUs for tensor-parallel sharding (see Step 3).

--trust-remote-code -- required for the Mamba2-Transformer hybrid architecture

--max-model-len 32768 -- maximum context length (increase up to 1M if VRAM allows)

--enable-auto-tool-choice --tool-call-parser qwen3_xml -- enables function/tool calling for the agent (see the note above on the parser name).

--reasoning-parser nemotron_v3 -- separates chain-of-thought reasoning from the response so the TUI/Web UI can display them cleanly

Phase 2: Install and Run NemoClaw

Step 4. Install NemoClaw

The installer script installs Node.js (if needed), OpenShell, the NemoClaw CLI, and runs onboarding to create a sandbox. The vLLM provider requires the experimental flag and an extended inference timeout (the default 15-second validation timeout is too short for a 120B model).

Recommended: non-interactive install (copy-paste friendly)

This path is best for SSH sessions, automation, and documentation — no arrow-key TUI in the terminal.

NEMOCLAW_EXPERIMENTAL=1 \
NEMOCLAW_NON_INTERACTIVE=1 \
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
NEMOCLAW_SANDBOX_NAME=my-assistant \
NEMOCLAW_PROVIDER=vllm \
NEMOCLAW_MODEL="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \
bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)"

Optional: include Telegram in the first onboard without typing the token over SSH — export credentials on the host before running the installer (same variables the NemoClaw Telegram bridge guide documents):

export TELEGRAM_BOT_TOKEN='<paste-token-here>'
## Optional DM allowlist (comma-separated Telegram user IDs):
## export TELEGRAM_ALLOWED_IDS='123456789,987654321'

Use Telegram Desktop or web.telegram.org on a laptop to copy the token from @BotFather and paste into your SSH session (or into a small env file you source). Typing a 46+ character token on a phone keyboard into a remote shell is error-prone.

To persist TELEGRAM_BOT_TOKEN across reboots, keep it in a root-owned or user-only file and source it from your shell profile (example — adjust path and permissions):

install -m 600 /dev/null ~/.nemoclaw/telegram.env
nano ~/.nemoclaw/telegram.env   # add: export TELEGRAM_BOT_TOKEN='...'
grep -q 'nemoclaw/telegram.env' ~/.bashrc || echo 'source ~/.nemoclaw/telegram.env 2>/dev/null' >> ~/.bashrc

NemoClaw also stores messaging credentials in its credential store when you onboard or run nemoclaw … channels add telegram; the file above is mainly for re-running scripts or non-interactive flows that read the environment.

Alternative: interactive installer

If you prefer the wizard:

NEMOCLAW_EXPERIMENTAL=1 \
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \
bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)"

The wizard asks six high-level prompts (third-party notice, inference provider, Brave Search, messaging channels, sandbox name, policy presets). In parallel, the installer prints eight numbered onboard sub-phases, [1/8] … [8/8] (preflight, gateway, inference detection, inference route, messaging channels, sandbox creation, OpenClaw inside sandbox, policy presets). Those two numberings are different on purpose — the [n/8] lines are internal progress steps; the numbered list above is what you answer in the TUI.

Third-party software notice -- Type yes to accept and continue.
Inference provider -- The wizard detects vLLM running locally. Select option 8 (Local vLLM [experimental] — running).
Brave Web Search -- Optional. Type skip if you don't have a Brave Search API key.
Messaging channels -- Optional. Press Enter to skip, or toggle Telegram/Discord/Slack if desired (this is the step that corresponds to onboard phase [5/8] in the log).
Sandbox name -- Pick a name (e.g. my-assistant). Names must be lowercase alphanumeric with hyphens only.
Policy presets -- Use arrow keys to toggle presets. pypi and npm are selected by default. Press Enter to confirm.

The install takes approximately 3 minutes. Example milestones in the output (wording may vary slightly by release):

[1/3] Node.js
  Node.js found: v22.22.2

[2/3] NemoClaw CLI
  Installing NemoClaw from GitHub...
  Verified: nemoclaw is available at /home/nvidia/.local/bin/nemoclaw

[3/3] Onboarding
  [1/8] Preflight checks
    ✓ Docker is running
    ✓ NVIDIA GPU detected: 2 GPU(s), 256703 MB VRAM   # example on a two-GPU system
  [2/8] Starting OpenShell gateway
    ✓ Gateway is healthy
  [3/8] Configuring inference (NIM)
    ✓ Using existing vLLM on localhost:8000
    Detected model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
  [4/8] Setting up inference provider
    ✓ Inference route set: vllm-local / nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
  [5/8] Messaging channels
    (example) Telegram disabled — skipped
#    # or: Telegram enabled; token stored in credential store
  [6/8] Creating sandbox
    ✓ Sandbox 'my-assistant' created
  [7/8] Setting up OpenClaw inside sandbox
    ✓ OpenClaw gateway launched inside sandbox
  [8/8] Policy presets
    Applied preset: pypi
    Applied preset: npm

When complete you will see:

──────────────────────────────────────────────────
Sandbox      my-assistant (Landlock + seccomp + netns)
Model        nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (Local vLLM)
──────────────────────────────────────────────────
Run:         nemoclaw my-assistant connect
Status:      nemoclaw my-assistant status
Logs:        nemoclaw my-assistant logs --follow

OpenClaw UI (tokenized URL; treat it like a password)
http://127.0.0.1:18789/#token=<long-token-here>
──────────────────────────────────────────────────

Important

Save the tokenized Web UI URL printed at the end -- you will need it in Step 8. It looks like: http://127.0.0.1:18789/#token=<long-token-here>

Important

NEMOCLAW_EXPERIMENTAL=1 is required for the vLLM provider. Without it, the installer will report "Requested provider 'vllm' is not available in this environment."

Important

NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 extends the validation timeout from the default 15 seconds to 300 seconds. Without this, the endpoint validation will fail on a cold 120B model, even if you warmed it up in Step 3 -- the installer sends its own test prompt which may be slower.

Note

If nemoclaw is not found after install, run source ~/.bashrc to reload your shell path.

Step 5. Connect to the sandbox and verify inference

Connect to the sandbox:

nemoclaw my-assistant connect

Expected:

sandbox@my-assistant:~$

You are now inside the sandboxed environment. Verify that the inference route is working:

curl -sf https://inference.local/v1/models

Expected:

{"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]}

Step 6. Talk to the agent (CLI)

Still inside the sandbox, send a test message through the OpenClaw gateway (the default path). The --local flag is intentionally blocked inside the NemoClaw OpenShell sandbox — it would bypass gateway controls — so the command you may see in generic OpenClaw quickstarts will fail here.

openclaw agent --agent main -m "hello" --session-id test

Expected (the agent will think, then respond -- first response may take 30--90 seconds): streaming or printed assistant text ending with a normal reply.

If you see a response from the agent, inference is working end-to-end.

Step 7. Interactive TUI

Launch the terminal UI for an interactive chat session:

openclaw tui

Press Ctrl+C to exit the TUI.

Step 8. Exit the sandbox and access the Web UI

Exit the sandbox to return to the host:

exit

If accessing the Web UI directly on the DGX Station (keyboard and monitor attached), open a browser and navigate to the tokenized URL from Step 4. Prefer 127.0.0.1 in the URL bar (not localhost) so it matches strict gateway origin checks:

http://127.0.0.1:18789/#token=<long-token-here>

If accessing the Web UI from a remote machine, you need to set up port forwarding.

First, find your DGX Station's IP address. On the Station, run:

hostname -I | awk '{print $1}'

Start the port forward on the DGX Station host:

openshell forward start 18789 my-assistant --background

Expected:

Forwarding 127.0.0.1:18789 -> my-assistant:18789 (background)

If the forward was already started during onboarding, you will see:

Error: Port 18789 is already forwarded to sandbox 'my-assistant'.

This is fine -- the forward is already running.

Then from your remote machine, create an SSH tunnel to the Station (replace <your-station-ip> with the IP address from above):

ssh -L 18789:127.0.0.1:18789 <your-user>@<your-station-ip>

Now open the tokenized URL in your remote machine's browser. Either of these usually works on the client side because both bind to your loopback through the tunnel:

http://127.0.0.1:18789/#token=<long-token-here>

Important

Use 127.0.0.1, not localhost -- the gateway origin check requires an exact match.

Phase 3: Telegram Bot

Messaging (Telegram, Discord, Slack) is wired during onboarding — credentials are stored, OpenShell providers are created, and channel configuration is baked into the sandbox image. Runtime config under /sandbox/.openclaw/ is not safely patchable from inside the running sandbox.

nemoclaw start does not start the Telegram bridge. In current NemoClaw releases it starts optional host services such as the cloudflared tunnel when installed; Telegram delivery stays under OpenShell. See NemoClaw commands and Set up Telegram bridge.

Step 9. Create a Telegram bot

Open Telegram, find @BotFather, send /newbot, and follow the prompts. Copy the bot token.

Tip: Use Telegram Desktop or web.telegram.org so you can copy-paste the token into your terminal or env file instead of typing 46+ characters from your phone into SSH.

Step 10. Enable Telegram (first time or after skipping it)

Path A — You have not installed yet, or you can re-run onboard

Export the token on the host, then run the installer / onboard again (non-interactive variables from Step 4, plus TELEGRAM_BOT_TOKEN). The wizard’s Messaging channels step (installer phase [5/8]) is the right time to toggle Telegram interactively.

Re-onboarding after a sandbox exists is supported; NemoClaw can detect token changes and rebuild the sandbox — see the official Telegram bridge page.

Path B — NemoClaw is already installed (recommended host command)

On the host (run exit if you are inside nemoclaw … connect):

Allow outbound access to the Telegram API if you have not already — add the telegram network preset:

nemoclaw my-assistant policy-add

When prompted, select telegram and confirm.

Register the bot token and rebuild the sandbox image so Telegram is included:

export TELEGRAM_BOT_TOKEN='<your-bot-token>'
nemoclaw my-assistant channels add telegram

Follow the prompts to rebuild when asked (or run nemoclaw my-assistant rebuild --yes afterward if non-interactive mode queued a rebuild — see NEMOCLAW_NON_INTERACTIVE=1 behavior in the commands reference).

Pause or resume Telegram delivery without changing credentials: use the nemoclaw channels stop / nemoclaw channels start patterns for the telegram channel described in Set up Telegram bridge (exact subcommand spelling may vary slightly by NemoClaw version; use nemoclaw --help if in doubt).

Check overall status:

nemoclaw status

Open Telegram, find your bot, and send it a message.

Note

The first response may take 30--90 seconds for a 120B parameter model running locally.

Note

To persist TELEGRAM_BOT_TOKEN for shell-based flows, use a chmod 600 env file and source it from ~/.bashrc as shown in Step 4.

Note

For chat allowlists and advanced Telegram behavior, see NemoClaw Telegram bridge documentation.

Phase 4: Cleanup and Uninstall

Step 11. Stop services

Stop any running auxiliary services (Telegram bridge, cloudflared tunnel):

nemoclaw stop

Expected:

[services] All services stopped.

Stop the port forward (always pass port and sandbox name):

openshell forward list
openshell forward stop 18789 my-assistant

Stop and remove the vLLM container so the name vllm-nemotron is free for a future run. The playbook created the container with --restart unless-stopped, so docker stop alone is not enough: Docker would restart it after reboot and the container would keep reserving GPU memory.

docker update --restart=no vllm-nemotron 2>/dev/null || true
docker stop vllm-nemotron
docker rm vllm-nemotron

To remove the container in one step even if it is running: docker rm -f vllm-nemotron.

Step 12. Uninstall NemoClaw

Run the uninstaller from the cloned source directory. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and vLLM are preserved.

cd ~/.nemoclaw/source
./uninstall.sh

Uninstaller flags:

Flag	Effect
`--yes`	Skip the confirmation prompt
`--keep-openshell`	Leave the `openshell` binary in place
`--delete-models`	Removes local inference models pulled by older NemoClaw flows (the upstream flag name still references Ollama). It does not remove Hugging Face weights used by this playbook’s vLLM container — delete those separately (below).

To also remove the vLLM container and cached model weights:

./uninstall.sh --yes
docker rm -f vllm-nemotron 2>/dev/null || true
rm -rf ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/

The uninstaller runs 6 steps:

Stop NemoClaw helper services and port-forward processes
Delete all OpenShell sandboxes, the NemoClaw gateway, and providers
Remove the global nemoclaw npm package
Remove NemoClaw/OpenShell Docker containers, images, and volumes
Remove Ollama models (only with --delete-models)
Remove state directories (~/.nemoclaw, ~/.config/openshell, ~/.config/nemoclaw) and the OpenShell binary

Note

The source clone at ~/.nemoclaw/source is removed as part of state cleanup in step 6. If you want to keep a local copy, move or back it up before running the uninstaller.

Useful commands

Command	Description
`nemoclaw my-assistant connect`	Shell into the sandbox
`nemoclaw my-assistant status`	Show sandbox status and inference config
`nemoclaw my-assistant logs --follow`	Stream sandbox logs in real time
`nemoclaw list`	List all registered sandboxes
`nemoclaw tunnel start`	Start optional host services such as cloudflared (public dashboard URL when installed); does not start Telegram
`nemoclaw start`	Deprecated alias for tunnel/aux host services — not for Telegram
`nemoclaw stop`	Stop host auxiliary services started by `nemoclaw tunnel start` / `nemoclaw start`
`nemoclaw <sandbox> channels add telegram`	Store Telegram token and rebuild sandbox (host)
`openshell term`	Open the monitoring TUI on the host
`openshell forward list`	List active port forwards
`openshell forward start 18789 my-assistant --background`	Start port forwarding for Web UI
`openshell forward stop 18789 my-assistant`	Stop Web UI port forward
`docker logs -f vllm-nemotron`	Stream vLLM inference server logs
`docker restart vllm-nemotron`	Restart the vLLM inference server
`curl http://localhost:8000/v1/models`	Check vLLM API status
`cd ~/.nemoclaw/source && ./uninstall.sh`	Remove NemoClaw (preserves Docker, Node.js, vLLM image)

Troubleshooting

Symptom	Cause	Fix
`openclaw agent --local` fails or is blocked inside the sandbox	`--local` bypasses the NemoClaw gateway and is disallowed in the OpenShell sandbox	Use gateway mode: `openclaw agent --agent main -m "hello" --session-id test` (no `--local`).
Onboard fails with “K8s namespace not ready” (or similar) with no clear reason	Often low disk space on `/` or Docker’s data root; image push / k3s need headroom	Run `df -h / /var/lib/docker`. Free at least ~40 GB (see NemoClaw quickstart prerequisites); prune Docker (`docker system prune`) or expand disk, then retry onboard.
vLLM warns about mixed devices or loads on an unexpected GPU	Multiple GPUs visible; default visibility does not match intent	Pin one GPU: `--gpus '"device=0"'` and `-e CUDA_VISIBLE_DEVICES=0` with `--tensor-parallel-size 1`, or use two GPUs explicitly with `--tensor-parallel-size 2` and `-e CUDA_VISIBLE_DEVICES=0,1` (see Step 3 in instructions).
`nemoclaw: command not found` after install	Shell PATH not updated	Run `source ~/.bashrc` (or `source ~/.zshrc` for zsh), or open a new terminal window.
`pip: command not found`	pip not installed on DGX Station by default	Install pip: `sudo apt install -y python3-pip`. Then use `pip3 install --break-system-packages huggingface-hub`.
`huggingface-cli` is deprecated	Hugging Face CLI was renamed	Use `hf download` instead of `huggingface-cli download`.
vLLM container won't start or crashes	GPU memory issue or wrong image	Check logs: `docker logs vllm-nemotron`. If CUDA OOM, reduce context: recreate the container with `--max-model-len 8192`. Ensure you are using the NVIDIA container image (`nvcr.io/nvidia/vllm:26.03-py3`), not the community `vllm/vllm-openai` image.
vLLM logs show `Application startup complete.` but `curl` times out	vLLM still compiling CUDA graphs after startup	Wait 1--2 minutes after `Application startup complete.` before sending requests. The first request compiles CUDA graphs and may take 30--90 seconds.
NemoClaw onboard fails with "endpoint validation failed"	vLLM model not warmed up or validation timeout too short	Warm up the model first: `curl -s --max-time 120 http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"hello"}],"max_tokens":10}'`. Then re-run with `NEMOCLAW_EXPERIMENTAL=1 NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 nemoclaw onboard`.
NemoClaw reports "provider 'vllm' is not available"	Missing experimental flag	Set `NEMOCLAW_EXPERIMENTAL=1` before running the installer or `nemoclaw onboard`. The vLLM provider is currently an experimental feature.
Docker permission denied	User not in docker group	`sudo usermod -aG docker $USER`, then log out and back in.
Gateway fails with cgroup / "Failed to start ContainerManager" errors	Docker not configured for host cgroup namespace on DGX Station	Run the cgroup fix: `sudo python3 -c "import json, os; path='/etc/docker/daemon.json'; d=json.load(open(path)) if os.path.exists(path) else {}; d['default-cgroupns-mode']='host'; json.dump(d, open(path,'w'), indent=2)"` then `sudo systemctl restart docker`.
Gateway fails with "port 8080 is held by container..."	Another OpenShell gateway or container is using port 8080	Stop the conflicting container: `openshell gateway destroy -g <old-gateway-name>` or `docker stop <container-name> && docker rm <container-name>`, then retry `nemoclaw onboard`.
Sandbox cannot reach the inference server	Using `localhost` instead of `host.openshell.internal` in endpoint URL	Inside the sandbox, `localhost` refers to the sandbox container, not the host. The onboard wizard configures `host.openshell.internal` automatically. Verify from inside the sandbox: `curl -sf https://inference.local/v1/models`. If this fails, check that vLLM is reachable from the host: `curl -s http://localhost:8000/v1/models`.
Agent gives no response or is very slow	Normal for 120B model running locally	Nemotron 3 Super 120B can take 30--90 seconds per response. Verify inference route: `nemoclaw my-assistant status`.
vLLM API returns empty or errors on tool calls	Missing tool-call flags	Verify that `--enable-auto-tool-choice` and `--tool-call-parser qwen3_xml` are set: `docker inspect vllm-nemotron --format '{{.Config.Cmd}}'`.
Port 18789 already in use	Another process is bound to the port	`lsof -i :18789` then `kill <PID>`. If needed, `kill -9 <PID>` to force-terminate.
Web UI port forward dies or dashboard unreachable	Port forward not active	`openshell forward stop 18789 my-assistant` then `openshell forward start 18789 my-assistant --background`. Always pass port and sandbox name to `openshell forward stop`.
Web UI shows `origin not allowed`	Browser origin does not match what the gateway expects	On the DGX Station local desktop, open `http://127.0.0.1:18789/#token=...` (not `localhost`). Through an SSH tunnel on another machine, `localhost` vs `127.0.0.1` in the client browser usually both work because the check applies to how you reach the forwarded port locally.
Telegram does not work after install; `nemoclaw start` does nothing for Telegram	`nemoclaw start` starts optional host services (e.g. cloudflared), not the Telegram bridge	Configure Telegram during onboard, or on the host run `nemoclaw my-assistant channels add telegram` (and rebuild), after `policy-add` for the `telegram` preset. See Set up Telegram bridge.
Telegram bot receives messages but does not reply	Telegram policy not added to sandbox	Run `nemoclaw my-assistant policy-add`, type `telegram`, hit Y. Ensure the channel was added with `nemoclaw my-assistant channels add telegram` so the image includes Telegram.
`docker: Error response from daemon: Conflict. The container name "/vllm-nemotron" is already in use`	Previous cleanup used `docker stop` only	`docker rm -f vllm-nemotron` (or `docker update --restart=no` then `docker stop` and `docker rm`). The playbook uses `--restart unless-stopped`; stopping alone leaves a restart policy and reserved name.

Model variant guidance:

Variant	Size	VRAM Required	When to Use
`NVFP4`	~60 GB	~80 GB	Default for DGX Station (GB300). Fits on single GPU with room for large KV cache.
`FP8`	~120 GB	~140 GB	Higher accuracy, still fits on GB300. Add `--kv-cache-dtype fp8` to the vLLM command.
`BF16`	~240 GB	~260 GB	Highest accuracy. Fits on GB300 but leaves little room for KV cache. Reduce `--max-model-len`.

For the latest known issues, see DGX Station documentation.

README.md Unescape Escape

NemoClaw with Nemotron-3-Super and vLLM on DGX Station

Table of Contents

Overview

Overview

Basic idea

What you'll accomplish

Notice and disclaimers

Quick start safety check

What you're getting

Key risks with AI agents

Participant acknowledgement

Isolation layers (OpenShell)

What to know before starting

Prerequisites

Have ready before you begin

Ancillary files

Time and risk

Instructions

Phase 1: Prerequisites

Step 1. Configure Docker and the NVIDIA container runtime

Step 2. Pull the Nemotron-3-Super model

Step 3. Start the vLLM inference server

Phase 2: Install and Run NemoClaw

Step 4. Install NemoClaw

Recommended: non-interactive install (copy-paste friendly)

Alternative: interactive installer

Step 5. Connect to the sandbox and verify inference

Step 6. Talk to the agent (CLI)

Step 7. Interactive TUI

Step 8. Exit the sandbox and access the Web UI

Phase 3: Telegram Bot

Step 9. Create a Telegram bot

Step 10. Enable Telegram (first time or after skipping it)

Path A — You have not installed yet, or you can re-run onboard

Path B — NemoClaw is already installed (recommended host command)

Phase 4: Cleanup and Uninstall

Step 11. Stop services

Step 12. Uninstall NemoClaw

Useful commands

Troubleshooting

README.md