| .. | ||
| endpoint-production.yaml | ||
| endpoint-test.yaml | ||
| overview.md | ||
| README.md | ||
NemoClaw with Nemotron-3-Super and vLLM on DGX Station
Install NemoClaw on DGX Station with local vLLM inference and Telegram bot integration
Table of Contents
- Overview
- Instructions
- Step 1. Configure Docker and the NVIDIA container runtime
- Step 2. Pull the Nemotron-3-Super model
- Step 3. Start the vLLM inference server
- Step 4. Install NemoClaw
- Step 5. Connect to the sandbox and verify inference
- Step 6. Talk to the agent (CLI)
- Step 7. Interactive TUI
- Step 8. Exit the sandbox and access the Web UI
- Step 9. Create a Telegram bot
- Step 10. Enable Telegram (first time or after skipping it)
- Step 11. Stop services
- Step 12. Uninstall NemoClaw
- Troubleshooting
Overview
Overview
Basic idea
NVIDIA NemoClaw is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the NVIDIA OpenShell runtime -- an environment designed for executing agents with additional security -- and open-source models like NVIDIA Nemotron. A single installer command handles Node.js, OpenShell, and the NemoClaw CLI, then walks you through an onboard wizard to create a sandboxed agent on your DGX Station using vLLM with Nemotron 3 Super.
By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, accessible via a web dashboard and a Telegram bot, with inference routed to a local Nemotron 3 Super 120B model served by vLLM on your DGX Station -- all without exposing your host filesystem or network to the agent.
What you'll accomplish
- Configure Docker and the NVIDIA container runtime for OpenShell on DGX Station
- Pull Nemotron 3 Super 120B (NVFP4) from Hugging Face and serve it with vLLM
- Install NemoClaw with a single command (handles Node.js, OpenShell, and the CLI)
- Run the onboard wizard to create a sandbox and configure local vLLM inference
- Chat with the agent via the CLI, TUI, and web UI
- Set up a Telegram bot that forwards messages to your sandboxed agent
Notice and disclaimers
The following sections describe safety, risks, and your responsibilities when running this demo.
Quick start safety check
Use only a clean environment. Run this demo on a fresh device or VM with no personal data, confidential information, or sensitive credentials. Keep it isolated like a sandbox.
By installing this demo, you accept responsibility for all third-party components, including reviewing their licenses, terms, and security posture. Read and accept before you install or use.
What you're getting
This experience is provided "AS IS" for demonstration purposes only -- no warranties, no guarantees. This is a demo, not a production-ready solution. You will need to implement appropriate security controls for your environment and use case.
Key risks with AI agents
- Data leakage -- Any materials the agent accesses could be exposed, leaked, or stolen.
- Malicious code execution -- The agent or its connected tools could expose your system to malicious code or cyber-attacks.
- Unintended actions -- The agent might modify or delete files, send messages, or access services without explicit approval.
- Prompt injection and manipulation -- External inputs or connected content could hijack the agent's behavior in unexpected ways.
Participant acknowledgement
By participating in this demo, you acknowledge that you are solely responsible for your configuration and for any data, accounts, and tools you connect. To the maximum extent permitted by law, NVIDIA is not responsible for any loss of data, device damage, security incidents, or other harm arising from your configuration or use of NemoClaw demo materials, including OpenClaw or any connected tools or services.
Isolation layers (OpenShell)
| Layer | What it protects | When it applies |
|---|---|---|
| Filesystem | Prevents reads/writes outside allowed paths. | Locked at sandbox creation. |
| Network | Blocks unauthorized outbound connections. | Hot-reloadable at runtime. |
| Process | Blocks privilege escalation and dangerous syscalls. | Locked at sandbox creation. |
| Inference | Reroutes model API calls to controlled backends. | Hot-reloadable at runtime. |
What to know before starting
- Basic use of the Linux terminal and SSH
- Familiarity with Docker (permissions,
docker run) - Awareness of the security and risk sections above
Prerequisites
Hardware and access:
- A DGX Station (GB300) with keyboard and monitor, or SSH access
- A Telegram bot token from @BotFather (create one with
/newbot) -- optional, for Phase 3
Software:
- Fresh install of DGX OS with latest updates
Verify your system before starting:
head -n 2 /etc/os-release
nvidia-smi
docker info --format '{{.ServerVersion}}'
df -h / /var/lib/docker 2>/dev/null | head -20
Expected: Ubuntu 24.04, NVIDIA GB300 GPU(s), Docker 28.x+, and enough free disk for Docker layers, the NemoClaw sandbox image, and Hugging Face cache (treat ~40 GB free on the Docker data filesystem as a practical minimum; very low free space can surface as cryptic onboard errors such as “K8s namespace not ready”).
Have ready before you begin
| Item | Where to get it |
|---|---|
| Telegram bot token (optional) | @BotFather on Telegram -- create with /newbot |
Ancillary files
All required assets are handled by the NemoClaw installer. No manual cloning is needed.
Time and risk
- Estimated time: 20--30 minutes (with model already downloaded). First-time model download adds ~10--20 minutes depending on network speed.
- Risk level: Medium -- you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
- Last Updated: 04/27/2026
- First publication for DGX Station with vLLM
Instructions
Phase 1: Prerequisites
These steps prepare a fresh DGX Station for NemoClaw. If Docker, the NVIDIA runtime, and vLLM are already configured, skip to Phase 2.
Important
Disk space: NemoClaw’s onboard flow pulls a multi-gigabyte sandbox image and runs Docker, k3s, and the gateway together. If root or Docker’s data disk is nearly full (for example only a few gigabytes free), onboarding can fail with generic errors such as “K8s namespace not ready” with no clear hint about storage. Before you start, check free space:
df -h / /var/lib/docker. NVIDIA recommends at least 40 GB free on the filesystem that holds Docker layers (often/or/var/lib/docker); treat under ~15 GB as high risk for first-time onboard failures.
Step 1. Configure Docker and the NVIDIA container runtime
OpenShell's gateway runs k3s inside Docker. On DGX Station (Ubuntu 24.04, cgroup v2), Docker must be configured with the NVIDIA runtime and host cgroup namespace mode.
Configure the NVIDIA container runtime for Docker:
sudo nvidia-ctk runtime configure --runtime=docker
Expected:
INFO Loading config from /etc/docker/daemon.json
INFO Wrote updated config to /etc/docker/daemon.json
INFO It is recommended that docker daemon be restarted.
Set the cgroup namespace mode required by OpenShell on DGX Station:
sudo python3 -c "
import json, os
path = '/etc/docker/daemon.json'
d = json.load(open(path)) if os.path.exists(path) else {}
d['default-cgroupns-mode'] = 'host'
json.dump(d, open(path, 'w'), indent=2)
"
Restart Docker:
sudo systemctl restart docker
Verify the NVIDIA runtime works:
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Expected:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| 0 NVIDIA GB300 On | 00000009:06:00.0 Off | 0 |
| N/A 46C P0 215W / 1300W | 18661MiB / 256703MiB | 0% Default |
+-----------------------------------------+------------------------+----------------------+
If you get a permission denied error on docker, add your user to the Docker group and activate the new group in your current session:
sudo usermod -aG docker $USER
newgrp docker
This applies the group change immediately. Alternatively, you can log out and back in instead of running newgrp docker.
Note
DGX Station uses cgroup v2. OpenShell's gateway embeds k3s inside Docker and needs host cgroup namespace access. Without
default-cgroupns-mode: host, the gateway can fail with "Failed to start ContainerManager" errors.
Step 2. Pull the Nemotron-3-Super model
Install pip and the Hugging Face CLI (if not already installed):
sudo apt install -y python3-pip
pip3 install --break-system-packages huggingface-hub
Download Nemotron 3 Super 120B in NVFP4 quantization (~60 GB; may take 10--20 minutes depending on network speed):
hf download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
Expected (on a fresh download; cached downloads complete instantly):
Fetching 36 files: 100%|██████████| 36/36 [15:42<00:00, 26.18s/it]
/home/nvidia/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/snapshots/0d6fa3ecad422a...
Verify the download completed:
ls ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/
Expected:
blobs refs snapshots
Note
The NVFP4 quantization is chosen because it fits entirely in one GB300 GPU’s 256 GB HBM3e with room for KV cache. On a two-GPU station you can still use NVFP4 with
--tensor-parallel-size 1and a single visible GPU, or shard with--tensor-parallel-size 2. For other quantization variants, see Troubleshooting.
Step 3. Start the vLLM inference server
Launch vLLM using the NVIDIA-optimized container image.
Single GPU (default on one-GPU systems, or pin to one GPU on multi-GPU stations): vLLM can emit mixed device warnings if several GPUs are visible but the model is only meant to use one. Pinning avoids accidentally placing weights on an unexpected device.
docker run -d --name vllm-nemotron \
--runtime nvidia --gpus '"device=0"' \
-e CUDA_VISIBLE_DEVICES=0 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--restart unless-stopped \
nvcr.io/nvidia/vllm:26.03-py3 \
python3 -m vllm.entrypoints.openai.api_server \
--model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--trust-remote-code \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser nemotron_v3
Two GPUs (tensor parallel): If your DGX Station has two Blackwell GPUs and you want Nemotron sharded across both, use both devices and set tensor parallel size to 2 (VRAM is summed across the GPUs):
docker run -d --name vllm-nemotron \
--runtime nvidia --gpus all \
-e CUDA_VISIBLE_DEVICES=0,1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--restart unless-stopped \
nvcr.io/nvidia/vllm:26.03-py3 \
python3 -m vllm.entrypoints.openai.api_server \
--model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--trust-remote-code \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser nemotron_v3
Pick a GPU index by name (optional one-liner): To print the device index of the first GPU whose name contains GB300 (adjust the pattern if your nvidia-smi name string differs), run on the host:
nvidia-smi --query-gpu=index,name --format=csv,noheader | awk -F', ' '/GB300/ { gsub(/^ +/,"",$1); print $1; exit }'
Use that index in Docker as --gpus '"device=N"' (replace N with the printed index).
Note
--tool-call-parser qwen3_xml: Nemotron’s tool-call wire format is exposed through vLLM’s Qwen3-compatible XML tool parser — the name refers to the parser implementation, not the base model. This pairing is what vLLM expects for correct function/tool calling with this checkpoint.
The first startup loads ~70 GB of weights into GPU memory. Watch the logs until you see the model is ready:
docker logs -f vllm-nemotron
Wait until you see the following in the logs (typically 3--5 minutes):
INFO Loading weights took 55.47 seconds
INFO Model loading took 69.39 GiB memory and 71.31 seconds
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
Then verify the API is responding:
curl -s http://localhost:8000/v1/models
Expected:
{"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]}
Send a test request to warm up the model before proceeding to Step 4. The first inference request compiles CUDA graphs and can take 30--90 seconds:
curl -s --max-time 120 http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"Say hello."}],"max_tokens":10}'
Expected (the first request may take 30--90 seconds; subsequent requests are much faster):
{"id":"chatcmpl-...","object":"chat.completion","model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","choices":[{"index":0,"message":{"role":"assistant","content":"..."},"finish_reason":"length"}],...}
Important
Warm up the model before running the NemoClaw installer. The onboard wizard validates the vLLM endpoint with a short timeout. If the model has not served at least one request, this validation will time out and the install will fail.
Important
Always start vLLM via the Docker container -- do not run
vllm servedirectly on the host. The NVIDIA container image (nvcr.io/nvidia/vllm:26.03-py3) includes optimized kernels for the GB300's Blackwell architecture that are not available in the pip-installed version.
Note
Key flags explained:
--tensor-parallel-size--1for a single visible GPU;2when you expose two GPUs for tensor-parallel sharding (see Step 3).--trust-remote-code-- required for the Mamba2-Transformer hybrid architecture--max-model-len 32768-- maximum context length (increase up to 1M if VRAM allows)--enable-auto-tool-choice --tool-call-parser qwen3_xml-- enables function/tool calling for the agent (see the note above on the parser name).--reasoning-parser nemotron_v3-- separates chain-of-thought reasoning from the response so the TUI/Web UI can display them cleanly
Phase 2: Install and Run NemoClaw
Step 4. Install NemoClaw
The installer script installs Node.js (if needed), OpenShell, the NemoClaw CLI, and runs onboarding to create a sandbox. The vLLM provider requires the experimental flag and an extended inference timeout (the default 15-second validation timeout is too short for a 120B model).
Recommended: non-interactive install (copy-paste friendly)
This path is best for SSH sessions, automation, and documentation — no arrow-key TUI in the terminal.
NEMOCLAW_EXPERIMENTAL=1 \
NEMOCLAW_NON_INTERACTIVE=1 \
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \
NEMOCLAW_SANDBOX_NAME=my-assistant \
NEMOCLAW_PROVIDER=vllm \
NEMOCLAW_MODEL="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \
bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)"
Optional: include Telegram in the first onboard without typing the token over SSH — export credentials on the host before running the installer (same variables the NemoClaw Telegram bridge guide documents):
export TELEGRAM_BOT_TOKEN='<paste-token-here>'
## Optional DM allowlist (comma-separated Telegram user IDs):
## export TELEGRAM_ALLOWED_IDS='123456789,987654321'
Use Telegram Desktop or web.telegram.org on a laptop to copy the token from @BotFather and paste into your SSH session (or into a small env file you source). Typing a 46+ character token on a phone keyboard into a remote shell is error-prone.
To persist TELEGRAM_BOT_TOKEN across reboots, keep it in a root-owned or user-only file and source it from your shell profile (example — adjust path and permissions):
install -m 600 /dev/null ~/.nemoclaw/telegram.env
nano ~/.nemoclaw/telegram.env # add: export TELEGRAM_BOT_TOKEN='...'
grep -q 'nemoclaw/telegram.env' ~/.bashrc || echo 'source ~/.nemoclaw/telegram.env 2>/dev/null' >> ~/.bashrc
NemoClaw also stores messaging credentials in its credential store when you onboard or run nemoclaw … channels add telegram; the file above is mainly for re-running scripts or non-interactive flows that read the environment.
Alternative: interactive installer
If you prefer the wizard:
NEMOCLAW_EXPERIMENTAL=1 \
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \
bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)"
The wizard asks six high-level prompts (third-party notice, inference provider, Brave Search, messaging channels, sandbox name, policy presets). In parallel, the installer prints eight numbered onboard sub-phases, [1/8] … [8/8] (preflight, gateway, inference detection, inference route, messaging channels, sandbox creation, OpenClaw inside sandbox, policy presets). Those two numberings are different on purpose — the [n/8] lines are internal progress steps; the numbered list above is what you answer in the TUI.
- Third-party software notice -- Type
yesto accept and continue. - Inference provider -- The wizard detects vLLM running locally. Select option 8 (
Local vLLM [experimental] — running). - Brave Web Search -- Optional. Type
skipif you don't have a Brave Search API key. - Messaging channels -- Optional. Press Enter to skip, or toggle Telegram/Discord/Slack if desired (this is the step that corresponds to onboard phase [5/8] in the log).
- Sandbox name -- Pick a name (e.g.
my-assistant). Names must be lowercase alphanumeric with hyphens only. - Policy presets -- Use arrow keys to toggle presets.
pypiandnpmare selected by default. Press Enter to confirm.
The install takes approximately 3 minutes. Example milestones in the output (wording may vary slightly by release):
[1/3] Node.js
Node.js found: v22.22.2
[2/3] NemoClaw CLI
Installing NemoClaw from GitHub...
Verified: nemoclaw is available at /home/nvidia/.local/bin/nemoclaw
[3/3] Onboarding
[1/8] Preflight checks
✓ Docker is running
✓ NVIDIA GPU detected: 2 GPU(s), 256703 MB VRAM # example on a two-GPU system
[2/8] Starting OpenShell gateway
✓ Gateway is healthy
[3/8] Configuring inference (NIM)
✓ Using existing vLLM on localhost:8000
Detected model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
[4/8] Setting up inference provider
✓ Inference route set: vllm-local / nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
[5/8] Messaging channels
(example) Telegram disabled — skipped
# # or: Telegram enabled; token stored in credential store
[6/8] Creating sandbox
✓ Sandbox 'my-assistant' created
[7/8] Setting up OpenClaw inside sandbox
✓ OpenClaw gateway launched inside sandbox
[8/8] Policy presets
Applied preset: pypi
Applied preset: npm
When complete you will see:
──────────────────────────────────────────────────
Sandbox my-assistant (Landlock + seccomp + netns)
Model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (Local vLLM)
──────────────────────────────────────────────────
Run: nemoclaw my-assistant connect
Status: nemoclaw my-assistant status
Logs: nemoclaw my-assistant logs --follow
OpenClaw UI (tokenized URL; treat it like a password)
http://127.0.0.1:18789/#token=<long-token-here>
──────────────────────────────────────────────────
Important
Save the tokenized Web UI URL printed at the end -- you will need it in Step 8. It looks like:
http://127.0.0.1:18789/#token=<long-token-here>
Important
NEMOCLAW_EXPERIMENTAL=1is required for the vLLM provider. Without it, the installer will report "Requested provider 'vllm' is not available in this environment."
Important
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300extends the validation timeout from the default 15 seconds to 300 seconds. Without this, the endpoint validation will fail on a cold 120B model, even if you warmed it up in Step 3 -- the installer sends its own test prompt which may be slower.
Note
If
nemoclawis not found after install, runsource ~/.bashrcto reload your shell path.
Step 5. Connect to the sandbox and verify inference
Connect to the sandbox:
nemoclaw my-assistant connect
Expected:
sandbox@my-assistant:~$
You are now inside the sandboxed environment. Verify that the inference route is working:
curl -sf https://inference.local/v1/models
Expected:
{"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]}
Step 6. Talk to the agent (CLI)
Still inside the sandbox, send a test message through the OpenClaw gateway (the default path). The --local flag is intentionally blocked inside the NemoClaw OpenShell sandbox — it would bypass gateway controls — so the command you may see in generic OpenClaw quickstarts will fail here.
openclaw agent --agent main -m "hello" --session-id test
Expected (the agent will think, then respond -- first response may take 30--90 seconds): streaming or printed assistant text ending with a normal reply.
If you see a response from the agent, inference is working end-to-end.
Step 7. Interactive TUI
Launch the terminal UI for an interactive chat session:
openclaw tui
Press Ctrl+C to exit the TUI.
Step 8. Exit the sandbox and access the Web UI
Exit the sandbox to return to the host:
exit
If accessing the Web UI directly on the DGX Station (keyboard and monitor attached), open a browser and navigate to the tokenized URL from Step 4. Prefer 127.0.0.1 in the URL bar (not localhost) so it matches strict gateway origin checks:
http://127.0.0.1:18789/#token=<long-token-here>
If accessing the Web UI from a remote machine, you need to set up port forwarding.
First, find your DGX Station's IP address. On the Station, run:
hostname -I | awk '{print $1}'
Start the port forward on the DGX Station host:
openshell forward start 18789 my-assistant --background
Expected:
Forwarding 127.0.0.1:18789 -> my-assistant:18789 (background)
If the forward was already started during onboarding, you will see:
Error: Port 18789 is already forwarded to sandbox 'my-assistant'.
This is fine -- the forward is already running.
Then from your remote machine, create an SSH tunnel to the Station (replace <your-station-ip> with the IP address from above):
ssh -L 18789:127.0.0.1:18789 <your-user>@<your-station-ip>
Now open the tokenized URL in your remote machine's browser. Either of these usually works on the client side because both bind to your loopback through the tunnel:
http://127.0.0.1:18789/#token=<long-token-here>
Important
Use
127.0.0.1, notlocalhost-- the gateway origin check requires an exact match.
Phase 3: Telegram Bot
Messaging (Telegram, Discord, Slack) is wired during onboarding — credentials are stored, OpenShell providers are created, and channel configuration is baked into the sandbox image. Runtime config under /sandbox/.openclaw/ is not safely patchable from inside the running sandbox.
nemoclaw start does not start the Telegram bridge. In current NemoClaw releases it starts optional host services such as the cloudflared tunnel when installed; Telegram delivery stays under OpenShell. See NemoClaw commands and Set up Telegram bridge.
Step 9. Create a Telegram bot
Open Telegram, find @BotFather, send /newbot, and follow the prompts. Copy the bot token.
Tip: Use Telegram Desktop or web.telegram.org so you can copy-paste the token into your terminal or env file instead of typing 46+ characters from your phone into SSH.
Step 10. Enable Telegram (first time or after skipping it)
Path A — You have not installed yet, or you can re-run onboard
Export the token on the host, then run the installer / onboard again (non-interactive variables from Step 4, plus TELEGRAM_BOT_TOKEN). The wizard’s Messaging channels step (installer phase [5/8]) is the right time to toggle Telegram interactively.
Re-onboarding after a sandbox exists is supported; NemoClaw can detect token changes and rebuild the sandbox — see the official Telegram bridge page.
Path B — NemoClaw is already installed (recommended host command)
On the host (run exit if you are inside nemoclaw … connect):
- Allow outbound access to the Telegram API if you have not already — add the
telegramnetwork preset:
nemoclaw my-assistant policy-add
When prompted, select telegram and confirm.
- Register the bot token and rebuild the sandbox image so Telegram is included:
export TELEGRAM_BOT_TOKEN='<your-bot-token>'
nemoclaw my-assistant channels add telegram
Follow the prompts to rebuild when asked (or run nemoclaw my-assistant rebuild --yes afterward if non-interactive mode queued a rebuild — see NEMOCLAW_NON_INTERACTIVE=1 behavior in the commands reference).
- Pause or resume Telegram delivery without changing credentials: use the
nemoclaw channels stop/nemoclaw channels startpatterns for thetelegramchannel described in Set up Telegram bridge (exact subcommand spelling may vary slightly by NemoClaw version; usenemoclaw --helpif in doubt).
Check overall status:
nemoclaw status
Open Telegram, find your bot, and send it a message.
Note
The first response may take 30--90 seconds for a 120B parameter model running locally.
Note
To persist
TELEGRAM_BOT_TOKENfor shell-based flows, use achmod 600env file andsourceit from~/.bashrcas shown in Step 4.
Note
For chat allowlists and advanced Telegram behavior, see NemoClaw Telegram bridge documentation.
Phase 4: Cleanup and Uninstall
Step 11. Stop services
Stop any running auxiliary services (Telegram bridge, cloudflared tunnel):
nemoclaw stop
Expected:
[services] All services stopped.
Stop the port forward (always pass port and sandbox name):
openshell forward list
openshell forward stop 18789 my-assistant
Stop and remove the vLLM container so the name vllm-nemotron is free for a future run. The playbook created the container with --restart unless-stopped, so docker stop alone is not enough: Docker would restart it after reboot and the container would keep reserving GPU memory.
docker update --restart=no vllm-nemotron 2>/dev/null || true
docker stop vllm-nemotron
docker rm vllm-nemotron
To remove the container in one step even if it is running: docker rm -f vllm-nemotron.
Step 12. Uninstall NemoClaw
Run the uninstaller from the cloned source directory. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and vLLM are preserved.
cd ~/.nemoclaw/source
./uninstall.sh
Uninstaller flags:
| Flag | Effect |
|---|---|
--yes |
Skip the confirmation prompt |
--keep-openshell |
Leave the openshell binary in place |
--delete-models |
Removes local inference models pulled by older NemoClaw flows (the upstream flag name still references Ollama). It does not remove Hugging Face weights used by this playbook’s vLLM container — delete those separately (below). |
To also remove the vLLM container and cached model weights:
./uninstall.sh --yes
docker rm -f vllm-nemotron 2>/dev/null || true
rm -rf ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/
The uninstaller runs 6 steps:
- Stop NemoClaw helper services and port-forward processes
- Delete all OpenShell sandboxes, the NemoClaw gateway, and providers
- Remove the global
nemoclawnpm package - Remove NemoClaw/OpenShell Docker containers, images, and volumes
- Remove Ollama models (only with
--delete-models) - Remove state directories (
~/.nemoclaw,~/.config/openshell,~/.config/nemoclaw) and the OpenShell binary
Note
The source clone at
~/.nemoclaw/sourceis removed as part of state cleanup in step 6. If you want to keep a local copy, move or back it up before running the uninstaller.
Useful commands
| Command | Description |
|---|---|
nemoclaw my-assistant connect |
Shell into the sandbox |
nemoclaw my-assistant status |
Show sandbox status and inference config |
nemoclaw my-assistant logs --follow |
Stream sandbox logs in real time |
nemoclaw list |
List all registered sandboxes |
nemoclaw tunnel start |
Start optional host services such as cloudflared (public dashboard URL when installed); does not start Telegram |
nemoclaw start |
Deprecated alias for tunnel/aux host services — not for Telegram |
nemoclaw stop |
Stop host auxiliary services started by nemoclaw tunnel start / nemoclaw start |
nemoclaw <sandbox> channels add telegram |
Store Telegram token and rebuild sandbox (host) |
openshell term |
Open the monitoring TUI on the host |
openshell forward list |
List active port forwards |
openshell forward start 18789 my-assistant --background |
Start port forwarding for Web UI |
openshell forward stop 18789 my-assistant |
Stop Web UI port forward |
docker logs -f vllm-nemotron |
Stream vLLM inference server logs |
docker restart vllm-nemotron |
Restart the vLLM inference server |
curl http://localhost:8000/v1/models |
Check vLLM API status |
cd ~/.nemoclaw/source && ./uninstall.sh |
Remove NemoClaw (preserves Docker, Node.js, vLLM image) |
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
openclaw agent --local fails or is blocked inside the sandbox |
--local bypasses the NemoClaw gateway and is disallowed in the OpenShell sandbox |
Use gateway mode: openclaw agent --agent main -m "hello" --session-id test (no --local). |
| Onboard fails with “K8s namespace not ready” (or similar) with no clear reason | Often low disk space on / or Docker’s data root; image push / k3s need headroom |
Run df -h / /var/lib/docker. Free at least ~40 GB (see NemoClaw quickstart prerequisites); prune Docker (docker system prune) or expand disk, then retry onboard. |
| vLLM warns about mixed devices or loads on an unexpected GPU | Multiple GPUs visible; default visibility does not match intent | Pin one GPU: --gpus '"device=0"' and -e CUDA_VISIBLE_DEVICES=0 with --tensor-parallel-size 1, or use two GPUs explicitly with --tensor-parallel-size 2 and -e CUDA_VISIBLE_DEVICES=0,1 (see Step 3 in instructions). |
nemoclaw: command not found after install |
Shell PATH not updated | Run source ~/.bashrc (or source ~/.zshrc for zsh), or open a new terminal window. |
pip: command not found |
pip not installed on DGX Station by default | Install pip: sudo apt install -y python3-pip. Then use pip3 install --break-system-packages huggingface-hub. |
huggingface-cli is deprecated |
Hugging Face CLI was renamed | Use hf download instead of huggingface-cli download. |
| vLLM container won't start or crashes | GPU memory issue or wrong image | Check logs: docker logs vllm-nemotron. If CUDA OOM, reduce context: recreate the container with --max-model-len 8192. Ensure you are using the NVIDIA container image (nvcr.io/nvidia/vllm:26.03-py3), not the community vllm/vllm-openai image. |
vLLM logs show Application startup complete. but curl times out |
vLLM still compiling CUDA graphs after startup | Wait 1--2 minutes after Application startup complete. before sending requests. The first request compiles CUDA graphs and may take 30--90 seconds. |
| NemoClaw onboard fails with "endpoint validation failed" | vLLM model not warmed up or validation timeout too short | Warm up the model first: curl -s --max-time 120 http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"hello"}],"max_tokens":10}'. Then re-run with NEMOCLAW_EXPERIMENTAL=1 NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 nemoclaw onboard. |
| NemoClaw reports "provider 'vllm' is not available" | Missing experimental flag | Set NEMOCLAW_EXPERIMENTAL=1 before running the installer or nemoclaw onboard. The vLLM provider is currently an experimental feature. |
| Docker permission denied | User not in docker group | sudo usermod -aG docker $USER, then log out and back in. |
| Gateway fails with cgroup / "Failed to start ContainerManager" errors | Docker not configured for host cgroup namespace on DGX Station | Run the cgroup fix: sudo python3 -c "import json, os; path='/etc/docker/daemon.json'; d=json.load(open(path)) if os.path.exists(path) else {}; d['default-cgroupns-mode']='host'; json.dump(d, open(path,'w'), indent=2)" then sudo systemctl restart docker. |
| Gateway fails with "port 8080 is held by container..." | Another OpenShell gateway or container is using port 8080 | Stop the conflicting container: openshell gateway destroy -g <old-gateway-name> or docker stop <container-name> && docker rm <container-name>, then retry nemoclaw onboard. |
| Sandbox cannot reach the inference server | Using localhost instead of host.openshell.internal in endpoint URL |
Inside the sandbox, localhost refers to the sandbox container, not the host. The onboard wizard configures host.openshell.internal automatically. Verify from inside the sandbox: curl -sf https://inference.local/v1/models. If this fails, check that vLLM is reachable from the host: curl -s http://localhost:8000/v1/models. |
| Agent gives no response or is very slow | Normal for 120B model running locally | Nemotron 3 Super 120B can take 30--90 seconds per response. Verify inference route: nemoclaw my-assistant status. |
| vLLM API returns empty or errors on tool calls | Missing tool-call flags | Verify that --enable-auto-tool-choice and --tool-call-parser qwen3_xml are set: docker inspect vllm-nemotron --format '{{.Config.Cmd}}'. |
| Port 18789 already in use | Another process is bound to the port | lsof -i :18789 then kill <PID>. If needed, kill -9 <PID> to force-terminate. |
| Web UI port forward dies or dashboard unreachable | Port forward not active | openshell forward stop 18789 my-assistant then openshell forward start 18789 my-assistant --background. Always pass port and sandbox name to openshell forward stop. |
Web UI shows origin not allowed |
Browser origin does not match what the gateway expects | On the DGX Station local desktop, open http://127.0.0.1:18789/#token=... (not localhost). Through an SSH tunnel on another machine, localhost vs 127.0.0.1 in the client browser usually both work because the check applies to how you reach the forwarded port locally. |
Telegram does not work after install; nemoclaw start does nothing for Telegram |
nemoclaw start starts optional host services (e.g. cloudflared), not the Telegram bridge |
Configure Telegram during onboard, or on the host run nemoclaw my-assistant channels add telegram (and rebuild), after policy-add for the telegram preset. See Set up Telegram bridge. |
| Telegram bot receives messages but does not reply | Telegram policy not added to sandbox | Run nemoclaw my-assistant policy-add, type telegram, hit Y. Ensure the channel was added with nemoclaw my-assistant channels add telegram so the image includes Telegram. |
docker: Error response from daemon: Conflict. The container name "/vllm-nemotron" is already in use |
Previous cleanup used docker stop only |
docker rm -f vllm-nemotron (or docker update --restart=no then docker stop and docker rm). The playbook uses --restart unless-stopped; stopping alone leaves a restart policy and reserved name. |
Model variant guidance:
| Variant | Size | VRAM Required | When to Use |
|---|---|---|---|
NVFP4 |
~60 GB | ~80 GB | Default for DGX Station (GB300). Fits on single GPU with room for large KV cache. |
FP8 |
~120 GB | ~140 GB | Higher accuracy, still fits on GB300. Add --kv-cache-dtype fp8 to the vLLM command. |
BF16 |
~240 GB | ~260 GB | Highest accuracy. Fits on GB300 but leaves little room for KV cache. Reduce --max-model-len. |
For the latest known issues, see DGX Station documentation.