From 1d1a95b3cb74fe5db1a40a5531d966265796e726 Mon Sep 17 00:00:00 2001 From: GitLab CI Date: Sat, 30 May 2026 11:49:27 +0000 Subject: [PATCH] chore: Regenerate all playbooks --- nvidia/station-ai-skills/README.md | 327 +++++++ nvidia/station-ai-skills/assets/.DS_Store | Bin 0 -> 6148 bytes .../assets/.codex/prompts/dgx-diagnose.md | 85 ++ .../assets/.codex/prompts/mig-configure.md | 103 +++ .../assets/.codex/prompts/sglang-setup.md | 115 +++ .../assets/.codex/prompts/vllm-setup.md | 74 ++ nvidia/station-ai-skills/assets/AGENTS.md | 81 ++ nvidia/station-ai-skills/assets/install.sh | 218 +++++ .../assets/skills/dgx-diagnose/SKILL.md | 92 ++ .../assets/skills/mig-configure/SKILL.md | 110 +++ .../assets/skills/sglang-setup/SKILL.md | 122 +++ .../assets/skills/vllm-setup/SKILL.md | 81 ++ nvidia/station-ai-skills/endpoint-test.yaml | 413 +++++++++ nvidia/station-ai-skills/overview.md | 2 + nvidia/station-brev/README.md | 36 +- nvidia/station-brev/endpoint-test.yaml | 39 +- .../station-nanochat/endpoint-production.yaml | 193 ++-- nvidia/station-nemoclaw/README.md | 832 ++++++----------- nvidia/station-nemoclaw/endpoint-test.yaml | 840 ++++++------------ nvidia/station-vllm/endpoint-test.yaml | 176 +++- 20 files changed, 2709 insertions(+), 1230 deletions(-) create mode 100644 nvidia/station-ai-skills/README.md create mode 100644 nvidia/station-ai-skills/assets/.DS_Store create mode 100644 nvidia/station-ai-skills/assets/.codex/prompts/dgx-diagnose.md create mode 100644 nvidia/station-ai-skills/assets/.codex/prompts/mig-configure.md create mode 100644 nvidia/station-ai-skills/assets/.codex/prompts/sglang-setup.md create mode 100644 nvidia/station-ai-skills/assets/.codex/prompts/vllm-setup.md create mode 100644 nvidia/station-ai-skills/assets/AGENTS.md create mode 100644 nvidia/station-ai-skills/assets/install.sh create mode 100644 nvidia/station-ai-skills/assets/skills/dgx-diagnose/SKILL.md create mode 100644 nvidia/station-ai-skills/assets/skills/mig-configure/SKILL.md create mode 100644 nvidia/station-ai-skills/assets/skills/sglang-setup/SKILL.md create mode 100644 nvidia/station-ai-skills/assets/skills/vllm-setup/SKILL.md create mode 100644 nvidia/station-ai-skills/endpoint-test.yaml create mode 100644 nvidia/station-ai-skills/overview.md diff --git a/nvidia/station-ai-skills/README.md b/nvidia/station-ai-skills/README.md new file mode 100644 index 0000000..a7fda9e --- /dev/null +++ b/nvidia/station-ai-skills/README.md @@ -0,0 +1,327 @@ +# DGX Station AI Skills for Coding Agents + +> Give your coding agent (Claude Code, Codex, Gemini CLI, Cursor) DGX Station expertise via an AGENTS.md and on-demand Agent Skills + + +## Table of Contents + +- [Overview](#overview) + - [AGENTS.md vs Agent Skill — why split?](#agentsmd-vs-agent-skill-why-split) +- [Instructions](#instructions) + - [Project-specific](#project-specific) +- [Troubleshooting](#troubleshooting) + +--- + +## Overview + +## Basic idea + +Modern coding agents — Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor — all support two extension mechanisms: a project-level **context file** that's loaded into every conversation, and **on-demand procedural workflows** (called skills, prompts, commands, or rules depending on the harness). This playbook ships both for DGX Station: + +- An **`AGENTS.md`** with the critical DGX Station constraints your agent should always know (mixed coherency, GPU targeting, common pitfalls). `AGENTS.md` is the cross-harness standard; an `install.sh` lays it down as `CLAUDE.md`, `GEMINI.md`, or `AGENTS.md` depending on the agent you use. +- **Four Agent Skills** — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` — authored once in the [Anthropic Agent Skills format](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview) and installed into the right per-harness location (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`). + +This approach keeps your agent's context lean in every conversation while giving it deep procedural knowledge on demand, regardless of which agent you use. + +### AGENTS.md vs Agent Skill — why split? + +| | AGENTS.md | Agent Skill | +|---|---|---| +| **Loaded** | Every conversation, automatically | Only when invoked by name (or matched by description, in Claude) | +| **Best for** | Constraints, pitfalls, "never do X" rules | Step-by-step workflows, deployment procedures | +| **Context cost** | Consumed every time | Zero until invoked | + +The DGX Station mixed-coherency constraint (`--gpus all` will crash) should be in every conversation. The full vLLM deployment procedure should not. + +## What you'll accomplish + +- Install the `AGENTS.md` and four Agent Skills into your project directory for your chosen agent (Claude Code, Codex, Gemini CLI, or Cursor). +- Verify the agent loads the constraints automatically and the skills on demand. +- Invoke `vllm-setup` to deploy a vLLM inference server with validated configuration. +- Invoke `sglang-setup` to deploy an SGLang inference server. +- Invoke `mig-configure` to partition the GB300 into MIG instances. +- Invoke `dgx-diagnose` to troubleshoot common DGX Station issues. + +## What to know before starting + +- Basic familiarity with one supported coding agent (running it, giving it prompts, using slash commands or rule references) +- General understanding of DGX Station (two GPUs, Docker-based workflows) + +## Prerequisites + +- NVIDIA DGX Station with GB300 +- One of the supported coding agents installed: + - **Claude Code:** `curl -fsSL https://claude.ai/install.sh | sh` + - **OpenAI Codex CLI:** `npm i -g @openai/codex` + - **Gemini CLI:** `npm i -g @google/gemini-cli` + - **Cursor:** download from `https://cursor.com/` +- A project directory where you do DGX Station work + +## Ancillary files + +- `assets/AGENTS.md` — canonical context file with critical constraints, GPU targeting, software versions, and common pitfalls. Cross-harness standard. +- `assets/skills/vllm-setup/SKILL.md` — skill: deploy vLLM with validated configuration. +- `assets/skills/sglang-setup/SKILL.md` — skill: deploy SGLang with validated configuration. +- `assets/skills/mig-configure/SKILL.md` — skill: configure MIG partitions on the GB300. +- `assets/skills/dgx-diagnose/SKILL.md` — skill: troubleshoot common DGX Station issues. +- `assets/install.sh` — per-harness installer (`claude`, `codex`, `gemini`, `cursor`, or `all`). + +## Time & risk + +* **Duration:** 10-15 minutes +* **Risk level:** Low — this playbook copies markdown files into your project directory +* **Rollback:** Delete the context file (`AGENTS.md` / `CLAUDE.md` / `GEMINI.md`) and the harness-specific skill directory (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`) from your project directory +* **Last Updated:** 05/18/2026 + * Restructured as harness-agnostic Agent Skills (Claude Code, Codex, Gemini CLI, Cursor) + +## Instructions + +## Step 1. Install your coding agent + +Pick whichever agent you prefer — the rest of this playbook works the same regardless. Install commands: + +| Agent | Install | +|-------|---------| +| Claude Code | `curl -fsSL https://claude.ai/install.sh \| sh` | +| OpenAI Codex CLI | `npm i -g @openai/codex` | +| Gemini CLI | `npm i -g @google/gemini-cli` | +| Cursor | Download from `https://cursor.com/` | + +Verify with `claude --version`, `codex --version`, `gemini --version`, or by launching Cursor. + +## Step 2. Install the skills into your project + +Navigate to the project where you want DGX Station expertise, then run the installer with the harness you use: + +```bash +cd ~/your-project + +## Pick one: +/path/to/this/playbook/assets/install.sh claude +/path/to/this/playbook/assets/install.sh codex +/path/to/this/playbook/assets/install.sh gemini +/path/to/this/playbook/assets/install.sh cursor + +## Or install for all four at once: +/path/to/this/playbook/assets/install.sh all +``` + +If you downloaded the playbook as a zip, the path is relative to the extracted directory: + +```bash +station-ai-skills/assets/install.sh claude ~/your-project +``` + +The installer is additive for skill directories (won't clobber existing skills you've written) and refuses to overwrite an existing context file (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`) unless you pass `--force`. + +**Resulting layout** (per harness): + +```text +your-project/ + AGENTS.md or CLAUDE.md or GEMINI.md # context file (named for your agent) + .claude/skills//SKILL.md # claude + .codex/prompts/.md # codex + .gemini/commands/.md # gemini + .cursor/rules/.mdc # cursor +``` + +Where `` is each of `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose`. + +> [!NOTE] +> Every supported agent automatically reads the context file from the working directory at startup. Skills/prompts/rules in the harness-specific directory are discovered automatically — no additional configuration needed. + +## Step 3. Verify the setup + +Start your agent in the project directory and ask a question that requires constraint knowledge: + +```text +Can I use --gpus all to run my CUDA workload on DGX Station? +``` + +The agent should immediately warn about the mixed-coherency constraint and recommend `--gpus '"device=N"'` targeting. If you don't get the warning, the context file isn't being loaded — see Troubleshooting. + +Then verify the skills are discoverable: + +| Agent | How to check | +|-------|--------------| +| Claude Code | Type `/` — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` should appear in the autocomplete | +| Codex CLI | Type `/prompts:` — same four names appear | +| Gemini CLI | Type `/` — same four names appear | +| Cursor | Open the Rules panel — same four rules appear | + +## Step 4. Use vllm-setup to deploy an inference server + +Invoke the skill in your agent: + +| Agent | Invocation | +|-------|-----------| +| Claude Code | `/vllm-setup` (slash command) or just describe the task ("deploy vllm with Qwen3-8B") | +| Codex CLI | `/prompts:vllm-setup` | +| Gemini CLI | `/vllm-setup` | +| Cursor | In chat: "use the vllm-setup rule to deploy a vllm server" | + +The agent will walk you through deploying a vLLM server with a validated container image, correct GPU targeting, and recommended parameters. It will check your GPU index, ask which model you want to serve, and generate the full `docker run` command. + +## Step 5. Use sglang-setup to deploy SGLang + +Same invocation pattern, but for SGLang with the `cu130` container, RadixAttention prefix caching, and structured JSON output support. + +## Step 6. Use mig-configure to partition the GB300 + +The agent will query your current MIG state, show available profiles, help you choose a layout for your workloads, and execute the partitioning commands. + +## Step 7. Use dgx-diagnose to troubleshoot issues + +If you encounter problems, invoke `dgx-diagnose`. The agent will check GPU status, driver version, running processes, MIG state, and Fabric Manager to identify the issue. + +## Step 8. Customize + +Both the `AGENTS.md` and the skills are plain markdown — extend them freely. + +**Add project-specific constraints to `AGENTS.md`** (or your harness-specific context file): + +```markdown +### Project-specific + +- Our production MIG layout is 3g.139gb + 2g.70gb + 2g.70gb +- Always use port 8080 for inference (nginx proxy on 443) +- Model weights are cached at /data/models, mount with -v /data/models:/root/.cache/huggingface/hub +``` + +**Create new skills** by adding a directory and `SKILL.md` to `assets/skills/`, then re-run `install.sh`: + +```bash +mkdir -p assets/skills/run-benchmarks +cat > assets/skills/run-benchmarks/SKILL.md << 'EOF' +--- +name: run-benchmarks +description: Run our standard inference benchmark suite against the running vLLM or SGLang server and compare against the baseline. +--- + +## Run benchmarks + +1. Check which inference server is running (vLLM on port 8000 or SGLang on port 30000) +2. Run the appropriate benchmark script from ./benchmarks/ +3. Report throughput (tokens/sec), latency (TTFT, ITL), and memory utilization +4. Compare against the baseline in ./benchmarks/baseline.json +EOF +``` + +> [!TIP] +> Keep `AGENTS.md` focused on constraints and pitfalls (things that break). Put procedural workflows in skills (things you do step-by-step). + +## Troubleshooting + +## Skills don't appear in autocomplete / aren't discoverable + +Each agent discovers skills from a harness-specific directory in the current directory (or a parent). Check the right one: + +| Agent | Expected location | +|-------|-------------------| +| Claude Code | `.claude/skills//SKILL.md` | +| Codex CLI | `.codex/prompts/.md` | +| Gemini CLI | `.gemini/commands/.md` | +| Cursor | `.cursor/rules/.mdc` | + +```bash +## Examples — check the directory for your agent +ls -la .claude/skills/ +ls -la .codex/prompts/ +ls -la .gemini/commands/ +ls -la .cursor/rules/ +``` + +You should see entries for `vllm-setup`, `sglang-setup`, `mig-configure`, and `dgx-diagnose`. + +**Check you're in the right directory:** + +```bash +pwd +``` + +The agent must be started from the directory containing the harness directory, or a subdirectory of it. + +## Context file not loaded + +If the agent gives generic answers without DGX Station awareness, the context file isn't being picked up. Each agent reads a different filename — verify the one for your agent exists: + +| Agent | Expected filename | +|-------|-------------------| +| Claude Code | `CLAUDE.md` (also reads `AGENTS.md` as fallback) | +| Codex CLI | `AGENTS.md` | +| Gemini CLI | `GEMINI.md` | +| Cursor | `AGENTS.md` | + +```bash +## Verify the file exists for your agent +cat AGENTS.md | head -5 +cat CLAUDE.md | head -5 +cat GEMINI.md | head -5 + +## Restart the agent in the correct directory +cd ~/your-project +claude # or codex, gemini, etc. +``` + +All four agents read the context file from the working directory (and parent directories up to the project root). + +## Skill gives outdated information + +The skills contain validated container versions and parameters as of the publication date. If a newer container is available, edit the canonical source and re-install: + +```bash +nano /path/to/playbook/assets/skills/vllm-setup/SKILL.md +/path/to/playbook/assets/install.sh all --force +``` + +Or edit the installed copy directly: + +```bash +## Claude Code +nano .claude/skills/vllm-setup/SKILL.md +## Codex +nano .codex/prompts/vllm-setup.md +## Gemini CLI +nano .gemini/commands/vllm-setup.md +## Cursor +nano .cursor/rules/vllm-setup.mdc +``` + +> [!TIP] +> Skills are plain markdown — you can version them in git alongside your project code. + +## "Both GPUs cannot be used" errors + +This is the mixed-coherency constraint working as intended. If you see CUDA errors when using `--gpus all`: + +```bash +## Find the GB300 index +nvidia-smi --query-gpu=index,name --format=csv,noheader + +## Use device-specific targeting +docker run --gpus '"device=1"' ... +``` + +The `AGENTS.md` covers this constraint, but if you removed that section, add it back — it's the most important piece of DGX Station knowledge. + +## Skills conflict with existing project directory + +If your project already has a `.claude/`, `.codex/`, `.gemini/`, or `.cursor/` directory with its own contents, `install.sh` is **additive** for skill directories — it adds the new skill files alongside whatever you already have and warns on collision rather than overwriting. + +For context files (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`), the installer **refuses** to overwrite an existing file. Pass `--force` to override, or merge the new content manually: + +```bash +## See what would be written +diff /path/to/playbook/assets/AGENTS.md ./AGENTS.md + +## Force overwrite +/path/to/playbook/assets/install.sh claude . --force +``` + +## Installer reports "WROTE" for some files but "SKIP" for others + +That's the safe-by-default behavior. The installer skips any file that already exists, prints a warning, and continues with the rest. To get a clean install, either: + +1. Delete the existing files first: `rm -rf .claude/skills/{vllm-setup,sglang-setup,mig-configure,dgx-diagnose}` +2. Or pass `--force` (only affects context files; skill files are still skipped if present) diff --git a/nvidia/station-ai-skills/assets/.DS_Store b/nvidia/station-ai-skills/assets/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..2fb6377bff85d12be82b8118ce2ec2e66c362d38 GIT binary patch literal 6148 zcmeHKI|>3p3{6x-u-VdbuHX#@(Gz$9Q5$i=VzJ-Kb9pphJ_xc}SlGx5ByT2@H_N_a zvk?(pU5`tVMnqa0lP&G#k4Lb5A>uO>j7v z|1kf*lDMM+RN$`^(8+qcUg4Fpw+>#;dToL4;8t^on_=w~1aHScZ^zhJJ6?EE)D>If Wye9U6PDkG9K>iGvE;K6eYXu&OGZm=- literal 0 HcmV?d00001 diff --git a/nvidia/station-ai-skills/assets/.codex/prompts/dgx-diagnose.md b/nvidia/station-ai-skills/assets/.codex/prompts/dgx-diagnose.md new file mode 100644 index 0000000..1c4bf59 --- /dev/null +++ b/nvidia/station-ai-skills/assets/.codex/prompts/dgx-diagnose.md @@ -0,0 +1,85 @@ + +# DGX Station Diagnostics + +Diagnose common DGX Station issues. Run through the checks below to identify the problem. + +## Step 1. Gather system state + +Run these commands and analyze the output: + +```bash +# GPU status +nvidia-smi + +# GPU device list with indices +nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader + +# Driver version +nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1 + +# MIG state +nvidia-smi -i 1 -q 2>/dev/null | grep -i "MIG Mode" || echo "Could not query MIG on device 1" + +# Fabric Manager +systemctl is-active nvidia-fabricmanager + +# GPU processes +sudo fuser -v /dev/nvidia* 2>/dev/null || echo "No GPU processes found" + +# Docker containers using GPUs +docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" 2>/dev/null +``` + +## Step 2. Match symptoms to known issues + +Based on the gathered state and the user's reported problem, check for these known issues: + +### CUDA crashes with `--gpus all` +**Cause:** Mixed coherency — GB300 (ATS) and RTX PRO (non-ATS) cannot share a CUDA context. +**Fix:** Use `--gpus '"device=N"'` targeting only the GB300. + +### Model running on wrong GPU (RTX PRO instead of GB300) +**Check:** The device index in the docker command vs actual GPU indices. +**Fix:** Verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` and correct the `--gpus` flag. + +### vLLM crash / FlashInfer buffer overflow +**Check:** Container version — `docker inspect vllm-server | grep Image` +**Fix:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Version 25.10 has a known FlashInfer bug on DGX Station. + +### SGLang CUDA errors +**Check:** Container tag — must be `cu130` for Blackwell SM103. +**Fix:** Use `lmsysorg/sglang:latest-cu130`. + +### CUDA OOM despite 279 GB HBM +**Check:** `--max-model-len` / `--context-length` and memory utilization settings. +**Fix:** Reduce context length or lower `--gpu-memory-utilization` / `--mem-fraction-static`. + +### `nvidia-smi -mig 1` returns "In use by another client" +**Check:** `sudo fuser -v /dev/nvidia*` — GPU processes must be stopped first. +**Fix:** Stop all GPU workloads, then retry. + +### NVLink errors after disabling MIG +**Check:** `systemctl is-active nvidia-fabricmanager` +**Fix:** `sudo systemctl start nvidia-fabricmanager` + +### X server crash after nvidia-xconfig -a +**Fix:** `sudo cp /etc/X11/xorg.conf.nvidia-xconfig-original /etc/X11/xorg.conf` + +### Vulkan VK_ERROR_INITIALIZATION_FAILED +**Cause:** CUDA initialized before Vulkan, binding to GB300. +**Fix:** Run CUDA and Vulkan workloads in separate processes. For Vulkan apps: `__GL_DeviceModalityPreference=2 ./your_app` + +### HuggingFace 401 / token errors +**Fix:** Pass token inline: `-e HF_TOKEN="hf_..."`. Don't rely on shell export for background Docker tasks. + +### Port already in use +**Check:** `lsof -i :` +**Fix:** Stop the conflicting process or use a different host port: `-p 8001:8000`. + +## Step 3. Report findings + +Tell the user: +1. What the issue is +2. Why it happens (root cause) +3. The specific command to fix it +4. How to verify the fix worked diff --git a/nvidia/station-ai-skills/assets/.codex/prompts/mig-configure.md b/nvidia/station-ai-skills/assets/.codex/prompts/mig-configure.md new file mode 100644 index 0000000..9d9ef62 --- /dev/null +++ b/nvidia/station-ai-skills/assets/.codex/prompts/mig-configure.md @@ -0,0 +1,103 @@ + +# MIG Configuration on DGX Station + +Configure MIG (Multi-Instance GPU) partitions on the DGX Station GB300. + +## Steps + +1. **Find the GB300 GPU index.** Run: + ```bash + nvidia-smi --query-gpu=index,name --format=csv,noheader + ``` + +2. **Check current MIG state:** + ```bash + nvidia-smi -i -q | grep -i "MIG Mode" + ``` + +3. **If MIG is already enabled, show current instances:** + ```bash + nvidia-smi mig -lgi -i + nvidia-smi mig -lci -i + ``` + If the user wants to reconfigure, destroy existing instances first (step 6). + +4. **If MIG is not enabled, enable it.** All GPU processes must be stopped first: + ```bash + # Check for running GPU processes + sudo fuser -v /dev/nvidia* + + # Enable MIG + sudo nvidia-smi -i -mig 1 + + # Verify + nvidia-smi -i -q | grep -i "MIG Mode" + ``` + +5. **Show available profiles and help the user choose a layout:** + ```bash + nvidia-smi mig -lgip -i + ``` + + Common GB300 MIG profiles: + + | Profile | ID | Memory | Use case | + |---------|----|--------|----------| + | 1g.35gb | 19 | ~35 GB | Small models (7-8B), dev/test | + | 1g.35gb+me | 20 | ~35 GB | Same + media extensions | + | 1g.70gb | 15 | ~70 GB | Slightly larger inference | + | 2g.70gb | 14 | ~70 GB | Medium models (14-30B) | + | 3g.139gb | 9 | ~139 GB | Large models (70B quantized) | + | 4g.139gb | 5 | ~139 GB | Large models, more compute | + | 7g.278gb | 0 | ~278 GB | Full GPU as single instance | + + Suggest layouts based on the user's workload. Examples: + - **Two models (70B + 8B):** `3g.139gb + 2g.70gb + 2g.70gb` → IDs `9,14,14` + - **Many small models:** `7 × 1g.35gb` → IDs `19,19,19,19,19,19,19` + - **One large model with isolation:** `7g.278gb` → ID `0` + + Ask the user what models they want to run before suggesting a layout. + +6. **Create (or recreate) instances:** + + If reconfiguring, destroy existing instances first: + ```bash + sudo nvidia-smi mig -dci -i + sudo nvidia-smi mig -dgi -i + ``` + + Then create the new layout: + ```bash + sudo nvidia-smi mig -cgi -C -i + ``` + +7. **Get the MIG device UUIDs:** + ```bash + nvidia-smi -L + ``` + Note the `MIG-` entries — these are used to target specific MIG instances. + +8. **Show the user how to use MIG devices:** + ```bash + # Bare metal + export CUDA_VISIBLE_DEVICES=MIG- + + # Docker + docker run --gpus '"device=MIG-"' ... + ``` + +9. **Report the final layout** to the user with UUIDs and suggested docker commands for each instance. + +## Disabling MIG + +If the user wants to return to full-GPU mode: + +```bash +# Stop all workloads using MIG instances first +sudo nvidia-smi mig -dci -i +sudo nvidia-smi mig -dgi -i +sudo nvidia-smi -i -mig 0 + +# Ensure Fabric Manager is running for NVLink re-initialization +sudo systemctl start nvidia-fabricmanager +``` diff --git a/nvidia/station-ai-skills/assets/.codex/prompts/sglang-setup.md b/nvidia/station-ai-skills/assets/.codex/prompts/sglang-setup.md new file mode 100644 index 0000000..2a8c56a --- /dev/null +++ b/nvidia/station-ai-skills/assets/.codex/prompts/sglang-setup.md @@ -0,0 +1,115 @@ + +# SGLang Setup on DGX Station + +Deploy an SGLang inference server on DGX Station with validated configuration. + +## Steps + +1. **Find the GB300 GPU index.** Run: + ```bash + nvidia-smi --query-gpu=index,name --format=csv,noheader + ``` + Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures. + +2. **Ask the user which model to serve.** If they don't have a preference, suggest: + - `Qwen/Qwen3-8B` — small, fast, good for testing + - `Qwen/Qwen3-32B` — medium, good balance + - `meta-llama/Llama-3.1-70B-Instruct` — large general-purpose + +3. **Check if the user has an HF_TOKEN.** Pass inline with `-e HF_TOKEN="..."`. + +4. **Deploy the container.** Use this validated configuration: + + ```bash + docker pull lmsysorg/sglang:latest-cu130 + + docker run -d \ + --name sglang-server \ + --gpus '"device="' \ + --ipc host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -p 30000:30000 \ + -e HF_TOKEN="" \ + -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ + lmsysorg/sglang:latest-cu130 \ + sglang serve --model-path "" \ + --host 0.0.0.0 \ + --port 30000 \ + --context-length 32768 \ + --mem-fraction-static 0.85 + ``` + + **Container version:** Use `lmsysorg/sglang:latest-cu130`. The `cu130` tag is required for Blackwell SM103 support. + + **First launch** downloads the model and compiles kernels. This takes extra time — subsequent starts are faster. + +5. **Wait for the server to be ready.** Monitor logs: + ```bash + docker logs -f sglang-server + ``` + +6. **Test the server:** + ```bash + curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Hello"}], + "max_tokens": 64 + }' + ``` + +7. **Report the result** to the user, including: + - Model loaded and serving on port 30000 + - How to stop: `docker stop sglang-server && docker rm sglang-server` + +## Key features + +- **RadixAttention** — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with: `docker logs sglang-server 2>&1 | grep "cached-token" | tail -5` +- **Structured JSON output** — use `response_format.json_schema` in API requests for guaranteed valid JSON. +- **Chunked prefill** — add `--chunked-prefill-size 8192` to break long prefills into chunks, reducing time-to-first-token. + +## Tuning parameters + +| Parameter | Default | Agent workloads | Throughput workloads | +|-----------|---------|-----------------|---------------------| +| `--context-length` | 32768 | 32768-65536 | 8192-16384 | +| `--mem-fraction-static` | 0.85 | 0.80-0.85 | 0.85-0.88 | +| `--chunked-prefill-size` | off | 4096-8192 | 8192 | +| `--enable-metrics` | off | Optional | Recommended | + +## Structured output example + +```bash +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "List three programming languages."}], + "max_tokens": 512, + "response_format": { + "type": "json_schema", + "json_schema": { + "name": "languages", + "schema": { + "type": "object", + "properties": { + "languages": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": {"type": "string"}, + "primary_use": {"type": "string"} + }, + "required": ["name", "primary_use"] + } + } + }, + "required": ["languages"] + } + } + } + }' +``` diff --git a/nvidia/station-ai-skills/assets/.codex/prompts/vllm-setup.md b/nvidia/station-ai-skills/assets/.codex/prompts/vllm-setup.md new file mode 100644 index 0000000..1620997 --- /dev/null +++ b/nvidia/station-ai-skills/assets/.codex/prompts/vllm-setup.md @@ -0,0 +1,74 @@ + +# vLLM Setup on DGX Station + +Deploy a vLLM inference server on DGX Station with validated configuration. + +## Steps + +1. **Find the GB300 GPU index.** Run: + ```bash + nvidia-smi --query-gpu=index,name --format=csv,noheader + ``` + Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures. + +2. **Ask the user which model to serve.** If they don't have a preference, suggest: + - `nvidia/Qwen3-235B-A22B-NVFP4` — large MoE model, fits in 279 GB HBM + - `meta-llama/Llama-3.1-70B-Instruct` — solid general-purpose model + - `Qwen/Qwen3-8B` — small model for testing + +3. **Check if the user has an HF_TOKEN.** Many models require HuggingFace authentication. The token must be passed inline with `-e HF_TOKEN="..."` — do not rely on shell export in background Docker tasks. + +4. **Deploy the container.** Use this validated configuration: + + ```bash + docker pull nvcr.io/nvidia/vllm:26.01-py3 + + docker run -d \ + --name vllm-server \ + --gpus '"device="' \ + --ipc host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -p 8000:8000 \ + -e HF_TOKEN="" \ + -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ + nvcr.io/nvidia/vllm:26.01-py3 \ + vllm serve "" \ + --max-model-len 32768 \ + --gpu-memory-utilization 0.9 + ``` + + **Container version:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Do NOT use 25.10 — it has a FlashInfer buffer overflow on DGX Station. + +5. **Wait for the server to be ready.** Monitor logs: + ```bash + docker logs -f vllm-server + ``` + Wait for the line indicating the server is listening on port 8000. + +6. **Test the server:** + ```bash + curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Hello"}], + "max_tokens": 64 + }' + ``` + +7. **Report the result** to the user, including: + - Model loaded and serving on port 8000 + - GPU memory utilization + - How to stop: `docker stop vllm-server && docker rm vllm-server` + +## Tuning parameters + +Adjust these based on the user's workload: + +| Parameter | Default | Agent workloads | Throughput workloads | +|-----------|---------|-----------------|---------------------| +| `--max-model-len` | 32768 | 32768-65536 | 8192-16384 | +| `--gpu-memory-utilization` | 0.9 | 0.85-0.90 | 0.90-0.92 | +| `--enable-prefix-caching` | off | Enable (multi-turn reuse) | Enable | +| `--max-num-seqs` | default | 4-16 (lower latency) | 32+ (higher throughput) | diff --git a/nvidia/station-ai-skills/assets/AGENTS.md b/nvidia/station-ai-skills/assets/AGENTS.md new file mode 100644 index 0000000..6fc6d17 --- /dev/null +++ b/nvidia/station-ai-skills/assets/AGENTS.md @@ -0,0 +1,81 @@ +# DGX Station Essential Constraints + +This file gives your coding agent the critical constraints it needs to avoid breaking things on NVIDIA DGX Station. When you need a step-by-step workflow, invoke the bundled skills: `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose`. In Codex, install them into `$CODEX_HOME/skills` and mention them as `$vllm-setup` or plain text like "use vllm-setup"; in Claude Code or Gemini CLI, type `/`; in Cursor, reference the rule by name. + +## System architecture (quick reference) + +- **GB300 GPU** — Blackwell Ultra (SM103), up to 279 GB HBM3e, 20 PFLOPS sparse FP4. This is the AI compute GPU. +- **Grace CPU** — 72-core ARM Neoverse V2, up to 496 GB LPDDR5x. +- **RTX PRO 6000** — Discrete display GPU (PCIe, non-coherent). For graphics only. +- **NVLink C2C** — Coherent CPU-GPU link. CPU + GPU memory = up to 775 GB total. +- The GB300 is typically device **1** and RTX PRO is device **0**. Always verify: `nvidia-smi --query-gpu=index,name --format=csv,noheader` + +## Critical constraint: mixed coherency + +**CUDA cannot handle mixed-coherency GPUs in the same process.** The GB300 uses hardware-coherent memory (ATS) while the RTX PRO uses non-coherent memory (HMM via PCIe). A single CUDA context can use one or the other, not both. + +**Never use `--gpus all`** — it will cause CUDA assert failures. + +## GPU targeting + +There are three ways to target the GB300: + +**1. By device index** (most common): +```bash +export CUDA_VISIBLE_DEVICES=1 # bare metal +docker run --gpus '"device=1"' ... # Docker +``` + +**2. By coherency modality:** +```bash +export CUDA_DEVICE_MODALITY=ATS # GB300 (coherent) +export CUDA_DEVICE_MODALITY=NONATS # RTX PRO (non-coherent) +``` + +**3. By driver application profiles** in `~/.nv/nvidia-application-profiles-rc`: +```json +{ + "rules": [ + { "pattern": { "feature": "cmdline", "matches": "my_app" }, "profile": "UseATSGpuInMixedCoherencySystems" } + ] +} +``` + +## Display and graphics + +- The GB300 does not support X display. Display runs on RTX PRO only. +- **Do not run `nvidia-xconfig -a`** — it generates an invalid config. +- If CUDA initializes before Vulkan in a process, it may bind to the GB300, causing `VK_ERROR_INITIALIZATION_FAILED`. Run CUDA and Vulkan in separate processes. + +## Memory + +- GB300 HBM is in the system memory pool (NUMA node 1). `malloc` may allocate there. +- Use `numactl --membind=0` for CPU-only processes that shouldn't touch GPU memory. +- CPU can cache accesses to GB300 memory, but GB300 cannot cache accesses to CPU memory. + +## Software versions + +| Component | Validated version | Notes | +|-----------|-------------------|-------| +| NVIDIA Driver | 590.48.01 | Check with `nvidia-smi` | +| CUDA (driver) | 13.1 | Containers bring their own runtime | +| vLLM container | `nvcr.io/nvidia/vllm:26.01-py3` | **Avoid 25.10** (FlashInfer buffer overflow) | +| SGLang container | `lmsysorg/sglang:latest-cu130` | cu130 required for SM103 | +| CUDA base image | `nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04` | For custom containers | +| Ubuntu | 24.04 | Preinstalled | + +## Common pitfalls + +| Symptom | Cause | Fix | +|---------|-------|-----| +| `--gpus all` CUDA assert failure | Mixed coherency | Use `--gpus '"device=N"'` for the GB300 | +| vLLM 25.10 FlashInfer crash | Known DGX Station bug | Use `vllm:26.01-py3` or newer | +| SGLang CUDA errors | Wrong CUDA for Blackwell | Use `sglang:latest-cu130` | +| Model runs on RTX PRO | Wrong device index | Verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` | +| `nvidia-smi -mig 1` "In use" | GPU processes running | `sudo fuser -v /dev/nvidia*` | +| NVLink errors after disabling MIG | Fabric Manager stopped | `sudo systemctl start nvidia-fabricmanager` | +| `malloc` lands in GPU memory | HBM in system pool | `numactl --membind=0` | +| X crash after `nvidia-xconfig -a` | Invalid mixed-coherency config | Restore from `/etc/X11/xorg.conf.nvidia-xconfig-original` | +| Vulkan `VK_ERROR_INITIALIZATION_FAILED` | CUDA bound GB300 first | Separate CUDA and Vulkan into different processes | +| HuggingFace 401 | Missing HF_TOKEN | Pass inline: `-e HF_TOKEN="hf_..."` | +| Port conflict | Port already in use | `lsof -i :PORT`, use different port | diff --git a/nvidia/station-ai-skills/assets/install.sh b/nvidia/station-ai-skills/assets/install.sh new file mode 100644 index 0000000..67397fd --- /dev/null +++ b/nvidia/station-ai-skills/assets/install.sh @@ -0,0 +1,218 @@ +#!/bin/sh +# install.sh — Install DGX Station AI Skills into a project for a chosen coding agent. +# +# Usage: ./install.sh [target-dir] [--force] +# harness: claude | codex | gemini | cursor | all +# target-dir: where to install (default: current directory) +# --force: overwrite existing context files (AGENTS.md, CLAUDE.md, GEMINI.md) +# +# Layout produced per harness: +# claude -> CLAUDE.md + .claude/skills//SKILL.md +# codex -> AGENTS.md + $CODEX_HOME/skills//SKILL.md +# gemini -> GEMINI.md + .gemini/commands/.md +# cursor -> AGENTS.md + .cursor/rules/.mdc +# all -> all of the above + +set -eu + +usage() { + cat < [target-dir] [--force] + +Harnesses: + claude Claude Code -> CLAUDE.md + .claude/skills//SKILL.md + codex OpenAI Codex CLI -> AGENTS.md + \$CODEX_HOME/skills//SKILL.md + gemini Gemini CLI -> GEMINI.md + .gemini/commands/.md + cursor Cursor -> AGENTS.md + .cursor/rules/.mdc + all Install for all four + +Options: + --force Overwrite existing context files instead of erroring +EOF +} + +if [ $# -lt 1 ]; then + usage >&2 + exit 2 +fi + +case "$1" in + -h|--help) usage; exit 0 ;; +esac + +HARNESS="$1" +shift + +TARGET="." +FORCE=0 +while [ $# -gt 0 ]; do + case "$1" in + --force) FORCE=1 ;; + -h|--help) usage; exit 0 ;; + *) TARGET="$1" ;; + esac + shift +done + +case "$HARNESS" in + claude|codex|gemini|cursor|all) ;; + *) printf 'Error: unknown harness "%s"\n\n' "$HARNESS" >&2; usage >&2; exit 2 ;; +esac + +ASSETS="$(cd "$(dirname "$0")" && pwd)" +SKILLS_DIR="$ASSETS/skills" +AGENTS_MD="$ASSETS/AGENTS.md" + +if [ ! -f "$AGENTS_MD" ]; then + printf 'Error: %s not found\n' "$AGENTS_MD" >&2 + exit 1 +fi + +if [ ! -d "$SKILLS_DIR" ]; then + printf 'Error: %s not found\n' "$SKILLS_DIR" >&2 + exit 1 +fi + +mkdir -p "$TARGET" + +SKILL_NAMES="vllm-setup sglang-setup mig-configure dgx-diagnose" + +# write_context +# Copies AGENTS.md to /, refusing to overwrite without --force. +write_context() { + fname="$1" + dest="$TARGET/$fname" + if [ -e "$dest" ] && [ "$FORCE" -ne 1 ]; then + printf ' SKIP %s (exists; pass --force to overwrite)\n' "$dest" >&2 + return 1 + fi + cp "$AGENTS_MD" "$dest" + printf ' WROTE %s\n' "$dest" +} + +# strip_frontmatter +# Emits the SKILL.md body (everything after the closing `---`) to . +# Note: POSIX sh has no local vars; use unique names to avoid clobbering callers. +strip_frontmatter() { + _sf_src="$1" + _sf_dest="$2" + awk 'BEGIN { in_fm=0; past_fm=0 } + past_fm == 1 { print; next } + /^---$/ && in_fm == 0 { in_fm=1; next } + /^---$/ && in_fm == 1 { past_fm=1; next } + in_fm == 0 && past_fm == 0 { past_fm=1; print }' "$_sf_src" > "$_sf_dest" +} + +# write_cursor_rule +# Writes a Cursor .mdc rule: replaces Anthropic frontmatter with Cursor's shape, keeps the body. +write_cursor_rule() { + _wc_src="$1" + _wc_dest="$2" + _wc_name="$3" + _wc_desc="$4" + { + printf -- '---\n' + printf 'description: %s\n' "$_wc_desc" + printf 'globs: ["**/*"]\n' + printf 'alwaysApply: false\n' + printf -- '---\n\n' + } > "$_wc_dest" + strip_frontmatter "$_wc_src" "$_wc_dest.body" + cat "$_wc_dest.body" >> "$_wc_dest" + rm -f "$_wc_dest.body" +} + +# extract_description +# Reads the description: line from the skill's SKILL.md frontmatter. +extract_description() { + _ed_name="$1" + awk '/^description: / { sub(/^description: /, ""); print; exit }' "$SKILLS_DIR/$_ed_name/SKILL.md" +} + +install_claude() { + printf 'Installing for Claude Code into %s/\n' "$TARGET" + write_context "CLAUDE.md" || true + for name in $SKILL_NAMES; do + dest_dir="$TARGET/.claude/skills/$name" + dest="$dest_dir/SKILL.md" + mkdir -p "$dest_dir" + if [ -e "$dest" ]; then + printf ' SKIP %s (exists)\n' "$dest" >&2 + continue + fi + cp "$SKILLS_DIR/$name/SKILL.md" "$dest" + printf ' WROTE %s\n' "$dest" + done + printf 'Next: cd %s && claude (type "/" to see vllm-setup, sglang-setup, mig-configure, dgx-diagnose)\n' "$TARGET" +} + +install_codex() { + printf 'Installing for OpenAI Codex CLI into %s/\n' "$TARGET" + write_context "AGENTS.md" || true + codex_home="${CODEX_HOME:-$HOME/.codex}" + codex_skills="$codex_home/skills" + mkdir -p "$codex_skills" + for name in $SKILL_NAMES; do + dest_dir="$codex_skills/$name" + dest="$dest_dir/SKILL.md" + if [ -e "$dest" ] && [ "$FORCE" -ne 1 ]; then + printf ' SKIP %s (exists)\n' "$dest" >&2 + continue + fi + mkdir -p "$dest_dir" + cp -R "$SKILLS_DIR/$name/." "$dest_dir/" + printf ' WROTE %s\n' "$dest_dir" + done + printf 'Next: cd %s && codex (mention $vllm-setup or "use vllm-setup"; restart Codex if it was already running)\n' "$TARGET" +} + +install_gemini() { + printf 'Installing for Gemini CLI into %s/\n' "$TARGET" + write_context "GEMINI.md" || true + mkdir -p "$TARGET/.gemini/commands" + for name in $SKILL_NAMES; do + dest="$TARGET/.gemini/commands/$name.md" + if [ -e "$dest" ]; then + printf ' SKIP %s (exists)\n' "$dest" >&2 + continue + fi + strip_frontmatter "$SKILLS_DIR/$name/SKILL.md" "$dest" + printf ' WROTE %s\n' "$dest" + done + printf 'Next: cd %s && gemini (type / to invoke a skill)\n' "$TARGET" +} + +install_cursor() { + printf 'Installing for Cursor into %s/\n' "$TARGET" + write_context "AGENTS.md" || true + mkdir -p "$TARGET/.cursor/rules" + for name in $SKILL_NAMES; do + dest="$TARGET/.cursor/rules/$name.mdc" + if [ -e "$dest" ]; then + printf ' SKIP %s (exists)\n' "$dest" >&2 + continue + fi + desc="$(extract_description "$name")" + write_cursor_rule "$SKILLS_DIR/$name/SKILL.md" "$dest" "$name" "$desc" + printf ' WROTE %s\n' "$dest" + done + printf 'Next: open %s in Cursor (reference rules by name in chat, e.g. "use the vllm-setup rule")\n' "$TARGET" +} + +case "$HARNESS" in + claude) install_claude ;; + codex) install_codex ;; + gemini) install_gemini ;; + cursor) install_cursor ;; + all) + install_claude + printf '\n' + install_codex + printf '\n' + install_gemini + printf '\n' + install_cursor + ;; +esac + +printf '\nDone.\n' diff --git a/nvidia/station-ai-skills/assets/skills/dgx-diagnose/SKILL.md b/nvidia/station-ai-skills/assets/skills/dgx-diagnose/SKILL.md new file mode 100644 index 0000000..d466214 --- /dev/null +++ b/nvidia/station-ai-skills/assets/skills/dgx-diagnose/SKILL.md @@ -0,0 +1,92 @@ +--- +name: dgx-diagnose +description: Diagnose common DGX Station GB300 issues — CUDA crashes, wrong-GPU targeting, vLLM/SGLang container bugs, MIG state problems, NVLink/Fabric Manager errors, X/Vulkan failures, HuggingFace auth, and port conflicts. Use when the user reports a GPU error, inference server crash, MIG problem, or any unexplained DGX Station failure. +metadata: + publisher: nvidia + hardware: DGX Station GB300 +--- + +# DGX Station Diagnostics + +Diagnose common DGX Station issues. Run through the checks below to identify the problem. + +## Step 1. Gather system state + +Run these commands and analyze the output: + +```bash +# GPU status +nvidia-smi + +# GPU device list with indices +nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader + +# Driver version +nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1 + +# MIG state +nvidia-smi -i 1 -q 2>/dev/null | grep -i "MIG Mode" || echo "Could not query MIG on device 1" + +# Fabric Manager +systemctl is-active nvidia-fabricmanager + +# GPU processes +sudo fuser -v /dev/nvidia* 2>/dev/null || echo "No GPU processes found" + +# Docker containers using GPUs +docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" 2>/dev/null +``` + +## Step 2. Match symptoms to known issues + +Based on the gathered state and the user's reported problem, check for these known issues: + +### CUDA crashes with `--gpus all` +**Cause:** Mixed coherency — GB300 (ATS) and RTX PRO (non-ATS) cannot share a CUDA context. +**Fix:** Use `--gpus '"device=N"'` targeting only the GB300. + +### Model running on wrong GPU (RTX PRO instead of GB300) +**Check:** The device index in the docker command vs actual GPU indices. +**Fix:** Verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` and correct the `--gpus` flag. + +### vLLM crash / FlashInfer buffer overflow +**Check:** Container version — `docker inspect vllm-server | grep Image` +**Fix:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Version 25.10 has a known FlashInfer bug on DGX Station. + +### SGLang CUDA errors +**Check:** Container tag — must be `cu130` for Blackwell SM103. +**Fix:** Use `lmsysorg/sglang:latest-cu130`. + +### CUDA OOM despite 279 GB HBM +**Check:** `--max-model-len` / `--context-length` and memory utilization settings. +**Fix:** Reduce context length or lower `--gpu-memory-utilization` / `--mem-fraction-static`. + +### `nvidia-smi -mig 1` returns "In use by another client" +**Check:** `sudo fuser -v /dev/nvidia*` — GPU processes must be stopped first. +**Fix:** Stop all GPU workloads, then retry. + +### NVLink errors after disabling MIG +**Check:** `systemctl is-active nvidia-fabricmanager` +**Fix:** `sudo systemctl start nvidia-fabricmanager` + +### X server crash after nvidia-xconfig -a +**Fix:** `sudo cp /etc/X11/xorg.conf.nvidia-xconfig-original /etc/X11/xorg.conf` + +### Vulkan VK_ERROR_INITIALIZATION_FAILED +**Cause:** CUDA initialized before Vulkan, binding to GB300. +**Fix:** Run CUDA and Vulkan workloads in separate processes. For Vulkan apps: `__GL_DeviceModalityPreference=2 ./your_app` + +### HuggingFace 401 / token errors +**Fix:** Pass token inline: `-e HF_TOKEN="hf_..."`. Don't rely on shell export for background Docker tasks. + +### Port already in use +**Check:** `lsof -i :` +**Fix:** Stop the conflicting process or use a different host port: `-p 8001:8000`. + +## Step 3. Report findings + +Tell the user: +1. What the issue is +2. Why it happens (root cause) +3. The specific command to fix it +4. How to verify the fix worked diff --git a/nvidia/station-ai-skills/assets/skills/mig-configure/SKILL.md b/nvidia/station-ai-skills/assets/skills/mig-configure/SKILL.md new file mode 100644 index 0000000..272ebfa --- /dev/null +++ b/nvidia/station-ai-skills/assets/skills/mig-configure/SKILL.md @@ -0,0 +1,110 @@ +--- +name: mig-configure +description: Configure NVIDIA MIG (Multi-Instance GPU) partitions on the DGX Station GB300, including enabling MIG mode, choosing a profile layout, creating instances, and retrieving MIG UUIDs. Use when the user asks to partition the GB300, set up MIG, run multiple models in isolation on one GPU, or reconfigure existing MIG instances. +metadata: + publisher: nvidia + hardware: DGX Station GB300 +--- + +# MIG Configuration on DGX Station + +Configure MIG (Multi-Instance GPU) partitions on the DGX Station GB300. + +## Steps + +1. **Find the GB300 GPU index.** Run: + ```bash + nvidia-smi --query-gpu=index,name --format=csv,noheader + ``` + +2. **Check current MIG state:** + ```bash + nvidia-smi -i -q | grep -i "MIG Mode" + ``` + +3. **If MIG is already enabled, show current instances:** + ```bash + nvidia-smi mig -lgi -i + nvidia-smi mig -lci -i + ``` + If the user wants to reconfigure, destroy existing instances first (step 6). + +4. **If MIG is not enabled, enable it.** All GPU processes must be stopped first: + ```bash + # Check for running GPU processes + sudo fuser -v /dev/nvidia* + + # Enable MIG + sudo nvidia-smi -i -mig 1 + + # Verify + nvidia-smi -i -q | grep -i "MIG Mode" + ``` + +5. **Show available profiles and help the user choose a layout:** + ```bash + nvidia-smi mig -lgip -i + ``` + + Common GB300 MIG profiles: + + | Profile | ID | Memory | Use case | + |---------|----|--------|----------| + | 1g.35gb | 19 | ~35 GB | Small models (7-8B), dev/test | + | 1g.35gb+me | 20 | ~35 GB | Same + media extensions | + | 1g.70gb | 15 | ~70 GB | Slightly larger inference | + | 2g.70gb | 14 | ~70 GB | Medium models (14-30B) | + | 3g.139gb | 9 | ~139 GB | Large models (70B quantized) | + | 4g.139gb | 5 | ~139 GB | Large models, more compute | + | 7g.278gb | 0 | ~278 GB | Full GPU as single instance | + + Suggest layouts based on the user's workload. Examples: + - **Two models (70B + 8B):** `3g.139gb + 2g.70gb + 2g.70gb` → IDs `9,14,14` + - **Many small models:** `7 × 1g.35gb` → IDs `19,19,19,19,19,19,19` + - **One large model with isolation:** `7g.278gb` → ID `0` + + Ask the user what models they want to run before suggesting a layout. + +6. **Create (or recreate) instances:** + + If reconfiguring, destroy existing instances first: + ```bash + sudo nvidia-smi mig -dci -i + sudo nvidia-smi mig -dgi -i + ``` + + Then create the new layout: + ```bash + sudo nvidia-smi mig -cgi -C -i + ``` + +7. **Get the MIG device UUIDs:** + ```bash + nvidia-smi -L + ``` + Note the `MIG-` entries — these are used to target specific MIG instances. + +8. **Show the user how to use MIG devices:** + ```bash + # Bare metal + export CUDA_VISIBLE_DEVICES=MIG- + + # Docker + docker run --gpus '"device=MIG-"' ... + ``` + +9. **Report the final layout** to the user with UUIDs and suggested docker commands for each instance. + +## Disabling MIG + +If the user wants to return to full-GPU mode: + +```bash +# Stop all workloads using MIG instances first +sudo nvidia-smi mig -dci -i +sudo nvidia-smi mig -dgi -i +sudo nvidia-smi -i -mig 0 + +# Ensure Fabric Manager is running for NVLink re-initialization +sudo systemctl start nvidia-fabricmanager +``` diff --git a/nvidia/station-ai-skills/assets/skills/sglang-setup/SKILL.md b/nvidia/station-ai-skills/assets/skills/sglang-setup/SKILL.md new file mode 100644 index 0000000..41913a3 --- /dev/null +++ b/nvidia/station-ai-skills/assets/skills/sglang-setup/SKILL.md @@ -0,0 +1,122 @@ +--- +name: sglang-setup +description: Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station. +metadata: + publisher: nvidia + hardware: DGX Station GB300 +--- + +# SGLang Setup on DGX Station + +Deploy an SGLang inference server on DGX Station with validated configuration. + +## Steps + +1. **Find the GB300 GPU index.** Run: + ```bash + nvidia-smi --query-gpu=index,name --format=csv,noheader + ``` + Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures. + +2. **Ask the user which model to serve.** If they don't have a preference, suggest: + - `Qwen/Qwen3-8B` — small, fast, good for testing + - `Qwen/Qwen3-32B` — medium, good balance + - `meta-llama/Llama-3.1-70B-Instruct` — large general-purpose + +3. **Check if the user has an HF_TOKEN.** Pass inline with `-e HF_TOKEN="..."`. + +4. **Deploy the container.** Use this validated configuration: + + ```bash + docker pull lmsysorg/sglang:latest-cu130 + + docker run -d \ + --name sglang-server \ + --gpus '"device="' \ + --ipc host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -p 30000:30000 \ + -e HF_TOKEN="" \ + -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ + lmsysorg/sglang:latest-cu130 \ + sglang serve --model-path "" \ + --host 0.0.0.0 \ + --port 30000 \ + --context-length 32768 \ + --mem-fraction-static 0.85 + ``` + + **Container version:** Use `lmsysorg/sglang:latest-cu130`. The `cu130` tag is required for Blackwell SM103 support. + + **First launch** downloads the model and compiles kernels. This takes extra time — subsequent starts are faster. + +5. **Wait for the server to be ready.** Monitor logs: + ```bash + docker logs -f sglang-server + ``` + +6. **Test the server:** + ```bash + curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Hello"}], + "max_tokens": 64 + }' + ``` + +7. **Report the result** to the user, including: + - Model loaded and serving on port 30000 + - How to stop: `docker stop sglang-server && docker rm sglang-server` + +## Key features + +- **RadixAttention** — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with: `docker logs sglang-server 2>&1 | grep "cached-token" | tail -5` +- **Structured JSON output** — use `response_format.json_schema` in API requests for guaranteed valid JSON. +- **Chunked prefill** — add `--chunked-prefill-size 8192` to break long prefills into chunks, reducing time-to-first-token. + +## Tuning parameters + +| Parameter | Default | Agent workloads | Throughput workloads | +|-----------|---------|-----------------|---------------------| +| `--context-length` | 32768 | 32768-65536 | 8192-16384 | +| `--mem-fraction-static` | 0.85 | 0.80-0.85 | 0.85-0.88 | +| `--chunked-prefill-size` | off | 4096-8192 | 8192 | +| `--enable-metrics` | off | Optional | Recommended | + +## Structured output example + +```bash +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "List three programming languages."}], + "max_tokens": 512, + "response_format": { + "type": "json_schema", + "json_schema": { + "name": "languages", + "schema": { + "type": "object", + "properties": { + "languages": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": {"type": "string"}, + "primary_use": {"type": "string"} + }, + "required": ["name", "primary_use"] + } + } + }, + "required": ["languages"] + } + } + } + }' +``` diff --git a/nvidia/station-ai-skills/assets/skills/vllm-setup/SKILL.md b/nvidia/station-ai-skills/assets/skills/vllm-setup/SKILL.md new file mode 100644 index 0000000..15d2609 --- /dev/null +++ b/nvidia/station-ai-skills/assets/skills/vllm-setup/SKILL.md @@ -0,0 +1,81 @@ +--- +name: vllm-setup +description: Deploy a vLLM inference server on an NVIDIA DGX Station GB300 with validated container, GPU targeting, and tuning parameters. Use when the user asks to serve a model with vLLM, start a vLLM endpoint, or set up OpenAI-compatible inference on DGX Station. +metadata: + publisher: nvidia + hardware: DGX Station GB300 +--- + +# vLLM Setup on DGX Station + +Deploy a vLLM inference server on DGX Station with validated configuration. + +## Steps + +1. **Find the GB300 GPU index.** Run: + ```bash + nvidia-smi --query-gpu=index,name --format=csv,noheader + ``` + Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures. + +2. **Ask the user which model to serve.** If they don't have a preference, suggest: + - `nvidia/Qwen3-235B-A22B-NVFP4` — large MoE model, fits in 279 GB HBM + - `meta-llama/Llama-3.1-70B-Instruct` — solid general-purpose model + - `Qwen/Qwen3-8B` — small model for testing + +3. **Check if the user has an HF_TOKEN.** Many models require HuggingFace authentication. The token must be passed inline with `-e HF_TOKEN="..."` — do not rely on shell export in background Docker tasks. + +4. **Deploy the container.** Use this validated configuration: + + ```bash + docker pull nvcr.io/nvidia/vllm:26.01-py3 + + docker run -d \ + --name vllm-server \ + --gpus '"device="' \ + --ipc host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -p 8000:8000 \ + -e HF_TOKEN="" \ + -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ + nvcr.io/nvidia/vllm:26.01-py3 \ + vllm serve "" \ + --max-model-len 32768 \ + --gpu-memory-utilization 0.9 + ``` + + **Container version:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Do NOT use 25.10 — it has a FlashInfer buffer overflow on DGX Station. + +5. **Wait for the server to be ready.** Monitor logs: + ```bash + docker logs -f vllm-server + ``` + Wait for the line indicating the server is listening on port 8000. + +6. **Test the server:** + ```bash + curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Hello"}], + "max_tokens": 64 + }' + ``` + +7. **Report the result** to the user, including: + - Model loaded and serving on port 8000 + - GPU memory utilization + - How to stop: `docker stop vllm-server && docker rm vllm-server` + +## Tuning parameters + +Adjust these based on the user's workload: + +| Parameter | Default | Agent workloads | Throughput workloads | +|-----------|---------|-----------------|---------------------| +| `--max-model-len` | 32768 | 32768-65536 | 8192-16384 | +| `--gpu-memory-utilization` | 0.9 | 0.85-0.90 | 0.90-0.92 | +| `--enable-prefix-caching` | off | Enable (multi-turn reuse) | Enable | +| `--max-num-seqs` | default | 4-16 (lower latency) | 32+ (higher throughput) | diff --git a/nvidia/station-ai-skills/endpoint-test.yaml b/nvidia/station-ai-skills/endpoint-test.yaml new file mode 100644 index 0000000..6ef089a --- /dev/null +++ b/nvidia/station-ai-skills/endpoint-test.yaml @@ -0,0 +1,413 @@ +kind: Playbook +metadata: + name: station-ai-skills + displayName: DGX Station AI Skills for Coding Agents + shortDescription: Give your coding agent (Claude Code, Codex, Gemini CLI, Cursor) DGX Station expertise via an AGENTS.md and on-demand Agent Skills + + publisher: nvidia + description: | + # REPLACE THIS WITH YOUR MODEL CARD + https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads + + labelsV2: + - gpuType:playbook:gpu_type_station + - DGX Station + - GB300 + - Blackwell + - AI Agents + - Agent Skills + - AGENTS.md + - Claude Code + - Codex + - Gemini CLI + - Cursor + - vLLM + - SGLang + - MIG + - Mixed Coherency + + attributes: + - key: DURATION + value: 15 MIN + +spec: + artifactName: station-ai-skills + nvcfFunctionId: None + attributes: + + showUnavailableBanner: false + apiDocsUrl: None + termsOfUse: | + + cta: + text: View on GitHub + url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-ai-skills/ + + + tabs: + - + id: overview + + label: Overview + content: | + # Basic idea + + Modern coding agents — Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor — all support two extension mechanisms: a project-level **context file** that's loaded into every conversation, and **on-demand procedural workflows** (called skills, prompts, commands, or rules depending on the harness). This playbook ships both for DGX Station: + + - An **`AGENTS.md`** with the critical DGX Station constraints your agent should always know (mixed coherency, GPU targeting, common pitfalls). `AGENTS.md` is the cross-harness standard; an `install.sh` lays it down as `CLAUDE.md`, `GEMINI.md`, or `AGENTS.md` depending on the agent you use. + - **Four Agent Skills** — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` — authored once in the [Anthropic Agent Skills format](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview) and installed into the right per-harness location (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`). + + This approach keeps your agent's context lean in every conversation while giving it deep procedural knowledge on demand, regardless of which agent you use. + + ## AGENTS.md vs Agent Skill — why split? + + | | AGENTS.md | Agent Skill | + |---|---|---| + | **Loaded** | Every conversation, automatically | Only when invoked by name (or matched by description, in Claude) | + | **Best for** | Constraints, pitfalls, "never do X" rules | Step-by-step workflows, deployment procedures | + | **Context cost** | Consumed every time | Zero until invoked | + + The DGX Station mixed-coherency constraint (`--gpus all` will crash) should be in every conversation. The full vLLM deployment procedure should not. + + # What you'll accomplish + + - Install the `AGENTS.md` and four Agent Skills into your project directory for your chosen agent (Claude Code, Codex, Gemini CLI, or Cursor). + - Verify the agent loads the constraints automatically and the skills on demand. + - Invoke `vllm-setup` to deploy a vLLM inference server with validated configuration. + - Invoke `sglang-setup` to deploy an SGLang inference server. + - Invoke `mig-configure` to partition the GB300 into MIG instances. + - Invoke `dgx-diagnose` to troubleshoot common DGX Station issues. + + # What to know before starting + + - Basic familiarity with one supported coding agent (running it, giving it prompts, using slash commands or rule references) + - General understanding of DGX Station (two GPUs, Docker-based workflows) + + # Prerequisites + + - NVIDIA DGX Station with GB300 + - One of the supported coding agents installed: + - **Claude Code:** `curl -fsSL https://claude.ai/install.sh | sh` + - **OpenAI Codex CLI:** `npm i -g @openai/codex` + - **Gemini CLI:** `npm i -g @google/gemini-cli` + - **Cursor:** download from `https://cursor.com/` + - A project directory where you do DGX Station work + + # Ancillary files + + - `assets/AGENTS.md` — canonical context file with critical constraints, GPU targeting, software versions, and common pitfalls. Cross-harness standard. + - `assets/skills/vllm-setup/SKILL.md` — skill: deploy vLLM with validated configuration. + - `assets/skills/sglang-setup/SKILL.md` — skill: deploy SGLang with validated configuration. + - `assets/skills/mig-configure/SKILL.md` — skill: configure MIG partitions on the GB300. + - `assets/skills/dgx-diagnose/SKILL.md` — skill: troubleshoot common DGX Station issues. + - `assets/install.sh` — per-harness installer (`claude`, `codex`, `gemini`, `cursor`, or `all`). + + # Time & risk + + * **Duration:** 10-15 minutes + * **Risk level:** Low — this playbook copies markdown files into your project directory + * **Rollback:** Delete the context file (`AGENTS.md` / `CLAUDE.md` / `GEMINI.md`) and the harness-specific skill directory (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`) from your project directory + * **Last Updated:** 05/18/2026 + * Restructured as harness-agnostic Agent Skills (Claude Code, Codex, Gemini CLI, Cursor) + + + + - + id: instructions + + label: Instructions + content: | + # Step 1. Install your coding agent + + Pick whichever agent you prefer — the rest of this playbook works the same regardless. Install commands: + + | Agent | Install | + |-------|---------| + | Claude Code | `curl -fsSL https://claude.ai/install.sh \| sh` | + | OpenAI Codex CLI | `npm i -g @openai/codex` | + | Gemini CLI | `npm i -g @google/gemini-cli` | + | Cursor | Download from `https://cursor.com/` | + + Verify with `claude --version`, `codex --version`, `gemini --version`, or by launching Cursor. + + # Step 2. Install the skills into your project + + Navigate to the project where you want DGX Station expertise, then run the installer with the harness you use: + + ```bash + cd ~/your-project + + # Pick one: + /path/to/this/playbook/assets/install.sh claude + /path/to/this/playbook/assets/install.sh codex + /path/to/this/playbook/assets/install.sh gemini + /path/to/this/playbook/assets/install.sh cursor + + # Or install for all four at once: + /path/to/this/playbook/assets/install.sh all + ``` + + If you downloaded the playbook as a zip, the path is relative to the extracted directory: + + ```bash + station-ai-skills/assets/install.sh claude ~/your-project + ``` + + The installer is additive for skill directories (won't clobber existing skills you've written) and refuses to overwrite an existing context file (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`) unless you pass `--force`. + + **Resulting layout** (per harness): + + ```text + your-project/ + AGENTS.md or CLAUDE.md or GEMINI.md # context file (named for your agent) + .claude/skills//SKILL.md # claude + .codex/prompts/.md # codex + .gemini/commands/.md # gemini + .cursor/rules/.mdc # cursor + ``` + + Where `` is each of `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose`. + + > [!NOTE] + > Every supported agent automatically reads the context file from the working directory at startup. Skills/prompts/rules in the harness-specific directory are discovered automatically — no additional configuration needed. + + # Step 3. Verify the setup + + Start your agent in the project directory and ask a question that requires constraint knowledge: + + ```text + Can I use --gpus all to run my CUDA workload on DGX Station? + ``` + + The agent should immediately warn about the mixed-coherency constraint and recommend `--gpus '"device=N"'` targeting. If you don't get the warning, the context file isn't being loaded — see Troubleshooting. + + Then verify the skills are discoverable: + + | Agent | How to check | + |-------|--------------| + | Claude Code | Type `/` — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` should appear in the autocomplete | + | Codex CLI | Type `/prompts:` — same four names appear | + | Gemini CLI | Type `/` — same four names appear | + | Cursor | Open the Rules panel — same four rules appear | + + # Step 4. Use vllm-setup to deploy an inference server + + Invoke the skill in your agent: + + | Agent | Invocation | + |-------|-----------| + | Claude Code | `/vllm-setup` (slash command) or just describe the task ("deploy vllm with Qwen3-8B") | + | Codex CLI | `/prompts:vllm-setup` | + | Gemini CLI | `/vllm-setup` | + | Cursor | In chat: "use the vllm-setup rule to deploy a vllm server" | + + The agent will walk you through deploying a vLLM server with a validated container image, correct GPU targeting, and recommended parameters. It will check your GPU index, ask which model you want to serve, and generate the full `docker run` command. + + # Step 5. Use sglang-setup to deploy SGLang + + Same invocation pattern, but for SGLang with the `cu130` container, RadixAttention prefix caching, and structured JSON output support. + + # Step 6. Use mig-configure to partition the GB300 + + The agent will query your current MIG state, show available profiles, help you choose a layout for your workloads, and execute the partitioning commands. + + # Step 7. Use dgx-diagnose to troubleshoot issues + + If you encounter problems, invoke `dgx-diagnose`. The agent will check GPU status, driver version, running processes, MIG state, and Fabric Manager to identify the issue. + + # Step 8. Customize + + Both the `AGENTS.md` and the skills are plain markdown — extend them freely. + + **Add project-specific constraints to `AGENTS.md`** (or your harness-specific context file): + + ```markdown + ## Project-specific + + - Our production MIG layout is 3g.139gb + 2g.70gb + 2g.70gb + - Always use port 8080 for inference (nginx proxy on 443) + - Model weights are cached at /data/models, mount with -v /data/models:/root/.cache/huggingface/hub + ``` + + **Create new skills** by adding a directory and `SKILL.md` to `assets/skills/`, then re-run `install.sh`: + + ```bash + mkdir -p assets/skills/run-benchmarks + cat > assets/skills/run-benchmarks/SKILL.md << 'EOF' + --- + name: run-benchmarks + description: Run our standard inference benchmark suite against the running vLLM or SGLang server and compare against the baseline. + --- + + # Run benchmarks + + 1. Check which inference server is running (vLLM on port 8000 or SGLang on port 30000) + 2. Run the appropriate benchmark script from ./benchmarks/ + 3. Report throughput (tokens/sec), latency (TTFT, ITL), and memory utilization + 4. Compare against the baseline in ./benchmarks/baseline.json + EOF + ``` + + > [!TIP] + > Keep `AGENTS.md` focused on constraints and pitfalls (things that break). Put procedural workflows in skills (things you do step-by-step). + + + + - + id: troubleshooting + + label: Troubleshooting + content: | + # Skills don't appear in autocomplete / aren't discoverable + + Each agent discovers skills from a harness-specific directory in the current directory (or a parent). Check the right one: + + | Agent | Expected location | + |-------|-------------------| + | Claude Code | `.claude/skills//SKILL.md` | + | Codex CLI | `.codex/prompts/.md` | + | Gemini CLI | `.gemini/commands/.md` | + | Cursor | `.cursor/rules/.mdc` | + + ```bash + # Examples — check the directory for your agent + ls -la .claude/skills/ + ls -la .codex/prompts/ + ls -la .gemini/commands/ + ls -la .cursor/rules/ + ``` + + You should see entries for `vllm-setup`, `sglang-setup`, `mig-configure`, and `dgx-diagnose`. + + **Check you're in the right directory:** + + ```bash + pwd + ``` + + The agent must be started from the directory containing the harness directory, or a subdirectory of it. + + # Context file not loaded + + If the agent gives generic answers without DGX Station awareness, the context file isn't being picked up. Each agent reads a different filename — verify the one for your agent exists: + + | Agent | Expected filename | + |-------|-------------------| + | Claude Code | `CLAUDE.md` (also reads `AGENTS.md` as fallback) | + | Codex CLI | `AGENTS.md` | + | Gemini CLI | `GEMINI.md` | + | Cursor | `AGENTS.md` | + + ```bash + # Verify the file exists for your agent + cat AGENTS.md | head -5 + cat CLAUDE.md | head -5 + cat GEMINI.md | head -5 + + # Restart the agent in the correct directory + cd ~/your-project + claude # or codex, gemini, etc. + ``` + + All four agents read the context file from the working directory (and parent directories up to the project root). + + # Skill gives outdated information + + The skills contain validated container versions and parameters as of the publication date. If a newer container is available, edit the canonical source and re-install: + + ```bash + nano /path/to/playbook/assets/skills/vllm-setup/SKILL.md + /path/to/playbook/assets/install.sh all --force + ``` + + Or edit the installed copy directly: + + ```bash + # Claude Code + nano .claude/skills/vllm-setup/SKILL.md + # Codex + nano .codex/prompts/vllm-setup.md + # Gemini CLI + nano .gemini/commands/vllm-setup.md + # Cursor + nano .cursor/rules/vllm-setup.mdc + ``` + + > [!TIP] + > Skills are plain markdown — you can version them in git alongside your project code. + + # "Both GPUs cannot be used" errors + + This is the mixed-coherency constraint working as intended. If you see CUDA errors when using `--gpus all`: + + ```bash + # Find the GB300 index + nvidia-smi --query-gpu=index,name --format=csv,noheader + + # Use device-specific targeting + docker run --gpus '"device=1"' ... + ``` + + The `AGENTS.md` covers this constraint, but if you removed that section, add it back — it's the most important piece of DGX Station knowledge. + + # Skills conflict with existing project directory + + If your project already has a `.claude/`, `.codex/`, `.gemini/`, or `.cursor/` directory with its own contents, `install.sh` is **additive** for skill directories — it adds the new skill files alongside whatever you already have and warns on collision rather than overwriting. + + For context files (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`), the installer **refuses** to overwrite an existing file. Pass `--force` to override, or merge the new content manually: + + ```bash + # See what would be written + diff /path/to/playbook/assets/AGENTS.md ./AGENTS.md + + # Force overwrite + /path/to/playbook/assets/install.sh claude . --force + ``` + + # Installer reports "WROTE" for some files but "SKIP" for others + + That's the safe-by-default behavior. The installer skips any file that already exists, prints a warning, and continues with the rest. To get a clean install, either: + + 1. Delete the existing files first: `rm -rf .claude/skills/{vllm-setup,sglang-setup,mig-configure,dgx-diagnose}` + 2. Or pass `--force` (only affects context files; skill files are still skipped if present) + + + + + resources: + - name: Anthropic Agent Skills Overview + url: https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview + + + - name: AGENTS.md Standard + url: https://agents.md/ + + + - name: Claude Code Documentation + url: https://docs.anthropic.com/en/docs/claude-code + + + - name: OpenAI Codex AGENTS.md Guide + url: https://developers.openai.com/codex/guides/agents-md + + + - name: Gemini CLI Custom Commands + url: https://geminicli.com/docs/cli/custom-commands/ + + + - name: Cursor Rules Documentation + url: https://docs.cursor.com/ + + + - name: vLLM Documentation + url: https://docs.vllm.ai/en/latest/ + + + - name: SGLang Documentation + url: https://docs.sglang.io/ + + + - name: MIG User Guide + url: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ + + diff --git a/nvidia/station-ai-skills/overview.md b/nvidia/station-ai-skills/overview.md new file mode 100644 index 0000000..958a829 --- /dev/null +++ b/nvidia/station-ai-skills/overview.md @@ -0,0 +1,2 @@ +# REPLACE THIS WITH YOUR MODEL CARD +https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads diff --git a/nvidia/station-brev/README.md b/nvidia/station-brev/README.md index c2569a4..c1ea1aa 100644 --- a/nvidia/station-brev/README.md +++ b/nvidia/station-brev/README.md @@ -1,4 +1,4 @@ -# Station Register to Brev +# Register DGX Station to Brev > Link your DGX Station to Brev for remote access and sharing @@ -27,7 +27,7 @@ You’ll register your DGX Station with Brev and it will be visible as a healthy While Brev automates the complex configuration, understanding a few key concepts when establishing the initial connection will be useful: * **Terminal Basics**: - * Familiarity with the command line to run a few simple setup commands + * Familiarity with command-line use to run a few simple setup commands. ## Prerequisites @@ -45,23 +45,28 @@ You will also need the following: * **Estimated time:** 5-10 minutes * **Risk level:** Low - Registration configures the Station for secure remote access without altering your existing workloads * **Rollback:** The Brev configuration can be removed through the UI and CLI +* **Last Updated:** 05/29/2026 + * First Publication ## Instructions -## Step 1. Login to Brev +## Step 1. Log in to Brev -Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm you’re in the correct org (by clicking the org button on the top right hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation. +Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm you’re in the correct org (by clicking the org button on the top right-hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation. Click the “Register Compute” button and follow the instructions in the pop-up window. -## Step 2. Complete Popup Instructions +## Step 2. Complete Pop-up Instructions * Install the Brev CLI * Configure your compute * Add a name for compute - * To configure ssh, ensure the “Enable SSH access” toggle is on + * To configure SSH, ensure the “Enable SSH access” toggle is on * Run the registration command +> [!IMPORTANT] +> Run the Brev CLI install command **without `sudo`**. Prefixing the installer with `sudo` writes the `brev` binary into root's home directory, which is not on your user shell's `PATH` — the next command will fail with `brev: command not found`. Copy the install command from the pop-up and run it as your normal user. + ## Step 3. Follow Registration Flow In the CLI, you’ll be walked through registration. Go through the flow until registration is complete. @@ -70,7 +75,7 @@ In the CLI, you’ll be walked through registration. Go through the flow until r * Go to the [Brev UI](https://brev.nvidia.com) * Navigate to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) -* Confirm that the DGX Station appears as a registered node with an **Available** status +* Confirm that the DGX Station appears as a registered node with a **Connected** status ## Step 5. Next Steps @@ -78,7 +83,14 @@ Your DGX Station is now integrated into Brev as a secure, remotely accessible GP Now that your hardware is connected, you can: -* **Share Access Anywhere:** Access your machine from anywhere and share access with others through the Brev UI under [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute). +* **Access your machine from anywhere:** Open the [Brev UI](https://brev.nvidia.com) and launch a session from [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute). +* **Share access with others:** Invite teammates to your DGX Station from the Brev UI: + * Go to the [Brev UI](https://brev.nvidia.com) and open [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute). + * Find your DGX Station in the list and open the row's three-dot (⋯) menu. + * Select **Share Access**. + * Enter the email address of the person you want to share with. + * Choose their role / permission level. + * Confirm to send the invitation. ## Step 6. Cleanup @@ -93,12 +105,12 @@ brev deregister In the UI: * Go to the [Brev UI](https://brev.nvidia.com) * Navigate to the section listing “GPU Environments” and look under “Registered Compute” -* Click the “Deregister” menu item on the device you wish to delete from Brev -* Confirm your selection +* Click the “Remove” menu item on the device you wish to delete from Brev. +* Confirm your selection. ## Troubleshooting | Symptom | Cause | Fix | |---------|-------|-----| -| Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set ` and then redo the registration process | -| Unable to `brev shell ` | Need to refresh | `brev refresh` | +| Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set ` and then redo the registration process. | +| Unable to `brev shell ` | Need to refresh | `brev refresh`. | diff --git a/nvidia/station-brev/endpoint-test.yaml b/nvidia/station-brev/endpoint-test.yaml index 73d7f57..d08fa9c 100644 --- a/nvidia/station-brev/endpoint-test.yaml +++ b/nvidia/station-brev/endpoint-test.yaml @@ -1,7 +1,7 @@ kind: Playbook metadata: name: station-brev - displayName: Station Register to Brev + displayName: Register DGX Station to Brev shortDescription: Link your DGX Station to Brev for remote access and sharing publisher: nvidia description: | @@ -10,8 +10,7 @@ metadata: labelsV2: - gpuType:playbook:gpu_type_station - - DGX - - Station + - DGX Station - Brev attributes: @@ -53,7 +52,7 @@ spec: While Brev automates the complex configuration, understanding a few key concepts when establishing the initial connection will be useful: * **Terminal Basics**: - * Familiarity with the command line to run a few simple setup commands + * Familiarity with command-line use to run a few simple setup commands. # Prerequisites @@ -71,6 +70,8 @@ spec: * **Estimated time:** 5-10 minutes * **Risk level:** Low - Registration configures the Station for secure remote access without altering your existing workloads * **Rollback:** The Brev configuration can be removed through the UI and CLI + * **Last Updated:** 05/29/2026 + * First Publication @@ -79,20 +80,23 @@ spec: label: Instructions content: | - # Step 1. Login to Brev + # Step 1. Log in to Brev - Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm you’re in the correct org (by clicking the org button on the top right hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation. + Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm you’re in the correct org (by clicking the org button on the top right-hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation. Click the “Register Compute” button and follow the instructions in the pop-up window. - # Step 2. Complete Popup Instructions + # Step 2. Complete Pop-up Instructions * Install the Brev CLI * Configure your compute * Add a name for compute - * To configure ssh, ensure the “Enable SSH access” toggle is on + * To configure SSH, ensure the “Enable SSH access” toggle is on * Run the registration command + > [!IMPORTANT] + > Run the Brev CLI install command **without `sudo`**. Prefixing the installer with `sudo` writes the `brev` binary into root's home directory, which is not on your user shell's `PATH` — the next command will fail with `brev: command not found`. Copy the install command from the pop-up and run it as your normal user. + # Step 3. Follow Registration Flow In the CLI, you’ll be walked through registration. Go through the flow until registration is complete. @@ -101,7 +105,7 @@ spec: * Go to the [Brev UI](https://brev.nvidia.com) * Navigate to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) - * Confirm that the DGX Station appears as a registered node with an **Available** status + * Confirm that the DGX Station appears as a registered node with a **Connected** status # Step 5. Next Steps @@ -109,7 +113,14 @@ spec: Now that your hardware is connected, you can: - * **Share Access Anywhere:** Access your machine from anywhere and share access with others through the Brev UI under [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute). + * **Access your machine from anywhere:** Open the [Brev UI](https://brev.nvidia.com) and launch a session from [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute). + * **Share access with others:** Invite teammates to your DGX Station from the Brev UI: + * Go to the [Brev UI](https://brev.nvidia.com) and open [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute). + * Find your DGX Station in the list and open the row's three-dot (⋯) menu. + * Select **Share Access**. + * Enter the email address of the person you want to share with. + * Choose their role / permission level. + * Confirm to send the invitation. # Step 6. Cleanup @@ -124,8 +135,8 @@ spec: In the UI: * Go to the [Brev UI](https://brev.nvidia.com) * Navigate to the section listing “GPU Environments” and look under “Registered Compute” - * Click the “Deregister” menu item on the device you wish to delete from Brev - * Confirm your selection + * Click the “Remove” menu item on the device you wish to delete from Brev. + * Confirm your selection. @@ -136,8 +147,8 @@ spec: content: | | Symptom | Cause | Fix | |---------|-------|-----| - | Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set ` and then redo the registration process | - | Unable to `brev shell ` | Need to refresh | `brev refresh` | + | Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set ` and then redo the registration process. | + | Unable to `brev shell ` | Need to refresh | `brev refresh`. | diff --git a/nvidia/station-nanochat/endpoint-production.yaml b/nvidia/station-nanochat/endpoint-production.yaml index 0da04f6..03a1063 100644 --- a/nvidia/station-nanochat/endpoint-production.yaml +++ b/nvidia/station-nanochat/endpoint-production.yaml @@ -45,18 +45,16 @@ spec: content: | # Basic idea - This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI. + This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI. - The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation. + The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision. # What you'll accomplish - You will have a working nanochat setup that trains a small LLM and serves it for chat. - - - **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station. - - **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation. - - **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints. - - **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples. + - **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station. + - **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation. + - **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints. + - **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples. # What to know before starting @@ -68,36 +66,58 @@ spec: **Hardware:** - - NVIDIA DGX Station with GB300 Ultra Superchip. - - Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun). - - Adequate storage for cache (~24GB+ for FineWeb data and checkpoints). + - NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM). + - Adequate storage for cache (~25GB+ for FineWeb data and checkpoints). **Software:** - - Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi` - - Network access to download datasets (Hugging Face, FineWeb) and container images. + - Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi` + - Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io) - [Weights & Biases](https://wandb.ai/) account and API key. - [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets. + # Model architecture (d24) + + ``` + Layers: 24 + Attention Heads: 12 + Head Dimension: 128 + Context Length: 2048 tokens + Vocabulary Size: 65,536 (2^16, trained BPE) + Precision: FP8 (e4m3, tensorwise scaling) + ``` + + # Training stages + + | Stage | Description | + |-------|-------------| + | Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb | + | Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 | + | SFT | Fine-tunes on synthetic identity conversations + SmolTalk | + | Report | Generates `report.md` with metrics, samples, and system info | + # Ancillary files - All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository). - - - `assets/Dockerfile` – PyTorch NGC image plus nanochat dependencies and venv. - - `assets/setup.sh` – Clones nanochat, checks out the supported commit, and builds the Docker image. - - `assets/launch.sh` – Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation). - - `assets/README.md` – Additional detail on training stages, inference, and troubleshooting. + All required assets are in `nvidia/station-nanochat/assets/`: + - `Dockerfile` – PyTorch NGC image with nanochat pip dependencies. + - `setup.sh` – Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image. + - `launch.sh` – Runs the training container (full pipeline: tokenizer → pretrain → SFT → report). + - `speedrun_station.sh` – Modified speedrun script adapted for single-GPU DGX Station. # Time & risk - - **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more). - - **Risk level:** Medium - - Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space. - - API keys (W&B, HF) must be set or the launch script will exit. - - **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed. - * **Last Updated:** 03/02/2026 - * First Publication + - **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra. + - **Risk level:** Medium + - Large downloads (FineWeb) can be slow; ensure stable network and disk space. + - API keys (W&B, HF) must be set or `launch.sh` will exit immediately. + - **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed. + + # Credits + + - [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy + - [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data) + - [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data) @@ -108,69 +128,86 @@ spec: content: | # Step 1. Prerequisites and environment - This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets. + Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets. ```bash # Verify GPU and Docker nvidia-smi - docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi + docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi ``` - Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them. + Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell: ```bash export WANDB_API_KEY= export HF_TOKEN= ``` - # Step 2. Clone the playbook and set up nanochat + # Step 2. Clone and set up - Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image. + Clone the playbook repository and navigate to the assets directory: ```bash git clone https://github.com/NVIDIA/dgx-spark-playbooks cd dgx-spark-playbooks/nvidia/station-nanochat/assets ``` - From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.). + Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies): ```bash ./setup.sh ``` - Setup may take several minutes while the image builds. Verify the image: + You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this: - ```bash - docker images | grep nanochat + ``` + assets/ + ├── Dockerfile + ├── launch.sh + ├── setup.sh + ├── speedrun_station.sh + └── nanochat/ ``` - You should see the `nanochat` image listed. + # Step 3. Launch training - # Step 3. Launch full training - - > [!NOTE] - > The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running. - - To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours: + Ensure your API keys are exported, then launch: ```bash - export WANDB_API_KEY= - export HF_TOKEN= - ./launch_full.sh + ./launch.sh ``` - This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation. + The training runs inside the `nanochat` container and executes the full pipeline automatically: - # Step 4. Verify and use the model + 1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer + 2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8 + 3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat + 4. **Report generation** — produces `report.md` with metrics and samples - After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station. + Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run. + + # Step 4. Monitor training + + **W&B dashboard:** + + Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics: + - Training loss + - Validation BPB + - Throughput (tokens/sec) + + # Step 5. Inference + + After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively: **Web UI (recommended):** ```bash - cd nanochat - source ../.venv/bin/activate # if using venv from container context; otherwise use the container - python -m scripts.chat_web + docker run --rm --gpus all --net=host \ + -v $(pwd)/nanochat:/workspace/nanochat \ + -v $(pwd)/nanochat_cache:/root/.cache/nanochat \ + -w /workspace/nanochat \ + nanochat \ + python -m scripts.chat_web ``` Open a browser to `http://:8000` where `` is your DGX Station’s IP address. @@ -178,14 +215,15 @@ spec: **CLI:** ```bash - cd nanochat - python -m scripts.chat_cli -p "Why is the sky blue?" - python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning" + docker run --rm -it --gpus all \ + -v $(pwd)/nanochat:/workspace/nanochat \ + -v $(pwd)/nanochat_cache:/root/.cache/nanochat \ + -w /workspace/nanochat \ + nanochat \ + python -m scripts.chat_cli -p "Why is the sky blue?" ``` - A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project. - - # Step 5. Cleanup + # Step 6. Cleanup To stop training early, interrupt the launch script or stop the container: @@ -195,23 +233,32 @@ spec: ```bash # If launch.sh is running: press Ctrl+C - # Or stop the container by name + # Or stop the container directly docker stop $(docker ps -q --filter ancestor=nanochat) ``` - To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`): + To free disk space: ```bash rm -rf ./nanochat_cache ./hf_cache docker system prune -a ``` - # Step 6. Next steps and customization + # Step 7. Customization - - **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time. - - **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory. - - **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput). - - **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`. + **Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size: + + ```bash + # Fewer data shards (10 instead of default) + python -m nanochat.dataset -n 10 & + + # Smaller model (d4 instead of d24), smaller batch size + python -m scripts.base_train --depth=4 --device-batch-size=32 + ``` + + **Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs. + + Then re-run `./setup.sh` to rebuild with the changes. @@ -221,14 +268,16 @@ spec: label: Troubleshooting content: | | Symptom | Cause | Fix | - |--------|--------|-----| - | `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=` and `export HF_TOKEN=` in the same shell, then run `./launch.sh`. | - | `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). | - | Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. | - | `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` don’t exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p ` and re-run `launch.sh`. | - | `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). | - | Training exits immediately or script doesn’t wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs `. Fix env vars, cache paths, or batch size as above. | - | Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. | + |---------|-------|-----| + | `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=` and `export HF_TOKEN=` in the same shell, then re-run `./launch.sh` | + | `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` | + | Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs `. Fix env vars or paths as needed | + | `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` | + | `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` | + | Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` | + | W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct | + | Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs ` | + | GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) | diff --git a/nvidia/station-nemoclaw/README.md b/nvidia/station-nemoclaw/README.md index 19451e1..384cc45 100644 --- a/nvidia/station-nemoclaw/README.md +++ b/nvidia/station-nemoclaw/README.md @@ -1,13 +1,11 @@ -# NemoClaw with Nemotron-3-Super and vLLM on DGX Station +# Run NemoClaw with a Local LLM -> Install NemoClaw on DGX Station with local vLLM inference and Telegram bot integration +> Build your first local AI assistant on DGX Station using NemoClaw in a secure sandbox, with optional Telegram. ## Table of Contents - [Overview](#overview) - - [Overview](#overview) - - [Basic idea](#basic-idea) - [What you'll accomplish](#what-youll-accomplish) - [Notice and disclaimers](#notice-and-disclaimers) - [Isolation layers (OpenShell)](#isolation-layers-openshell) @@ -17,40 +15,33 @@ - [Ancillary files](#ancillary-files) - [Time and risk](#time-and-risk) - [Instructions](#instructions) - - [Step 1. Configure Docker and the NVIDIA container runtime](#step-1-configure-docker-and-the-nvidia-container-runtime) - - [Step 2. Pull the Nemotron-3-Super model](#step-2-pull-the-nemotron-3-super-model) - - [Step 3. Start the vLLM inference server](#step-3-start-the-vllm-inference-server) - - [Step 4. Install NemoClaw](#step-4-install-nemoclaw) - - [Step 5. Connect to the sandbox and verify inference](#step-5-connect-to-the-sandbox-and-verify-inference) - - [Step 6. Talk to the agent (CLI)](#step-6-talk-to-the-agent-cli) - - [Step 7. Interactive TUI](#step-7-interactive-tui) - - [Step 8. Exit the sandbox and access the Web UI](#step-8-exit-the-sandbox-and-access-the-web-ui) - - [Step 9. Create a Telegram bot](#step-9-create-a-telegram-bot) - - [Step 10. Enable Telegram (first time or after skipping it)](#step-10-enable-telegram-first-time-or-after-skipping-it) - - [Step 11. Stop services](#step-11-stop-services) - - [Step 12. Uninstall NemoClaw](#step-12-uninstall-nemoclaw) + - [Step 1. Install NemoClaw](#step-1-install-nemoclaw) + - [Step 2. NemoClaw Onboarding](#step-2-nemoclaw-onboarding) + - [Step 3. Interact with OpenClaw](#step-3-interact-with-openclaw) + - [Step 4. Enable Brave Search in sandbox](#step-4-enable-brave-search-in-sandbox) + - [Step 5. Set up Messaging Channel (Telegram Bot as an example)](#step-5-set-up-messaging-channel-telegram-bot-as-an-example) + - [Step 6. Set Up NemoClaw Agents](#step-6-set-up-nemoclaw-agents) + - [Step 7. Stop services](#step-7-stop-services) + - [Step 8. Uninstall NemoClaw](#step-8-uninstall-nemoclaw) - [Troubleshooting](#troubleshooting) --- ## Overview -### Overview +## Basic idea -### Basic idea +**NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime — an environment designed for executing agents with additional security — and connects them to local inference on your DGX Station. A single installer command (`nemoclaw.sh`) handles Node.js, OpenShell, and the NemoClaw CLI; the **onboard** wizard then creates a sandboxed agent, optional **Brave Search**, optional **messaging channels** (Telegram, Discord, or Slack), and a **policy tier** with network presets. -**NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime -- an environment designed for executing agents with additional security -- and open-source models like NVIDIA Nemotron. A single installer command handles Node.js, OpenShell, and the NemoClaw CLI, then walks you through an onboard wizard to create a sandboxed agent on your DGX Station using vLLM with Nemotron 3 Super. - -By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, accessible via a web dashboard and a Telegram bot, with inference routed to a local Nemotron 3 Super 120B model served by vLLM on your DGX Station -- all without exposing your host filesystem or network to the agent. +By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, reachable through the **Web UI** or **terminal TUI**, with inference routed to local inference on the DGX Station. You can optionally add **Telegram** (with **cloudflared** for a public webhook URL) and optional **web search** — all without exposing your host filesystem or network beyond what you explicitly allow in policy. ### What you'll accomplish -- Configure Docker and the NVIDIA container runtime for OpenShell on DGX Station -- Pull Nemotron 3 Super 120B (NVFP4) from Hugging Face and serve it with vLLM -- Install NemoClaw with a single command (handles Node.js, OpenShell, and the CLI) -- Run the onboard wizard to create a sandbox and configure local vLLM inference -- Chat with the agent via the CLI, TUI, and web UI -- Set up a Telegram bot that forwards messages to your sandboxed agent +- Install **NemoClaw** with one command (`nemoclaw.sh`), which pulls Node.js, OpenShell, and the CLI as needed +- Walk through `nemoclaw onboard` wizard with recommended settings +- Open the **Web UI** to interact with agent +- Optionally enable **Brave Search** or **Telegram** after onboarding +- **Cleanup and uninstall** with the documented `uninstall.sh` flags when finished ### Notice and disclaimers @@ -64,14 +55,14 @@ By installing this demo, you accept responsibility for all third-party component #### What you're getting -This experience is provided "AS IS" for demonstration purposes only -- no warranties, no guarantees. This is a demo, not a production-ready solution. You will need to implement appropriate security controls for your environment and use case. +This experience is provided "AS IS" for demonstration purposes only — no warranties, no guarantees. This is a demo, not a production-ready solution. You will need to implement appropriate security controls for your environment and use case. #### Key risks with AI agents -- **Data leakage** -- Any materials the agent accesses could be exposed, leaked, or stolen. -- **Malicious code execution** -- The agent or its connected tools could expose your system to malicious code or cyber-attacks. -- **Unintended actions** -- The agent might modify or delete files, send messages, or access services without explicit approval. -- **Prompt injection and manipulation** -- External inputs or connected content could hijack the agent's behavior in unexpected ways. +- **Data leakage** — Any materials the agent accesses could be exposed, leaked, or stolen. +- **Malicious code execution** — The agent or its connected tools could expose your system to malicious code or cyber-attacks. +- **Unintended actions** — The agent might modify or delete files, send messages, or access services without explicit approval. +- **Prompt injection and manipulation** — External inputs or connected content could hijack the agent's behavior in unexpected ways. #### Participant acknowledgement @@ -81,23 +72,22 @@ By participating in this demo, you acknowledge that you are solely responsible f | Layer | What it protects | When it applies | |------------|----------------------------------------------------|-----------------------------| -| Filesystem | Prevents reads/writes outside allowed paths. | Locked at sandbox creation. | +| Filesystem | Prevents reads/writes outside allowed paths. | Locked at sandbox creation. | | Network | Blocks unauthorized outbound connections. | Hot-reloadable at runtime. | -| Process | Blocks privilege escalation and dangerous syscalls.| Locked at sandbox creation. | +| Process | Blocks privilege escalation and dangerous syscalls.| Locked at sandbox creation. | | Inference | Reroutes model API calls to controlled backends. | Hot-reloadable at runtime. | ### What to know before starting - Basic use of the Linux terminal and SSH -- Familiarity with Docker (permissions, `docker run`) +- Familiarity with Docker (permissions, `docker run`, optional `docker` group membership) - Awareness of the security and risk sections above ### Prerequisites -**Hardware and access:** +**Hardware:** - A DGX Station (GB300) with keyboard and monitor, or SSH access -- A **Telegram bot token** from [@BotFather](https://t.me/BotFather) (create one with `/newbot`) -- optional, for Phase 3 **Software:** @@ -109,16 +99,16 @@ Verify your system before starting: head -n 2 /etc/os-release nvidia-smi docker info --format '{{.ServerVersion}}' -df -h / /var/lib/docker 2>/dev/null | head -20 ``` -Expected: Ubuntu 24.04, NVIDIA GB300 GPU(s), Docker 28.x+, and **enough free disk** for Docker layers, the NemoClaw sandbox image, and Hugging Face cache (treat **~40 GB free** on the Docker data filesystem as a practical minimum; very low free space can surface as cryptic onboard errors such as “K8s namespace not ready”). +Expected: Ubuntu 24.04, NVIDIA GB300 GPU, Docker 28.x+. ### Have ready before you begin -| Item | Where to get it | -|------|----------------| -| Telegram bot token (optional) | [@BotFather](https://t.me/BotFather) on Telegram -- create with `/newbot` | +| Item | When you need it | +|------|------------------| +| **Telegram bot token** (optional) | Create with [@BotFather](https://t.me/BotFather) (`/newbot`). You can paste it during **onboarding** (Step 3) **or** when you run **`nemoclaw channels add telegram`** later. | +| **Brave Search API key** (optional) | From [Brave Search API](https://brave.com/search/api/) if you enable web search during onboarding or via **`nemoclaw onboard --fresh --gpu`** (`--fresh` re-prompts every onboarding question, including features you previously skipped; without `--fresh` the wizard resumes the previous session and will not re-prompt). | ### Ancillary files @@ -126,362 +116,118 @@ All required assets are handled by the NemoClaw installer. No manual cloning is ### Time and risk -- **Estimated time:** 20--30 minutes (with model already downloaded). First-time model download adds ~10--20 minutes depending on network speed. -- **Risk level:** Medium -- you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts. -- **Last Updated:** 04/27/2026 - * First publication for DGX Station with vLLM +- **Estimated time:** About 30–60 minutes for a first full pass (install, onboard, model download depending on choice and network). Optional Brave, Telegram, and cloudflared steps add time if you do them in a second session. +- **Risk level:** Medium — you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts. +- **Last Updated:** 05/29/2026 + - Update to latest nemoclaw installer instructions ## Instructions -## Phase 1: Prerequisites +## Phase 1: Install and Run NemoClaw -These steps prepare a fresh DGX Station for NemoClaw. If Docker, the NVIDIA runtime, and vLLM are already configured, skip to Phase 2. +### Step 1. Install NemoClaw -> [!IMPORTANT] -> **Disk space:** NemoClaw’s onboard flow pulls a multi-gigabyte sandbox image and runs Docker, k3s, and the gateway together. If root or Docker’s data disk is nearly full (for example only a few gigabytes free), onboarding can fail with generic errors such as **“K8s namespace not ready”** with no clear hint about storage. Before you start, check free space: `df -h / /var/lib/docker`. NVIDIA recommends **at least 40 GB free** on the filesystem that holds Docker layers (often `/` or `/var/lib/docker`); treat **under ~15 GB** as high risk for first-time onboard failures. - -### Step 1. Configure Docker and the NVIDIA container runtime - -OpenShell's gateway runs k3s inside Docker. On DGX Station (Ubuntu 24.04, cgroup v2), Docker must be configured with the NVIDIA runtime and host cgroup namespace mode. - -Configure the NVIDIA container runtime for Docker: +This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.55** release (set via `NEMOCLAW_VERSION`; v0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox. ```bash -sudo nvidia-ctk runtime configure --runtime=docker +curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_VERSION=v0.55 bash ``` -Expected: +The installation wizard walks you through setup: -```text -INFO Loading config from /etc/docker/daemon.json -INFO Wrote updated config to /etc/docker/daemon.json -INFO It is recommended that docker daemon be restarted. -``` +1. **Accept NemoClaw license** -- Confirm by entering `yes` +2. **Run express install** -- Confirm by entering `Y` -Set the cgroup namespace mode required by OpenShell on DGX Station: +The installer requires **Node.js 22.16+** (installed automatically if missing). It walks you through Node.js, NemoClaw CLI and Onboarding phases. See more details of Onboarding configuration in the next step. -```bash -sudo python3 -c " -import json, os -path = '/etc/docker/daemon.json' -d = json.load(open(path)) if os.path.exists(path) else {} -d['default-cgroupns-mode'] = 'host' -json.dump(d, open(path, 'w'), indent=2) -" -``` - -Restart Docker: - -```bash -sudo systemctl restart docker -``` - -Verify the NVIDIA runtime works: - -```bash -docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi -``` - -Expected: - -```text -+-----------------------------------------------------------------------------------------+ -| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 | -+-----------------------------------------+------------------------+----------------------+ -| 0 NVIDIA GB300 On | 00000009:06:00.0 Off | 0 | -| N/A 46C P0 215W / 1300W | 18661MiB / 256703MiB | 0% Default | -+-----------------------------------------+------------------------+----------------------+ -``` - -If you get a permission denied error on `docker`, add your user to the Docker group and activate the new group in your current session: - -```bash -sudo usermod -aG docker $USER -newgrp docker -``` - -This applies the group change immediately. Alternatively, you can log out and back in instead of running `newgrp docker`. +### Step 2. NemoClaw Onboarding > [!NOTE] -> DGX Station uses cgroup v2. OpenShell's gateway embeds k3s inside Docker and needs host cgroup namespace access. Without `default-cgroupns-mode: host`, the gateway can fail with "Failed to start ContainerManager" errors. +> If you chose **express install** in Step 1, all settings are auto-configured with recommended defaults. Skip to Step 3. -### Step 2. Pull the Nemotron-3-Super model +During custom setup, the onboard wizard walks you through: -Install pip and the Hugging Face CLI (if not already installed): +1. **Configuring inference** -- Choose to set up local inference on your DGX Station by selecting **`7) Local Ollama`**. +2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will provide options to download models to start. +3. **Sandbox name** -- Pick a name (e.g. my-assistant). Each sandbox requires a unique name. +4. **Apply this configuration** -- Enter `Y` to confirm setting up local inference. +5. **Enable Brave Web Search** -- Optional. If you enable it, paste a [Brave Search API](https://brave.com/search/api/) key when prompted. +6. **Messaging channels** -- Optional. If you enable it, choose your desired bot (`telegram`, `discord` or `slack`) and paste your bot token when prompted. +7. **Policy presets** -- Choose desired Policy tier (`Balanced` recommended) and accept/edit the suggested presets when prompted (confirm with **Enter**). -```bash -sudo apt install -y python3-pip -pip3 install --break-system-packages huggingface-hub -``` - -Download Nemotron 3 Super 120B in NVFP4 quantization (~60 GB; may take 10--20 minutes depending on network speed): - -```bash -hf download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 -``` - -Expected (on a fresh download; cached downloads complete instantly): - -```text -Fetching 36 files: 100%|██████████| 36/36 [15:42<00:00, 26.18s/it] -/home/nvidia/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/snapshots/0d6fa3ecad422a... -``` - -Verify the download completed: - -```bash -ls ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/ -``` - -Expected: - -```text -blobs refs snapshots -``` - -> [!NOTE] -> The NVFP4 quantization is chosen because it fits entirely in **one** GB300 GPU’s 256 GB HBM3e with room for KV cache. On a **two-GPU** station you can still use NVFP4 with `--tensor-parallel-size 1` and a single visible GPU, or shard with `--tensor-parallel-size 2`. For other quantization variants, see [Troubleshooting](troubleshooting.md). - -### Step 3. Start the vLLM inference server - -Launch vLLM using the NVIDIA-optimized container image. - -**Single GPU (default on one-GPU systems, or pin to one GPU on multi-GPU stations):** vLLM can emit **mixed device** warnings if several GPUs are visible but the model is only meant to use one. Pinning avoids accidentally placing weights on an unexpected device. - -```bash -docker run -d --name vllm-nemotron \ - --runtime nvidia --gpus '"device=0"' \ - -e CUDA_VISIBLE_DEVICES=0 \ - -v ~/.cache/huggingface:/root/.cache/huggingface \ - -p 8000:8000 \ - --restart unless-stopped \ - nvcr.io/nvidia/vllm:26.03-py3 \ - python3 -m vllm.entrypoints.openai.api_server \ - --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ - --host 0.0.0.0 \ - --port 8000 \ - --tensor-parallel-size 1 \ - --trust-remote-code \ - --max-model-len 32768 \ - --enable-auto-tool-choice \ - --tool-call-parser qwen3_xml \ - --reasoning-parser nemotron_v3 -``` - -**Two GPUs (tensor parallel):** If your DGX Station has two Blackwell GPUs and you want Nemotron sharded across both, use both devices and set tensor parallel size to `2` (VRAM is summed across the GPUs): - -```bash -docker run -d --name vllm-nemotron \ - --runtime nvidia --gpus all \ - -e CUDA_VISIBLE_DEVICES=0,1 \ - -v ~/.cache/huggingface:/root/.cache/huggingface \ - -p 8000:8000 \ - --restart unless-stopped \ - nvcr.io/nvidia/vllm:26.03-py3 \ - python3 -m vllm.entrypoints.openai.api_server \ - --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ - --host 0.0.0.0 \ - --port 8000 \ - --tensor-parallel-size 2 \ - --trust-remote-code \ - --max-model-len 32768 \ - --enable-auto-tool-choice \ - --tool-call-parser qwen3_xml \ - --reasoning-parser nemotron_v3 -``` - -**Pick a GPU index by name (optional one-liner):** To print the device index of the first GPU whose name contains `GB300` (adjust the pattern if your `nvidia-smi` name string differs), run on the host: - -```bash -nvidia-smi --query-gpu=index,name --format=csv,noheader | awk -F', ' '/GB300/ { gsub(/^ +/,"",$1); print $1; exit }' -``` - -Use that index in Docker as `--gpus '"device=N"'` (replace `N` with the printed index). - -> [!NOTE] -> **`--tool-call-parser qwen3_xml`:** Nemotron’s tool-call wire format is exposed through vLLM’s **Qwen3-compatible XML tool parser** — the name refers to the parser implementation, not the base model. This pairing is what vLLM expects for correct function/tool calling with this checkpoint. - -The first startup loads ~70 GB of weights into GPU memory. Watch the logs until you see the model is ready: - -```bash -docker logs -f vllm-nemotron -``` - -Wait until you see the following in the logs (typically 3--5 minutes): - -```text -INFO Loading weights took 55.47 seconds -INFO Model loading took 69.39 GiB memory and 71.31 seconds -INFO: Started server process [1] -INFO: Waiting for application startup. -INFO: Application startup complete. -``` - -Then verify the API is responding: - -```bash -curl -s http://localhost:8000/v1/models -``` - -Expected: - -```json -{"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]} -``` - -Send a test request to warm up the model before proceeding to Step 4. The first inference request compiles CUDA graphs and can take 30--90 seconds: - -```bash -curl -s --max-time 120 http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"Say hello."}],"max_tokens":10}' -``` - -Expected (the first request may take 30--90 seconds; subsequent requests are much faster): - -```json -{"id":"chatcmpl-...","object":"chat.completion","model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","choices":[{"index":0,"message":{"role":"assistant","content":"..."},"finish_reason":"length"}],...} -``` - -> [!IMPORTANT] -> Warm up the model before running the NemoClaw installer. The onboard wizard validates the vLLM endpoint with a short timeout. If the model has not served at least one request, this validation will time out and the install will fail. - -> [!IMPORTANT] -> Always start vLLM via the Docker container -- do not run `vllm serve` directly on the host. The NVIDIA container image (`nvcr.io/nvidia/vllm:26.03-py3`) includes optimized kernels for the GB300's Blackwell architecture that are not available in the pip-installed version. - -> [!NOTE] -> Key flags explained: -> - `--tensor-parallel-size` -- `1` for a single visible GPU; `2` when you expose two GPUs for tensor-parallel sharding (see Step 3). -> - `--trust-remote-code` -- required for the Mamba2-Transformer hybrid architecture -> - `--max-model-len 32768` -- maximum context length (increase up to 1M if VRAM allows) -> - `--enable-auto-tool-choice --tool-call-parser qwen3_xml` -- enables function/tool calling for the agent (see the note above on the parser name). -> - `--reasoning-parser nemotron_v3` -- separates chain-of-thought reasoning from the response so the TUI/Web UI can display them cleanly - ---- - -## Phase 2: Install and Run NemoClaw - -### Step 4. Install NemoClaw - -The installer script installs Node.js (if needed), OpenShell, the NemoClaw CLI, and runs onboarding to create a sandbox. The vLLM provider requires the **experimental** flag and an **extended inference timeout** (the default 15-second validation timeout is too short for a 120B model). - -#### Recommended: non-interactive install (copy-paste friendly) - -This path is best for SSH sessions, automation, and documentation — no arrow-key TUI in the terminal. - -```bash -NEMOCLAW_EXPERIMENTAL=1 \ -NEMOCLAW_NON_INTERACTIVE=1 \ -NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \ -NEMOCLAW_SANDBOX_NAME=my-assistant \ -NEMOCLAW_PROVIDER=vllm \ -NEMOCLAW_MODEL="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \ -NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \ -bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)" -``` - -Optional: include **Telegram** in the first onboard without typing the token over SSH — export credentials on the host **before** running the installer (same variables the [NemoClaw Telegram bridge guide](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html) documents): - -```bash -export TELEGRAM_BOT_TOKEN='' -## Optional DM allowlist (comma-separated Telegram user IDs): -## export TELEGRAM_ALLOWED_IDS='123456789,987654321' -``` - -Use [Telegram Desktop](https://desktop.telegram.org/) or [web.telegram.org](https://web.telegram.org/) on a laptop to copy the token from [@BotFather](https://t.me/BotFather) and paste into your SSH session (or into a small env file you `source`). Typing a 46+ character token on a phone keyboard into a remote shell is error-prone. - -To **persist** `TELEGRAM_BOT_TOKEN` across reboots, keep it in a root-owned or user-only file and source it from your shell profile (example — adjust path and permissions): - -```bash -install -m 600 /dev/null ~/.nemoclaw/telegram.env -nano ~/.nemoclaw/telegram.env # add: export TELEGRAM_BOT_TOKEN='...' -grep -q 'nemoclaw/telegram.env' ~/.bashrc || echo 'source ~/.nemoclaw/telegram.env 2>/dev/null' >> ~/.bashrc -``` - -NemoClaw also stores messaging credentials in its credential store when you onboard or run `nemoclaw … channels add telegram`; the file above is mainly for **re-running scripts** or **non-interactive** flows that read the environment. - -#### Alternative: interactive installer - -If you prefer the wizard: - -```bash -NEMOCLAW_EXPERIMENTAL=1 \ -NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \ -bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)" -``` - -The wizard asks **six** high-level prompts (third-party notice, inference provider, Brave Search, messaging channels, sandbox name, policy presets). In parallel, the installer prints **eight** numbered onboard sub-phases, `[1/8]` … `[8/8]` (preflight, gateway, inference detection, inference route, messaging channels, sandbox creation, OpenClaw inside sandbox, policy presets). **Those two numberings are different on purpose** — the `[n/8]` lines are internal progress steps; the numbered list above is what you answer in the TUI. - -1. **Third-party software notice** -- Type `yes` to accept and continue. -2. **Inference provider** -- The wizard detects vLLM running locally. Select option **8** (`Local vLLM [experimental] — running`). -3. **Brave Web Search** -- Optional. Type `skip` if you don't have a Brave Search API key. -4. **Messaging channels** -- Optional. Press **Enter** to skip, or toggle Telegram/Discord/Slack if desired (this is the step that corresponds to onboard phase **[5/8]** in the log). -5. **Sandbox name** -- Pick a name (e.g. `my-assistant`). Names must be lowercase alphanumeric with hyphens only. -6. **Policy presets** -- Use arrow keys to toggle presets. `pypi` and `npm` are selected by default. Press **Enter** to confirm. - -The install takes approximately 3 minutes. Example milestones in the output (wording may vary slightly by release): - -```text -[1/3] Node.js - Node.js found: v22.22.2 - -[2/3] NemoClaw CLI - Installing NemoClaw from GitHub... - Verified: nemoclaw is available at /home/nvidia/.local/bin/nemoclaw - -[3/3] Onboarding - [1/8] Preflight checks - ✓ Docker is running - ✓ NVIDIA GPU detected: 2 GPU(s), 256703 MB VRAM # example on a two-GPU system - [2/8] Starting OpenShell gateway - ✓ Gateway is healthy - [3/8] Configuring inference (NIM) - ✓ Using existing vLLM on localhost:8000 - Detected model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 - [4/8] Setting up inference provider - ✓ Inference route set: vllm-local / nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 - [5/8] Messaging channels - (example) Telegram disabled — skipped -# # or: Telegram enabled; token stored in credential store - [6/8] Creating sandbox - ✓ Sandbox 'my-assistant' created - [7/8] Setting up OpenClaw inside sandbox - ✓ OpenClaw gateway launched inside sandbox - [8/8] Policy presets - Applied preset: pypi - Applied preset: npm -``` - -When complete you will see: +When complete you will see output like: ```text ────────────────────────────────────────────────── Sandbox my-assistant (Landlock + seccomp + netns) -Model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (Local vLLM) +Model (Local Ollama) ────────────────────────────────────────────────── Run: nemoclaw my-assistant connect Status: nemoclaw my-assistant status Logs: nemoclaw my-assistant logs --follow - -OpenClaw UI (tokenized URL; treat it like a password) -http://127.0.0.1:18789/#token= ────────────────────────────────────────────────── ``` -> [!IMPORTANT] -> Save the tokenized Web UI URL printed at the end -- you will need it in Step 8. It looks like: -> `http://127.0.0.1:18789/#token=` +> [!NOTE] +> - If `nemoclaw` is not found after install, run `source ~/.bashrc` to reload your shell path. +> - Time to finish **Onboarding** can vary, depending on the model choice and internet speed. + +NemoClaw Onboarding can be run repeatedly to create multiple sandboxes for independent usecases. Use `--name ` to create an additional sandbox alongside any existing ones: + +```bash +nemoclaw onboard --gpu --name +``` > [!IMPORTANT] -> `NEMOCLAW_EXPERIMENTAL=1` is required for the vLLM provider. Without it, the installer will report "Requested provider 'vllm' is not available in this environment." +> Use `--name ` to create an additional sandbox without affecting existing ones. The `--fresh` flag is a destructive option reserved for starting a completely new onboard session — if a sandbox with the same name already exists, `--fresh` will **destroy and recreate it**. Only use `--fresh` when you intend to wipe and re-onboard (see Step 4 for an example where re-prompting is required). + +### Step 3. Interact with OpenClaw + +There are two ways to interact with your OpenClaw, Web UI or terminal UI. + +#### Option 1. Web UI + +Get the full dashboard URL (includes the auto-assigned port and token): + +```bash +nemoclaw my-assistant dashboard-url --quiet +``` + +This prints a URL like `http://127.0.0.1:18790/#token=`. The port is auto-assigned (commonly 18789 or 18790) and may differ between installs. + +**If accessing the Web UI directly on the DGX Station** (keyboard and monitor attached), open the dashboard URL in a browser. + +**If accessing the Web UI from a remote machine**, you need to set up an SSH tunnel. + +First, note the port number from the dashboard URL above (e.g. `18790`). + +Find your DGX Station's IP address: + +```bash +hostname -I | awk '{print $1}' +``` + +This prints the primary IP address (e.g. `192.168.1.42`). You can also find it in **Settings > Wi-Fi** or **Settings > Network** on the DGX Station's desktop, or check your router's connected-devices list. + +From your remote machine, create an SSH tunnel using the port from above (replace `` and ``): + +```bash +ssh -L :127.0.0.1: @ +``` + +Now open the dashboard URL in your remote machine's browser. > [!IMPORTANT] -> `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300` extends the validation timeout from the default 15 seconds to 300 seconds. Without this, the endpoint validation will fail on a cold 120B model, even if you warmed it up in Step 3 -- the installer sends its own test prompt which may be slower. +> Use `127.0.0.1`, not `localhost` -- the gateway origin check requires an exact match. > [!NOTE] -> If `nemoclaw` is not found after install, run `source ~/.bashrc` to reload your shell path. +> If the Web UI fails to load and the port forward may be stale, get the port from `nemoclaw my-assistant dashboard-url --quiet` and reset: +> ```bash +> openshell forward stop my-assistant || true +> openshell forward start my-assistant --background +> ``` -### Step 5. Connect to the sandbox and verify inference +#### Option 2. Terminal UI Connect to the sandbox: @@ -489,207 +235,158 @@ Connect to the sandbox: nemoclaw my-assistant connect ``` -Expected: - -```text -sandbox@my-assistant:~$ -``` - -You are now inside the sandboxed environment. Verify that the inference route is working: - -```bash -curl -sf https://inference.local/v1/models -``` - -Expected: - -```json -{"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]} -``` - -### Step 6. Talk to the agent (CLI) - -Still inside the sandbox, send a test message **through the OpenClaw gateway** (the default path). The `--local` flag is **intentionally blocked** inside the NemoClaw OpenShell sandbox — it would bypass gateway controls — so the command you may see in generic OpenClaw quickstarts will fail here. - -```bash -openclaw agent --agent main -m "hello" --session-id test -``` - -Expected (the agent will think, then respond -- first response may take 30--90 seconds): streaming or printed assistant text ending with a normal reply. - -If you see a response from the agent, inference is working end-to-end. - -### Step 7. Interactive TUI - -Launch the terminal UI for an interactive chat session: +Then launch the terminal UI inside the sandbox: ```bash openclaw tui ``` -Press **Ctrl+C** to exit the TUI. +You can start chatting with OpenClaw. Press **Ctrl+C** to exit the terminal UI. -### Step 8. Exit the sandbox and access the Web UI - -Exit the sandbox to return to the host: +To exit the sandbox: ```bash exit ``` -**If accessing the Web UI directly on the DGX Station** (keyboard and monitor attached), open a browser and navigate to the tokenized URL from Step 4. Prefer **`127.0.0.1`** in the URL bar (not `localhost`) so it matches strict gateway origin checks: - -```text -http://127.0.0.1:18789/#token= -``` - -**If accessing the Web UI from a remote machine**, you need to set up port forwarding. - -First, find your DGX Station's IP address. On the Station, run: - -```bash -hostname -I | awk '{print $1}' -``` - -Start the port forward on the DGX Station host: - -```bash -openshell forward start 18789 my-assistant --background -``` - -Expected: - -```text -Forwarding 127.0.0.1:18789 -> my-assistant:18789 (background) -``` - -If the forward was already started during onboarding, you will see: - -```text -Error: Port 18789 is already forwarded to sandbox 'my-assistant'. -``` - -This is fine -- the forward is already running. - -Then from your remote machine, create an SSH tunnel to the Station (replace `` with the IP address from above): - -```bash -ssh -L 18789:127.0.0.1:18789 @ -``` - -Now open the tokenized URL in your remote machine's browser. Either of these usually works on the **client** side because both bind to your loopback through the tunnel: - -```text -http://127.0.0.1:18789/#token= -``` - -> [!IMPORTANT] -> Use `127.0.0.1`, not `localhost` -- the gateway origin check requires an exact match. - --- -## Phase 3: Telegram Bot +## Phase 2: Modify NemoClaw Policy -Messaging (Telegram, Discord, Slack) is **wired during onboarding** — credentials are stored, OpenShell providers are created, and channel configuration is **baked into the sandbox image**. Runtime config under `/sandbox/.openclaw/` is not safely patchable from inside the running sandbox. +### Step 4. Enable Brave Search in sandbox -**`nemoclaw start` does not start the Telegram bridge.** In current NemoClaw releases it starts **optional host services** such as the **cloudflared** tunnel when installed; Telegram delivery stays under OpenShell. See [NemoClaw commands](https://docs.nvidia.com/nemoclaw/latest/reference/commands.html) and [Set up Telegram bridge](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html). - -### Step 9. Create a Telegram bot - -Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the bot token. - -**Tip:** Use [Telegram Desktop](https://desktop.telegram.org/) or [web.telegram.org](https://web.telegram.org/) so you can **copy-paste** the token into your terminal or env file instead of typing 46+ characters from your phone into SSH. - -### Step 10. Enable Telegram (first time or after skipping it) - -#### Path A — You have not installed yet, or you can re-run onboard - -Export the token on the **host**, then run the installer / onboard again (non-interactive variables from Step 4, plus `TELEGRAM_BOT_TOKEN`). The wizard’s **Messaging channels** step (installer phase **[5/8]**) is the right time to toggle Telegram interactively. - -Re-onboarding after a sandbox exists is supported; NemoClaw can detect token changes and rebuild the sandbox — see the official [Telegram bridge](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html) page. - -#### Path B — NemoClaw is already installed (recommended host command) - -On the **host** (run `exit` if you are inside `nemoclaw … connect`): - -1. **Allow outbound access to the Telegram API** if you have not already — add the `telegram` network preset: +To add Brave Web Search to an existing sandbox, re-run the onboard wizard with `--fresh` to start a new session that re-prompts all options (including previously skipped features): ```bash -nemoclaw my-assistant policy-add +nemoclaw onboard --fresh --gpu ``` -When prompted, select `telegram` and confirm. +> [!NOTE] +> Without `--fresh`, the onboard wizard **resumes** the previous session and will not re-prompt for features you already skipped. -2. **Register the bot token and rebuild** the sandbox image so Telegram is included: +When you reach **Enable Brave Web Search**, choose **yes** and paste the key from the [Brave Search API](https://brave.com/search/api/) console. Confirm the same sandbox name and inference choices where prompted. The wizard will **rebuild** the sandbox so the key is applied. + +> [!NOTE] +> Alternatively, set `BRAVE_API_KEY` in your environment before running the installer and Brave Search will be enabled automatically during onboard. + +To confirm web search is enabled, relaunch your OpenClaw WebUI or terminal UI. Ask the agent for something that needs **live web search**. If requests still fail, recheck **`policy-list`** and re-read the onboard output for Brave/API errors. + +### Step 5. Set up Messaging Channel (Telegram Bot as an example) + +These steps apply when your sandbox exists but **Telegram was never configured** (you skipped **Messaging channels** in Step 2, or the sandbox policy tier never included Telegram-related egress). Replace `` with your sandbox (for example `my-assistant`). + +#### 1. Create a Telegram bot + +In Telegram, open [@BotFather](https://t.me/BotFather), send `/newbot`, and complete the prompts. Copy the **bot token** BotFather returns and keep it ready for the next step. + +#### 2. Register Telegram with NemoClaw and rebuild the sandbox ```bash -export TELEGRAM_BOT_TOKEN='' -nemoclaw my-assistant channels add telegram +nemoclaw channels add telegram ``` -Follow the prompts to rebuild when asked (or run `nemoclaw my-assistant rebuild --yes` afterward if non-interactive mode queued a rebuild — see `NEMOCLAW_NON_INTERACTIVE=1` behavior in the [commands reference](https://docs.nvidia.com/nemoclaw/latest/reference/commands.html)). +Paste the token when prompted. NemoClaw persists credentials and **rebuilds** the sandbox so OpenClaw can use Telegram as a messaging channel. -3. **Pause or resume** Telegram delivery without changing credentials: use the **`nemoclaw channels stop`** / **`nemoclaw channels start`** patterns for the `telegram` channel described in [Set up Telegram bridge](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html) (exact subcommand spelling may vary slightly by NemoClaw version; use `nemoclaw --help` if in doubt). +#### 3. (If needed) Allow Telegram egress in the sandbox policy -Check overall status: +If messages fail with network or policy errors after the channel is registered, inspect presets and add Telegram-related egress if your tier omitted it: + +```bash +nemoclaw policy-list +nemoclaw policy-add telegram +``` + +Preset names follow your selected tier; confirm against [Network policies](https://docs.nvidia.com/nemoclaw/latest/reference/network-policies.html). + +#### 4. Verify Telegram + +Telegram uses long-polling (`getUpdates`) — the sandbox actively pulls messages from Telegram servers. **No public URL or cloudflared tunnel is required for Telegram to work.** + +Open Telegram, find your bot, and send a message. The bot should forward traffic to the agent in your NemoClaw sandbox and reply. + +> [!NOTE] +> The first response may take longer depending on model size (30B models respond in a few seconds; larger models may take longer on first inference). + +> [!NOTE] +> If the bot does not respond: +> - Run `nemoclaw status` to confirm the sandbox is running and inference is healthy. +> - Run `nemoclaw logs --follow` and look for Telegram-related errors. +> - If Telegram egress is missing, run `nemoclaw policy-add` and select `telegram`. +> - If the channel was never registered, run `nemoclaw channels add telegram`. + +> [!NOTE] +> The `channels add telegram` wizard also prompts for an optional **Telegram User ID** to restrict who can DM the bot. Send `/start` to [@userinfobot](https://t.me/userinfobot) on Telegram to get your numeric user ID. If you skip this, the bot will require device pairing (a terminal-based code confirmation) before responding to messages. + +> [!NOTE] +> For details on restricting which Telegram chats can interact with the agent, see the [NemoClaw Telegram bridge documentation](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html). + +#### 5. (Optional) Install cloudflared for remote Web UI access + +The cloudflared tunnel provides a **public URL for the Web UI dashboard** — it is not related to Telegram messaging. + +Install cloudflared (DGX Station is arm64): + +```bash +curl -L --output cloudflared.deb \ + https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-arm64.deb +sudo dpkg -i cloudflared.deb +``` + +Start the tunnel: + +```bash +nemoclaw tunnel start +``` + +Verify: ```bash nemoclaw status ``` -Open Telegram, find your bot, and send it a message. +You should see `● cloudflared` with a `trycloudflare.com` public URL. -> [!NOTE] -> The first response may take 30--90 seconds for a 120B parameter model running locally. +--- -> [!NOTE] -> To **persist** `TELEGRAM_BOT_TOKEN` for shell-based flows, use a `chmod 600` env file and `source` it from `~/.bashrc` as shown in Step 4. +## Phase 3: Set Up NemoClaw Agent -> [!NOTE] -> For chat allowlists and advanced Telegram behavior, see [NemoClaw Telegram bridge documentation](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html). +### Step 6. Set Up NemoClaw Agents + +Set up NemoClaw Agents in general require three steps: Configure NemoClaw security policy, Run Agent Workflow Prompt, Personalize the Workflow for your own use case. + +Checkout these [Example NemoClaw Agents](https://build.nvidia.com/station/nemoclaw-applications) for reference. Consider sharing your NemoClaw agent setup with the community at [DGX Station Developer Forum](https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station-gb300) --- ## Phase 4: Cleanup and Uninstall -### Step 11. Stop services +### Step 7. Stop services -Stop any running auxiliary services (Telegram bridge, cloudflared tunnel): +Stop the cloudflared tunnel: ```bash -nemoclaw stop +nemoclaw tunnel stop ``` -Expected: - -```text -[services] All services stopped. -``` - -Stop the port forward (always pass **port** and **sandbox name**): +Stop the port forward: ```bash -openshell forward list -openshell forward stop 18789 my-assistant +openshell forward list # find active forwards and their ports +openshell forward stop # stop the dashboard forward (use the port shown above) ``` -Stop and **remove** the vLLM container so the name `vllm-nemotron` is free for a future run. The playbook created the container with **`--restart unless-stopped`**, so `docker stop` alone is not enough: Docker would **restart it after reboot** and the container would keep reserving GPU memory. +### Step 8. Uninstall NemoClaw + +The NemoClaw CLI includes a built-in uninstaller. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and Ollama are preserved. ```bash -docker update --restart=no vllm-nemotron 2>/dev/null || true -docker stop vllm-nemotron -docker rm vllm-nemotron +nemoclaw uninstall --yes ``` -To remove the container in one step even if it is running: `docker rm -f vllm-nemotron`. - -### Step 12. Uninstall NemoClaw - -Run the uninstaller from the cloned source directory. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and vLLM are preserved. +To remove everything including the Ollama model: ```bash -cd ~/.nemoclaw/source -./uninstall.sh +nemoclaw uninstall --yes --delete-models ``` **Uninstaller flags:** @@ -698,15 +395,13 @@ cd ~/.nemoclaw/source |------|--------| | `--yes` | Skip the confirmation prompt | | `--keep-openshell` | Leave the `openshell` binary in place | -| `--delete-models` | Removes **local inference models pulled by older NemoClaw flows** (the upstream flag name still references **Ollama**). It does **not** remove Hugging Face weights used by this playbook’s **vLLM** container — delete those separately (below). | +| `--delete-models` | Also remove the Ollama models pulled by NemoClaw | -To also remove the vLLM container and cached model weights: - -```bash -./uninstall.sh --yes -docker rm -f vllm-nemotron 2>/dev/null || true -rm -rf ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/ -``` +> [!NOTE] +> If the `nemoclaw` CLI is not available (e.g. install failed partway), use the remote uninstaller as a fallback: +> ```bash +> curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh | bash -s -- --yes +> ``` The uninstaller runs 6 steps: 1. Stop NemoClaw helper services and port-forward processes @@ -717,7 +412,7 @@ The uninstaller runs 6 steps: 6. Remove state directories (`~/.nemoclaw`, `~/.config/openshell`, `~/.config/nemoclaw`) and the OpenShell binary > [!NOTE] -> The source clone at `~/.nemoclaw/source` is removed as part of state cleanup in step 6. If you want to keep a local copy, move or back it up before running the uninstaller. +> If you have a local clone at `~/.nemoclaw/source` you want to keep, move or back it up before running the uninstaller — it is removed as part of state cleanup in step 6. ## Useful commands @@ -727,52 +422,81 @@ The uninstaller runs 6 steps: | `nemoclaw my-assistant status` | Show sandbox status and inference config | | `nemoclaw my-assistant logs --follow` | Stream sandbox logs in real time | | `nemoclaw list` | List all registered sandboxes | -| `nemoclaw tunnel start` | Start optional host services such as **cloudflared** (public dashboard URL when installed); does **not** start Telegram | -| `nemoclaw start` | Deprecated alias for tunnel/aux host services — **not** for Telegram | -| `nemoclaw stop` | Stop host auxiliary services started by `nemoclaw tunnel start` / `nemoclaw start` | -| `nemoclaw channels add telegram` | Store Telegram token and rebuild sandbox (host) | +| `nemoclaw tunnel start` | Start cloudflared tunnel (public URL for remote Web UI access) | +| `nemoclaw tunnel stop` | Stop the cloudflared tunnel | +| `nemoclaw my-assistant dashboard-url --quiet` | Print the full tokenized Web UI URL (includes auto-assigned port) | | `openshell term` | Open the monitoring TUI on the host | | `openshell forward list` | List active port forwards | -| `openshell forward start 18789 my-assistant --background` | Start port forwarding for Web UI | -| `openshell forward stop 18789 my-assistant` | Stop Web UI port forward | -| `docker logs -f vllm-nemotron` | Stream vLLM inference server logs | -| `docker restart vllm-nemotron` | Restart the vLLM inference server | -| `curl http://localhost:8000/v1/models` | Check vLLM API status | -| `cd ~/.nemoclaw/source && ./uninstall.sh` | Remove NemoClaw (preserves Docker, Node.js, vLLM image) | +| `nemoclaw uninstall --yes` | Remove NemoClaw (preserves Docker, Node.js, Ollama) | +| `nemoclaw uninstall --yes --delete-models` | Remove NemoClaw and Ollama models | ## Troubleshooting | Symptom | Cause | Fix | |---------|-------|-----| -| `openclaw agent --local` fails or is blocked inside the sandbox | `--local` bypasses the NemoClaw gateway and is disallowed in the OpenShell sandbox | Use gateway mode: `openclaw agent --agent main -m "hello" --session-id test` (no `--local`). | -| Onboard fails with **“K8s namespace not ready”** (or similar) with no clear reason | Often **low disk space** on `/` or Docker’s data root; image push / k3s need headroom | Run `df -h / /var/lib/docker`. Free **at least ~40 GB** (see [NemoClaw quickstart prerequisites](https://docs.nvidia.com/nemoclaw/latest/get-started/quickstart.html)); prune Docker (`docker system prune`) or expand disk, then retry onboard. | -| vLLM warns about **mixed devices** or loads on an unexpected GPU | Multiple GPUs visible; default visibility does not match intent | Pin one GPU: `--gpus '"device=0"'` and `-e CUDA_VISIBLE_DEVICES=0` with `--tensor-parallel-size 1`, or use two GPUs explicitly with `--tensor-parallel-size 2` and `-e CUDA_VISIBLE_DEVICES=0,1` (see Step 3 in instructions). | | `nemoclaw: command not found` after install | Shell PATH not updated | Run `source ~/.bashrc` (or `source ~/.zshrc` for zsh), or open a new terminal window. | -| `pip: command not found` | pip not installed on DGX Station by default | Install pip: `sudo apt install -y python3-pip`. Then use `pip3 install --break-system-packages huggingface-hub`. | -| `huggingface-cli` is deprecated | Hugging Face CLI was renamed | Use `hf download` instead of `huggingface-cli download`. | -| vLLM container won't start or crashes | GPU memory issue or wrong image | Check logs: `docker logs vllm-nemotron`. If CUDA OOM, reduce context: recreate the container with `--max-model-len 8192`. Ensure you are using the NVIDIA container image (`nvcr.io/nvidia/vllm:26.03-py3`), not the community `vllm/vllm-openai` image. | -| vLLM logs show `Application startup complete.` but `curl` times out | vLLM still compiling CUDA graphs after startup | Wait 1--2 minutes after `Application startup complete.` before sending requests. The first request compiles CUDA graphs and may take 30--90 seconds. | -| NemoClaw onboard fails with "endpoint validation failed" | vLLM model not warmed up or validation timeout too short | Warm up the model first: `curl -s --max-time 120 http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"hello"}],"max_tokens":10}'`. Then re-run with `NEMOCLAW_EXPERIMENTAL=1 NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 nemoclaw onboard`. | -| NemoClaw reports "provider 'vllm' is not available" | Missing experimental flag | Set `NEMOCLAW_EXPERIMENTAL=1` before running the installer or `nemoclaw onboard`. The vLLM provider is currently an experimental feature. | +| Installer fails with Node.js version error | Node.js version below 22.16 | Install Node.js 22.16+: `curl -fsSL https://deb.nodesource.com/setup_22.x \| sudo -E bash - && sudo apt-get install -y nodejs` then re-run the installer. | +| npm install fails with `EACCES` permission error | npm global directory not writable | `mkdir -p ~/.npm-global && npm config set prefix ~/.npm-global && export PATH=~/.npm-global/bin:$PATH` then re-run the installer. Add the `export` line to `~/.bashrc` to make it permanent. | | Docker permission denied | User not in docker group | `sudo usermod -aG docker $USER`, then log out and back in. | -| Gateway fails with cgroup / "Failed to start ContainerManager" errors | Docker not configured for host cgroup namespace on DGX Station | Run the cgroup fix: `sudo python3 -c "import json, os; path='/etc/docker/daemon.json'; d=json.load(open(path)) if os.path.exists(path) else {}; d['default-cgroupns-mode']='host'; json.dump(d, open(path,'w'), indent=2)"` then `sudo systemctl restart docker`. | +| Gateway fails with cgroup / "Failed to start ContainerManager" errors | Older OpenShell or Docker still using a **private** cgroup namespace for the gateway so kubelet cannot see cgroup v2 controllers | First **upgrade OpenShell** (re-run the Phase 1 `nemoclaw.sh` install so you get a build that sets host cgroupns on the gateway container). If it still fails, force Docker's default to host mode by running the [daemon.json cgroup fix](#daemonjson-cgroup-fix) below, then run `sudo systemctl restart docker`. | | Gateway fails with "port 8080 is held by container..." | Another OpenShell gateway or container is using port 8080 | Stop the conflicting container: `openshell gateway destroy -g ` or `docker stop && docker rm `, then retry `nemoclaw onboard`. | -| Sandbox cannot reach the inference server | Using `localhost` instead of `host.openshell.internal` in endpoint URL | Inside the sandbox, `localhost` refers to the sandbox container, not the host. The onboard wizard configures `host.openshell.internal` automatically. Verify from inside the sandbox: `curl -sf https://inference.local/v1/models`. If this fails, check that vLLM is reachable from the host: `curl -s http://localhost:8000/v1/models`. | -| Agent gives no response or is very slow | Normal for 120B model running locally | Nemotron 3 Super 120B can take 30--90 seconds per response. Verify inference route: `nemoclaw my-assistant status`. | -| vLLM API returns empty or errors on tool calls | Missing tool-call flags | Verify that `--enable-auto-tool-choice` and `--tool-call-parser qwen3_xml` are set: `docker inspect vllm-nemotron --format '{{.Config.Cmd}}'`. | +| Sandbox creation fails | Stale gateway state or DNS not propagated | Run `openshell gateway destroy && openshell gateway start`, then re-run the installer or `nemoclaw onboard`. | +| CoreDNS crash loop | Known issue on some DGX Station configurations | Re-run the NemoClaw installer (`curl -fsSL https://www.nvidia.com/nemoclaw.sh \| bash`) which includes the CoreDNS fix. If the issue persists, see [NemoClaw troubleshooting](https://docs.nvidia.com/nemoclaw/latest/reference/troubleshooting.html). | +| "No GPU detected" during onboard | DGX Station GB300 reports unified memory differently | Expected on DGX Station. The wizard still works and uses Ollama for inference. | +| Inference timeout or hangs | Ollama not running or not reachable | Check Ollama: `curl http://127.0.0.1:11434`. If not running: `sudo systemctl restart ollama`. Verify the NemoClaw auth proxy is healthy: `curl http://127.0.0.1:11435/api/tags`. If both respond, check `nemoclaw my-assistant status` for the Inference health line. | +| Agent gives no response or is very slow | First response can be slow, especially with larger models | Response time depends on model size (30B: a few seconds, 120B: 30–90 seconds). Verify inference route: `nemoclaw my-assistant status`. | | Port 18789 already in use | Another process is bound to the port | `lsof -i :18789` then `kill `. If needed, `kill -9 ` to force-terminate. | -| Web UI port forward dies or dashboard unreachable | Port forward not active | `openshell forward stop 18789 my-assistant` then `openshell forward start 18789 my-assistant --background`. Always pass **port** and **sandbox name** to `openshell forward stop`. | -| Web UI shows `origin not allowed` | Browser origin does not match what the gateway expects | On the **DGX Station local desktop**, open `http://127.0.0.1:18789/#token=...` (not `localhost`). Through an **SSH tunnel** on another machine, `localhost` vs `127.0.0.1` in the client browser usually both work because the check applies to how you reach the forwarded port locally. | -| Telegram does not work after install; `nemoclaw start` does nothing for Telegram | **`nemoclaw start` starts optional host services (e.g. cloudflared), not the Telegram bridge** | Configure Telegram during onboard, or on the host run `nemoclaw my-assistant channels add telegram` (and rebuild), after `policy-add` for the `telegram` preset. See [Set up Telegram bridge](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html). | -| Telegram bot receives messages but does not reply | Telegram policy not added to sandbox | Run `nemoclaw my-assistant policy-add`, type `telegram`, hit Y. Ensure the channel was added with `nemoclaw my-assistant channels add telegram` so the image includes Telegram. | -| `docker: Error response from daemon: Conflict. The container name "/vllm-nemotron" is already in use` | Previous cleanup used `docker stop` only | `docker rm -f vllm-nemotron` (or `docker update --restart=no` then `docker stop` and `docker rm`). The playbook uses `--restart unless-stopped`; stopping alone leaves a restart policy and reserved name. | +| Web UI port forward dies or dashboard unreachable | Port forward not active | `openshell forward stop 18789 my-assistant` then `openshell forward start 18789 my-assistant --background`. | +| Web UI shows `origin not allowed` | Accessing via `localhost` instead of `127.0.0.1` | Use `http://127.0.0.1:18789/#token=...` in the browser. The gateway origin check requires `127.0.0.1` exactly. | +| Telegram bridge does not start | Telegram channel not registered with sandbox | Run `nemoclaw channels add telegram` to register the bot token and rebuild the sandbox. Verify with `nemoclaw status`. | +| Telegram stops responding after sandbox rebuild | Telegram long-polling session stale after rebuild | Run `nemoclaw recover` to restart the gateway. If still unresponsive, run `nemoclaw channels add telegram` to re-register and rebuild. | +| Telegram bot receives messages but does not reply | Telegram network egress policy not added | Run `nemoclaw policy-add`, select `telegram`, and confirm. This is a hot-reload — no rebuild needed. | -**Model variant guidance:** +#### daemon.json cgroup fix -| Variant | Size | VRAM Required | When to Use | -|---------|------|---------------|-------------| -| `NVFP4` | ~60 GB | ~80 GB | Default for DGX Station (GB300). Fits on single GPU with room for large KV cache. | -| `FP8` | ~120 GB | ~140 GB | Higher accuracy, still fits on GB300. Add `--kv-cache-dtype fp8` to the vLLM command. | -| `BF16` | ~240 GB | ~260 GB | Highest accuracy. Fits on GB300 but leaves little room for KV cache. Reduce `--max-model-len`. | +Use this script as the fallback for the cgroup / "Failed to start ContainerManager" row above. It validates any existing `/etc/docker/daemon.json`, writes a `.bak` backup, sets `default-cgroupns-mode` to `host`, and atomically replaces the file. It exits non-zero with an error on stderr if anything fails, leaving the original `daemon.json` untouched. -For the latest known issues, see [DGX Station documentation](https://docs.nvidia.com/dgx/dgx-station-user-guide/index.html). +```bash +sudo python3 - <<'PY' +import json, os, shutil, sys, tempfile + +path = '/etc/docker/daemon.json' +try: + if os.path.exists(path): + with open(path) as f: + data = json.load(f) + if not isinstance(data, dict): + raise ValueError(f'{path} is not a JSON object') + else: + data = {} +except (json.JSONDecodeError, ValueError, OSError) as e: + print(f'error: failed to read {path}: {e}', file=sys.stderr) + sys.exit(1) + +if os.path.exists(path): + try: + shutil.copy2(path, path + '.bak') + except OSError as e: + print(f'error: failed to back up {path}: {e}', file=sys.stderr) + sys.exit(1) + +data['default-cgroupns-mode'] = 'host' + +target_dir = os.path.dirname(path) or '/' +fd, tmp = tempfile.mkstemp(prefix='daemon.json.', dir=target_dir) +try: + with os.fdopen(fd, 'w') as f: + json.dump(data, f, indent=2) + f.write('\n') + os.chmod(tmp, 0o644) + os.replace(tmp, path) +except OSError as e: + if os.path.exists(tmp): + try: + os.unlink(tmp) + except OSError: + pass + print(f'error: failed to write {path}: {e}', file=sys.stderr) + sys.exit(1) +PY +``` diff --git a/nvidia/station-nemoclaw/endpoint-test.yaml b/nvidia/station-nemoclaw/endpoint-test.yaml index 54569b2..e83eccf 100644 --- a/nvidia/station-nemoclaw/endpoint-test.yaml +++ b/nvidia/station-nemoclaw/endpoint-test.yaml @@ -1,8 +1,8 @@ kind: Playbook metadata: - name: station-nemoclaw - displayName: NemoClaw with Nemotron-3-Super and vLLM on DGX Station - shortDescription: Install NemoClaw on DGX Station with local vLLM inference and Telegram bot integration + name: nemoclaw + displayName: Run NemoClaw with a Local LLM + shortDescription: Build your first local AI assistant on DGX Station using NemoClaw in a secure sandbox, with optional Telegram. publisher: nvidia description: | @@ -11,23 +11,19 @@ metadata: labelsV2: - gpuType:playbook:gpu_type_station - - DGX - DGX Station - - GB300 - - AI Agent + - Agentic Workflow - OpenShell - - vLLM - - Nemotron-3-Super - NemoClaw - Telegram attributes: - key: DURATION - value: 30 MINS + value: 30 MIN spec: - artifactName: station-nemoclaw - nvcfFunctionId: None + artifactName: nemoclaw + nvcfFunctionId: 3b0ad962-7cfe-4370-9f4d-8024298a6d13 attributes: showUnavailableBanner: false @@ -45,22 +41,19 @@ spec: label: Overview content: | - ## Overview + # Basic idea - ## Basic idea + **NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime — an environment designed for executing agents with additional security — and connects them to local inference on your DGX Station. A single installer command (`nemoclaw.sh`) handles Node.js, OpenShell, and the NemoClaw CLI; the **onboard** wizard then creates a sandboxed agent, optional **Brave Search**, optional **messaging channels** (Telegram, Discord, or Slack), and a **policy tier** with network presets. - **NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime -- an environment designed for executing agents with additional security -- and open-source models like NVIDIA Nemotron. A single installer command handles Node.js, OpenShell, and the NemoClaw CLI, then walks you through an onboard wizard to create a sandboxed agent on your DGX Station using vLLM with Nemotron 3 Super. - - By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, accessible via a web dashboard and a Telegram bot, with inference routed to a local Nemotron 3 Super 120B model served by vLLM on your DGX Station -- all without exposing your host filesystem or network to the agent. + By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, reachable through the **Web UI** or **terminal TUI**, with inference routed to local inference on the DGX Station. You can optionally add **Telegram** (with **cloudflared** for a public webhook URL) and optional **web search** — all without exposing your host filesystem or network beyond what you explicitly allow in policy. ## What you'll accomplish - - Configure Docker and the NVIDIA container runtime for OpenShell on DGX Station - - Pull Nemotron 3 Super 120B (NVFP4) from Hugging Face and serve it with vLLM - - Install NemoClaw with a single command (handles Node.js, OpenShell, and the CLI) - - Run the onboard wizard to create a sandbox and configure local vLLM inference - - Chat with the agent via the CLI, TUI, and web UI - - Set up a Telegram bot that forwards messages to your sandboxed agent + - Install **NemoClaw** with one command (`nemoclaw.sh`), which pulls Node.js, OpenShell, and the CLI as needed + - Walk through `nemoclaw onboard` wizard with recommended settings + - Open the **Web UI** to interact with agent + - Optionally enable **Brave Search** or **Telegram** after onboarding + - **Cleanup and uninstall** with the documented `uninstall.sh` flags when finished ## Notice and disclaimers @@ -74,14 +67,14 @@ spec: ### What you're getting - This experience is provided "AS IS" for demonstration purposes only -- no warranties, no guarantees. This is a demo, not a production-ready solution. You will need to implement appropriate security controls for your environment and use case. + This experience is provided "AS IS" for demonstration purposes only — no warranties, no guarantees. This is a demo, not a production-ready solution. You will need to implement appropriate security controls for your environment and use case. ### Key risks with AI agents - - **Data leakage** -- Any materials the agent accesses could be exposed, leaked, or stolen. - - **Malicious code execution** -- The agent or its connected tools could expose your system to malicious code or cyber-attacks. - - **Unintended actions** -- The agent might modify or delete files, send messages, or access services without explicit approval. - - **Prompt injection and manipulation** -- External inputs or connected content could hijack the agent's behavior in unexpected ways. + - **Data leakage** — Any materials the agent accesses could be exposed, leaked, or stolen. + - **Malicious code execution** — The agent or its connected tools could expose your system to malicious code or cyber-attacks. + - **Unintended actions** — The agent might modify or delete files, send messages, or access services without explicit approval. + - **Prompt injection and manipulation** — External inputs or connected content could hijack the agent's behavior in unexpected ways. ### Participant acknowledgement @@ -91,23 +84,22 @@ spec: | Layer | What it protects | When it applies | |------------|----------------------------------------------------|-----------------------------| - | Filesystem | Prevents reads/writes outside allowed paths. | Locked at sandbox creation. | + | Filesystem | Prevents reads/writes outside allowed paths. | Locked at sandbox creation. | | Network | Blocks unauthorized outbound connections. | Hot-reloadable at runtime. | - | Process | Blocks privilege escalation and dangerous syscalls.| Locked at sandbox creation. | + | Process | Blocks privilege escalation and dangerous syscalls.| Locked at sandbox creation. | | Inference | Reroutes model API calls to controlled backends. | Hot-reloadable at runtime. | ## What to know before starting - Basic use of the Linux terminal and SSH - - Familiarity with Docker (permissions, `docker run`) + - Familiarity with Docker (permissions, `docker run`, optional `docker` group membership) - Awareness of the security and risk sections above ## Prerequisites - **Hardware and access:** + **Hardware:** - A DGX Station (GB300) with keyboard and monitor, or SSH access - - A **Telegram bot token** from [@BotFather](https://t.me/BotFather) (create one with `/newbot`) -- optional, for Phase 3 **Software:** @@ -119,16 +111,16 @@ spec: head -n 2 /etc/os-release nvidia-smi docker info --format '{{.ServerVersion}}' - df -h / /var/lib/docker 2>/dev/null | head -20 ``` - Expected: Ubuntu 24.04, NVIDIA GB300 GPU(s), Docker 28.x+, and **enough free disk** for Docker layers, the NemoClaw sandbox image, and Hugging Face cache (treat **~40 GB free** on the Docker data filesystem as a practical minimum; very low free space can surface as cryptic onboard errors such as “K8s namespace not ready”). + Expected: Ubuntu 24.04, NVIDIA GB300 GPU, Docker 28.x+. ## Have ready before you begin - | Item | Where to get it | - |------|----------------| - | Telegram bot token (optional) | [@BotFather](https://t.me/BotFather) on Telegram -- create with `/newbot` | + | Item | When you need it | + |------|------------------| + | **Telegram bot token** (optional) | Create with [@BotFather](https://t.me/BotFather) (`/newbot`). You can paste it during **onboarding** (Step 3) **or** when you run **`nemoclaw channels add telegram`** later. | + | **Brave Search API key** (optional) | From [Brave Search API](https://brave.com/search/api/) if you enable web search during onboarding or via **`nemoclaw onboard --fresh --gpu`** (`--fresh` re-prompts every onboarding question, including features you previously skipped; without `--fresh` the wizard resumes the previous session and will not re-prompt). | ## Ancillary files @@ -136,10 +128,10 @@ spec: ## Time and risk - - **Estimated time:** 20--30 minutes (with model already downloaded). First-time model download adds ~10--20 minutes depending on network speed. - - **Risk level:** Medium -- you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts. - - **Last Updated:** 04/27/2026 - * First publication for DGX Station with vLLM + - **Estimated time:** About 30–60 minutes for a first full pass (install, onboard, model download depending on choice and network). Optional Brave, Telegram, and cloudflared steps add time if you do them in a second session. + - **Risk level:** Medium — you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts. + - **Last Updated:** 05/29/2026 + - Update to latest nemoclaw installer instructions @@ -148,355 +140,111 @@ spec: label: Instructions content: | - # Phase 1: Prerequisites + # Phase 1: Install and Run NemoClaw - These steps prepare a fresh DGX Station for NemoClaw. If Docker, the NVIDIA runtime, and vLLM are already configured, skip to Phase 2. + ## Step 1. Install NemoClaw - > [!IMPORTANT] - > **Disk space:** NemoClaw’s onboard flow pulls a multi-gigabyte sandbox image and runs Docker, k3s, and the gateway together. If root or Docker’s data disk is nearly full (for example only a few gigabytes free), onboarding can fail with generic errors such as **“K8s namespace not ready”** with no clear hint about storage. Before you start, check free space: `df -h / /var/lib/docker`. NVIDIA recommends **at least 40 GB free** on the filesystem that holds Docker layers (often `/` or `/var/lib/docker`); treat **under ~15 GB** as high risk for first-time onboard failures. - - ## Step 1. Configure Docker and the NVIDIA container runtime - - OpenShell's gateway runs k3s inside Docker. On DGX Station (Ubuntu 24.04, cgroup v2), Docker must be configured with the NVIDIA runtime and host cgroup namespace mode. - - Configure the NVIDIA container runtime for Docker: + This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.55** release (set via `NEMOCLAW_VERSION`; v0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox. ```bash - sudo nvidia-ctk runtime configure --runtime=docker + curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_VERSION=v0.55 bash ``` - Expected: + The installation wizard walks you through setup: - ```text - INFO Loading config from /etc/docker/daemon.json - INFO Wrote updated config to /etc/docker/daemon.json - INFO It is recommended that docker daemon be restarted. - ``` + 1. **Accept NemoClaw license** -- Confirm by entering `yes` + 2. **Run express install** -- Confirm by entering `Y` - Set the cgroup namespace mode required by OpenShell on DGX Station: + The installer requires **Node.js 22.16+** (installed automatically if missing). It walks you through Node.js, NemoClaw CLI and Onboarding phases. See more details of Onboarding configuration in the next step. - ```bash - sudo python3 -c " - import json, os - path = '/etc/docker/daemon.json' - d = json.load(open(path)) if os.path.exists(path) else {} - d['default-cgroupns-mode'] = 'host' - json.dump(d, open(path, 'w'), indent=2) - " - ``` - - Restart Docker: - - ```bash - sudo systemctl restart docker - ``` - - Verify the NVIDIA runtime works: - - ```bash - docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi - ``` - - Expected: - - ```text - +-----------------------------------------------------------------------------------------+ - | NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 | - +-----------------------------------------+------------------------+----------------------+ - | 0 NVIDIA GB300 On | 00000009:06:00.0 Off | 0 | - | N/A 46C P0 215W / 1300W | 18661MiB / 256703MiB | 0% Default | - +-----------------------------------------+------------------------+----------------------+ - ``` - - If you get a permission denied error on `docker`, add your user to the Docker group and activate the new group in your current session: - - ```bash - sudo usermod -aG docker $USER - newgrp docker - ``` - - This applies the group change immediately. Alternatively, you can log out and back in instead of running `newgrp docker`. + ## Step 2. NemoClaw Onboarding > [!NOTE] - > DGX Station uses cgroup v2. OpenShell's gateway embeds k3s inside Docker and needs host cgroup namespace access. Without `default-cgroupns-mode: host`, the gateway can fail with "Failed to start ContainerManager" errors. + > If you chose **express install** in Step 1, all settings are auto-configured with recommended defaults. Skip to Step 3. - ## Step 2. Pull the Nemotron-3-Super model + During custom setup, the onboard wizard walks you through: - Install pip and the Hugging Face CLI (if not already installed): + 1. **Configuring inference** -- Choose to set up local inference on your DGX Station by selecting **`7) Local Ollama`**. + 2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will provide options to download models to start. + 3. **Sandbox name** -- Pick a name (e.g. my-assistant). Each sandbox requires a unique name. + 4. **Apply this configuration** -- Enter `Y` to confirm setting up local inference. + 5. **Enable Brave Web Search** -- Optional. If you enable it, paste a [Brave Search API](https://brave.com/search/api/) key when prompted. + 6. **Messaging channels** -- Optional. If you enable it, choose your desired bot (`telegram`, `discord` or `slack`) and paste your bot token when prompted. + 7. **Policy presets** -- Choose desired Policy tier (`Balanced` recommended) and accept/edit the suggested presets when prompted (confirm with **Enter**). - ```bash - sudo apt install -y python3-pip - pip3 install --break-system-packages huggingface-hub - ``` - - Download Nemotron 3 Super 120B in NVFP4 quantization (~60 GB; may take 10--20 minutes depending on network speed): - - ```bash - hf download nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 - ``` - - Expected (on a fresh download; cached downloads complete instantly): - - ```text - Fetching 36 files: 100%|██████████| 36/36 [15:42<00:00, 26.18s/it] - /home/nvidia/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/snapshots/0d6fa3ecad422a... - ``` - - Verify the download completed: - - ```bash - ls ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/ - ``` - - Expected: - - ```text - blobs refs snapshots - ``` - - > [!NOTE] - > The NVFP4 quantization is chosen because it fits entirely in **one** GB300 GPU’s 256 GB HBM3e with room for KV cache. On a **two-GPU** station you can still use NVFP4 with `--tensor-parallel-size 1` and a single visible GPU, or shard with `--tensor-parallel-size 2`. For other quantization variants, see [Troubleshooting](troubleshooting.md). - - ## Step 3. Start the vLLM inference server - - Launch vLLM using the NVIDIA-optimized container image. - - **Single GPU (default on one-GPU systems, or pin to one GPU on multi-GPU stations):** vLLM can emit **mixed device** warnings if several GPUs are visible but the model is only meant to use one. Pinning avoids accidentally placing weights on an unexpected device. - - ```bash - docker run -d --name vllm-nemotron \ - --runtime nvidia --gpus '"device=0"' \ - -e CUDA_VISIBLE_DEVICES=0 \ - -v ~/.cache/huggingface:/root/.cache/huggingface \ - -p 8000:8000 \ - --restart unless-stopped \ - nvcr.io/nvidia/vllm:26.03-py3 \ - python3 -m vllm.entrypoints.openai.api_server \ - --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ - --host 0.0.0.0 \ - --port 8000 \ - --tensor-parallel-size 1 \ - --trust-remote-code \ - --max-model-len 32768 \ - --enable-auto-tool-choice \ - --tool-call-parser qwen3_xml \ - --reasoning-parser nemotron_v3 - ``` - - **Two GPUs (tensor parallel):** If your DGX Station has two Blackwell GPUs and you want Nemotron sharded across both, use both devices and set tensor parallel size to `2` (VRAM is summed across the GPUs): - - ```bash - docker run -d --name vllm-nemotron \ - --runtime nvidia --gpus all \ - -e CUDA_VISIBLE_DEVICES=0,1 \ - -v ~/.cache/huggingface:/root/.cache/huggingface \ - -p 8000:8000 \ - --restart unless-stopped \ - nvcr.io/nvidia/vllm:26.03-py3 \ - python3 -m vllm.entrypoints.openai.api_server \ - --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ - --host 0.0.0.0 \ - --port 8000 \ - --tensor-parallel-size 2 \ - --trust-remote-code \ - --max-model-len 32768 \ - --enable-auto-tool-choice \ - --tool-call-parser qwen3_xml \ - --reasoning-parser nemotron_v3 - ``` - - **Pick a GPU index by name (optional one-liner):** To print the device index of the first GPU whose name contains `GB300` (adjust the pattern if your `nvidia-smi` name string differs), run on the host: - - ```bash - nvidia-smi --query-gpu=index,name --format=csv,noheader | awk -F', ' '/GB300/ { gsub(/^ +/,"",$1); print $1; exit }' - ``` - - Use that index in Docker as `--gpus '"device=N"'` (replace `N` with the printed index). - - > [!NOTE] - > **`--tool-call-parser qwen3_xml`:** Nemotron’s tool-call wire format is exposed through vLLM’s **Qwen3-compatible XML tool parser** — the name refers to the parser implementation, not the base model. This pairing is what vLLM expects for correct function/tool calling with this checkpoint. - - The first startup loads ~70 GB of weights into GPU memory. Watch the logs until you see the model is ready: - - ```bash - docker logs -f vllm-nemotron - ``` - - Wait until you see the following in the logs (typically 3--5 minutes): - - ```text - INFO Loading weights took 55.47 seconds - INFO Model loading took 69.39 GiB memory and 71.31 seconds - INFO: Started server process [1] - INFO: Waiting for application startup. - INFO: Application startup complete. - ``` - - Then verify the API is responding: - - ```bash - curl -s http://localhost:8000/v1/models - ``` - - Expected: - - ```json - {"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]} - ``` - - Send a test request to warm up the model before proceeding to Step 4. The first inference request compiles CUDA graphs and can take 30--90 seconds: - - ```bash - curl -s --max-time 120 http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"Say hello."}],"max_tokens":10}' - ``` - - Expected (the first request may take 30--90 seconds; subsequent requests are much faster): - - ```json - {"id":"chatcmpl-...","object":"chat.completion","model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","choices":[{"index":0,"message":{"role":"assistant","content":"..."},"finish_reason":"length"}],...} - ``` - - > [!IMPORTANT] - > Warm up the model before running the NemoClaw installer. The onboard wizard validates the vLLM endpoint with a short timeout. If the model has not served at least one request, this validation will time out and the install will fail. - - > [!IMPORTANT] - > Always start vLLM via the Docker container -- do not run `vllm serve` directly on the host. The NVIDIA container image (`nvcr.io/nvidia/vllm:26.03-py3`) includes optimized kernels for the GB300's Blackwell architecture that are not available in the pip-installed version. - - > [!NOTE] - > Key flags explained: - > - `--tensor-parallel-size` -- `1` for a single visible GPU; `2` when you expose two GPUs for tensor-parallel sharding (see Step 3). - > - `--trust-remote-code` -- required for the Mamba2-Transformer hybrid architecture - > - `--max-model-len 32768` -- maximum context length (increase up to 1M if VRAM allows) - > - `--enable-auto-tool-choice --tool-call-parser qwen3_xml` -- enables function/tool calling for the agent (see the note above on the parser name). - > - `--reasoning-parser nemotron_v3` -- separates chain-of-thought reasoning from the response so the TUI/Web UI can display them cleanly - - --- - - # Phase 2: Install and Run NemoClaw - - ## Step 4. Install NemoClaw - - The installer script installs Node.js (if needed), OpenShell, the NemoClaw CLI, and runs onboarding to create a sandbox. The vLLM provider requires the **experimental** flag and an **extended inference timeout** (the default 15-second validation timeout is too short for a 120B model). - - ### Recommended: non-interactive install (copy-paste friendly) - - This path is best for SSH sessions, automation, and documentation — no arrow-key TUI in the terminal. - - ```bash - NEMOCLAW_EXPERIMENTAL=1 \ - NEMOCLAW_NON_INTERACTIVE=1 \ - NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 \ - NEMOCLAW_SANDBOX_NAME=my-assistant \ - NEMOCLAW_PROVIDER=vllm \ - NEMOCLAW_MODEL="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \ - NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \ - bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)" - ``` - - Optional: include **Telegram** in the first onboard without typing the token over SSH — export credentials on the host **before** running the installer (same variables the [NemoClaw Telegram bridge guide](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html) documents): - - ```bash - export TELEGRAM_BOT_TOKEN='' - # Optional DM allowlist (comma-separated Telegram user IDs): - # export TELEGRAM_ALLOWED_IDS='123456789,987654321' - ``` - - Use [Telegram Desktop](https://desktop.telegram.org/) or [web.telegram.org](https://web.telegram.org/) on a laptop to copy the token from [@BotFather](https://t.me/BotFather) and paste into your SSH session (or into a small env file you `source`). Typing a 46+ character token on a phone keyboard into a remote shell is error-prone. - - To **persist** `TELEGRAM_BOT_TOKEN` across reboots, keep it in a root-owned or user-only file and source it from your shell profile (example — adjust path and permissions): - - ```bash - install -m 600 /dev/null ~/.nemoclaw/telegram.env - nano ~/.nemoclaw/telegram.env # add: export TELEGRAM_BOT_TOKEN='...' - grep -q 'nemoclaw/telegram.env' ~/.bashrc || echo 'source ~/.nemoclaw/telegram.env 2>/dev/null' >> ~/.bashrc - ``` - - NemoClaw also stores messaging credentials in its credential store when you onboard or run `nemoclaw … channels add telegram`; the file above is mainly for **re-running scripts** or **non-interactive** flows that read the environment. - - ### Alternative: interactive installer - - If you prefer the wizard: - - ```bash - NEMOCLAW_EXPERIMENTAL=1 \ - NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 \ - bash -c "$(curl -fsSL https://www.nvidia.com/nemoclaw.sh)" - ``` - - The wizard asks **six** high-level prompts (third-party notice, inference provider, Brave Search, messaging channels, sandbox name, policy presets). In parallel, the installer prints **eight** numbered onboard sub-phases, `[1/8]` … `[8/8]` (preflight, gateway, inference detection, inference route, messaging channels, sandbox creation, OpenClaw inside sandbox, policy presets). **Those two numberings are different on purpose** — the `[n/8]` lines are internal progress steps; the numbered list above is what you answer in the TUI. - - 1. **Third-party software notice** -- Type `yes` to accept and continue. - 2. **Inference provider** -- The wizard detects vLLM running locally. Select option **8** (`Local vLLM [experimental] — running`). - 3. **Brave Web Search** -- Optional. Type `skip` if you don't have a Brave Search API key. - 4. **Messaging channels** -- Optional. Press **Enter** to skip, or toggle Telegram/Discord/Slack if desired (this is the step that corresponds to onboard phase **[5/8]** in the log). - 5. **Sandbox name** -- Pick a name (e.g. `my-assistant`). Names must be lowercase alphanumeric with hyphens only. - 6. **Policy presets** -- Use arrow keys to toggle presets. `pypi` and `npm` are selected by default. Press **Enter** to confirm. - - The install takes approximately 3 minutes. Example milestones in the output (wording may vary slightly by release): - - ```text - [1/3] Node.js - Node.js found: v22.22.2 - - [2/3] NemoClaw CLI - Installing NemoClaw from GitHub... - Verified: nemoclaw is available at /home/nvidia/.local/bin/nemoclaw - - [3/3] Onboarding - [1/8] Preflight checks - ✓ Docker is running - ✓ NVIDIA GPU detected: 2 GPU(s), 256703 MB VRAM # example on a two-GPU system - [2/8] Starting OpenShell gateway - ✓ Gateway is healthy - [3/8] Configuring inference (NIM) - ✓ Using existing vLLM on localhost:8000 - Detected model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 - [4/8] Setting up inference provider - ✓ Inference route set: vllm-local / nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 - [5/8] Messaging channels - (example) Telegram disabled — skipped - # or: Telegram enabled; token stored in credential store - [6/8] Creating sandbox - ✓ Sandbox 'my-assistant' created - [7/8] Setting up OpenClaw inside sandbox - ✓ OpenClaw gateway launched inside sandbox - [8/8] Policy presets - Applied preset: pypi - Applied preset: npm - ``` - - When complete you will see: + When complete you will see output like: ```text ────────────────────────────────────────────────── Sandbox my-assistant (Landlock + seccomp + netns) - Model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 (Local vLLM) + Model (Local Ollama) ────────────────────────────────────────────────── Run: nemoclaw my-assistant connect Status: nemoclaw my-assistant status Logs: nemoclaw my-assistant logs --follow - - OpenClaw UI (tokenized URL; treat it like a password) - http://127.0.0.1:18789/#token= ────────────────────────────────────────────────── ``` - > [!IMPORTANT] - > Save the tokenized Web UI URL printed at the end -- you will need it in Step 8. It looks like: - > `http://127.0.0.1:18789/#token=` + > [!NOTE] + > - If `nemoclaw` is not found after install, run `source ~/.bashrc` to reload your shell path. + > - Time to finish **Onboarding** can vary, depending on the model choice and internet speed. + + NemoClaw Onboarding can be run repeatedly to create multiple sandboxes for independent usecases. Use `--name ` to create an additional sandbox alongside any existing ones: + + ```bash + nemoclaw onboard --gpu --name + ``` > [!IMPORTANT] - > `NEMOCLAW_EXPERIMENTAL=1` is required for the vLLM provider. Without it, the installer will report "Requested provider 'vllm' is not available in this environment." + > Use `--name ` to create an additional sandbox without affecting existing ones. The `--fresh` flag is a destructive option reserved for starting a completely new onboard session — if a sandbox with the same name already exists, `--fresh` will **destroy and recreate it**. Only use `--fresh` when you intend to wipe and re-onboard (see Step 4 for an example where re-prompting is required). + + ## Step 3. Interact with OpenClaw + + There are two ways to interact with your OpenClaw, Web UI or terminal UI. + + ### Option 1. Web UI + + Get the full dashboard URL (includes the auto-assigned port and token): + + ```bash + nemoclaw my-assistant dashboard-url --quiet + ``` + + This prints a URL like `http://127.0.0.1:18790/#token=`. The port is auto-assigned (commonly 18789 or 18790) and may differ between installs. + + **If accessing the Web UI directly on the DGX Station** (keyboard and monitor attached), open the dashboard URL in a browser. + + **If accessing the Web UI from a remote machine**, you need to set up an SSH tunnel. + + First, note the port number from the dashboard URL above (e.g. `18790`). + + Find your DGX Station's IP address: + + ```bash + hostname -I | awk '{print $1}' + ``` + + This prints the primary IP address (e.g. `192.168.1.42`). You can also find it in **Settings > Wi-Fi** or **Settings > Network** on the DGX Station's desktop, or check your router's connected-devices list. + + From your remote machine, create an SSH tunnel using the port from above (replace `` and ``): + + ```bash + ssh -L :127.0.0.1: @ + ``` + + Now open the dashboard URL in your remote machine's browser. > [!IMPORTANT] - > `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300` extends the validation timeout from the default 15 seconds to 300 seconds. Without this, the endpoint validation will fail on a cold 120B model, even if you warmed it up in Step 3 -- the installer sends its own test prompt which may be slower. + > Use `127.0.0.1`, not `localhost` -- the gateway origin check requires an exact match. > [!NOTE] - > If `nemoclaw` is not found after install, run `source ~/.bashrc` to reload your shell path. + > If the Web UI fails to load and the port forward may be stale, get the port from `nemoclaw my-assistant dashboard-url --quiet` and reset: + > ```bash + > openshell forward stop my-assistant || true + > openshell forward start my-assistant --background + > ``` - ## Step 5. Connect to the sandbox and verify inference + ### Option 2. Terminal UI Connect to the sandbox: @@ -504,207 +252,158 @@ spec: nemoclaw my-assistant connect ``` - Expected: - - ```text - sandbox@my-assistant:~$ - ``` - - You are now inside the sandboxed environment. Verify that the inference route is working: - - ```bash - curl -sf https://inference.local/v1/models - ``` - - Expected: - - ```json - {"object":"list","data":[{"id":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","object":"model",...}]} - ``` - - ## Step 6. Talk to the agent (CLI) - - Still inside the sandbox, send a test message **through the OpenClaw gateway** (the default path). The `--local` flag is **intentionally blocked** inside the NemoClaw OpenShell sandbox — it would bypass gateway controls — so the command you may see in generic OpenClaw quickstarts will fail here. - - ```bash - openclaw agent --agent main -m "hello" --session-id test - ``` - - Expected (the agent will think, then respond -- first response may take 30--90 seconds): streaming or printed assistant text ending with a normal reply. - - If you see a response from the agent, inference is working end-to-end. - - ## Step 7. Interactive TUI - - Launch the terminal UI for an interactive chat session: + Then launch the terminal UI inside the sandbox: ```bash openclaw tui ``` - Press **Ctrl+C** to exit the TUI. + You can start chatting with OpenClaw. Press **Ctrl+C** to exit the terminal UI. - ## Step 8. Exit the sandbox and access the Web UI - - Exit the sandbox to return to the host: + To exit the sandbox: ```bash exit ``` - **If accessing the Web UI directly on the DGX Station** (keyboard and monitor attached), open a browser and navigate to the tokenized URL from Step 4. Prefer **`127.0.0.1`** in the URL bar (not `localhost`) so it matches strict gateway origin checks: - - ```text - http://127.0.0.1:18789/#token= - ``` - - **If accessing the Web UI from a remote machine**, you need to set up port forwarding. - - First, find your DGX Station's IP address. On the Station, run: - - ```bash - hostname -I | awk '{print $1}' - ``` - - Start the port forward on the DGX Station host: - - ```bash - openshell forward start 18789 my-assistant --background - ``` - - Expected: - - ```text - Forwarding 127.0.0.1:18789 -> my-assistant:18789 (background) - ``` - - If the forward was already started during onboarding, you will see: - - ```text - Error: Port 18789 is already forwarded to sandbox 'my-assistant'. - ``` - - This is fine -- the forward is already running. - - Then from your remote machine, create an SSH tunnel to the Station (replace `` with the IP address from above): - - ```bash - ssh -L 18789:127.0.0.1:18789 @ - ``` - - Now open the tokenized URL in your remote machine's browser. Either of these usually works on the **client** side because both bind to your loopback through the tunnel: - - ```text - http://127.0.0.1:18789/#token= - ``` - - > [!IMPORTANT] - > Use `127.0.0.1`, not `localhost` -- the gateway origin check requires an exact match. - --- - # Phase 3: Telegram Bot + # Phase 2: Modify NemoClaw Policy - Messaging (Telegram, Discord, Slack) is **wired during onboarding** — credentials are stored, OpenShell providers are created, and channel configuration is **baked into the sandbox image**. Runtime config under `/sandbox/.openclaw/` is not safely patchable from inside the running sandbox. + ## Step 4. Enable Brave Search in sandbox - **`nemoclaw start` does not start the Telegram bridge.** In current NemoClaw releases it starts **optional host services** such as the **cloudflared** tunnel when installed; Telegram delivery stays under OpenShell. See [NemoClaw commands](https://docs.nvidia.com/nemoclaw/latest/reference/commands.html) and [Set up Telegram bridge](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html). - - ## Step 9. Create a Telegram bot - - Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the bot token. - - **Tip:** Use [Telegram Desktop](https://desktop.telegram.org/) or [web.telegram.org](https://web.telegram.org/) so you can **copy-paste** the token into your terminal or env file instead of typing 46+ characters from your phone into SSH. - - ## Step 10. Enable Telegram (first time or after skipping it) - - ### Path A — You have not installed yet, or you can re-run onboard - - Export the token on the **host**, then run the installer / onboard again (non-interactive variables from Step 4, plus `TELEGRAM_BOT_TOKEN`). The wizard’s **Messaging channels** step (installer phase **[5/8]**) is the right time to toggle Telegram interactively. - - Re-onboarding after a sandbox exists is supported; NemoClaw can detect token changes and rebuild the sandbox — see the official [Telegram bridge](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html) page. - - ### Path B — NemoClaw is already installed (recommended host command) - - On the **host** (run `exit` if you are inside `nemoclaw … connect`): - - 1. **Allow outbound access to the Telegram API** if you have not already — add the `telegram` network preset: + To add Brave Web Search to an existing sandbox, re-run the onboard wizard with `--fresh` to start a new session that re-prompts all options (including previously skipped features): ```bash - nemoclaw my-assistant policy-add + nemoclaw onboard --fresh --gpu ``` - When prompted, select `telegram` and confirm. + > [!NOTE] + > Without `--fresh`, the onboard wizard **resumes** the previous session and will not re-prompt for features you already skipped. - 2. **Register the bot token and rebuild** the sandbox image so Telegram is included: + When you reach **Enable Brave Web Search**, choose **yes** and paste the key from the [Brave Search API](https://brave.com/search/api/) console. Confirm the same sandbox name and inference choices where prompted. The wizard will **rebuild** the sandbox so the key is applied. + + > [!NOTE] + > Alternatively, set `BRAVE_API_KEY` in your environment before running the installer and Brave Search will be enabled automatically during onboard. + + To confirm web search is enabled, relaunch your OpenClaw WebUI or terminal UI. Ask the agent for something that needs **live web search**. If requests still fail, recheck **`policy-list`** and re-read the onboard output for Brave/API errors. + + ## Step 5. Set up Messaging Channel (Telegram Bot as an example) + + These steps apply when your sandbox exists but **Telegram was never configured** (you skipped **Messaging channels** in Step 2, or the sandbox policy tier never included Telegram-related egress). Replace `` with your sandbox (for example `my-assistant`). + + ### 1. Create a Telegram bot + + In Telegram, open [@BotFather](https://t.me/BotFather), send `/newbot`, and complete the prompts. Copy the **bot token** BotFather returns and keep it ready for the next step. + + ### 2. Register Telegram with NemoClaw and rebuild the sandbox ```bash - export TELEGRAM_BOT_TOKEN='' - nemoclaw my-assistant channels add telegram + nemoclaw channels add telegram ``` - Follow the prompts to rebuild when asked (or run `nemoclaw my-assistant rebuild --yes` afterward if non-interactive mode queued a rebuild — see `NEMOCLAW_NON_INTERACTIVE=1` behavior in the [commands reference](https://docs.nvidia.com/nemoclaw/latest/reference/commands.html)). + Paste the token when prompted. NemoClaw persists credentials and **rebuilds** the sandbox so OpenClaw can use Telegram as a messaging channel. - 3. **Pause or resume** Telegram delivery without changing credentials: use the **`nemoclaw channels stop`** / **`nemoclaw channels start`** patterns for the `telegram` channel described in [Set up Telegram bridge](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html) (exact subcommand spelling may vary slightly by NemoClaw version; use `nemoclaw --help` if in doubt). + ### 3. (If needed) Allow Telegram egress in the sandbox policy - Check overall status: + If messages fail with network or policy errors after the channel is registered, inspect presets and add Telegram-related egress if your tier omitted it: + + ```bash + nemoclaw policy-list + nemoclaw policy-add telegram + ``` + + Preset names follow your selected tier; confirm against [Network policies](https://docs.nvidia.com/nemoclaw/latest/reference/network-policies.html). + + ### 4. Verify Telegram + + Telegram uses long-polling (`getUpdates`) — the sandbox actively pulls messages from Telegram servers. **No public URL or cloudflared tunnel is required for Telegram to work.** + + Open Telegram, find your bot, and send a message. The bot should forward traffic to the agent in your NemoClaw sandbox and reply. + + > [!NOTE] + > The first response may take longer depending on model size (30B models respond in a few seconds; larger models may take longer on first inference). + + > [!NOTE] + > If the bot does not respond: + > - Run `nemoclaw status` to confirm the sandbox is running and inference is healthy. + > - Run `nemoclaw logs --follow` and look for Telegram-related errors. + > - If Telegram egress is missing, run `nemoclaw policy-add` and select `telegram`. + > - If the channel was never registered, run `nemoclaw channels add telegram`. + + > [!NOTE] + > The `channels add telegram` wizard also prompts for an optional **Telegram User ID** to restrict who can DM the bot. Send `/start` to [@userinfobot](https://t.me/userinfobot) on Telegram to get your numeric user ID. If you skip this, the bot will require device pairing (a terminal-based code confirmation) before responding to messages. + + > [!NOTE] + > For details on restricting which Telegram chats can interact with the agent, see the [NemoClaw Telegram bridge documentation](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html). + + ### 5. (Optional) Install cloudflared for remote Web UI access + + The cloudflared tunnel provides a **public URL for the Web UI dashboard** — it is not related to Telegram messaging. + + Install cloudflared (DGX Station is arm64): + + ```bash + curl -L --output cloudflared.deb \ + https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-arm64.deb + sudo dpkg -i cloudflared.deb + ``` + + Start the tunnel: + + ```bash + nemoclaw tunnel start + ``` + + Verify: ```bash nemoclaw status ``` - Open Telegram, find your bot, and send it a message. + You should see `● cloudflared` with a `trycloudflare.com` public URL. - > [!NOTE] - > The first response may take 30--90 seconds for a 120B parameter model running locally. + --- - > [!NOTE] - > To **persist** `TELEGRAM_BOT_TOKEN` for shell-based flows, use a `chmod 600` env file and `source` it from `~/.bashrc` as shown in Step 4. + # Phase 3: Set Up NemoClaw Agent - > [!NOTE] - > For chat allowlists and advanced Telegram behavior, see [NemoClaw Telegram bridge documentation](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html). + ## Step 6. Set Up NemoClaw Agents + + Set up NemoClaw Agents in general require three steps: Configure NemoClaw security policy, Run Agent Workflow Prompt, Personalize the Workflow for your own use case. + + Checkout these [Example NemoClaw Agents](https://build.nvidia.com/station/nemoclaw-applications) for reference. Consider sharing your NemoClaw agent setup with the community at [DGX Station Developer Forum](https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station-gb300) --- # Phase 4: Cleanup and Uninstall - ## Step 11. Stop services + ## Step 7. Stop services - Stop any running auxiliary services (Telegram bridge, cloudflared tunnel): + Stop the cloudflared tunnel: ```bash - nemoclaw stop + nemoclaw tunnel stop ``` - Expected: - - ```text - [services] All services stopped. - ``` - - Stop the port forward (always pass **port** and **sandbox name**): + Stop the port forward: ```bash - openshell forward list - openshell forward stop 18789 my-assistant + openshell forward list # find active forwards and their ports + openshell forward stop # stop the dashboard forward (use the port shown above) ``` - Stop and **remove** the vLLM container so the name `vllm-nemotron` is free for a future run. The playbook created the container with **`--restart unless-stopped`**, so `docker stop` alone is not enough: Docker would **restart it after reboot** and the container would keep reserving GPU memory. + ## Step 8. Uninstall NemoClaw + + The NemoClaw CLI includes a built-in uninstaller. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and Ollama are preserved. ```bash - docker update --restart=no vllm-nemotron 2>/dev/null || true - docker stop vllm-nemotron - docker rm vllm-nemotron + nemoclaw uninstall --yes ``` - To remove the container in one step even if it is running: `docker rm -f vllm-nemotron`. - - ## Step 12. Uninstall NemoClaw - - Run the uninstaller from the cloned source directory. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and vLLM are preserved. + To remove everything including the Ollama model: ```bash - cd ~/.nemoclaw/source - ./uninstall.sh + nemoclaw uninstall --yes --delete-models ``` **Uninstaller flags:** @@ -713,15 +412,13 @@ spec: |------|--------| | `--yes` | Skip the confirmation prompt | | `--keep-openshell` | Leave the `openshell` binary in place | - | `--delete-models` | Removes **local inference models pulled by older NemoClaw flows** (the upstream flag name still references **Ollama**). It does **not** remove Hugging Face weights used by this playbook’s **vLLM** container — delete those separately (below). | + | `--delete-models` | Also remove the Ollama models pulled by NemoClaw | - To also remove the vLLM container and cached model weights: - - ```bash - ./uninstall.sh --yes - docker rm -f vllm-nemotron 2>/dev/null || true - rm -rf ~/.cache/huggingface/hub/models--nvidia--NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/ - ``` + > [!NOTE] + > If the `nemoclaw` CLI is not available (e.g. install failed partway), use the remote uninstaller as a fallback: + > ```bash + > curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh | bash -s -- --yes + > ``` The uninstaller runs 6 steps: 1. Stop NemoClaw helper services and port-forward processes @@ -732,7 +429,7 @@ spec: 6. Remove state directories (`~/.nemoclaw`, `~/.config/openshell`, `~/.config/nemoclaw`) and the OpenShell binary > [!NOTE] - > The source clone at `~/.nemoclaw/source` is removed as part of state cleanup in step 6. If you want to keep a local copy, move or back it up before running the uninstaller. + > If you have a local clone at `~/.nemoclaw/source` you want to keep, move or back it up before running the uninstaller — it is removed as part of state cleanup in step 6. # Useful commands @@ -742,18 +439,13 @@ spec: | `nemoclaw my-assistant status` | Show sandbox status and inference config | | `nemoclaw my-assistant logs --follow` | Stream sandbox logs in real time | | `nemoclaw list` | List all registered sandboxes | - | `nemoclaw tunnel start` | Start optional host services such as **cloudflared** (public dashboard URL when installed); does **not** start Telegram | - | `nemoclaw start` | Deprecated alias for tunnel/aux host services — **not** for Telegram | - | `nemoclaw stop` | Stop host auxiliary services started by `nemoclaw tunnel start` / `nemoclaw start` | - | `nemoclaw channels add telegram` | Store Telegram token and rebuild sandbox (host) | + | `nemoclaw tunnel start` | Start cloudflared tunnel (public URL for remote Web UI access) | + | `nemoclaw tunnel stop` | Stop the cloudflared tunnel | + | `nemoclaw my-assistant dashboard-url --quiet` | Print the full tokenized Web UI URL (includes auto-assigned port) | | `openshell term` | Open the monitoring TUI on the host | | `openshell forward list` | List active port forwards | - | `openshell forward start 18789 my-assistant --background` | Start port forwarding for Web UI | - | `openshell forward stop 18789 my-assistant` | Stop Web UI port forward | - | `docker logs -f vllm-nemotron` | Stream vLLM inference server logs | - | `docker restart vllm-nemotron` | Restart the vLLM inference server | - | `curl http://localhost:8000/v1/models` | Check vLLM API status | - | `cd ~/.nemoclaw/source && ./uninstall.sh` | Remove NemoClaw (preserves Docker, Node.js, vLLM image) | + | `nemoclaw uninstall --yes` | Remove NemoClaw (preserves Docker, Node.js, Ollama) | + | `nemoclaw uninstall --yes --delete-models` | Remove NemoClaw and Ollama models | @@ -765,38 +457,72 @@ spec: | Symptom | Cause | Fix | |---------|-------|-----| - | `openclaw agent --local` fails or is blocked inside the sandbox | `--local` bypasses the NemoClaw gateway and is disallowed in the OpenShell sandbox | Use gateway mode: `openclaw agent --agent main -m "hello" --session-id test` (no `--local`). | - | Onboard fails with **“K8s namespace not ready”** (or similar) with no clear reason | Often **low disk space** on `/` or Docker’s data root; image push / k3s need headroom | Run `df -h / /var/lib/docker`. Free **at least ~40 GB** (see [NemoClaw quickstart prerequisites](https://docs.nvidia.com/nemoclaw/latest/get-started/quickstart.html)); prune Docker (`docker system prune`) or expand disk, then retry onboard. | - | vLLM warns about **mixed devices** or loads on an unexpected GPU | Multiple GPUs visible; default visibility does not match intent | Pin one GPU: `--gpus '"device=0"'` and `-e CUDA_VISIBLE_DEVICES=0` with `--tensor-parallel-size 1`, or use two GPUs explicitly with `--tensor-parallel-size 2` and `-e CUDA_VISIBLE_DEVICES=0,1` (see Step 3 in instructions). | | `nemoclaw: command not found` after install | Shell PATH not updated | Run `source ~/.bashrc` (or `source ~/.zshrc` for zsh), or open a new terminal window. | - | `pip: command not found` | pip not installed on DGX Station by default | Install pip: `sudo apt install -y python3-pip`. Then use `pip3 install --break-system-packages huggingface-hub`. | - | `huggingface-cli` is deprecated | Hugging Face CLI was renamed | Use `hf download` instead of `huggingface-cli download`. | - | vLLM container won't start or crashes | GPU memory issue or wrong image | Check logs: `docker logs vllm-nemotron`. If CUDA OOM, reduce context: recreate the container with `--max-model-len 8192`. Ensure you are using the NVIDIA container image (`nvcr.io/nvidia/vllm:26.03-py3`), not the community `vllm/vllm-openai` image. | - | vLLM logs show `Application startup complete.` but `curl` times out | vLLM still compiling CUDA graphs after startup | Wait 1--2 minutes after `Application startup complete.` before sending requests. The first request compiles CUDA graphs and may take 30--90 seconds. | - | NemoClaw onboard fails with "endpoint validation failed" | vLLM model not warmed up or validation timeout too short | Warm up the model first: `curl -s --max-time 120 http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4","messages":[{"role":"user","content":"hello"}],"max_tokens":10}'`. Then re-run with `NEMOCLAW_EXPERIMENTAL=1 NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300 nemoclaw onboard`. | - | NemoClaw reports "provider 'vllm' is not available" | Missing experimental flag | Set `NEMOCLAW_EXPERIMENTAL=1` before running the installer or `nemoclaw onboard`. The vLLM provider is currently an experimental feature. | + | Installer fails with Node.js version error | Node.js version below 22.16 | Install Node.js 22.16+: `curl -fsSL https://deb.nodesource.com/setup_22.x \| sudo -E bash - && sudo apt-get install -y nodejs` then re-run the installer. | + | npm install fails with `EACCES` permission error | npm global directory not writable | `mkdir -p ~/.npm-global && npm config set prefix ~/.npm-global && export PATH=~/.npm-global/bin:$PATH` then re-run the installer. Add the `export` line to `~/.bashrc` to make it permanent. | | Docker permission denied | User not in docker group | `sudo usermod -aG docker $USER`, then log out and back in. | - | Gateway fails with cgroup / "Failed to start ContainerManager" errors | Docker not configured for host cgroup namespace on DGX Station | Run the cgroup fix: `sudo python3 -c "import json, os; path='/etc/docker/daemon.json'; d=json.load(open(path)) if os.path.exists(path) else {}; d['default-cgroupns-mode']='host'; json.dump(d, open(path,'w'), indent=2)"` then `sudo systemctl restart docker`. | + | Gateway fails with cgroup / "Failed to start ContainerManager" errors | Older OpenShell or Docker still using a **private** cgroup namespace for the gateway so kubelet cannot see cgroup v2 controllers | First **upgrade OpenShell** (re-run the Phase 1 `nemoclaw.sh` install so you get a build that sets host cgroupns on the gateway container). If it still fails, force Docker's default to host mode by running the [daemon.json cgroup fix](#daemonjson-cgroup-fix) below, then run `sudo systemctl restart docker`. | | Gateway fails with "port 8080 is held by container..." | Another OpenShell gateway or container is using port 8080 | Stop the conflicting container: `openshell gateway destroy -g ` or `docker stop && docker rm `, then retry `nemoclaw onboard`. | - | Sandbox cannot reach the inference server | Using `localhost` instead of `host.openshell.internal` in endpoint URL | Inside the sandbox, `localhost` refers to the sandbox container, not the host. The onboard wizard configures `host.openshell.internal` automatically. Verify from inside the sandbox: `curl -sf https://inference.local/v1/models`. If this fails, check that vLLM is reachable from the host: `curl -s http://localhost:8000/v1/models`. | - | Agent gives no response or is very slow | Normal for 120B model running locally | Nemotron 3 Super 120B can take 30--90 seconds per response. Verify inference route: `nemoclaw my-assistant status`. | - | vLLM API returns empty or errors on tool calls | Missing tool-call flags | Verify that `--enable-auto-tool-choice` and `--tool-call-parser qwen3_xml` are set: `docker inspect vllm-nemotron --format '{{.Config.Cmd}}'`. | + | Sandbox creation fails | Stale gateway state or DNS not propagated | Run `openshell gateway destroy && openshell gateway start`, then re-run the installer or `nemoclaw onboard`. | + | CoreDNS crash loop | Known issue on some DGX Station configurations | Re-run the NemoClaw installer (`curl -fsSL https://www.nvidia.com/nemoclaw.sh \| bash`) which includes the CoreDNS fix. If the issue persists, see [NemoClaw troubleshooting](https://docs.nvidia.com/nemoclaw/latest/reference/troubleshooting.html). | + | "No GPU detected" during onboard | DGX Station GB300 reports unified memory differently | Expected on DGX Station. The wizard still works and uses Ollama for inference. | + | Inference timeout or hangs | Ollama not running or not reachable | Check Ollama: `curl http://127.0.0.1:11434`. If not running: `sudo systemctl restart ollama`. Verify the NemoClaw auth proxy is healthy: `curl http://127.0.0.1:11435/api/tags`. If both respond, check `nemoclaw my-assistant status` for the Inference health line. | + | Agent gives no response or is very slow | First response can be slow, especially with larger models | Response time depends on model size (30B: a few seconds, 120B: 30–90 seconds). Verify inference route: `nemoclaw my-assistant status`. | | Port 18789 already in use | Another process is bound to the port | `lsof -i :18789` then `kill `. If needed, `kill -9 ` to force-terminate. | - | Web UI port forward dies or dashboard unreachable | Port forward not active | `openshell forward stop 18789 my-assistant` then `openshell forward start 18789 my-assistant --background`. Always pass **port** and **sandbox name** to `openshell forward stop`. | - | Web UI shows `origin not allowed` | Browser origin does not match what the gateway expects | On the **DGX Station local desktop**, open `http://127.0.0.1:18789/#token=...` (not `localhost`). Through an **SSH tunnel** on another machine, `localhost` vs `127.0.0.1` in the client browser usually both work because the check applies to how you reach the forwarded port locally. | - | Telegram does not work after install; `nemoclaw start` does nothing for Telegram | **`nemoclaw start` starts optional host services (e.g. cloudflared), not the Telegram bridge** | Configure Telegram during onboard, or on the host run `nemoclaw my-assistant channels add telegram` (and rebuild), after `policy-add` for the `telegram` preset. See [Set up Telegram bridge](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html). | - | Telegram bot receives messages but does not reply | Telegram policy not added to sandbox | Run `nemoclaw my-assistant policy-add`, type `telegram`, hit Y. Ensure the channel was added with `nemoclaw my-assistant channels add telegram` so the image includes Telegram. | - | `docker: Error response from daemon: Conflict. The container name "/vllm-nemotron" is already in use` | Previous cleanup used `docker stop` only | `docker rm -f vllm-nemotron` (or `docker update --restart=no` then `docker stop` and `docker rm`). The playbook uses `--restart unless-stopped`; stopping alone leaves a restart policy and reserved name. | + | Web UI port forward dies or dashboard unreachable | Port forward not active | `openshell forward stop 18789 my-assistant` then `openshell forward start 18789 my-assistant --background`. | + | Web UI shows `origin not allowed` | Accessing via `localhost` instead of `127.0.0.1` | Use `http://127.0.0.1:18789/#token=...` in the browser. The gateway origin check requires `127.0.0.1` exactly. | + | Telegram bridge does not start | Telegram channel not registered with sandbox | Run `nemoclaw channels add telegram` to register the bot token and rebuild the sandbox. Verify with `nemoclaw status`. | + | Telegram stops responding after sandbox rebuild | Telegram long-polling session stale after rebuild | Run `nemoclaw recover` to restart the gateway. If still unresponsive, run `nemoclaw channels add telegram` to re-register and rebuild. | + | Telegram bot receives messages but does not reply | Telegram network egress policy not added | Run `nemoclaw policy-add`, select `telegram`, and confirm. This is a hot-reload — no rebuild needed. | - **Model variant guidance:** + ### daemon.json cgroup fix - | Variant | Size | VRAM Required | When to Use | - |---------|------|---------------|-------------| - | `NVFP4` | ~60 GB | ~80 GB | Default for DGX Station (GB300). Fits on single GPU with room for large KV cache. | - | `FP8` | ~120 GB | ~140 GB | Higher accuracy, still fits on GB300. Add `--kv-cache-dtype fp8` to the vLLM command. | - | `BF16` | ~240 GB | ~260 GB | Highest accuracy. Fits on GB300 but leaves little room for KV cache. Reduce `--max-model-len`. | + Use this script as the fallback for the cgroup / "Failed to start ContainerManager" row above. It validates any existing `/etc/docker/daemon.json`, writes a `.bak` backup, sets `default-cgroupns-mode` to `host`, and atomically replaces the file. It exits non-zero with an error on stderr if anything fails, leaving the original `daemon.json` untouched. - For the latest known issues, see [DGX Station documentation](https://docs.nvidia.com/dgx/dgx-station-user-guide/index.html). + ```bash + sudo python3 - <<'PY' + import json, os, shutil, sys, tempfile + + path = '/etc/docker/daemon.json' + try: + if os.path.exists(path): + with open(path) as f: + data = json.load(f) + if not isinstance(data, dict): + raise ValueError(f'{path} is not a JSON object') + else: + data = {} + except (json.JSONDecodeError, ValueError, OSError) as e: + print(f'error: failed to read {path}: {e}', file=sys.stderr) + sys.exit(1) + + if os.path.exists(path): + try: + shutil.copy2(path, path + '.bak') + except OSError as e: + print(f'error: failed to back up {path}: {e}', file=sys.stderr) + sys.exit(1) + + data['default-cgroupns-mode'] = 'host' + + target_dir = os.path.dirname(path) or '/' + fd, tmp = tempfile.mkstemp(prefix='daemon.json.', dir=target_dir) + try: + with os.fdopen(fd, 'w') as f: + json.dump(data, f, indent=2) + f.write('\n') + os.chmod(tmp, 0o644) + os.replace(tmp, path) + except OSError as e: + if os.path.exists(tmp): + try: + os.unlink(tmp) + except OSError: + pass + print(f'error: failed to write {path}: {e}', file=sys.stderr) + sys.exit(1) + PY + ``` @@ -814,19 +540,3 @@ spec: url: https://docs.openclaw.ai - - name: vLLM Documentation - url: https://docs.vllm.ai - - - - name: Nemotron-3-Super on Hugging Face - url: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 - - - - name: DGX Station Documentation - url: https://docs.nvidia.com/dgx/dgx-station-user-guide/index.html - - - - name: DGX Station Forum - url: https://forums.developer.nvidia.com - - diff --git a/nvidia/station-vllm/endpoint-test.yaml b/nvidia/station-vllm/endpoint-test.yaml index c003724..d018424 100644 --- a/nvidia/station-vllm/endpoint-test.yaml +++ b/nvidia/station-vllm/endpoint-test.yaml @@ -1,8 +1,8 @@ kind: Playbook metadata: name: station-vllm - displayName: Serve Qwen3-235B with vLLM - shortDescription: Set up vLLM server with Qwen3-235B on DGX Station + displayName: vLLM for Inference + shortDescription: Install and use vLLM on DGX Station publisher: nvidia description: | # REPLACE THIS WITH YOUR MODEL CARD @@ -15,7 +15,7 @@ metadata: attributes: - key: DURATION - value: 20 MIN + value: 30 MIN spec: artifactName: station-vllm @@ -42,7 +42,9 @@ spec: # What you'll accomplish - Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU. + Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models. + + You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture. # What to know before starting @@ -57,21 +59,33 @@ spec: - HuggingFace account with access token - Network access to NGC and HuggingFace + # Model Support Matrix + + The following models are supported with vLLM on DGX Station. All listed models are available and ready to use: + + | Model | Quantization | Support Status | HF Handle | + |-------|-------------|----------------|-----------| + | **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) | + | **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) | + | **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) | + | **Kimi-K2.5 (1T)** | NVFP4 | ✅ | [`nvidia/Kimi-K2.5-NVFP4`](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) | + | **DeepSeek-V4-Flash** | NVFP4 | ✅ | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | # Time & risk - * **Duration:** 15-20 minutes (longer on first run due to model download) + * **Duration:** 30 minutes (longer on first run due to model download) * **Risks:** Model download requires HuggingFace authentication * **Rollback:** Stop and remove the container to restore state - * **Last Updated:** 03/02/2026 - * First Publication + * **Last Updated:** 05/29/2026 + * Update models + * Add base configuration example, per-setting explanations, and DeepSeek-V4-Flash recipe - id: instructions - label: Serve Qwen3-235B + label: Instructions content: | # Step 1. Set up Docker permissions @@ -92,7 +106,7 @@ spec: export HF_TOKEN="your_huggingface_token" # Model to serve - export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4" + export MODEL_HANDLE="" # Maximum context length export MAX_MODEL_LEN=8192 @@ -106,9 +120,28 @@ spec: docker pull nvcr.io/nvidia/vllm:26.01-py3 ``` + For Step-3.7-Flash models, pull the custom VLLM container + ```bash + docker pull vllm/vllm-openai:stepfun37 + ``` + + For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, pull the **26.03** image, which includes the `--cpu-offload-params` support used below: + ```bash + docker pull nvcr.io/nvidia/vllm:26.03-py3 + ``` + + For DeepSeek-V4-Flash, pull the stable DeepSeek-V4 release container. Use the **cu130** build on DGX Station (Blackwell): + ```bash + docker pull vllm/vllm-openai:v0.20.0-cu130 + ``` + # Step 4. Start vLLM server - Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`. + Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`. + + ## Base configuration (most models) + + This is the recommended starting point for any model that fits entirely in VRAM on the GB300. The Qwen3-235B-A22B-NVFP4 model, for example, runs directly with this configuration. ```bash docker run -d \ @@ -126,6 +159,122 @@ spec: --gpu-memory-utilization 0.9 ``` + Settings used: + - `--max-model-len` — maximum context length (prompt + output) per request. Larger values reserve more GPU memory for the KV cache; size it to your workload. + - `--gpu-memory-utilization 0.9` — fraction of GPU memory vLLM may use for weights and KV cache. `0.9` leaves headroom for other processes; raise toward `0.95` to fit more KV cache if the GPU is dedicated. + + ## Step-3.7-Flash (FP8 / NVFP4) + + For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300. + + ```bash + docker run -d \ + --name vllm-server \ + --gpus all \ + --ipc host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -p 8000:8000 \ + -e HF_TOKEN="$HF_TOKEN" \ + -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ + vllm/vllm-openai:stepfun37 \ + "$MODEL_HANDLE" \ + --gpu-memory-utilization 0.95 \ + --trust-remote-code \ + --reasoning-parser step3p5 \ + --enable-auto-tool-choice \ + --tool-call-parser step3p5 \ + --kv-cache-dtype fp8 + ``` + + Settings used (in addition to the base configuration): + - `--trust-remote-code` — allows the model's custom modeling code (shipped in its repo) to load. Required for Step-3.7. + - `--reasoning-parser step3p5` — parses the model's reasoning/thinking tokens into the dedicated `reasoning_content` response field. + - `--enable-auto-tool-choice` — lets the model decide when to call a tool, enabling OpenAI-compatible function calling. + - `--tool-call-parser step3p5` — parses the model's tool-call output into structured `tool_calls`. Pairs with `--enable-auto-tool-choice`. + - `--kv-cache-dtype fp8` — stores the KV cache in FP8, roughly halving KV-cache memory versus 16-bit and allowing more concurrent/longer sequences. + + ## Kimi-K2.5 NVFP4 (1T) — CPU offloading + + For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, run with the **26.03** NGC container. This model does not fit entirely in VRAM, so the MoE expert weights are offloaded to CPU DRAM with `--cpu-offload-gb 375 --cpu-offload-params experts`. Ensure the system has enough free DRAM to hold the offloaded weights. + + ```bash + docker run -d \ + --name vllm-server \ + --gpus all \ + --ipc host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -p 8000:8000 \ + -e HF_TOKEN="$HF_TOKEN" \ + -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ + nvcr.io/nvidia/vllm:26.03-py3 \ + vllm serve nvidia/Kimi-K2.5-NVFP4 \ + --host 0.0.0.0 \ + --port 8000 \ + --dtype auto \ + --kv-cache-dtype auto \ + --gpu-memory-utilization 0.95 \ + --served-model-name nvidia/Kimi-K2.5-NVFP4 \ + --tensor-parallel-size 1 \ + --no-enable-prefix-caching \ + --trust-remote-code \ + --max-model-len 40960 \ + --max-num-seqs 1 \ + --max-num-batched-tokens 32768 \ + --cpu-offload-gb 375 \ + --cpu-offload-params experts + ``` + + Settings used (in addition to the base configuration): + - `--cpu-offload-gb 375` — amount of CPU DRAM (in GiB) vLLM may use to hold weights that don't fit in VRAM. Must be large enough for the offloaded experts; the system needs at least this much free DRAM. + - `--cpu-offload-params experts` — offloads only the MoE expert weights (the bulk of a large MoE model) to DRAM, keeping attention and other hot weights in VRAM. + - `--tensor-parallel-size 1` — single GPU; the GB300 serves the whole model. + - `--max-num-seqs 1` / `--max-num-batched-tokens 32768` — caps concurrency to one sequence and the batch token budget. With expert weights paged from DRAM, throughput is offload-bound, so a low concurrency keeps latency predictable. + - `--no-enable-prefix-caching` — disables prefix-cache reuse. Offloaded experts make the memory budget tight, so the cache is turned off here rather than spent on KV reuse. + - `--kv-cache-dtype auto` / `--dtype auto` — let vLLM pick the KV-cache and compute dtypes from the model's quantization (NVFP4). + + ## DeepSeek-V4-Flash — MTP + agentic + + For DeepSeek-V4-Flash, run with the stable **v0.20.0-cu130** container. This recipe targets agentic workloads and enables Multi-Token Prediction (MTP) speculative decoding. On a single GB300 (TP1) the MoE expert-parallel path is used; the `deep_gemm_mega_moe` backend from some internal recipes is not needed at TP1 and is omitted here. + + ```bash + docker run -d \ + --name vllm-server \ + --gpus all \ + --ipc host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -p 8000:8000 \ + -e HF_TOKEN="$HF_TOKEN" \ + -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \ + vllm/vllm-openai:v0.20.0-cu130 \ + deepseek-ai/DeepSeek-V4-Flash \ + --enable-expert-parallel \ + --kv-cache-dtype fp8 \ + --trust-remote-code \ + --block-size 256 \ + --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \ + --attention_config.use_fp4_indexer_cache True \ + --tokenizer-mode deepseek_v4 \ + --tool-call-parser deepseek_v4 \ + --enable-auto-tool-choice \ + --reasoning-parser deepseek_v4 \ + --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \ + --max-model-len 32768 + ``` + + Settings used (in addition to the base configuration): + - `--enable-expert-parallel` — shards the MoE experts across the available GPU(s) using expert parallelism, the recommended MoE execution path for DeepSeek-V4. + - `--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'` — enables **MTP (Multi-Token Prediction)** speculative decoding: the model proposes 3 tokens per step that are verified in a single forward pass, cutting latency for accepted tokens. + - `--kv-cache-dtype fp8` — FP8 KV cache to fit more concurrent/longer sequences. + - `--block-size 256` — KV-cache page size in tokens. DeepSeek-V4 uses multiple KV-cache groups; `256` matches the recipe validated on Station. + - `--attention_config.use_fp4_indexer_cache True` — enables the FP4 indexer cache used by DeepSeek-V4's attention. (Drop this flag on platforms without native FP4, e.g. Hopper.) + - `--tokenizer-mode deepseek_v4` / `--tool-call-parser deepseek_v4` / `--reasoning-parser deepseek_v4` — DeepSeek-V4-specific tokenizer, tool-call, and reasoning parsers. + - `--enable-auto-tool-choice` — OpenAI-compatible function calling for agentic use. + - `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` — uses full + piecewise CUDA graph capture and enables all custom ops for lower per-step overhead. + - **Prefix caching is left enabled (the vLLM default).** For agentic workloads with large shared prefixes (e.g. a 32k system/context prefix) at low batch sizes (~BS 3–4), prefix caching gives a significant throughput boost by reusing the cached prefix across requests. Some internal recipes carry `--no-enable-prefix-caching`, but that was inherited from random-data benchmarking and is not recommended for agentic use here. + Check the server logs for startup progress: ```bash @@ -135,7 +284,7 @@ spec: Expected output includes: - Model download progress (first run only) - Model loading into GPU memory - - `Uvicorn running on http://0.0.0.0:8000` + - `Application startup complete.` Press `Ctrl+C` to exit log view once the server is ready. @@ -166,9 +315,10 @@ spec: Optionally, remove the image and cached model: + Eg. ```bash - docker rmi nvcr.io/nvidia/vllm:26.01-py3 - rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4 + docker rmi "" + rm -rf $HOME/.cache/huggingface/hub/"" ```