chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2026-06-02 14:46:57 +00:00
parent b849d2d191
commit 32cbd72374
6 changed files with 864 additions and 690 deletions

View File

@ -0,0 +1,413 @@
kind: Playbook
metadata:
name: station-ai-skills
displayName: DGX Station AI Skills for Coding Agents
shortDescription: Give your coding agent (Claude Code, Codex, Gemini CLI, Cursor) DGX Station expertise via an AGENTS.md and on-demand Agent Skills
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
labelsV2:
- gpuType:playbook:gpu_type_station
- DGX Station
- GB300
- Blackwell
- AI Agents
- Agent Skills
- AGENTS.md
- Claude Code
- Codex
- Gemini CLI
- Cursor
- vLLM
- SGLang
- MIG
- Mixed Coherency
attributes:
- key: DURATION
value: 15 MIN
spec:
artifactName: station-ai-skills
nvcfFunctionId: None
attributes:
showUnavailableBanner: false
apiDocsUrl: None
termsOfUse: |
cta:
text: View on GitHub
url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-ai-skills/
tabs:
-
id: overview
label: Overview
content: |
# Basic idea
Modern coding agents — Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor — all support two extension mechanisms: a project-level **context file** that's loaded into every conversation, and **on-demand procedural workflows** (called skills, prompts, commands, or rules depending on the harness). This playbook ships both for DGX Station:
- An **`AGENTS.md`** with the critical DGX Station constraints your agent should always know (mixed coherency, GPU targeting, common pitfalls). `AGENTS.md` is the cross-harness standard; an `install.sh` lays it down as `CLAUDE.md`, `GEMINI.md`, or `AGENTS.md` depending on the agent you use.
- **Four Agent Skills** — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` — authored once in the [Anthropic Agent Skills format](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview) and installed into the right per-harness location (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`).
This approach keeps your agent's context lean in every conversation while giving it deep procedural knowledge on demand, regardless of which agent you use.
## AGENTS.md vs Agent Skill — why split?
| | AGENTS.md | Agent Skill |
|---|---|---|
| **Loaded** | Every conversation, automatically | Only when invoked by name (or matched by description, in Claude) |
| **Best for** | Constraints, pitfalls, "never do X" rules | Step-by-step workflows, deployment procedures |
| **Context cost** | Consumed every time | Zero until invoked |
The DGX Station mixed-coherency constraint (`--gpus all` will crash) should be in every conversation. The full vLLM deployment procedure should not.
# What you'll accomplish
- Install the `AGENTS.md` and four Agent Skills into your project directory for your chosen agent (Claude Code, Codex, Gemini CLI, or Cursor).
- Verify the agent loads the constraints automatically and the skills on demand.
- Invoke `vllm-setup` to deploy a vLLM inference server with validated configuration.
- Invoke `sglang-setup` to deploy an SGLang inference server.
- Invoke `mig-configure` to partition the GB300 into MIG instances.
- Invoke `dgx-diagnose` to troubleshoot common DGX Station issues.
# What to know before starting
- Basic familiarity with one supported coding agent (running it, giving it prompts, using slash commands or rule references)
- General understanding of DGX Station (two GPUs, Docker-based workflows)
# Prerequisites
- NVIDIA DGX Station with GB300
- One of the supported coding agents installed:
- **Claude Code:** `curl -fsSL https://claude.ai/install.sh | sh`
- **OpenAI Codex CLI:** `npm i -g @openai/codex`
- **Gemini CLI:** `npm i -g @google/gemini-cli`
- **Cursor:** download from `https://cursor.com/`
- A project directory where you do DGX Station work
# Ancillary files
- `assets/AGENTS.md` — canonical context file with critical constraints, GPU targeting, software versions, and common pitfalls. Cross-harness standard.
- `assets/skills/vllm-setup/SKILL.md` — skill: deploy vLLM with validated configuration.
- `assets/skills/sglang-setup/SKILL.md` — skill: deploy SGLang with validated configuration.
- `assets/skills/mig-configure/SKILL.md` — skill: configure MIG partitions on the GB300.
- `assets/skills/dgx-diagnose/SKILL.md` — skill: troubleshoot common DGX Station issues.
- `assets/install.sh` — per-harness installer (`claude`, `codex`, `gemini`, `cursor`, or `all`).
# Time & risk
* **Duration:** 10-15 minutes
* **Risk level:** Low — this playbook copies markdown files into your project directory
* **Rollback:** Delete the context file (`AGENTS.md` / `CLAUDE.md` / `GEMINI.md`) and the harness-specific skill directory (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`) from your project directory
* **Last Updated:** 05/18/2026
* Restructured as harness-agnostic Agent Skills (Claude Code, Codex, Gemini CLI, Cursor)
-
id: instructions
label: Instructions
content: |
# Step 1. Install your coding agent
Pick whichever agent you prefer — the rest of this playbook works the same regardless. Install commands:
| Agent | Install |
|-------|---------|
| Claude Code | `curl -fsSL https://claude.ai/install.sh \| sh` |
| OpenAI Codex CLI | `npm i -g @openai/codex` |
| Gemini CLI | `npm i -g @google/gemini-cli` |
| Cursor | Download from `https://cursor.com/` |
Verify with `claude --version`, `codex --version`, `gemini --version`, or by launching Cursor.
# Step 2. Install the skills into your project
Navigate to the project where you want DGX Station expertise, then run the installer with the harness you use:
```bash
cd ~/your-project
# Pick one:
/path/to/this/playbook/assets/install.sh claude
/path/to/this/playbook/assets/install.sh codex
/path/to/this/playbook/assets/install.sh gemini
/path/to/this/playbook/assets/install.sh cursor
# Or install for all four at once:
/path/to/this/playbook/assets/install.sh all
```
If you downloaded the playbook as a zip, the path is relative to the extracted directory:
```bash
station-ai-skills/assets/install.sh claude ~/your-project
```
The installer is additive for skill directories (won't clobber existing skills you've written) and refuses to overwrite an existing context file (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`) unless you pass `--force`.
**Resulting layout** (per harness):
```text
your-project/
AGENTS.md or CLAUDE.md or GEMINI.md # context file (named for your agent)
.claude/skills/<name>/SKILL.md # claude
.codex/prompts/<name>.md # codex
.gemini/commands/<name>.md # gemini
.cursor/rules/<name>.mdc # cursor
```
Where `<name>` is each of `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose`.
> [!NOTE]
> Every supported agent automatically reads the context file from the working directory at startup. Skills/prompts/rules in the harness-specific directory are discovered automatically — no additional configuration needed.
# Step 3. Verify the setup
Start your agent in the project directory and ask a question that requires constraint knowledge:
```text
Can I use --gpus all to run my CUDA workload on DGX Station?
```
The agent should immediately warn about the mixed-coherency constraint and recommend `--gpus '"device=N"'` targeting. If you don't get the warning, the context file isn't being loaded — see Troubleshooting.
Then verify the skills are discoverable:
| Agent | How to check |
|-------|--------------|
| Claude Code | Type `/` — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` should appear in the autocomplete |
| Codex CLI | Type `/prompts:` — same four names appear |
| Gemini CLI | Type `/` — same four names appear |
| Cursor | Open the Rules panel — same four rules appear |
# Step 4. Use vllm-setup to deploy an inference server
Invoke the skill in your agent:
| Agent | Invocation |
|-------|-----------|
| Claude Code | `/vllm-setup` (slash command) or just describe the task ("deploy vllm with Qwen3-8B") |
| Codex CLI | `/prompts:vllm-setup` |
| Gemini CLI | `/vllm-setup` |
| Cursor | In chat: "use the vllm-setup rule to deploy a vllm server" |
The agent will walk you through deploying a vLLM server with a validated container image, correct GPU targeting, and recommended parameters. It will check your GPU index, ask which model you want to serve, and generate the full `docker run` command.
# Step 5. Use sglang-setup to deploy SGLang
Same invocation pattern, but for SGLang with the `cu130` container, RadixAttention prefix caching, and structured JSON output support.
# Step 6. Use mig-configure to partition the GB300
The agent will query your current MIG state, show available profiles, help you choose a layout for your workloads, and execute the partitioning commands.
# Step 7. Use dgx-diagnose to troubleshoot issues
If you encounter problems, invoke `dgx-diagnose`. The agent will check GPU status, driver version, running processes, MIG state, and Fabric Manager to identify the issue.
# Step 8. Customize
Both the `AGENTS.md` and the skills are plain markdown — extend them freely.
**Add project-specific constraints to `AGENTS.md`** (or your harness-specific context file):
```markdown
## Project-specific
- Our production MIG layout is 3g.139gb + 2g.70gb + 2g.70gb
- Always use port 8080 for inference (nginx proxy on 443)
- Model weights are cached at /data/models, mount with -v /data/models:/root/.cache/huggingface/hub
```
**Create new skills** by adding a directory and `SKILL.md` to `assets/skills/`, then re-run `install.sh`:
```bash
mkdir -p assets/skills/run-benchmarks
cat > assets/skills/run-benchmarks/SKILL.md << 'EOF'
---
name: run-benchmarks
description: Run our standard inference benchmark suite against the running vLLM or SGLang server and compare against the baseline.
---
# Run benchmarks
1. Check which inference server is running (vLLM on port 8000 or SGLang on port 30000)
2. Run the appropriate benchmark script from ./benchmarks/
3. Report throughput (tokens/sec), latency (TTFT, ITL), and memory utilization
4. Compare against the baseline in ./benchmarks/baseline.json
EOF
```
> [!TIP]
> Keep `AGENTS.md` focused on constraints and pitfalls (things that break). Put procedural workflows in skills (things you do step-by-step).
-
id: troubleshooting
label: Troubleshooting
content: |
# Skills don't appear in autocomplete / aren't discoverable
Each agent discovers skills from a harness-specific directory in the current directory (or a parent). Check the right one:
| Agent | Expected location |
|-------|-------------------|
| Claude Code | `.claude/skills/<name>/SKILL.md` |
| Codex CLI | `.codex/prompts/<name>.md` |
| Gemini CLI | `.gemini/commands/<name>.md` |
| Cursor | `.cursor/rules/<name>.mdc` |
```bash
# Examples — check the directory for your agent
ls -la .claude/skills/
ls -la .codex/prompts/
ls -la .gemini/commands/
ls -la .cursor/rules/
```
You should see entries for `vllm-setup`, `sglang-setup`, `mig-configure`, and `dgx-diagnose`.
**Check you're in the right directory:**
```bash
pwd
```
The agent must be started from the directory containing the harness directory, or a subdirectory of it.
# Context file not loaded
If the agent gives generic answers without DGX Station awareness, the context file isn't being picked up. Each agent reads a different filename — verify the one for your agent exists:
| Agent | Expected filename |
|-------|-------------------|
| Claude Code | `CLAUDE.md` (also reads `AGENTS.md` as fallback) |
| Codex CLI | `AGENTS.md` |
| Gemini CLI | `GEMINI.md` |
| Cursor | `AGENTS.md` |
```bash
# Verify the file exists for your agent
cat AGENTS.md | head -5
cat CLAUDE.md | head -5
cat GEMINI.md | head -5
# Restart the agent in the correct directory
cd ~/your-project
claude # or codex, gemini, etc.
```
All four agents read the context file from the working directory (and parent directories up to the project root).
# Skill gives outdated information
The skills contain validated container versions and parameters as of the publication date. If a newer container is available, edit the canonical source and re-install:
```bash
nano /path/to/playbook/assets/skills/vllm-setup/SKILL.md
/path/to/playbook/assets/install.sh all --force
```
Or edit the installed copy directly:
```bash
# Claude Code
nano .claude/skills/vllm-setup/SKILL.md
# Codex
nano .codex/prompts/vllm-setup.md
# Gemini CLI
nano .gemini/commands/vllm-setup.md
# Cursor
nano .cursor/rules/vllm-setup.mdc
```
> [!TIP]
> Skills are plain markdown — you can version them in git alongside your project code.
# "Both GPUs cannot be used" errors
This is the mixed-coherency constraint working as intended. If you see CUDA errors when using `--gpus all`:
```bash
# Find the GB300 index
nvidia-smi --query-gpu=index,name --format=csv,noheader
# Use device-specific targeting
docker run --gpus '"device=1"' ...
```
The `AGENTS.md` covers this constraint, but if you removed that section, add it back — it's the most important piece of DGX Station knowledge.
# Skills conflict with existing project directory
If your project already has a `.claude/`, `.codex/`, `.gemini/`, or `.cursor/` directory with its own contents, `install.sh` is **additive** for skill directories — it adds the new skill files alongside whatever you already have and warns on collision rather than overwriting.
For context files (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`), the installer **refuses** to overwrite an existing file. Pass `--force` to override, or merge the new content manually:
```bash
# See what would be written
diff /path/to/playbook/assets/AGENTS.md ./AGENTS.md
# Force overwrite
/path/to/playbook/assets/install.sh claude . --force
```
# Installer reports "WROTE" for some files but "SKIP" for others
That's the safe-by-default behavior. The installer skips any file that already exists, prints a warning, and continues with the rest. To get a clean install, either:
1. Delete the existing files first: `rm -rf .claude/skills/{vllm-setup,sglang-setup,mig-configure,dgx-diagnose}`
2. Or pass `--force` (only affects context files; skill files are still skipped if present)
resources:
- name: Anthropic Agent Skills Overview
url: https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview
- name: AGENTS.md Standard
url: https://agents.md/
- name: Claude Code Documentation
url: https://docs.anthropic.com/en/docs/claude-code
- name: OpenAI Codex AGENTS.md Guide
url: https://developers.openai.com/codex/guides/agents-md
- name: Gemini CLI Custom Commands
url: https://geminicli.com/docs/cli/custom-commands/
- name: Cursor Rules Documentation
url: https://docs.cursor.com/
- name: vLLM Documentation
url: https://docs.vllm.ai/en/latest/
- name: SGLang Documentation
url: https://docs.sglang.io/
- name: MIG User Guide
url: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/

View File

@ -0,0 +1,160 @@
kind: Playbook
metadata:
name: station-brev
displayName: Register DGX Station to Brev
shortDescription: Link your DGX Station to Brev for remote access and sharing
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
labelsV2:
- gpuType:playbook:gpu_type_station
- DGX Station
- Brev
attributes:
- key: DURATION
value: 5 MIN
spec:
artifactName: station-brev
nvcfFunctionId: None
attributes:
showUnavailableBanner: false
apiDocsUrl: None
termsOfUse: |
cta:
text: Brev Overview
url: https://docs.nvidia.com/brev/concepts/overview
tabs:
-
id: overview
label: Overview
content: |
# Basic idea
NVIDIA Brev is an AI development platform that makes GPU environments remotely accessible, shareable, and easy to standardize using preconfigured setups called Launchables.
This walkthrough will help you connect your NVIDIA DGX Station to Brev so it shows up as a managed GPU environment in Brev. After a one-time registration, your Station becomes remotely accessible and shareable.
# What you'll accomplish
Youll register your DGX Station with Brev and it will be visible as a healthy node in the Brev web UI and CLI, ready to share access and accept workloads whenever needed.
# What to know before starting
While Brev automates the complex configuration, understanding a few key concepts when establishing the initial connection will be useful:
* **Terminal Basics**:
* Familiarity with command-line use to run a few simple setup commands.
# Prerequisites
You will also need the following:
* NVIDIA DGX Station with GB300 GPU
* **Brev Account**:
* Have an NVIDIA Brev account. [Create an NVIDIA Brev account](https://login.brev.nvidia.com/signin) if you dont have one.
* **Permissions**:
* You have administrative (root or sudo) access on the DGX Station device to run the registration command.
# Time & risk
* **Estimated time:** 5-10 minutes
* **Risk level:** Low - Registration configures the Station for secure remote access without altering your existing workloads
* **Rollback:** The Brev configuration can be removed through the UI and CLI
* **Last Updated:** 05/29/2026
* First Publication
-
id: instructions
label: Instructions
content: |
# Step 1. Log in to Brev
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm youre in the correct org (by clicking the org button on the top right-hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
Click the “Register Compute” button and follow the instructions in the pop-up window.
# Step 2. Complete Pop-up Instructions
* Install the Brev CLI
* Configure your compute
* Add a name for compute
* To configure SSH, ensure the “Enable SSH access” toggle is on
* Run the registration command
> [!IMPORTANT]
> Run the Brev CLI install command **without `sudo`**. Prefixing the installer with `sudo` writes the `brev` binary into root's home directory, which is not on your user shell's `PATH` — the next command will fail with `brev: command not found`. Copy the install command from the pop-up and run it as your normal user.
# Step 3. Follow Registration Flow
In the CLI, youll be walked through registration. Go through the flow until registration is complete.
# Step 4. Confirm DGX Station in Brev UI
* Go to the [Brev UI](https://brev.nvidia.com)
* Navigate to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute)
* Confirm that the DGX Station appears as a registered node with a **Connected** status
# Step 5. Next Steps
Your DGX Station is now integrated into Brev as a secure, remotely accessible GPU environment.
Now that your hardware is connected, you can:
* **Access your machine from anywhere:** Open the [Brev UI](https://brev.nvidia.com) and launch a session from [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
* **Share access with others:** Invite teammates to your DGX Station from the Brev UI:
* Go to the [Brev UI](https://brev.nvidia.com) and open [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
* Find your DGX Station in the list and open the row's three-dot (⋯) menu.
* Select **Share Access**.
* Enter the email address of the person you want to share with.
* Choose their role / permission level.
* Confirm to send the invitation.
# Step 6. Cleanup
If you ever decide to unregister your DGX Station with Brev, you can either do so through the Brev UI or the Brev CLI.
With the CLI simply run:
```bash
brev deregister
```
In the UI:
* Go to the [Brev UI](https://brev.nvidia.com)
* Navigate to the section listing “GPU Environments” and look under “Registered Compute”
* Click the “Remove” menu item on the device you wish to delete from Brev.
* Confirm your selection.
-
id: troubleshooting
label: Troubleshooting
content: |
| Symptom | Cause | Fix |
|---------|-------|-----|
| Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set <my-org>` and then redo the registration process. |
| Unable to `brev shell <name>` | Need to refresh | `brev refresh`. |
resources:
- name: Brev Documentation
url: https://docs.nvidia.com/brev/latest

View File

@ -118,8 +118,8 @@ All required assets are handled by the NemoClaw installer. No manual cloning is
- **Estimated time:** About 3060 minutes for a first full pass (install, onboard, model download depending on choice and network). Optional Brave, Telegram, and cloudflared steps add time if you do them in a second session.
- **Risk level:** Medium — you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
- **Last Updated:** 05/29/2026
- Update to latest nemoclaw installer instructions
- **Last Updated:** 06/01/2026
- Pin nemoclaw installer to v0.0.55, the latest stable version
## Instructions
@ -127,10 +127,10 @@ All required assets are handled by the NemoClaw installer. No manual cloning is
### Step 1. Install NemoClaw
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.55** release (set via `NEMOCLAW_VERSION`; v0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox.
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.0.55** release (set via `NEMOCLAW_INSTALL_TAG`; v0.0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox.
```bash
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_VERSION=v0.55 bash
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.55 bash
```
The installation wizard walks you through setup:
@ -148,7 +148,7 @@ The installer requires **Node.js 22.16+** (installed automatically if missing).
During custom setup, the onboard wizard walks you through:
1. **Configuring inference** -- Choose to set up local inference on your DGX Station by selecting **`7) Local Ollama`**.
2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will provide options to download models to start.
2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will download **`qwen3.6:35b`** automatically.
3. **Sandbox name** -- Pick a name (e.g. my-assistant). Each sandbox requires a unique name.
4. **Apply this configuration** -- Enter `Y` to confirm setting up local inference.
5. **Enable Brave Web Search** -- Optional. If you enable it, paste a [Brave Search API](https://brave.com/search/api/) key when prompted.
@ -324,7 +324,7 @@ Open Telegram, find your bot, and send a message. The bot should forward traffic
The cloudflared tunnel provides a **public URL for the Web UI dashboard** — it is not related to Telegram messaging.
Install cloudflared (DGX Station is arm64):
Install cloudflared (DGX Station is aarch64):
```bash
curl -L --output cloudflared.deb \
@ -354,7 +354,7 @@ You should see `● cloudflared` with a `trycloudflare.com` public URL.
Set up NemoClaw Agents in general require three steps: Configure NemoClaw security policy, Run Agent Workflow Prompt, Personalize the Workflow for your own use case.
Checkout these [Example NemoClaw Agents](https://build.nvidia.com/station/nemoclaw-applications) for reference. Consider sharing your NemoClaw agent setup with the community at [DGX Station Developer Forum](https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station-gb300)
Checkout these [Example NemoClaw Agents](https://build.nvidia.com/spark/nemoclaw-applications) for reference.
---

File diff suppressed because it is too large Load Diff

View File

@ -1,6 +1,6 @@
kind: Playbook
metadata:
name: nemoclaw
name: station-nemoclaw
displayName: Run NemoClaw with a Local LLM
shortDescription: Build your first local AI assistant on DGX Station using NemoClaw in a secure sandbox, with optional Telegram.
@ -22,8 +22,8 @@ metadata:
value: 30 MIN
spec:
artifactName: nemoclaw
nvcfFunctionId: 3b0ad962-7cfe-4370-9f4d-8024298a6d13
artifactName: station-nemoclaw
nvcfFunctionId: None
attributes:
showUnavailableBanner: false
@ -130,8 +130,8 @@ spec:
- **Estimated time:** About 3060 minutes for a first full pass (install, onboard, model download depending on choice and network). Optional Brave, Telegram, and cloudflared steps add time if you do them in a second session.
- **Risk level:** Medium — you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
- **Last Updated:** 05/29/2026
- Update to latest nemoclaw installer instructions
- **Last Updated:** 06/01/2026
- Pin nemoclaw installer to v0.0.55, the latest stable version
@ -144,10 +144,10 @@ spec:
## Step 1. Install NemoClaw
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.55** release (set via `NEMOCLAW_VERSION`; v0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox.
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.0.55** release (set via `NEMOCLAW_INSTALL_TAG`; v0.0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox.
```bash
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_VERSION=v0.55 bash
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.55 bash
```
The installation wizard walks you through setup:
@ -165,7 +165,7 @@ spec:
During custom setup, the onboard wizard walks you through:
1. **Configuring inference** -- Choose to set up local inference on your DGX Station by selecting **`7) Local Ollama`**.
2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will provide options to download models to start.
2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will download **`qwen3.6:35b`** automatically.
3. **Sandbox name** -- Pick a name (e.g. my-assistant). Each sandbox requires a unique name.
4. **Apply this configuration** -- Enter `Y` to confirm setting up local inference.
5. **Enable Brave Web Search** -- Optional. If you enable it, paste a [Brave Search API](https://brave.com/search/api/) key when prompted.
@ -341,7 +341,7 @@ spec:
The cloudflared tunnel provides a **public URL for the Web UI dashboard** — it is not related to Telegram messaging.
Install cloudflared (DGX Station is arm64):
Install cloudflared (DGX Station is aarch64):
```bash
curl -L --output cloudflared.deb \
@ -371,7 +371,7 @@ spec:
Set up NemoClaw Agents in general require three steps: Configure NemoClaw security policy, Run Agent Workflow Prompt, Personalize the Workflow for your own use case.
Checkout these [Example NemoClaw Agents](https://build.nvidia.com/station/nemoclaw-applications) for reference. Consider sharing your NemoClaw agent setup with the community at [DGX Station Developer Forum](https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station-gb300)
Checkout these [Example NemoClaw Agents](https://build.nvidia.com/spark/nemoclaw-applications) for reference.
---

View File

@ -68,17 +68,14 @@ spec:
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
| **Kimi-K2.5 (1T)** | NVFP4 | ✅ | [`nvidia/Kimi-K2.5-NVFP4`](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) |
| **DeepSeek-V4-Flash** | NVFP4 | ✅ | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
# Time & risk
* **Duration:** 30 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 05/29/2026
* **Last Updated:** 05/28/2026
* Update models
* Add base configuration example, per-setting explanations, and DeepSeek-V4-Flash recipe
@ -125,23 +122,11 @@ spec:
docker pull vllm/vllm-openai:stepfun37
```
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, pull the **26.03** image, which includes the `--cpu-offload-params` support used below:
```bash
docker pull nvcr.io/nvidia/vllm:26.03-py3
```
For DeepSeek-V4-Flash, pull the stable DeepSeek-V4 release container. Use the **cu130** build on DGX Station (Blackwell):
```bash
docker pull vllm/vllm-openai:v0.20.0-cu130
```
# Step 4. Start vLLM server
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
## Base configuration (most models)
This is the recommended starting point for any model that fits entirely in VRAM on the GB300. The Qwen3-235B-A22B-NVFP4 model, for example, runs directly with this configuration.
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
```bash
docker run -d \
@ -159,12 +144,6 @@ spec:
--gpu-memory-utilization 0.9
```
Settings used:
- `--max-model-len` — maximum context length (prompt + output) per request. Larger values reserve more GPU memory for the KV cache; size it to your workload.
- `--gpu-memory-utilization 0.9` — fraction of GPU memory vLLM may use for weights and KV cache. `0.9` leaves headroom for other processes; raise toward `0.95` to fit more KV cache if the GPU is dedicated.
## Step-3.7-Flash (FP8 / NVFP4)
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash
@ -187,94 +166,6 @@ spec:
--kv-cache-dtype fp8
```
Settings used (in addition to the base configuration):
- `--trust-remote-code` — allows the model's custom modeling code (shipped in its repo) to load. Required for Step-3.7.
- `--reasoning-parser step3p5` — parses the model's reasoning/thinking tokens into the dedicated `reasoning_content` response field.
- `--enable-auto-tool-choice` — lets the model decide when to call a tool, enabling OpenAI-compatible function calling.
- `--tool-call-parser step3p5` — parses the model's tool-call output into structured `tool_calls`. Pairs with `--enable-auto-tool-choice`.
- `--kv-cache-dtype fp8` — stores the KV cache in FP8, roughly halving KV-cache memory versus 16-bit and allowing more concurrent/longer sequences.
## Kimi-K2.5 NVFP4 (1T) — CPU offloading
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, run with the **26.03** NGC container. This model does not fit entirely in VRAM, so the MoE expert weights are offloaded to CPU DRAM with `--cpu-offload-gb 375 --cpu-offload-params experts`. Ensure the system has enough free DRAM to hold the offloaded weights.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
nvcr.io/nvidia/vllm:26.03-py3 \
vllm serve nvidia/Kimi-K2.5-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--dtype auto \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.95 \
--served-model-name nvidia/Kimi-K2.5-NVFP4 \
--tensor-parallel-size 1 \
--no-enable-prefix-caching \
--trust-remote-code \
--max-model-len 40960 \
--max-num-seqs 1 \
--max-num-batched-tokens 32768 \
--cpu-offload-gb 375 \
--cpu-offload-params experts
```
Settings used (in addition to the base configuration):
- `--cpu-offload-gb 375` — amount of CPU DRAM (in GiB) vLLM may use to hold weights that don't fit in VRAM. Must be large enough for the offloaded experts; the system needs at least this much free DRAM.
- `--cpu-offload-params experts` — offloads only the MoE expert weights (the bulk of a large MoE model) to DRAM, keeping attention and other hot weights in VRAM.
- `--tensor-parallel-size 1` — single GPU; the GB300 serves the whole model.
- `--max-num-seqs 1` / `--max-num-batched-tokens 32768` — caps concurrency to one sequence and the batch token budget. With expert weights paged from DRAM, throughput is offload-bound, so a low concurrency keeps latency predictable.
- `--no-enable-prefix-caching` — disables prefix-cache reuse. Offloaded experts make the memory budget tight, so the cache is turned off here rather than spent on KV reuse.
- `--kv-cache-dtype auto` / `--dtype auto` — let vLLM pick the KV-cache and compute dtypes from the model's quantization (NVFP4).
## DeepSeek-V4-Flash — MTP + agentic
For DeepSeek-V4-Flash, run with the stable **v0.20.0-cu130** container. This recipe targets agentic workloads and enables Multi-Token Prediction (MTP) speculative decoding. On a single GB300 (TP1) the MoE expert-parallel path is used; the `deep_gemm_mega_moe` backend from some internal recipes is not needed at TP1 and is omitted here.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:v0.20.0-cu130 \
deepseek-ai/DeepSeek-V4-Flash \
--enable-expert-parallel \
--kv-cache-dtype fp8 \
--trust-remote-code \
--block-size 256 \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache True \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
--max-model-len 32768
```
Settings used (in addition to the base configuration):
- `--enable-expert-parallel` — shards the MoE experts across the available GPU(s) using expert parallelism, the recommended MoE execution path for DeepSeek-V4.
- `--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'` — enables **MTP (Multi-Token Prediction)** speculative decoding: the model proposes 3 tokens per step that are verified in a single forward pass, cutting latency for accepted tokens.
- `--kv-cache-dtype fp8` — FP8 KV cache to fit more concurrent/longer sequences.
- `--block-size 256` — KV-cache page size in tokens. DeepSeek-V4 uses multiple KV-cache groups; `256` matches the recipe validated on Station.
- `--attention_config.use_fp4_indexer_cache True` — enables the FP4 indexer cache used by DeepSeek-V4's attention. (Drop this flag on platforms without native FP4, e.g. Hopper.)
- `--tokenizer-mode deepseek_v4` / `--tool-call-parser deepseek_v4` / `--reasoning-parser deepseek_v4` — DeepSeek-V4-specific tokenizer, tool-call, and reasoning parsers.
- `--enable-auto-tool-choice` — OpenAI-compatible function calling for agentic use.
- `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` — uses full + piecewise CUDA graph capture and enables all custom ops for lower per-step overhead.
- **Prefix caching is left enabled (the vLLM default).** For agentic workloads with large shared prefixes (e.g. a 32k system/context prefix) at low batch sizes (~BS 34), prefix caching gives a significant throughput boost by reusing the cached prefix across requests. Some internal recipes carry `--no-enable-prefix-caching`, but that was inherited from random-data benchmarking and is not recommended for agentic use here.
Check the server logs for startup progress:
```bash