mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-18 04:22:21 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
b849d2d191
commit
32cbd72374
413
nvidia/station-ai-skills/endpoint-production.yaml
Normal file
413
nvidia/station-ai-skills/endpoint-production.yaml
Normal file
@ -0,0 +1,413 @@
|
||||
kind: Playbook
|
||||
metadata:
|
||||
name: station-ai-skills
|
||||
displayName: DGX Station AI Skills for Coding Agents
|
||||
shortDescription: Give your coding agent (Claude Code, Codex, Gemini CLI, Cursor) DGX Station expertise via an AGENTS.md and on-demand Agent Skills
|
||||
|
||||
publisher: nvidia
|
||||
description: |
|
||||
# REPLACE THIS WITH YOUR MODEL CARD
|
||||
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
|
||||
|
||||
labelsV2:
|
||||
- gpuType:playbook:gpu_type_station
|
||||
- DGX Station
|
||||
- GB300
|
||||
- Blackwell
|
||||
- AI Agents
|
||||
- Agent Skills
|
||||
- AGENTS.md
|
||||
- Claude Code
|
||||
- Codex
|
||||
- Gemini CLI
|
||||
- Cursor
|
||||
- vLLM
|
||||
- SGLang
|
||||
- MIG
|
||||
- Mixed Coherency
|
||||
|
||||
attributes:
|
||||
- key: DURATION
|
||||
value: 15 MIN
|
||||
|
||||
spec:
|
||||
artifactName: station-ai-skills
|
||||
nvcfFunctionId: None
|
||||
attributes:
|
||||
|
||||
showUnavailableBanner: false
|
||||
apiDocsUrl: None
|
||||
termsOfUse: |
|
||||
|
||||
cta:
|
||||
text: View on GitHub
|
||||
url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-ai-skills/
|
||||
|
||||
|
||||
tabs:
|
||||
-
|
||||
id: overview
|
||||
|
||||
label: Overview
|
||||
content: |
|
||||
# Basic idea
|
||||
|
||||
Modern coding agents — Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor — all support two extension mechanisms: a project-level **context file** that's loaded into every conversation, and **on-demand procedural workflows** (called skills, prompts, commands, or rules depending on the harness). This playbook ships both for DGX Station:
|
||||
|
||||
- An **`AGENTS.md`** with the critical DGX Station constraints your agent should always know (mixed coherency, GPU targeting, common pitfalls). `AGENTS.md` is the cross-harness standard; an `install.sh` lays it down as `CLAUDE.md`, `GEMINI.md`, or `AGENTS.md` depending on the agent you use.
|
||||
- **Four Agent Skills** — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` — authored once in the [Anthropic Agent Skills format](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview) and installed into the right per-harness location (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`).
|
||||
|
||||
This approach keeps your agent's context lean in every conversation while giving it deep procedural knowledge on demand, regardless of which agent you use.
|
||||
|
||||
## AGENTS.md vs Agent Skill — why split?
|
||||
|
||||
| | AGENTS.md | Agent Skill |
|
||||
|---|---|---|
|
||||
| **Loaded** | Every conversation, automatically | Only when invoked by name (or matched by description, in Claude) |
|
||||
| **Best for** | Constraints, pitfalls, "never do X" rules | Step-by-step workflows, deployment procedures |
|
||||
| **Context cost** | Consumed every time | Zero until invoked |
|
||||
|
||||
The DGX Station mixed-coherency constraint (`--gpus all` will crash) should be in every conversation. The full vLLM deployment procedure should not.
|
||||
|
||||
# What you'll accomplish
|
||||
|
||||
- Install the `AGENTS.md` and four Agent Skills into your project directory for your chosen agent (Claude Code, Codex, Gemini CLI, or Cursor).
|
||||
- Verify the agent loads the constraints automatically and the skills on demand.
|
||||
- Invoke `vllm-setup` to deploy a vLLM inference server with validated configuration.
|
||||
- Invoke `sglang-setup` to deploy an SGLang inference server.
|
||||
- Invoke `mig-configure` to partition the GB300 into MIG instances.
|
||||
- Invoke `dgx-diagnose` to troubleshoot common DGX Station issues.
|
||||
|
||||
# What to know before starting
|
||||
|
||||
- Basic familiarity with one supported coding agent (running it, giving it prompts, using slash commands or rule references)
|
||||
- General understanding of DGX Station (two GPUs, Docker-based workflows)
|
||||
|
||||
# Prerequisites
|
||||
|
||||
- NVIDIA DGX Station with GB300
|
||||
- One of the supported coding agents installed:
|
||||
- **Claude Code:** `curl -fsSL https://claude.ai/install.sh | sh`
|
||||
- **OpenAI Codex CLI:** `npm i -g @openai/codex`
|
||||
- **Gemini CLI:** `npm i -g @google/gemini-cli`
|
||||
- **Cursor:** download from `https://cursor.com/`
|
||||
- A project directory where you do DGX Station work
|
||||
|
||||
# Ancillary files
|
||||
|
||||
- `assets/AGENTS.md` — canonical context file with critical constraints, GPU targeting, software versions, and common pitfalls. Cross-harness standard.
|
||||
- `assets/skills/vllm-setup/SKILL.md` — skill: deploy vLLM with validated configuration.
|
||||
- `assets/skills/sglang-setup/SKILL.md` — skill: deploy SGLang with validated configuration.
|
||||
- `assets/skills/mig-configure/SKILL.md` — skill: configure MIG partitions on the GB300.
|
||||
- `assets/skills/dgx-diagnose/SKILL.md` — skill: troubleshoot common DGX Station issues.
|
||||
- `assets/install.sh` — per-harness installer (`claude`, `codex`, `gemini`, `cursor`, or `all`).
|
||||
|
||||
# Time & risk
|
||||
|
||||
* **Duration:** 10-15 minutes
|
||||
* **Risk level:** Low — this playbook copies markdown files into your project directory
|
||||
* **Rollback:** Delete the context file (`AGENTS.md` / `CLAUDE.md` / `GEMINI.md`) and the harness-specific skill directory (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`) from your project directory
|
||||
* **Last Updated:** 05/18/2026
|
||||
* Restructured as harness-agnostic Agent Skills (Claude Code, Codex, Gemini CLI, Cursor)
|
||||
|
||||
|
||||
|
||||
-
|
||||
id: instructions
|
||||
|
||||
label: Instructions
|
||||
content: |
|
||||
# Step 1. Install your coding agent
|
||||
|
||||
Pick whichever agent you prefer — the rest of this playbook works the same regardless. Install commands:
|
||||
|
||||
| Agent | Install |
|
||||
|-------|---------|
|
||||
| Claude Code | `curl -fsSL https://claude.ai/install.sh \| sh` |
|
||||
| OpenAI Codex CLI | `npm i -g @openai/codex` |
|
||||
| Gemini CLI | `npm i -g @google/gemini-cli` |
|
||||
| Cursor | Download from `https://cursor.com/` |
|
||||
|
||||
Verify with `claude --version`, `codex --version`, `gemini --version`, or by launching Cursor.
|
||||
|
||||
# Step 2. Install the skills into your project
|
||||
|
||||
Navigate to the project where you want DGX Station expertise, then run the installer with the harness you use:
|
||||
|
||||
```bash
|
||||
cd ~/your-project
|
||||
|
||||
# Pick one:
|
||||
/path/to/this/playbook/assets/install.sh claude
|
||||
/path/to/this/playbook/assets/install.sh codex
|
||||
/path/to/this/playbook/assets/install.sh gemini
|
||||
/path/to/this/playbook/assets/install.sh cursor
|
||||
|
||||
# Or install for all four at once:
|
||||
/path/to/this/playbook/assets/install.sh all
|
||||
```
|
||||
|
||||
If you downloaded the playbook as a zip, the path is relative to the extracted directory:
|
||||
|
||||
```bash
|
||||
station-ai-skills/assets/install.sh claude ~/your-project
|
||||
```
|
||||
|
||||
The installer is additive for skill directories (won't clobber existing skills you've written) and refuses to overwrite an existing context file (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`) unless you pass `--force`.
|
||||
|
||||
**Resulting layout** (per harness):
|
||||
|
||||
```text
|
||||
your-project/
|
||||
AGENTS.md or CLAUDE.md or GEMINI.md # context file (named for your agent)
|
||||
.claude/skills/<name>/SKILL.md # claude
|
||||
.codex/prompts/<name>.md # codex
|
||||
.gemini/commands/<name>.md # gemini
|
||||
.cursor/rules/<name>.mdc # cursor
|
||||
```
|
||||
|
||||
Where `<name>` is each of `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose`.
|
||||
|
||||
> [!NOTE]
|
||||
> Every supported agent automatically reads the context file from the working directory at startup. Skills/prompts/rules in the harness-specific directory are discovered automatically — no additional configuration needed.
|
||||
|
||||
# Step 3. Verify the setup
|
||||
|
||||
Start your agent in the project directory and ask a question that requires constraint knowledge:
|
||||
|
||||
```text
|
||||
Can I use --gpus all to run my CUDA workload on DGX Station?
|
||||
```
|
||||
|
||||
The agent should immediately warn about the mixed-coherency constraint and recommend `--gpus '"device=N"'` targeting. If you don't get the warning, the context file isn't being loaded — see Troubleshooting.
|
||||
|
||||
Then verify the skills are discoverable:
|
||||
|
||||
| Agent | How to check |
|
||||
|-------|--------------|
|
||||
| Claude Code | Type `/` — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` should appear in the autocomplete |
|
||||
| Codex CLI | Type `/prompts:` — same four names appear |
|
||||
| Gemini CLI | Type `/` — same four names appear |
|
||||
| Cursor | Open the Rules panel — same four rules appear |
|
||||
|
||||
# Step 4. Use vllm-setup to deploy an inference server
|
||||
|
||||
Invoke the skill in your agent:
|
||||
|
||||
| Agent | Invocation |
|
||||
|-------|-----------|
|
||||
| Claude Code | `/vllm-setup` (slash command) or just describe the task ("deploy vllm with Qwen3-8B") |
|
||||
| Codex CLI | `/prompts:vllm-setup` |
|
||||
| Gemini CLI | `/vllm-setup` |
|
||||
| Cursor | In chat: "use the vllm-setup rule to deploy a vllm server" |
|
||||
|
||||
The agent will walk you through deploying a vLLM server with a validated container image, correct GPU targeting, and recommended parameters. It will check your GPU index, ask which model you want to serve, and generate the full `docker run` command.
|
||||
|
||||
# Step 5. Use sglang-setup to deploy SGLang
|
||||
|
||||
Same invocation pattern, but for SGLang with the `cu130` container, RadixAttention prefix caching, and structured JSON output support.
|
||||
|
||||
# Step 6. Use mig-configure to partition the GB300
|
||||
|
||||
The agent will query your current MIG state, show available profiles, help you choose a layout for your workloads, and execute the partitioning commands.
|
||||
|
||||
# Step 7. Use dgx-diagnose to troubleshoot issues
|
||||
|
||||
If you encounter problems, invoke `dgx-diagnose`. The agent will check GPU status, driver version, running processes, MIG state, and Fabric Manager to identify the issue.
|
||||
|
||||
# Step 8. Customize
|
||||
|
||||
Both the `AGENTS.md` and the skills are plain markdown — extend them freely.
|
||||
|
||||
**Add project-specific constraints to `AGENTS.md`** (or your harness-specific context file):
|
||||
|
||||
```markdown
|
||||
## Project-specific
|
||||
|
||||
- Our production MIG layout is 3g.139gb + 2g.70gb + 2g.70gb
|
||||
- Always use port 8080 for inference (nginx proxy on 443)
|
||||
- Model weights are cached at /data/models, mount with -v /data/models:/root/.cache/huggingface/hub
|
||||
```
|
||||
|
||||
**Create new skills** by adding a directory and `SKILL.md` to `assets/skills/`, then re-run `install.sh`:
|
||||
|
||||
```bash
|
||||
mkdir -p assets/skills/run-benchmarks
|
||||
cat > assets/skills/run-benchmarks/SKILL.md << 'EOF'
|
||||
---
|
||||
name: run-benchmarks
|
||||
description: Run our standard inference benchmark suite against the running vLLM or SGLang server and compare against the baseline.
|
||||
---
|
||||
|
||||
# Run benchmarks
|
||||
|
||||
1. Check which inference server is running (vLLM on port 8000 or SGLang on port 30000)
|
||||
2. Run the appropriate benchmark script from ./benchmarks/
|
||||
3. Report throughput (tokens/sec), latency (TTFT, ITL), and memory utilization
|
||||
4. Compare against the baseline in ./benchmarks/baseline.json
|
||||
EOF
|
||||
```
|
||||
|
||||
> [!TIP]
|
||||
> Keep `AGENTS.md` focused on constraints and pitfalls (things that break). Put procedural workflows in skills (things you do step-by-step).
|
||||
|
||||
|
||||
|
||||
-
|
||||
id: troubleshooting
|
||||
|
||||
label: Troubleshooting
|
||||
content: |
|
||||
# Skills don't appear in autocomplete / aren't discoverable
|
||||
|
||||
Each agent discovers skills from a harness-specific directory in the current directory (or a parent). Check the right one:
|
||||
|
||||
| Agent | Expected location |
|
||||
|-------|-------------------|
|
||||
| Claude Code | `.claude/skills/<name>/SKILL.md` |
|
||||
| Codex CLI | `.codex/prompts/<name>.md` |
|
||||
| Gemini CLI | `.gemini/commands/<name>.md` |
|
||||
| Cursor | `.cursor/rules/<name>.mdc` |
|
||||
|
||||
```bash
|
||||
# Examples — check the directory for your agent
|
||||
ls -la .claude/skills/
|
||||
ls -la .codex/prompts/
|
||||
ls -la .gemini/commands/
|
||||
ls -la .cursor/rules/
|
||||
```
|
||||
|
||||
You should see entries for `vllm-setup`, `sglang-setup`, `mig-configure`, and `dgx-diagnose`.
|
||||
|
||||
**Check you're in the right directory:**
|
||||
|
||||
```bash
|
||||
pwd
|
||||
```
|
||||
|
||||
The agent must be started from the directory containing the harness directory, or a subdirectory of it.
|
||||
|
||||
# Context file not loaded
|
||||
|
||||
If the agent gives generic answers without DGX Station awareness, the context file isn't being picked up. Each agent reads a different filename — verify the one for your agent exists:
|
||||
|
||||
| Agent | Expected filename |
|
||||
|-------|-------------------|
|
||||
| Claude Code | `CLAUDE.md` (also reads `AGENTS.md` as fallback) |
|
||||
| Codex CLI | `AGENTS.md` |
|
||||
| Gemini CLI | `GEMINI.md` |
|
||||
| Cursor | `AGENTS.md` |
|
||||
|
||||
```bash
|
||||
# Verify the file exists for your agent
|
||||
cat AGENTS.md | head -5
|
||||
cat CLAUDE.md | head -5
|
||||
cat GEMINI.md | head -5
|
||||
|
||||
# Restart the agent in the correct directory
|
||||
cd ~/your-project
|
||||
claude # or codex, gemini, etc.
|
||||
```
|
||||
|
||||
All four agents read the context file from the working directory (and parent directories up to the project root).
|
||||
|
||||
# Skill gives outdated information
|
||||
|
||||
The skills contain validated container versions and parameters as of the publication date. If a newer container is available, edit the canonical source and re-install:
|
||||
|
||||
```bash
|
||||
nano /path/to/playbook/assets/skills/vllm-setup/SKILL.md
|
||||
/path/to/playbook/assets/install.sh all --force
|
||||
```
|
||||
|
||||
Or edit the installed copy directly:
|
||||
|
||||
```bash
|
||||
# Claude Code
|
||||
nano .claude/skills/vllm-setup/SKILL.md
|
||||
# Codex
|
||||
nano .codex/prompts/vllm-setup.md
|
||||
# Gemini CLI
|
||||
nano .gemini/commands/vllm-setup.md
|
||||
# Cursor
|
||||
nano .cursor/rules/vllm-setup.mdc
|
||||
```
|
||||
|
||||
> [!TIP]
|
||||
> Skills are plain markdown — you can version them in git alongside your project code.
|
||||
|
||||
# "Both GPUs cannot be used" errors
|
||||
|
||||
This is the mixed-coherency constraint working as intended. If you see CUDA errors when using `--gpus all`:
|
||||
|
||||
```bash
|
||||
# Find the GB300 index
|
||||
nvidia-smi --query-gpu=index,name --format=csv,noheader
|
||||
|
||||
# Use device-specific targeting
|
||||
docker run --gpus '"device=1"' ...
|
||||
```
|
||||
|
||||
The `AGENTS.md` covers this constraint, but if you removed that section, add it back — it's the most important piece of DGX Station knowledge.
|
||||
|
||||
# Skills conflict with existing project directory
|
||||
|
||||
If your project already has a `.claude/`, `.codex/`, `.gemini/`, or `.cursor/` directory with its own contents, `install.sh` is **additive** for skill directories — it adds the new skill files alongside whatever you already have and warns on collision rather than overwriting.
|
||||
|
||||
For context files (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`), the installer **refuses** to overwrite an existing file. Pass `--force` to override, or merge the new content manually:
|
||||
|
||||
```bash
|
||||
# See what would be written
|
||||
diff /path/to/playbook/assets/AGENTS.md ./AGENTS.md
|
||||
|
||||
# Force overwrite
|
||||
/path/to/playbook/assets/install.sh claude . --force
|
||||
```
|
||||
|
||||
# Installer reports "WROTE" for some files but "SKIP" for others
|
||||
|
||||
That's the safe-by-default behavior. The installer skips any file that already exists, prints a warning, and continues with the rest. To get a clean install, either:
|
||||
|
||||
1. Delete the existing files first: `rm -rf .claude/skills/{vllm-setup,sglang-setup,mig-configure,dgx-diagnose}`
|
||||
2. Or pass `--force` (only affects context files; skill files are still skipped if present)
|
||||
|
||||
|
||||
|
||||
|
||||
resources:
|
||||
- name: Anthropic Agent Skills Overview
|
||||
url: https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview
|
||||
|
||||
|
||||
- name: AGENTS.md Standard
|
||||
url: https://agents.md/
|
||||
|
||||
|
||||
- name: Claude Code Documentation
|
||||
url: https://docs.anthropic.com/en/docs/claude-code
|
||||
|
||||
|
||||
- name: OpenAI Codex AGENTS.md Guide
|
||||
url: https://developers.openai.com/codex/guides/agents-md
|
||||
|
||||
|
||||
- name: Gemini CLI Custom Commands
|
||||
url: https://geminicli.com/docs/cli/custom-commands/
|
||||
|
||||
|
||||
- name: Cursor Rules Documentation
|
||||
url: https://docs.cursor.com/
|
||||
|
||||
|
||||
- name: vLLM Documentation
|
||||
url: https://docs.vllm.ai/en/latest/
|
||||
|
||||
|
||||
- name: SGLang Documentation
|
||||
url: https://docs.sglang.io/
|
||||
|
||||
|
||||
- name: MIG User Guide
|
||||
url: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
|
||||
|
||||
|
||||
160
nvidia/station-brev/endpoint-production.yaml
Normal file
160
nvidia/station-brev/endpoint-production.yaml
Normal file
@ -0,0 +1,160 @@
|
||||
kind: Playbook
|
||||
metadata:
|
||||
name: station-brev
|
||||
displayName: Register DGX Station to Brev
|
||||
shortDescription: Link your DGX Station to Brev for remote access and sharing
|
||||
publisher: nvidia
|
||||
description: |
|
||||
# REPLACE THIS WITH YOUR MODEL CARD
|
||||
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
|
||||
|
||||
labelsV2:
|
||||
- gpuType:playbook:gpu_type_station
|
||||
- DGX Station
|
||||
- Brev
|
||||
|
||||
attributes:
|
||||
- key: DURATION
|
||||
value: 5 MIN
|
||||
|
||||
spec:
|
||||
artifactName: station-brev
|
||||
nvcfFunctionId: None
|
||||
attributes:
|
||||
|
||||
showUnavailableBanner: false
|
||||
apiDocsUrl: None
|
||||
termsOfUse: |
|
||||
|
||||
cta:
|
||||
text: Brev Overview
|
||||
url: https://docs.nvidia.com/brev/concepts/overview
|
||||
|
||||
|
||||
tabs:
|
||||
-
|
||||
id: overview
|
||||
|
||||
label: Overview
|
||||
content: |
|
||||
# Basic idea
|
||||
|
||||
NVIDIA Brev is an AI development platform that makes GPU environments remotely accessible, shareable, and easy to standardize using preconfigured setups called Launchables.
|
||||
|
||||
This walkthrough will help you connect your NVIDIA DGX Station to Brev so it shows up as a managed GPU environment in Brev. After a one-time registration, your Station becomes remotely accessible and shareable.
|
||||
|
||||
# What you'll accomplish
|
||||
|
||||
You’ll register your DGX Station with Brev and it will be visible as a healthy node in the Brev web UI and CLI, ready to share access and accept workloads whenever needed.
|
||||
|
||||
# What to know before starting
|
||||
|
||||
While Brev automates the complex configuration, understanding a few key concepts when establishing the initial connection will be useful:
|
||||
|
||||
* **Terminal Basics**:
|
||||
* Familiarity with command-line use to run a few simple setup commands.
|
||||
|
||||
# Prerequisites
|
||||
|
||||
You will also need the following:
|
||||
|
||||
* NVIDIA DGX Station with GB300 GPU
|
||||
* **Brev Account**:
|
||||
* Have an NVIDIA Brev account. [Create an NVIDIA Brev account](https://login.brev.nvidia.com/signin) if you don’t have one.
|
||||
|
||||
* **Permissions**:
|
||||
* You have administrative (root or sudo) access on the DGX Station device to run the registration command.
|
||||
|
||||
# Time & risk
|
||||
|
||||
* **Estimated time:** 5-10 minutes
|
||||
* **Risk level:** Low - Registration configures the Station for secure remote access without altering your existing workloads
|
||||
* **Rollback:** The Brev configuration can be removed through the UI and CLI
|
||||
* **Last Updated:** 05/29/2026
|
||||
* First Publication
|
||||
|
||||
|
||||
|
||||
-
|
||||
id: instructions
|
||||
|
||||
label: Instructions
|
||||
content: |
|
||||
# Step 1. Log in to Brev
|
||||
|
||||
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm you’re in the correct org (by clicking the org button on the top right-hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
|
||||
|
||||
Click the “Register Compute” button and follow the instructions in the pop-up window.
|
||||
|
||||
# Step 2. Complete Pop-up Instructions
|
||||
|
||||
* Install the Brev CLI
|
||||
* Configure your compute
|
||||
* Add a name for compute
|
||||
* To configure SSH, ensure the “Enable SSH access” toggle is on
|
||||
* Run the registration command
|
||||
|
||||
> [!IMPORTANT]
|
||||
> Run the Brev CLI install command **without `sudo`**. Prefixing the installer with `sudo` writes the `brev` binary into root's home directory, which is not on your user shell's `PATH` — the next command will fail with `brev: command not found`. Copy the install command from the pop-up and run it as your normal user.
|
||||
|
||||
# Step 3. Follow Registration Flow
|
||||
|
||||
In the CLI, you’ll be walked through registration. Go through the flow until registration is complete.
|
||||
|
||||
# Step 4. Confirm DGX Station in Brev UI
|
||||
|
||||
* Go to the [Brev UI](https://brev.nvidia.com)
|
||||
* Navigate to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute)
|
||||
* Confirm that the DGX Station appears as a registered node with a **Connected** status
|
||||
|
||||
# Step 5. Next Steps
|
||||
|
||||
Your DGX Station is now integrated into Brev as a secure, remotely accessible GPU environment.
|
||||
|
||||
Now that your hardware is connected, you can:
|
||||
|
||||
* **Access your machine from anywhere:** Open the [Brev UI](https://brev.nvidia.com) and launch a session from [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
|
||||
* **Share access with others:** Invite teammates to your DGX Station from the Brev UI:
|
||||
* Go to the [Brev UI](https://brev.nvidia.com) and open [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
|
||||
* Find your DGX Station in the list and open the row's three-dot (⋯) menu.
|
||||
* Select **Share Access**.
|
||||
* Enter the email address of the person you want to share with.
|
||||
* Choose their role / permission level.
|
||||
* Confirm to send the invitation.
|
||||
|
||||
# Step 6. Cleanup
|
||||
|
||||
If you ever decide to unregister your DGX Station with Brev, you can either do so through the Brev UI or the Brev CLI.
|
||||
|
||||
With the CLI simply run:
|
||||
|
||||
```bash
|
||||
brev deregister
|
||||
```
|
||||
|
||||
In the UI:
|
||||
* Go to the [Brev UI](https://brev.nvidia.com)
|
||||
* Navigate to the section listing “GPU Environments” and look under “Registered Compute”
|
||||
* Click the “Remove” menu item on the device you wish to delete from Brev.
|
||||
* Confirm your selection.
|
||||
|
||||
|
||||
|
||||
-
|
||||
id: troubleshooting
|
||||
|
||||
label: Troubleshooting
|
||||
content: |
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set <my-org>` and then redo the registration process. |
|
||||
| Unable to `brev shell <name>` | Need to refresh | `brev refresh`. |
|
||||
|
||||
|
||||
|
||||
|
||||
resources:
|
||||
- name: Brev Documentation
|
||||
url: https://docs.nvidia.com/brev/latest
|
||||
|
||||
|
||||
@ -118,8 +118,8 @@ All required assets are handled by the NemoClaw installer. No manual cloning is
|
||||
|
||||
- **Estimated time:** About 30–60 minutes for a first full pass (install, onboard, model download depending on choice and network). Optional Brave, Telegram, and cloudflared steps add time if you do them in a second session.
|
||||
- **Risk level:** Medium — you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
|
||||
- **Last Updated:** 05/29/2026
|
||||
- Update to latest nemoclaw installer instructions
|
||||
- **Last Updated:** 06/01/2026
|
||||
- Pin nemoclaw installer to v0.0.55, the latest stable version
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -127,10 +127,10 @@ All required assets are handled by the NemoClaw installer. No manual cloning is
|
||||
|
||||
### Step 1. Install NemoClaw
|
||||
|
||||
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.55** release (set via `NEMOCLAW_VERSION`; v0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox.
|
||||
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.0.55** release (set via `NEMOCLAW_INSTALL_TAG`; v0.0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox.
|
||||
|
||||
```bash
|
||||
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_VERSION=v0.55 bash
|
||||
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.55 bash
|
||||
```
|
||||
|
||||
The installation wizard walks you through setup:
|
||||
@ -148,7 +148,7 @@ The installer requires **Node.js 22.16+** (installed automatically if missing).
|
||||
During custom setup, the onboard wizard walks you through:
|
||||
|
||||
1. **Configuring inference** -- Choose to set up local inference on your DGX Station by selecting **`7) Local Ollama`**.
|
||||
2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will provide options to download models to start.
|
||||
2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will download **`qwen3.6:35b`** automatically.
|
||||
3. **Sandbox name** -- Pick a name (e.g. my-assistant). Each sandbox requires a unique name.
|
||||
4. **Apply this configuration** -- Enter `Y` to confirm setting up local inference.
|
||||
5. **Enable Brave Web Search** -- Optional. If you enable it, paste a [Brave Search API](https://brave.com/search/api/) key when prompted.
|
||||
@ -324,7 +324,7 @@ Open Telegram, find your bot, and send a message. The bot should forward traffic
|
||||
|
||||
The cloudflared tunnel provides a **public URL for the Web UI dashboard** — it is not related to Telegram messaging.
|
||||
|
||||
Install cloudflared (DGX Station is arm64):
|
||||
Install cloudflared (DGX Station is aarch64):
|
||||
|
||||
```bash
|
||||
curl -L --output cloudflared.deb \
|
||||
@ -354,7 +354,7 @@ You should see `● cloudflared` with a `trycloudflare.com` public URL.
|
||||
|
||||
Set up NemoClaw Agents in general require three steps: Configure NemoClaw security policy, Run Agent Workflow Prompt, Personalize the Workflow for your own use case.
|
||||
|
||||
Checkout these [Example NemoClaw Agents](https://build.nvidia.com/station/nemoclaw-applications) for reference. Consider sharing your NemoClaw agent setup with the community at [DGX Station Developer Forum](https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station-gb300)
|
||||
Checkout these [Example NemoClaw Agents](https://build.nvidia.com/spark/nemoclaw-applications) for reference.
|
||||
|
||||
---
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@ -1,6 +1,6 @@
|
||||
kind: Playbook
|
||||
metadata:
|
||||
name: nemoclaw
|
||||
name: station-nemoclaw
|
||||
displayName: Run NemoClaw with a Local LLM
|
||||
shortDescription: Build your first local AI assistant on DGX Station using NemoClaw in a secure sandbox, with optional Telegram.
|
||||
|
||||
@ -22,8 +22,8 @@ metadata:
|
||||
value: 30 MIN
|
||||
|
||||
spec:
|
||||
artifactName: nemoclaw
|
||||
nvcfFunctionId: 3b0ad962-7cfe-4370-9f4d-8024298a6d13
|
||||
artifactName: station-nemoclaw
|
||||
nvcfFunctionId: None
|
||||
attributes:
|
||||
|
||||
showUnavailableBanner: false
|
||||
@ -130,8 +130,8 @@ spec:
|
||||
|
||||
- **Estimated time:** About 30–60 minutes for a first full pass (install, onboard, model download depending on choice and network). Optional Brave, Telegram, and cloudflared steps add time if you do them in a second session.
|
||||
- **Risk level:** Medium — you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
|
||||
- **Last Updated:** 05/29/2026
|
||||
- Update to latest nemoclaw installer instructions
|
||||
- **Last Updated:** 06/01/2026
|
||||
- Pin nemoclaw installer to v0.0.55, the latest stable version
|
||||
|
||||
|
||||
|
||||
@ -144,10 +144,10 @@ spec:
|
||||
|
||||
## Step 1. Install NemoClaw
|
||||
|
||||
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.55** release (set via `NEMOCLAW_VERSION`; v0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox.
|
||||
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.0.55** release (set via `NEMOCLAW_INSTALL_TAG`; v0.0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox.
|
||||
|
||||
```bash
|
||||
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_VERSION=v0.55 bash
|
||||
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.55 bash
|
||||
```
|
||||
|
||||
The installation wizard walks you through setup:
|
||||
@ -165,7 +165,7 @@ spec:
|
||||
During custom setup, the onboard wizard walks you through:
|
||||
|
||||
1. **Configuring inference** -- Choose to set up local inference on your DGX Station by selecting **`7) Local Ollama`**.
|
||||
2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will provide options to download models to start.
|
||||
2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will download **`qwen3.6:35b`** automatically.
|
||||
3. **Sandbox name** -- Pick a name (e.g. my-assistant). Each sandbox requires a unique name.
|
||||
4. **Apply this configuration** -- Enter `Y` to confirm setting up local inference.
|
||||
5. **Enable Brave Web Search** -- Optional. If you enable it, paste a [Brave Search API](https://brave.com/search/api/) key when prompted.
|
||||
@ -341,7 +341,7 @@ spec:
|
||||
|
||||
The cloudflared tunnel provides a **public URL for the Web UI dashboard** — it is not related to Telegram messaging.
|
||||
|
||||
Install cloudflared (DGX Station is arm64):
|
||||
Install cloudflared (DGX Station is aarch64):
|
||||
|
||||
```bash
|
||||
curl -L --output cloudflared.deb \
|
||||
@ -371,7 +371,7 @@ spec:
|
||||
|
||||
Set up NemoClaw Agents in general require three steps: Configure NemoClaw security policy, Run Agent Workflow Prompt, Personalize the Workflow for your own use case.
|
||||
|
||||
Checkout these [Example NemoClaw Agents](https://build.nvidia.com/station/nemoclaw-applications) for reference. Consider sharing your NemoClaw agent setup with the community at [DGX Station Developer Forum](https://forums.developer.nvidia.com/c/accelerated-computing/dgx-station-gb300)
|
||||
Checkout these [Example NemoClaw Agents](https://build.nvidia.com/spark/nemoclaw-applications) for reference.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@ -68,17 +68,14 @@ spec:
|
||||
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
|
||||
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
|
||||
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
|
||||
| **Kimi-K2.5 (1T)** | NVFP4 | ✅ | [`nvidia/Kimi-K2.5-NVFP4`](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) |
|
||||
| **DeepSeek-V4-Flash** | NVFP4 | ✅ | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
|
||||
|
||||
# Time & risk
|
||||
|
||||
* **Duration:** 30 minutes (longer on first run due to model download)
|
||||
* **Risks:** Model download requires HuggingFace authentication
|
||||
* **Rollback:** Stop and remove the container to restore state
|
||||
* **Last Updated:** 05/29/2026
|
||||
* **Last Updated:** 05/28/2026
|
||||
* Update models
|
||||
* Add base configuration example, per-setting explanations, and DeepSeek-V4-Flash recipe
|
||||
|
||||
|
||||
|
||||
@ -125,23 +122,11 @@ spec:
|
||||
docker pull vllm/vllm-openai:stepfun37
|
||||
```
|
||||
|
||||
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, pull the **26.03** image, which includes the `--cpu-offload-params` support used below:
|
||||
```bash
|
||||
docker pull nvcr.io/nvidia/vllm:26.03-py3
|
||||
```
|
||||
|
||||
For DeepSeek-V4-Flash, pull the stable DeepSeek-V4 release container. Use the **cu130** build on DGX Station (Blackwell):
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:v0.20.0-cu130
|
||||
```
|
||||
|
||||
# Step 4. Start vLLM server
|
||||
|
||||
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
|
||||
|
||||
## Base configuration (most models)
|
||||
|
||||
This is the recommended starting point for any model that fits entirely in VRAM on the GB300. The Qwen3-235B-A22B-NVFP4 model, for example, runs directly with this configuration.
|
||||
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
@ -159,12 +144,6 @@ spec:
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
Settings used:
|
||||
- `--max-model-len` — maximum context length (prompt + output) per request. Larger values reserve more GPU memory for the KV cache; size it to your workload.
|
||||
- `--gpu-memory-utilization 0.9` — fraction of GPU memory vLLM may use for weights and KV cache. `0.9` leaves headroom for other processes; raise toward `0.95` to fit more KV cache if the GPU is dedicated.
|
||||
|
||||
## Step-3.7-Flash (FP8 / NVFP4)
|
||||
|
||||
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
|
||||
|
||||
```bash
|
||||
@ -187,94 +166,6 @@ spec:
|
||||
--kv-cache-dtype fp8
|
||||
```
|
||||
|
||||
Settings used (in addition to the base configuration):
|
||||
- `--trust-remote-code` — allows the model's custom modeling code (shipped in its repo) to load. Required for Step-3.7.
|
||||
- `--reasoning-parser step3p5` — parses the model's reasoning/thinking tokens into the dedicated `reasoning_content` response field.
|
||||
- `--enable-auto-tool-choice` — lets the model decide when to call a tool, enabling OpenAI-compatible function calling.
|
||||
- `--tool-call-parser step3p5` — parses the model's tool-call output into structured `tool_calls`. Pairs with `--enable-auto-tool-choice`.
|
||||
- `--kv-cache-dtype fp8` — stores the KV cache in FP8, roughly halving KV-cache memory versus 16-bit and allowing more concurrent/longer sequences.
|
||||
|
||||
## Kimi-K2.5 NVFP4 (1T) — CPU offloading
|
||||
|
||||
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, run with the **26.03** NGC container. This model does not fit entirely in VRAM, so the MoE expert weights are offloaded to CPU DRAM with `--cpu-offload-gb 375 --cpu-offload-params experts`. Ensure the system has enough free DRAM to hold the offloaded weights.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name vllm-server \
|
||||
--gpus all \
|
||||
--ipc host \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-p 8000:8000 \
|
||||
-e HF_TOKEN="$HF_TOKEN" \
|
||||
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||
nvcr.io/nvidia/vllm:26.03-py3 \
|
||||
vllm serve nvidia/Kimi-K2.5-NVFP4 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--dtype auto \
|
||||
--kv-cache-dtype auto \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--served-model-name nvidia/Kimi-K2.5-NVFP4 \
|
||||
--tensor-parallel-size 1 \
|
||||
--no-enable-prefix-caching \
|
||||
--trust-remote-code \
|
||||
--max-model-len 40960 \
|
||||
--max-num-seqs 1 \
|
||||
--max-num-batched-tokens 32768 \
|
||||
--cpu-offload-gb 375 \
|
||||
--cpu-offload-params experts
|
||||
```
|
||||
|
||||
Settings used (in addition to the base configuration):
|
||||
- `--cpu-offload-gb 375` — amount of CPU DRAM (in GiB) vLLM may use to hold weights that don't fit in VRAM. Must be large enough for the offloaded experts; the system needs at least this much free DRAM.
|
||||
- `--cpu-offload-params experts` — offloads only the MoE expert weights (the bulk of a large MoE model) to DRAM, keeping attention and other hot weights in VRAM.
|
||||
- `--tensor-parallel-size 1` — single GPU; the GB300 serves the whole model.
|
||||
- `--max-num-seqs 1` / `--max-num-batched-tokens 32768` — caps concurrency to one sequence and the batch token budget. With expert weights paged from DRAM, throughput is offload-bound, so a low concurrency keeps latency predictable.
|
||||
- `--no-enable-prefix-caching` — disables prefix-cache reuse. Offloaded experts make the memory budget tight, so the cache is turned off here rather than spent on KV reuse.
|
||||
- `--kv-cache-dtype auto` / `--dtype auto` — let vLLM pick the KV-cache and compute dtypes from the model's quantization (NVFP4).
|
||||
|
||||
## DeepSeek-V4-Flash — MTP + agentic
|
||||
|
||||
For DeepSeek-V4-Flash, run with the stable **v0.20.0-cu130** container. This recipe targets agentic workloads and enables Multi-Token Prediction (MTP) speculative decoding. On a single GB300 (TP1) the MoE expert-parallel path is used; the `deep_gemm_mega_moe` backend from some internal recipes is not needed at TP1 and is omitted here.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name vllm-server \
|
||||
--gpus all \
|
||||
--ipc host \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-p 8000:8000 \
|
||||
-e HF_TOKEN="$HF_TOKEN" \
|
||||
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||
vllm/vllm-openai:v0.20.0-cu130 \
|
||||
deepseek-ai/DeepSeek-V4-Flash \
|
||||
--enable-expert-parallel \
|
||||
--kv-cache-dtype fp8 \
|
||||
--trust-remote-code \
|
||||
--block-size 256 \
|
||||
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
|
||||
--attention_config.use_fp4_indexer_cache True \
|
||||
--tokenizer-mode deepseek_v4 \
|
||||
--tool-call-parser deepseek_v4 \
|
||||
--enable-auto-tool-choice \
|
||||
--reasoning-parser deepseek_v4 \
|
||||
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
|
||||
--max-model-len 32768
|
||||
```
|
||||
|
||||
Settings used (in addition to the base configuration):
|
||||
- `--enable-expert-parallel` — shards the MoE experts across the available GPU(s) using expert parallelism, the recommended MoE execution path for DeepSeek-V4.
|
||||
- `--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'` — enables **MTP (Multi-Token Prediction)** speculative decoding: the model proposes 3 tokens per step that are verified in a single forward pass, cutting latency for accepted tokens.
|
||||
- `--kv-cache-dtype fp8` — FP8 KV cache to fit more concurrent/longer sequences.
|
||||
- `--block-size 256` — KV-cache page size in tokens. DeepSeek-V4 uses multiple KV-cache groups; `256` matches the recipe validated on Station.
|
||||
- `--attention_config.use_fp4_indexer_cache True` — enables the FP4 indexer cache used by DeepSeek-V4's attention. (Drop this flag on platforms without native FP4, e.g. Hopper.)
|
||||
- `--tokenizer-mode deepseek_v4` / `--tool-call-parser deepseek_v4` / `--reasoning-parser deepseek_v4` — DeepSeek-V4-specific tokenizer, tool-call, and reasoning parsers.
|
||||
- `--enable-auto-tool-choice` — OpenAI-compatible function calling for agentic use.
|
||||
- `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` — uses full + piecewise CUDA graph capture and enables all custom ops for lower per-step overhead.
|
||||
- **Prefix caching is left enabled (the vLLM default).** For agentic workloads with large shared prefixes (e.g. a 32k system/context prefix) at low batch sizes (~BS 3–4), prefix caching gives a significant throughput boost by reusing the cached prefix across requests. Some internal recipes carry `--no-enable-prefix-caching`, but that was inherited from random-data benchmarking and is not recommended for agentic use here.
|
||||
|
||||
Check the server logs for startup progress:
|
||||
|
||||
```bash
|
||||
|
||||
Loading…
Reference in New Issue
Block a user