mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-06-18 04:22:21 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
718d8288e3
commit
1d1a95b3cb
327
nvidia/station-ai-skills/README.md
Normal file
327
nvidia/station-ai-skills/README.md
Normal file
@ -0,0 +1,327 @@
|
||||
# DGX Station AI Skills for Coding Agents
|
||||
|
||||
> Give your coding agent (Claude Code, Codex, Gemini CLI, Cursor) DGX Station expertise via an AGENTS.md and on-demand Agent Skills
|
||||
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Overview](#overview)
|
||||
- [AGENTS.md vs Agent Skill — why split?](#agentsmd-vs-agent-skill-why-split)
|
||||
- [Instructions](#instructions)
|
||||
- [Project-specific](#project-specific)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
## Basic idea
|
||||
|
||||
Modern coding agents — Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor — all support two extension mechanisms: a project-level **context file** that's loaded into every conversation, and **on-demand procedural workflows** (called skills, prompts, commands, or rules depending on the harness). This playbook ships both for DGX Station:
|
||||
|
||||
- An **`AGENTS.md`** with the critical DGX Station constraints your agent should always know (mixed coherency, GPU targeting, common pitfalls). `AGENTS.md` is the cross-harness standard; an `install.sh` lays it down as `CLAUDE.md`, `GEMINI.md`, or `AGENTS.md` depending on the agent you use.
|
||||
- **Four Agent Skills** — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` — authored once in the [Anthropic Agent Skills format](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview) and installed into the right per-harness location (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`).
|
||||
|
||||
This approach keeps your agent's context lean in every conversation while giving it deep procedural knowledge on demand, regardless of which agent you use.
|
||||
|
||||
### AGENTS.md vs Agent Skill — why split?
|
||||
|
||||
| | AGENTS.md | Agent Skill |
|
||||
|---|---|---|
|
||||
| **Loaded** | Every conversation, automatically | Only when invoked by name (or matched by description, in Claude) |
|
||||
| **Best for** | Constraints, pitfalls, "never do X" rules | Step-by-step workflows, deployment procedures |
|
||||
| **Context cost** | Consumed every time | Zero until invoked |
|
||||
|
||||
The DGX Station mixed-coherency constraint (`--gpus all` will crash) should be in every conversation. The full vLLM deployment procedure should not.
|
||||
|
||||
## What you'll accomplish
|
||||
|
||||
- Install the `AGENTS.md` and four Agent Skills into your project directory for your chosen agent (Claude Code, Codex, Gemini CLI, or Cursor).
|
||||
- Verify the agent loads the constraints automatically and the skills on demand.
|
||||
- Invoke `vllm-setup` to deploy a vLLM inference server with validated configuration.
|
||||
- Invoke `sglang-setup` to deploy an SGLang inference server.
|
||||
- Invoke `mig-configure` to partition the GB300 into MIG instances.
|
||||
- Invoke `dgx-diagnose` to troubleshoot common DGX Station issues.
|
||||
|
||||
## What to know before starting
|
||||
|
||||
- Basic familiarity with one supported coding agent (running it, giving it prompts, using slash commands or rule references)
|
||||
- General understanding of DGX Station (two GPUs, Docker-based workflows)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- NVIDIA DGX Station with GB300
|
||||
- One of the supported coding agents installed:
|
||||
- **Claude Code:** `curl -fsSL https://claude.ai/install.sh | sh`
|
||||
- **OpenAI Codex CLI:** `npm i -g @openai/codex`
|
||||
- **Gemini CLI:** `npm i -g @google/gemini-cli`
|
||||
- **Cursor:** download from `https://cursor.com/`
|
||||
- A project directory where you do DGX Station work
|
||||
|
||||
## Ancillary files
|
||||
|
||||
- `assets/AGENTS.md` — canonical context file with critical constraints, GPU targeting, software versions, and common pitfalls. Cross-harness standard.
|
||||
- `assets/skills/vllm-setup/SKILL.md` — skill: deploy vLLM with validated configuration.
|
||||
- `assets/skills/sglang-setup/SKILL.md` — skill: deploy SGLang with validated configuration.
|
||||
- `assets/skills/mig-configure/SKILL.md` — skill: configure MIG partitions on the GB300.
|
||||
- `assets/skills/dgx-diagnose/SKILL.md` — skill: troubleshoot common DGX Station issues.
|
||||
- `assets/install.sh` — per-harness installer (`claude`, `codex`, `gemini`, `cursor`, or `all`).
|
||||
|
||||
## Time & risk
|
||||
|
||||
* **Duration:** 10-15 minutes
|
||||
* **Risk level:** Low — this playbook copies markdown files into your project directory
|
||||
* **Rollback:** Delete the context file (`AGENTS.md` / `CLAUDE.md` / `GEMINI.md`) and the harness-specific skill directory (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`) from your project directory
|
||||
* **Last Updated:** 05/18/2026
|
||||
* Restructured as harness-agnostic Agent Skills (Claude Code, Codex, Gemini CLI, Cursor)
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Install your coding agent
|
||||
|
||||
Pick whichever agent you prefer — the rest of this playbook works the same regardless. Install commands:
|
||||
|
||||
| Agent | Install |
|
||||
|-------|---------|
|
||||
| Claude Code | `curl -fsSL https://claude.ai/install.sh \| sh` |
|
||||
| OpenAI Codex CLI | `npm i -g @openai/codex` |
|
||||
| Gemini CLI | `npm i -g @google/gemini-cli` |
|
||||
| Cursor | Download from `https://cursor.com/` |
|
||||
|
||||
Verify with `claude --version`, `codex --version`, `gemini --version`, or by launching Cursor.
|
||||
|
||||
## Step 2. Install the skills into your project
|
||||
|
||||
Navigate to the project where you want DGX Station expertise, then run the installer with the harness you use:
|
||||
|
||||
```bash
|
||||
cd ~/your-project
|
||||
|
||||
## Pick one:
|
||||
/path/to/this/playbook/assets/install.sh claude
|
||||
/path/to/this/playbook/assets/install.sh codex
|
||||
/path/to/this/playbook/assets/install.sh gemini
|
||||
/path/to/this/playbook/assets/install.sh cursor
|
||||
|
||||
## Or install for all four at once:
|
||||
/path/to/this/playbook/assets/install.sh all
|
||||
```
|
||||
|
||||
If you downloaded the playbook as a zip, the path is relative to the extracted directory:
|
||||
|
||||
```bash
|
||||
station-ai-skills/assets/install.sh claude ~/your-project
|
||||
```
|
||||
|
||||
The installer is additive for skill directories (won't clobber existing skills you've written) and refuses to overwrite an existing context file (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`) unless you pass `--force`.
|
||||
|
||||
**Resulting layout** (per harness):
|
||||
|
||||
```text
|
||||
your-project/
|
||||
AGENTS.md or CLAUDE.md or GEMINI.md # context file (named for your agent)
|
||||
.claude/skills/<name>/SKILL.md # claude
|
||||
.codex/prompts/<name>.md # codex
|
||||
.gemini/commands/<name>.md # gemini
|
||||
.cursor/rules/<name>.mdc # cursor
|
||||
```
|
||||
|
||||
Where `<name>` is each of `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose`.
|
||||
|
||||
> [!NOTE]
|
||||
> Every supported agent automatically reads the context file from the working directory at startup. Skills/prompts/rules in the harness-specific directory are discovered automatically — no additional configuration needed.
|
||||
|
||||
## Step 3. Verify the setup
|
||||
|
||||
Start your agent in the project directory and ask a question that requires constraint knowledge:
|
||||
|
||||
```text
|
||||
Can I use --gpus all to run my CUDA workload on DGX Station?
|
||||
```
|
||||
|
||||
The agent should immediately warn about the mixed-coherency constraint and recommend `--gpus '"device=N"'` targeting. If you don't get the warning, the context file isn't being loaded — see Troubleshooting.
|
||||
|
||||
Then verify the skills are discoverable:
|
||||
|
||||
| Agent | How to check |
|
||||
|-------|--------------|
|
||||
| Claude Code | Type `/` — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` should appear in the autocomplete |
|
||||
| Codex CLI | Type `/prompts:` — same four names appear |
|
||||
| Gemini CLI | Type `/` — same four names appear |
|
||||
| Cursor | Open the Rules panel — same four rules appear |
|
||||
|
||||
## Step 4. Use vllm-setup to deploy an inference server
|
||||
|
||||
Invoke the skill in your agent:
|
||||
|
||||
| Agent | Invocation |
|
||||
|-------|-----------|
|
||||
| Claude Code | `/vllm-setup` (slash command) or just describe the task ("deploy vllm with Qwen3-8B") |
|
||||
| Codex CLI | `/prompts:vllm-setup` |
|
||||
| Gemini CLI | `/vllm-setup` |
|
||||
| Cursor | In chat: "use the vllm-setup rule to deploy a vllm server" |
|
||||
|
||||
The agent will walk you through deploying a vLLM server with a validated container image, correct GPU targeting, and recommended parameters. It will check your GPU index, ask which model you want to serve, and generate the full `docker run` command.
|
||||
|
||||
## Step 5. Use sglang-setup to deploy SGLang
|
||||
|
||||
Same invocation pattern, but for SGLang with the `cu130` container, RadixAttention prefix caching, and structured JSON output support.
|
||||
|
||||
## Step 6. Use mig-configure to partition the GB300
|
||||
|
||||
The agent will query your current MIG state, show available profiles, help you choose a layout for your workloads, and execute the partitioning commands.
|
||||
|
||||
## Step 7. Use dgx-diagnose to troubleshoot issues
|
||||
|
||||
If you encounter problems, invoke `dgx-diagnose`. The agent will check GPU status, driver version, running processes, MIG state, and Fabric Manager to identify the issue.
|
||||
|
||||
## Step 8. Customize
|
||||
|
||||
Both the `AGENTS.md` and the skills are plain markdown — extend them freely.
|
||||
|
||||
**Add project-specific constraints to `AGENTS.md`** (or your harness-specific context file):
|
||||
|
||||
```markdown
|
||||
### Project-specific
|
||||
|
||||
- Our production MIG layout is 3g.139gb + 2g.70gb + 2g.70gb
|
||||
- Always use port 8080 for inference (nginx proxy on 443)
|
||||
- Model weights are cached at /data/models, mount with -v /data/models:/root/.cache/huggingface/hub
|
||||
```
|
||||
|
||||
**Create new skills** by adding a directory and `SKILL.md` to `assets/skills/`, then re-run `install.sh`:
|
||||
|
||||
```bash
|
||||
mkdir -p assets/skills/run-benchmarks
|
||||
cat > assets/skills/run-benchmarks/SKILL.md << 'EOF'
|
||||
---
|
||||
name: run-benchmarks
|
||||
description: Run our standard inference benchmark suite against the running vLLM or SGLang server and compare against the baseline.
|
||||
---
|
||||
|
||||
## Run benchmarks
|
||||
|
||||
1. Check which inference server is running (vLLM on port 8000 or SGLang on port 30000)
|
||||
2. Run the appropriate benchmark script from ./benchmarks/
|
||||
3. Report throughput (tokens/sec), latency (TTFT, ITL), and memory utilization
|
||||
4. Compare against the baseline in ./benchmarks/baseline.json
|
||||
EOF
|
||||
```
|
||||
|
||||
> [!TIP]
|
||||
> Keep `AGENTS.md` focused on constraints and pitfalls (things that break). Put procedural workflows in skills (things you do step-by-step).
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
## Skills don't appear in autocomplete / aren't discoverable
|
||||
|
||||
Each agent discovers skills from a harness-specific directory in the current directory (or a parent). Check the right one:
|
||||
|
||||
| Agent | Expected location |
|
||||
|-------|-------------------|
|
||||
| Claude Code | `.claude/skills/<name>/SKILL.md` |
|
||||
| Codex CLI | `.codex/prompts/<name>.md` |
|
||||
| Gemini CLI | `.gemini/commands/<name>.md` |
|
||||
| Cursor | `.cursor/rules/<name>.mdc` |
|
||||
|
||||
```bash
|
||||
## Examples — check the directory for your agent
|
||||
ls -la .claude/skills/
|
||||
ls -la .codex/prompts/
|
||||
ls -la .gemini/commands/
|
||||
ls -la .cursor/rules/
|
||||
```
|
||||
|
||||
You should see entries for `vllm-setup`, `sglang-setup`, `mig-configure`, and `dgx-diagnose`.
|
||||
|
||||
**Check you're in the right directory:**
|
||||
|
||||
```bash
|
||||
pwd
|
||||
```
|
||||
|
||||
The agent must be started from the directory containing the harness directory, or a subdirectory of it.
|
||||
|
||||
## Context file not loaded
|
||||
|
||||
If the agent gives generic answers without DGX Station awareness, the context file isn't being picked up. Each agent reads a different filename — verify the one for your agent exists:
|
||||
|
||||
| Agent | Expected filename |
|
||||
|-------|-------------------|
|
||||
| Claude Code | `CLAUDE.md` (also reads `AGENTS.md` as fallback) |
|
||||
| Codex CLI | `AGENTS.md` |
|
||||
| Gemini CLI | `GEMINI.md` |
|
||||
| Cursor | `AGENTS.md` |
|
||||
|
||||
```bash
|
||||
## Verify the file exists for your agent
|
||||
cat AGENTS.md | head -5
|
||||
cat CLAUDE.md | head -5
|
||||
cat GEMINI.md | head -5
|
||||
|
||||
## Restart the agent in the correct directory
|
||||
cd ~/your-project
|
||||
claude # or codex, gemini, etc.
|
||||
```
|
||||
|
||||
All four agents read the context file from the working directory (and parent directories up to the project root).
|
||||
|
||||
## Skill gives outdated information
|
||||
|
||||
The skills contain validated container versions and parameters as of the publication date. If a newer container is available, edit the canonical source and re-install:
|
||||
|
||||
```bash
|
||||
nano /path/to/playbook/assets/skills/vllm-setup/SKILL.md
|
||||
/path/to/playbook/assets/install.sh all --force
|
||||
```
|
||||
|
||||
Or edit the installed copy directly:
|
||||
|
||||
```bash
|
||||
## Claude Code
|
||||
nano .claude/skills/vllm-setup/SKILL.md
|
||||
## Codex
|
||||
nano .codex/prompts/vllm-setup.md
|
||||
## Gemini CLI
|
||||
nano .gemini/commands/vllm-setup.md
|
||||
## Cursor
|
||||
nano .cursor/rules/vllm-setup.mdc
|
||||
```
|
||||
|
||||
> [!TIP]
|
||||
> Skills are plain markdown — you can version them in git alongside your project code.
|
||||
|
||||
## "Both GPUs cannot be used" errors
|
||||
|
||||
This is the mixed-coherency constraint working as intended. If you see CUDA errors when using `--gpus all`:
|
||||
|
||||
```bash
|
||||
## Find the GB300 index
|
||||
nvidia-smi --query-gpu=index,name --format=csv,noheader
|
||||
|
||||
## Use device-specific targeting
|
||||
docker run --gpus '"device=1"' ...
|
||||
```
|
||||
|
||||
The `AGENTS.md` covers this constraint, but if you removed that section, add it back — it's the most important piece of DGX Station knowledge.
|
||||
|
||||
## Skills conflict with existing project directory
|
||||
|
||||
If your project already has a `.claude/`, `.codex/`, `.gemini/`, or `.cursor/` directory with its own contents, `install.sh` is **additive** for skill directories — it adds the new skill files alongside whatever you already have and warns on collision rather than overwriting.
|
||||
|
||||
For context files (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`), the installer **refuses** to overwrite an existing file. Pass `--force` to override, or merge the new content manually:
|
||||
|
||||
```bash
|
||||
## See what would be written
|
||||
diff /path/to/playbook/assets/AGENTS.md ./AGENTS.md
|
||||
|
||||
## Force overwrite
|
||||
/path/to/playbook/assets/install.sh claude . --force
|
||||
```
|
||||
|
||||
## Installer reports "WROTE" for some files but "SKIP" for others
|
||||
|
||||
That's the safe-by-default behavior. The installer skips any file that already exists, prints a warning, and continues with the rest. To get a clean install, either:
|
||||
|
||||
1. Delete the existing files first: `rm -rf .claude/skills/{vllm-setup,sglang-setup,mig-configure,dgx-diagnose}`
|
||||
2. Or pass `--force` (only affects context files; skill files are still skipped if present)
|
||||
BIN
nvidia/station-ai-skills/assets/.DS_Store
vendored
Normal file
BIN
nvidia/station-ai-skills/assets/.DS_Store
vendored
Normal file
Binary file not shown.
@ -0,0 +1,85 @@
|
||||
|
||||
# DGX Station Diagnostics
|
||||
|
||||
Diagnose common DGX Station issues. Run through the checks below to identify the problem.
|
||||
|
||||
## Step 1. Gather system state
|
||||
|
||||
Run these commands and analyze the output:
|
||||
|
||||
```bash
|
||||
# GPU status
|
||||
nvidia-smi
|
||||
|
||||
# GPU device list with indices
|
||||
nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader
|
||||
|
||||
# Driver version
|
||||
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1
|
||||
|
||||
# MIG state
|
||||
nvidia-smi -i 1 -q 2>/dev/null | grep -i "MIG Mode" || echo "Could not query MIG on device 1"
|
||||
|
||||
# Fabric Manager
|
||||
systemctl is-active nvidia-fabricmanager
|
||||
|
||||
# GPU processes
|
||||
sudo fuser -v /dev/nvidia* 2>/dev/null || echo "No GPU processes found"
|
||||
|
||||
# Docker containers using GPUs
|
||||
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" 2>/dev/null
|
||||
```
|
||||
|
||||
## Step 2. Match symptoms to known issues
|
||||
|
||||
Based on the gathered state and the user's reported problem, check for these known issues:
|
||||
|
||||
### CUDA crashes with `--gpus all`
|
||||
**Cause:** Mixed coherency — GB300 (ATS) and RTX PRO (non-ATS) cannot share a CUDA context.
|
||||
**Fix:** Use `--gpus '"device=N"'` targeting only the GB300.
|
||||
|
||||
### Model running on wrong GPU (RTX PRO instead of GB300)
|
||||
**Check:** The device index in the docker command vs actual GPU indices.
|
||||
**Fix:** Verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` and correct the `--gpus` flag.
|
||||
|
||||
### vLLM crash / FlashInfer buffer overflow
|
||||
**Check:** Container version — `docker inspect vllm-server | grep Image`
|
||||
**Fix:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Version 25.10 has a known FlashInfer bug on DGX Station.
|
||||
|
||||
### SGLang CUDA errors
|
||||
**Check:** Container tag — must be `cu130` for Blackwell SM103.
|
||||
**Fix:** Use `lmsysorg/sglang:latest-cu130`.
|
||||
|
||||
### CUDA OOM despite 279 GB HBM
|
||||
**Check:** `--max-model-len` / `--context-length` and memory utilization settings.
|
||||
**Fix:** Reduce context length or lower `--gpu-memory-utilization` / `--mem-fraction-static`.
|
||||
|
||||
### `nvidia-smi -mig 1` returns "In use by another client"
|
||||
**Check:** `sudo fuser -v /dev/nvidia*` — GPU processes must be stopped first.
|
||||
**Fix:** Stop all GPU workloads, then retry.
|
||||
|
||||
### NVLink errors after disabling MIG
|
||||
**Check:** `systemctl is-active nvidia-fabricmanager`
|
||||
**Fix:** `sudo systemctl start nvidia-fabricmanager`
|
||||
|
||||
### X server crash after nvidia-xconfig -a
|
||||
**Fix:** `sudo cp /etc/X11/xorg.conf.nvidia-xconfig-original /etc/X11/xorg.conf`
|
||||
|
||||
### Vulkan VK_ERROR_INITIALIZATION_FAILED
|
||||
**Cause:** CUDA initialized before Vulkan, binding to GB300.
|
||||
**Fix:** Run CUDA and Vulkan workloads in separate processes. For Vulkan apps: `__GL_DeviceModalityPreference=2 ./your_app`
|
||||
|
||||
### HuggingFace 401 / token errors
|
||||
**Fix:** Pass token inline: `-e HF_TOKEN="hf_..."`. Don't rely on shell export for background Docker tasks.
|
||||
|
||||
### Port already in use
|
||||
**Check:** `lsof -i :<PORT>`
|
||||
**Fix:** Stop the conflicting process or use a different host port: `-p 8001:8000`.
|
||||
|
||||
## Step 3. Report findings
|
||||
|
||||
Tell the user:
|
||||
1. What the issue is
|
||||
2. Why it happens (root cause)
|
||||
3. The specific command to fix it
|
||||
4. How to verify the fix worked
|
||||
103
nvidia/station-ai-skills/assets/.codex/prompts/mig-configure.md
Normal file
103
nvidia/station-ai-skills/assets/.codex/prompts/mig-configure.md
Normal file
@ -0,0 +1,103 @@
|
||||
|
||||
# MIG Configuration on DGX Station
|
||||
|
||||
Configure MIG (Multi-Instance GPU) partitions on the DGX Station GB300.
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Find the GB300 GPU index.** Run:
|
||||
```bash
|
||||
nvidia-smi --query-gpu=index,name --format=csv,noheader
|
||||
```
|
||||
|
||||
2. **Check current MIG state:**
|
||||
```bash
|
||||
nvidia-smi -i <GB300_INDEX> -q | grep -i "MIG Mode"
|
||||
```
|
||||
|
||||
3. **If MIG is already enabled, show current instances:**
|
||||
```bash
|
||||
nvidia-smi mig -lgi -i <GB300_INDEX>
|
||||
nvidia-smi mig -lci -i <GB300_INDEX>
|
||||
```
|
||||
If the user wants to reconfigure, destroy existing instances first (step 6).
|
||||
|
||||
4. **If MIG is not enabled, enable it.** All GPU processes must be stopped first:
|
||||
```bash
|
||||
# Check for running GPU processes
|
||||
sudo fuser -v /dev/nvidia*
|
||||
|
||||
# Enable MIG
|
||||
sudo nvidia-smi -i <GB300_INDEX> -mig 1
|
||||
|
||||
# Verify
|
||||
nvidia-smi -i <GB300_INDEX> -q | grep -i "MIG Mode"
|
||||
```
|
||||
|
||||
5. **Show available profiles and help the user choose a layout:**
|
||||
```bash
|
||||
nvidia-smi mig -lgip -i <GB300_INDEX>
|
||||
```
|
||||
|
||||
Common GB300 MIG profiles:
|
||||
|
||||
| Profile | ID | Memory | Use case |
|
||||
|---------|----|--------|----------|
|
||||
| 1g.35gb | 19 | ~35 GB | Small models (7-8B), dev/test |
|
||||
| 1g.35gb+me | 20 | ~35 GB | Same + media extensions |
|
||||
| 1g.70gb | 15 | ~70 GB | Slightly larger inference |
|
||||
| 2g.70gb | 14 | ~70 GB | Medium models (14-30B) |
|
||||
| 3g.139gb | 9 | ~139 GB | Large models (70B quantized) |
|
||||
| 4g.139gb | 5 | ~139 GB | Large models, more compute |
|
||||
| 7g.278gb | 0 | ~278 GB | Full GPU as single instance |
|
||||
|
||||
Suggest layouts based on the user's workload. Examples:
|
||||
- **Two models (70B + 8B):** `3g.139gb + 2g.70gb + 2g.70gb` → IDs `9,14,14`
|
||||
- **Many small models:** `7 × 1g.35gb` → IDs `19,19,19,19,19,19,19`
|
||||
- **One large model with isolation:** `7g.278gb` → ID `0`
|
||||
|
||||
Ask the user what models they want to run before suggesting a layout.
|
||||
|
||||
6. **Create (or recreate) instances:**
|
||||
|
||||
If reconfiguring, destroy existing instances first:
|
||||
```bash
|
||||
sudo nvidia-smi mig -dci -i <GB300_INDEX>
|
||||
sudo nvidia-smi mig -dgi -i <GB300_INDEX>
|
||||
```
|
||||
|
||||
Then create the new layout:
|
||||
```bash
|
||||
sudo nvidia-smi mig -cgi <PROFILE_IDS> -C -i <GB300_INDEX>
|
||||
```
|
||||
|
||||
7. **Get the MIG device UUIDs:**
|
||||
```bash
|
||||
nvidia-smi -L
|
||||
```
|
||||
Note the `MIG-<uuid>` entries — these are used to target specific MIG instances.
|
||||
|
||||
8. **Show the user how to use MIG devices:**
|
||||
```bash
|
||||
# Bare metal
|
||||
export CUDA_VISIBLE_DEVICES=MIG-<uuid>
|
||||
|
||||
# Docker
|
||||
docker run --gpus '"device=MIG-<uuid>"' ...
|
||||
```
|
||||
|
||||
9. **Report the final layout** to the user with UUIDs and suggested docker commands for each instance.
|
||||
|
||||
## Disabling MIG
|
||||
|
||||
If the user wants to return to full-GPU mode:
|
||||
|
||||
```bash
|
||||
# Stop all workloads using MIG instances first
|
||||
sudo nvidia-smi mig -dci -i <GB300_INDEX>
|
||||
sudo nvidia-smi mig -dgi -i <GB300_INDEX>
|
||||
sudo nvidia-smi -i <GB300_INDEX> -mig 0
|
||||
|
||||
# Ensure Fabric Manager is running for NVLink re-initialization
|
||||
sudo systemctl start nvidia-fabricmanager
|
||||
```
|
||||
115
nvidia/station-ai-skills/assets/.codex/prompts/sglang-setup.md
Normal file
115
nvidia/station-ai-skills/assets/.codex/prompts/sglang-setup.md
Normal file
@ -0,0 +1,115 @@
|
||||
|
||||
# SGLang Setup on DGX Station
|
||||
|
||||
Deploy an SGLang inference server on DGX Station with validated configuration.
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Find the GB300 GPU index.** Run:
|
||||
```bash
|
||||
nvidia-smi --query-gpu=index,name --format=csv,noheader
|
||||
```
|
||||
Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures.
|
||||
|
||||
2. **Ask the user which model to serve.** If they don't have a preference, suggest:
|
||||
- `Qwen/Qwen3-8B` — small, fast, good for testing
|
||||
- `Qwen/Qwen3-32B` — medium, good balance
|
||||
- `meta-llama/Llama-3.1-70B-Instruct` — large general-purpose
|
||||
|
||||
3. **Check if the user has an HF_TOKEN.** Pass inline with `-e HF_TOKEN="..."`.
|
||||
|
||||
4. **Deploy the container.** Use this validated configuration:
|
||||
|
||||
```bash
|
||||
docker pull lmsysorg/sglang:latest-cu130
|
||||
|
||||
docker run -d \
|
||||
--name sglang-server \
|
||||
--gpus '"device=<GB300_INDEX>"' \
|
||||
--ipc host \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-p 30000:30000 \
|
||||
-e HF_TOKEN="<TOKEN>" \
|
||||
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||
lmsysorg/sglang:latest-cu130 \
|
||||
sglang serve --model-path "<MODEL>" \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000 \
|
||||
--context-length 32768 \
|
||||
--mem-fraction-static 0.85
|
||||
```
|
||||
|
||||
**Container version:** Use `lmsysorg/sglang:latest-cu130`. The `cu130` tag is required for Blackwell SM103 support.
|
||||
|
||||
**First launch** downloads the model and compiles kernels. This takes extra time — subsequent starts are faster.
|
||||
|
||||
5. **Wait for the server to be ready.** Monitor logs:
|
||||
```bash
|
||||
docker logs -f sglang-server
|
||||
```
|
||||
|
||||
6. **Test the server:**
|
||||
```bash
|
||||
curl http://localhost:30000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "<MODEL>",
|
||||
"messages": [{"role": "user", "content": "Hello"}],
|
||||
"max_tokens": 64
|
||||
}'
|
||||
```
|
||||
|
||||
7. **Report the result** to the user, including:
|
||||
- Model loaded and serving on port 30000
|
||||
- How to stop: `docker stop sglang-server && docker rm sglang-server`
|
||||
|
||||
## Key features
|
||||
|
||||
- **RadixAttention** — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with: `docker logs sglang-server 2>&1 | grep "cached-token" | tail -5`
|
||||
- **Structured JSON output** — use `response_format.json_schema` in API requests for guaranteed valid JSON.
|
||||
- **Chunked prefill** — add `--chunked-prefill-size 8192` to break long prefills into chunks, reducing time-to-first-token.
|
||||
|
||||
## Tuning parameters
|
||||
|
||||
| Parameter | Default | Agent workloads | Throughput workloads |
|
||||
|-----------|---------|-----------------|---------------------|
|
||||
| `--context-length` | 32768 | 32768-65536 | 8192-16384 |
|
||||
| `--mem-fraction-static` | 0.85 | 0.80-0.85 | 0.85-0.88 |
|
||||
| `--chunked-prefill-size` | off | 4096-8192 | 8192 |
|
||||
| `--enable-metrics` | off | Optional | Recommended |
|
||||
|
||||
## Structured output example
|
||||
|
||||
```bash
|
||||
curl http://localhost:30000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "<MODEL>",
|
||||
"messages": [{"role": "user", "content": "List three programming languages."}],
|
||||
"max_tokens": 512,
|
||||
"response_format": {
|
||||
"type": "json_schema",
|
||||
"json_schema": {
|
||||
"name": "languages",
|
||||
"schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"languages": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"name": {"type": "string"},
|
||||
"primary_use": {"type": "string"}
|
||||
},
|
||||
"required": ["name", "primary_use"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["languages"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
74
nvidia/station-ai-skills/assets/.codex/prompts/vllm-setup.md
Normal file
74
nvidia/station-ai-skills/assets/.codex/prompts/vllm-setup.md
Normal file
@ -0,0 +1,74 @@
|
||||
|
||||
# vLLM Setup on DGX Station
|
||||
|
||||
Deploy a vLLM inference server on DGX Station with validated configuration.
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Find the GB300 GPU index.** Run:
|
||||
```bash
|
||||
nvidia-smi --query-gpu=index,name --format=csv,noheader
|
||||
```
|
||||
Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures.
|
||||
|
||||
2. **Ask the user which model to serve.** If they don't have a preference, suggest:
|
||||
- `nvidia/Qwen3-235B-A22B-NVFP4` — large MoE model, fits in 279 GB HBM
|
||||
- `meta-llama/Llama-3.1-70B-Instruct` — solid general-purpose model
|
||||
- `Qwen/Qwen3-8B` — small model for testing
|
||||
|
||||
3. **Check if the user has an HF_TOKEN.** Many models require HuggingFace authentication. The token must be passed inline with `-e HF_TOKEN="..."` — do not rely on shell export in background Docker tasks.
|
||||
|
||||
4. **Deploy the container.** Use this validated configuration:
|
||||
|
||||
```bash
|
||||
docker pull nvcr.io/nvidia/vllm:26.01-py3
|
||||
|
||||
docker run -d \
|
||||
--name vllm-server \
|
||||
--gpus '"device=<GB300_INDEX>"' \
|
||||
--ipc host \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-p 8000:8000 \
|
||||
-e HF_TOKEN="<TOKEN>" \
|
||||
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||
nvcr.io/nvidia/vllm:26.01-py3 \
|
||||
vllm serve "<MODEL>" \
|
||||
--max-model-len 32768 \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
**Container version:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Do NOT use 25.10 — it has a FlashInfer buffer overflow on DGX Station.
|
||||
|
||||
5. **Wait for the server to be ready.** Monitor logs:
|
||||
```bash
|
||||
docker logs -f vllm-server
|
||||
```
|
||||
Wait for the line indicating the server is listening on port 8000.
|
||||
|
||||
6. **Test the server:**
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "<MODEL>",
|
||||
"messages": [{"role": "user", "content": "Hello"}],
|
||||
"max_tokens": 64
|
||||
}'
|
||||
```
|
||||
|
||||
7. **Report the result** to the user, including:
|
||||
- Model loaded and serving on port 8000
|
||||
- GPU memory utilization
|
||||
- How to stop: `docker stop vllm-server && docker rm vllm-server`
|
||||
|
||||
## Tuning parameters
|
||||
|
||||
Adjust these based on the user's workload:
|
||||
|
||||
| Parameter | Default | Agent workloads | Throughput workloads |
|
||||
|-----------|---------|-----------------|---------------------|
|
||||
| `--max-model-len` | 32768 | 32768-65536 | 8192-16384 |
|
||||
| `--gpu-memory-utilization` | 0.9 | 0.85-0.90 | 0.90-0.92 |
|
||||
| `--enable-prefix-caching` | off | Enable (multi-turn reuse) | Enable |
|
||||
| `--max-num-seqs` | default | 4-16 (lower latency) | 32+ (higher throughput) |
|
||||
81
nvidia/station-ai-skills/assets/AGENTS.md
Normal file
81
nvidia/station-ai-skills/assets/AGENTS.md
Normal file
@ -0,0 +1,81 @@
|
||||
# DGX Station Essential Constraints
|
||||
|
||||
This file gives your coding agent the critical constraints it needs to avoid breaking things on NVIDIA DGX Station. When you need a step-by-step workflow, invoke the bundled skills: `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose`. In Codex, install them into `$CODEX_HOME/skills` and mention them as `$vllm-setup` or plain text like "use vllm-setup"; in Claude Code or Gemini CLI, type `/<name>`; in Cursor, reference the rule by name.
|
||||
|
||||
## System architecture (quick reference)
|
||||
|
||||
- **GB300 GPU** — Blackwell Ultra (SM103), up to 279 GB HBM3e, 20 PFLOPS sparse FP4. This is the AI compute GPU.
|
||||
- **Grace CPU** — 72-core ARM Neoverse V2, up to 496 GB LPDDR5x.
|
||||
- **RTX PRO 6000** — Discrete display GPU (PCIe, non-coherent). For graphics only.
|
||||
- **NVLink C2C** — Coherent CPU-GPU link. CPU + GPU memory = up to 775 GB total.
|
||||
- The GB300 is typically device **1** and RTX PRO is device **0**. Always verify: `nvidia-smi --query-gpu=index,name --format=csv,noheader`
|
||||
|
||||
## Critical constraint: mixed coherency
|
||||
|
||||
**CUDA cannot handle mixed-coherency GPUs in the same process.** The GB300 uses hardware-coherent memory (ATS) while the RTX PRO uses non-coherent memory (HMM via PCIe). A single CUDA context can use one or the other, not both.
|
||||
|
||||
**Never use `--gpus all`** — it will cause CUDA assert failures.
|
||||
|
||||
## GPU targeting
|
||||
|
||||
There are three ways to target the GB300:
|
||||
|
||||
**1. By device index** (most common):
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES=1 # bare metal
|
||||
docker run --gpus '"device=1"' ... # Docker
|
||||
```
|
||||
|
||||
**2. By coherency modality:**
|
||||
```bash
|
||||
export CUDA_DEVICE_MODALITY=ATS # GB300 (coherent)
|
||||
export CUDA_DEVICE_MODALITY=NONATS # RTX PRO (non-coherent)
|
||||
```
|
||||
|
||||
**3. By driver application profiles** in `~/.nv/nvidia-application-profiles-rc`:
|
||||
```json
|
||||
{
|
||||
"rules": [
|
||||
{ "pattern": { "feature": "cmdline", "matches": "my_app" }, "profile": "UseATSGpuInMixedCoherencySystems" }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Display and graphics
|
||||
|
||||
- The GB300 does not support X display. Display runs on RTX PRO only.
|
||||
- **Do not run `nvidia-xconfig -a`** — it generates an invalid config.
|
||||
- If CUDA initializes before Vulkan in a process, it may bind to the GB300, causing `VK_ERROR_INITIALIZATION_FAILED`. Run CUDA and Vulkan in separate processes.
|
||||
|
||||
## Memory
|
||||
|
||||
- GB300 HBM is in the system memory pool (NUMA node 1). `malloc` may allocate there.
|
||||
- Use `numactl --membind=0` for CPU-only processes that shouldn't touch GPU memory.
|
||||
- CPU can cache accesses to GB300 memory, but GB300 cannot cache accesses to CPU memory.
|
||||
|
||||
## Software versions
|
||||
|
||||
| Component | Validated version | Notes |
|
||||
|-----------|-------------------|-------|
|
||||
| NVIDIA Driver | 590.48.01 | Check with `nvidia-smi` |
|
||||
| CUDA (driver) | 13.1 | Containers bring their own runtime |
|
||||
| vLLM container | `nvcr.io/nvidia/vllm:26.01-py3` | **Avoid 25.10** (FlashInfer buffer overflow) |
|
||||
| SGLang container | `lmsysorg/sglang:latest-cu130` | cu130 required for SM103 |
|
||||
| CUDA base image | `nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04` | For custom containers |
|
||||
| Ubuntu | 24.04 | Preinstalled |
|
||||
|
||||
## Common pitfalls
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| `--gpus all` CUDA assert failure | Mixed coherency | Use `--gpus '"device=N"'` for the GB300 |
|
||||
| vLLM 25.10 FlashInfer crash | Known DGX Station bug | Use `vllm:26.01-py3` or newer |
|
||||
| SGLang CUDA errors | Wrong CUDA for Blackwell | Use `sglang:latest-cu130` |
|
||||
| Model runs on RTX PRO | Wrong device index | Verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` |
|
||||
| `nvidia-smi -mig 1` "In use" | GPU processes running | `sudo fuser -v /dev/nvidia*` |
|
||||
| NVLink errors after disabling MIG | Fabric Manager stopped | `sudo systemctl start nvidia-fabricmanager` |
|
||||
| `malloc` lands in GPU memory | HBM in system pool | `numactl --membind=0` |
|
||||
| X crash after `nvidia-xconfig -a` | Invalid mixed-coherency config | Restore from `/etc/X11/xorg.conf.nvidia-xconfig-original` |
|
||||
| Vulkan `VK_ERROR_INITIALIZATION_FAILED` | CUDA bound GB300 first | Separate CUDA and Vulkan into different processes |
|
||||
| HuggingFace 401 | Missing HF_TOKEN | Pass inline: `-e HF_TOKEN="hf_..."` |
|
||||
| Port conflict | Port already in use | `lsof -i :PORT`, use different port |
|
||||
218
nvidia/station-ai-skills/assets/install.sh
Normal file
218
nvidia/station-ai-skills/assets/install.sh
Normal file
@ -0,0 +1,218 @@
|
||||
#!/bin/sh
|
||||
# install.sh — Install DGX Station AI Skills into a project for a chosen coding agent.
|
||||
#
|
||||
# Usage: ./install.sh <harness> [target-dir] [--force]
|
||||
# harness: claude | codex | gemini | cursor | all
|
||||
# target-dir: where to install (default: current directory)
|
||||
# --force: overwrite existing context files (AGENTS.md, CLAUDE.md, GEMINI.md)
|
||||
#
|
||||
# Layout produced per harness:
|
||||
# claude -> CLAUDE.md + .claude/skills/<name>/SKILL.md
|
||||
# codex -> AGENTS.md + $CODEX_HOME/skills/<name>/SKILL.md
|
||||
# gemini -> GEMINI.md + .gemini/commands/<name>.md
|
||||
# cursor -> AGENTS.md + .cursor/rules/<name>.mdc
|
||||
# all -> all of the above
|
||||
|
||||
set -eu
|
||||
|
||||
usage() {
|
||||
cat <<EOF
|
||||
Usage: $0 <harness> [target-dir] [--force]
|
||||
|
||||
Harnesses:
|
||||
claude Claude Code -> CLAUDE.md + .claude/skills/<name>/SKILL.md
|
||||
codex OpenAI Codex CLI -> AGENTS.md + \$CODEX_HOME/skills/<name>/SKILL.md
|
||||
gemini Gemini CLI -> GEMINI.md + .gemini/commands/<name>.md
|
||||
cursor Cursor -> AGENTS.md + .cursor/rules/<name>.mdc
|
||||
all Install for all four
|
||||
|
||||
Options:
|
||||
--force Overwrite existing context files instead of erroring
|
||||
EOF
|
||||
}
|
||||
|
||||
if [ $# -lt 1 ]; then
|
||||
usage >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
case "$1" in
|
||||
-h|--help) usage; exit 0 ;;
|
||||
esac
|
||||
|
||||
HARNESS="$1"
|
||||
shift
|
||||
|
||||
TARGET="."
|
||||
FORCE=0
|
||||
while [ $# -gt 0 ]; do
|
||||
case "$1" in
|
||||
--force) FORCE=1 ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) TARGET="$1" ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
case "$HARNESS" in
|
||||
claude|codex|gemini|cursor|all) ;;
|
||||
*) printf 'Error: unknown harness "%s"\n\n' "$HARNESS" >&2; usage >&2; exit 2 ;;
|
||||
esac
|
||||
|
||||
ASSETS="$(cd "$(dirname "$0")" && pwd)"
|
||||
SKILLS_DIR="$ASSETS/skills"
|
||||
AGENTS_MD="$ASSETS/AGENTS.md"
|
||||
|
||||
if [ ! -f "$AGENTS_MD" ]; then
|
||||
printf 'Error: %s not found\n' "$AGENTS_MD" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ ! -d "$SKILLS_DIR" ]; then
|
||||
printf 'Error: %s not found\n' "$SKILLS_DIR" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
mkdir -p "$TARGET"
|
||||
|
||||
SKILL_NAMES="vllm-setup sglang-setup mig-configure dgx-diagnose"
|
||||
|
||||
# write_context <target-filename>
|
||||
# Copies AGENTS.md to <target-dir>/<target-filename>, refusing to overwrite without --force.
|
||||
write_context() {
|
||||
fname="$1"
|
||||
dest="$TARGET/$fname"
|
||||
if [ -e "$dest" ] && [ "$FORCE" -ne 1 ]; then
|
||||
printf ' SKIP %s (exists; pass --force to overwrite)\n' "$dest" >&2
|
||||
return 1
|
||||
fi
|
||||
cp "$AGENTS_MD" "$dest"
|
||||
printf ' WROTE %s\n' "$dest"
|
||||
}
|
||||
|
||||
# strip_frontmatter <src> <dest>
|
||||
# Emits the SKILL.md body (everything after the closing `---`) to <dest>.
|
||||
# Note: POSIX sh has no local vars; use unique names to avoid clobbering callers.
|
||||
strip_frontmatter() {
|
||||
_sf_src="$1"
|
||||
_sf_dest="$2"
|
||||
awk 'BEGIN { in_fm=0; past_fm=0 }
|
||||
past_fm == 1 { print; next }
|
||||
/^---$/ && in_fm == 0 { in_fm=1; next }
|
||||
/^---$/ && in_fm == 1 { past_fm=1; next }
|
||||
in_fm == 0 && past_fm == 0 { past_fm=1; print }' "$_sf_src" > "$_sf_dest"
|
||||
}
|
||||
|
||||
# write_cursor_rule <src> <dest> <name> <description>
|
||||
# Writes a Cursor .mdc rule: replaces Anthropic frontmatter with Cursor's shape, keeps the body.
|
||||
write_cursor_rule() {
|
||||
_wc_src="$1"
|
||||
_wc_dest="$2"
|
||||
_wc_name="$3"
|
||||
_wc_desc="$4"
|
||||
{
|
||||
printf -- '---\n'
|
||||
printf 'description: %s\n' "$_wc_desc"
|
||||
printf 'globs: ["**/*"]\n'
|
||||
printf 'alwaysApply: false\n'
|
||||
printf -- '---\n\n'
|
||||
} > "$_wc_dest"
|
||||
strip_frontmatter "$_wc_src" "$_wc_dest.body"
|
||||
cat "$_wc_dest.body" >> "$_wc_dest"
|
||||
rm -f "$_wc_dest.body"
|
||||
}
|
||||
|
||||
# extract_description <skill-name>
|
||||
# Reads the description: line from the skill's SKILL.md frontmatter.
|
||||
extract_description() {
|
||||
_ed_name="$1"
|
||||
awk '/^description: / { sub(/^description: /, ""); print; exit }' "$SKILLS_DIR/$_ed_name/SKILL.md"
|
||||
}
|
||||
|
||||
install_claude() {
|
||||
printf 'Installing for Claude Code into %s/\n' "$TARGET"
|
||||
write_context "CLAUDE.md" || true
|
||||
for name in $SKILL_NAMES; do
|
||||
dest_dir="$TARGET/.claude/skills/$name"
|
||||
dest="$dest_dir/SKILL.md"
|
||||
mkdir -p "$dest_dir"
|
||||
if [ -e "$dest" ]; then
|
||||
printf ' SKIP %s (exists)\n' "$dest" >&2
|
||||
continue
|
||||
fi
|
||||
cp "$SKILLS_DIR/$name/SKILL.md" "$dest"
|
||||
printf ' WROTE %s\n' "$dest"
|
||||
done
|
||||
printf 'Next: cd %s && claude (type "/" to see vllm-setup, sglang-setup, mig-configure, dgx-diagnose)\n' "$TARGET"
|
||||
}
|
||||
|
||||
install_codex() {
|
||||
printf 'Installing for OpenAI Codex CLI into %s/\n' "$TARGET"
|
||||
write_context "AGENTS.md" || true
|
||||
codex_home="${CODEX_HOME:-$HOME/.codex}"
|
||||
codex_skills="$codex_home/skills"
|
||||
mkdir -p "$codex_skills"
|
||||
for name in $SKILL_NAMES; do
|
||||
dest_dir="$codex_skills/$name"
|
||||
dest="$dest_dir/SKILL.md"
|
||||
if [ -e "$dest" ] && [ "$FORCE" -ne 1 ]; then
|
||||
printf ' SKIP %s (exists)\n' "$dest" >&2
|
||||
continue
|
||||
fi
|
||||
mkdir -p "$dest_dir"
|
||||
cp -R "$SKILLS_DIR/$name/." "$dest_dir/"
|
||||
printf ' WROTE %s\n' "$dest_dir"
|
||||
done
|
||||
printf 'Next: cd %s && codex (mention $vllm-setup or "use vllm-setup"; restart Codex if it was already running)\n' "$TARGET"
|
||||
}
|
||||
|
||||
install_gemini() {
|
||||
printf 'Installing for Gemini CLI into %s/\n' "$TARGET"
|
||||
write_context "GEMINI.md" || true
|
||||
mkdir -p "$TARGET/.gemini/commands"
|
||||
for name in $SKILL_NAMES; do
|
||||
dest="$TARGET/.gemini/commands/$name.md"
|
||||
if [ -e "$dest" ]; then
|
||||
printf ' SKIP %s (exists)\n' "$dest" >&2
|
||||
continue
|
||||
fi
|
||||
strip_frontmatter "$SKILLS_DIR/$name/SKILL.md" "$dest"
|
||||
printf ' WROTE %s\n' "$dest"
|
||||
done
|
||||
printf 'Next: cd %s && gemini (type /<name> to invoke a skill)\n' "$TARGET"
|
||||
}
|
||||
|
||||
install_cursor() {
|
||||
printf 'Installing for Cursor into %s/\n' "$TARGET"
|
||||
write_context "AGENTS.md" || true
|
||||
mkdir -p "$TARGET/.cursor/rules"
|
||||
for name in $SKILL_NAMES; do
|
||||
dest="$TARGET/.cursor/rules/$name.mdc"
|
||||
if [ -e "$dest" ]; then
|
||||
printf ' SKIP %s (exists)\n' "$dest" >&2
|
||||
continue
|
||||
fi
|
||||
desc="$(extract_description "$name")"
|
||||
write_cursor_rule "$SKILLS_DIR/$name/SKILL.md" "$dest" "$name" "$desc"
|
||||
printf ' WROTE %s\n' "$dest"
|
||||
done
|
||||
printf 'Next: open %s in Cursor (reference rules by name in chat, e.g. "use the vllm-setup rule")\n' "$TARGET"
|
||||
}
|
||||
|
||||
case "$HARNESS" in
|
||||
claude) install_claude ;;
|
||||
codex) install_codex ;;
|
||||
gemini) install_gemini ;;
|
||||
cursor) install_cursor ;;
|
||||
all)
|
||||
install_claude
|
||||
printf '\n'
|
||||
install_codex
|
||||
printf '\n'
|
||||
install_gemini
|
||||
printf '\n'
|
||||
install_cursor
|
||||
;;
|
||||
esac
|
||||
|
||||
printf '\nDone.\n'
|
||||
92
nvidia/station-ai-skills/assets/skills/dgx-diagnose/SKILL.md
Normal file
92
nvidia/station-ai-skills/assets/skills/dgx-diagnose/SKILL.md
Normal file
@ -0,0 +1,92 @@
|
||||
---
|
||||
name: dgx-diagnose
|
||||
description: Diagnose common DGX Station GB300 issues — CUDA crashes, wrong-GPU targeting, vLLM/SGLang container bugs, MIG state problems, NVLink/Fabric Manager errors, X/Vulkan failures, HuggingFace auth, and port conflicts. Use when the user reports a GPU error, inference server crash, MIG problem, or any unexplained DGX Station failure.
|
||||
metadata:
|
||||
publisher: nvidia
|
||||
hardware: DGX Station GB300
|
||||
---
|
||||
|
||||
# DGX Station Diagnostics
|
||||
|
||||
Diagnose common DGX Station issues. Run through the checks below to identify the problem.
|
||||
|
||||
## Step 1. Gather system state
|
||||
|
||||
Run these commands and analyze the output:
|
||||
|
||||
```bash
|
||||
# GPU status
|
||||
nvidia-smi
|
||||
|
||||
# GPU device list with indices
|
||||
nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader
|
||||
|
||||
# Driver version
|
||||
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1
|
||||
|
||||
# MIG state
|
||||
nvidia-smi -i 1 -q 2>/dev/null | grep -i "MIG Mode" || echo "Could not query MIG on device 1"
|
||||
|
||||
# Fabric Manager
|
||||
systemctl is-active nvidia-fabricmanager
|
||||
|
||||
# GPU processes
|
||||
sudo fuser -v /dev/nvidia* 2>/dev/null || echo "No GPU processes found"
|
||||
|
||||
# Docker containers using GPUs
|
||||
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" 2>/dev/null
|
||||
```
|
||||
|
||||
## Step 2. Match symptoms to known issues
|
||||
|
||||
Based on the gathered state and the user's reported problem, check for these known issues:
|
||||
|
||||
### CUDA crashes with `--gpus all`
|
||||
**Cause:** Mixed coherency — GB300 (ATS) and RTX PRO (non-ATS) cannot share a CUDA context.
|
||||
**Fix:** Use `--gpus '"device=N"'` targeting only the GB300.
|
||||
|
||||
### Model running on wrong GPU (RTX PRO instead of GB300)
|
||||
**Check:** The device index in the docker command vs actual GPU indices.
|
||||
**Fix:** Verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` and correct the `--gpus` flag.
|
||||
|
||||
### vLLM crash / FlashInfer buffer overflow
|
||||
**Check:** Container version — `docker inspect vllm-server | grep Image`
|
||||
**Fix:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Version 25.10 has a known FlashInfer bug on DGX Station.
|
||||
|
||||
### SGLang CUDA errors
|
||||
**Check:** Container tag — must be `cu130` for Blackwell SM103.
|
||||
**Fix:** Use `lmsysorg/sglang:latest-cu130`.
|
||||
|
||||
### CUDA OOM despite 279 GB HBM
|
||||
**Check:** `--max-model-len` / `--context-length` and memory utilization settings.
|
||||
**Fix:** Reduce context length or lower `--gpu-memory-utilization` / `--mem-fraction-static`.
|
||||
|
||||
### `nvidia-smi -mig 1` returns "In use by another client"
|
||||
**Check:** `sudo fuser -v /dev/nvidia*` — GPU processes must be stopped first.
|
||||
**Fix:** Stop all GPU workloads, then retry.
|
||||
|
||||
### NVLink errors after disabling MIG
|
||||
**Check:** `systemctl is-active nvidia-fabricmanager`
|
||||
**Fix:** `sudo systemctl start nvidia-fabricmanager`
|
||||
|
||||
### X server crash after nvidia-xconfig -a
|
||||
**Fix:** `sudo cp /etc/X11/xorg.conf.nvidia-xconfig-original /etc/X11/xorg.conf`
|
||||
|
||||
### Vulkan VK_ERROR_INITIALIZATION_FAILED
|
||||
**Cause:** CUDA initialized before Vulkan, binding to GB300.
|
||||
**Fix:** Run CUDA and Vulkan workloads in separate processes. For Vulkan apps: `__GL_DeviceModalityPreference=2 ./your_app`
|
||||
|
||||
### HuggingFace 401 / token errors
|
||||
**Fix:** Pass token inline: `-e HF_TOKEN="hf_..."`. Don't rely on shell export for background Docker tasks.
|
||||
|
||||
### Port already in use
|
||||
**Check:** `lsof -i :<PORT>`
|
||||
**Fix:** Stop the conflicting process or use a different host port: `-p 8001:8000`.
|
||||
|
||||
## Step 3. Report findings
|
||||
|
||||
Tell the user:
|
||||
1. What the issue is
|
||||
2. Why it happens (root cause)
|
||||
3. The specific command to fix it
|
||||
4. How to verify the fix worked
|
||||
110
nvidia/station-ai-skills/assets/skills/mig-configure/SKILL.md
Normal file
110
nvidia/station-ai-skills/assets/skills/mig-configure/SKILL.md
Normal file
@ -0,0 +1,110 @@
|
||||
---
|
||||
name: mig-configure
|
||||
description: Configure NVIDIA MIG (Multi-Instance GPU) partitions on the DGX Station GB300, including enabling MIG mode, choosing a profile layout, creating instances, and retrieving MIG UUIDs. Use when the user asks to partition the GB300, set up MIG, run multiple models in isolation on one GPU, or reconfigure existing MIG instances.
|
||||
metadata:
|
||||
publisher: nvidia
|
||||
hardware: DGX Station GB300
|
||||
---
|
||||
|
||||
# MIG Configuration on DGX Station
|
||||
|
||||
Configure MIG (Multi-Instance GPU) partitions on the DGX Station GB300.
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Find the GB300 GPU index.** Run:
|
||||
```bash
|
||||
nvidia-smi --query-gpu=index,name --format=csv,noheader
|
||||
```
|
||||
|
||||
2. **Check current MIG state:**
|
||||
```bash
|
||||
nvidia-smi -i <GB300_INDEX> -q | grep -i "MIG Mode"
|
||||
```
|
||||
|
||||
3. **If MIG is already enabled, show current instances:**
|
||||
```bash
|
||||
nvidia-smi mig -lgi -i <GB300_INDEX>
|
||||
nvidia-smi mig -lci -i <GB300_INDEX>
|
||||
```
|
||||
If the user wants to reconfigure, destroy existing instances first (step 6).
|
||||
|
||||
4. **If MIG is not enabled, enable it.** All GPU processes must be stopped first:
|
||||
```bash
|
||||
# Check for running GPU processes
|
||||
sudo fuser -v /dev/nvidia*
|
||||
|
||||
# Enable MIG
|
||||
sudo nvidia-smi -i <GB300_INDEX> -mig 1
|
||||
|
||||
# Verify
|
||||
nvidia-smi -i <GB300_INDEX> -q | grep -i "MIG Mode"
|
||||
```
|
||||
|
||||
5. **Show available profiles and help the user choose a layout:**
|
||||
```bash
|
||||
nvidia-smi mig -lgip -i <GB300_INDEX>
|
||||
```
|
||||
|
||||
Common GB300 MIG profiles:
|
||||
|
||||
| Profile | ID | Memory | Use case |
|
||||
|---------|----|--------|----------|
|
||||
| 1g.35gb | 19 | ~35 GB | Small models (7-8B), dev/test |
|
||||
| 1g.35gb+me | 20 | ~35 GB | Same + media extensions |
|
||||
| 1g.70gb | 15 | ~70 GB | Slightly larger inference |
|
||||
| 2g.70gb | 14 | ~70 GB | Medium models (14-30B) |
|
||||
| 3g.139gb | 9 | ~139 GB | Large models (70B quantized) |
|
||||
| 4g.139gb | 5 | ~139 GB | Large models, more compute |
|
||||
| 7g.278gb | 0 | ~278 GB | Full GPU as single instance |
|
||||
|
||||
Suggest layouts based on the user's workload. Examples:
|
||||
- **Two models (70B + 8B):** `3g.139gb + 2g.70gb + 2g.70gb` → IDs `9,14,14`
|
||||
- **Many small models:** `7 × 1g.35gb` → IDs `19,19,19,19,19,19,19`
|
||||
- **One large model with isolation:** `7g.278gb` → ID `0`
|
||||
|
||||
Ask the user what models they want to run before suggesting a layout.
|
||||
|
||||
6. **Create (or recreate) instances:**
|
||||
|
||||
If reconfiguring, destroy existing instances first:
|
||||
```bash
|
||||
sudo nvidia-smi mig -dci -i <GB300_INDEX>
|
||||
sudo nvidia-smi mig -dgi -i <GB300_INDEX>
|
||||
```
|
||||
|
||||
Then create the new layout:
|
||||
```bash
|
||||
sudo nvidia-smi mig -cgi <PROFILE_IDS> -C -i <GB300_INDEX>
|
||||
```
|
||||
|
||||
7. **Get the MIG device UUIDs:**
|
||||
```bash
|
||||
nvidia-smi -L
|
||||
```
|
||||
Note the `MIG-<uuid>` entries — these are used to target specific MIG instances.
|
||||
|
||||
8. **Show the user how to use MIG devices:**
|
||||
```bash
|
||||
# Bare metal
|
||||
export CUDA_VISIBLE_DEVICES=MIG-<uuid>
|
||||
|
||||
# Docker
|
||||
docker run --gpus '"device=MIG-<uuid>"' ...
|
||||
```
|
||||
|
||||
9. **Report the final layout** to the user with UUIDs and suggested docker commands for each instance.
|
||||
|
||||
## Disabling MIG
|
||||
|
||||
If the user wants to return to full-GPU mode:
|
||||
|
||||
```bash
|
||||
# Stop all workloads using MIG instances first
|
||||
sudo nvidia-smi mig -dci -i <GB300_INDEX>
|
||||
sudo nvidia-smi mig -dgi -i <GB300_INDEX>
|
||||
sudo nvidia-smi -i <GB300_INDEX> -mig 0
|
||||
|
||||
# Ensure Fabric Manager is running for NVLink re-initialization
|
||||
sudo systemctl start nvidia-fabricmanager
|
||||
```
|
||||
122
nvidia/station-ai-skills/assets/skills/sglang-setup/SKILL.md
Normal file
122
nvidia/station-ai-skills/assets/skills/sglang-setup/SKILL.md
Normal file
@ -0,0 +1,122 @@
|
||||
---
|
||||
name: sglang-setup
|
||||
description: Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station.
|
||||
metadata:
|
||||
publisher: nvidia
|
||||
hardware: DGX Station GB300
|
||||
---
|
||||
|
||||
# SGLang Setup on DGX Station
|
||||
|
||||
Deploy an SGLang inference server on DGX Station with validated configuration.
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Find the GB300 GPU index.** Run:
|
||||
```bash
|
||||
nvidia-smi --query-gpu=index,name --format=csv,noheader
|
||||
```
|
||||
Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures.
|
||||
|
||||
2. **Ask the user which model to serve.** If they don't have a preference, suggest:
|
||||
- `Qwen/Qwen3-8B` — small, fast, good for testing
|
||||
- `Qwen/Qwen3-32B` — medium, good balance
|
||||
- `meta-llama/Llama-3.1-70B-Instruct` — large general-purpose
|
||||
|
||||
3. **Check if the user has an HF_TOKEN.** Pass inline with `-e HF_TOKEN="..."`.
|
||||
|
||||
4. **Deploy the container.** Use this validated configuration:
|
||||
|
||||
```bash
|
||||
docker pull lmsysorg/sglang:latest-cu130
|
||||
|
||||
docker run -d \
|
||||
--name sglang-server \
|
||||
--gpus '"device=<GB300_INDEX>"' \
|
||||
--ipc host \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-p 30000:30000 \
|
||||
-e HF_TOKEN="<TOKEN>" \
|
||||
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||
lmsysorg/sglang:latest-cu130 \
|
||||
sglang serve --model-path "<MODEL>" \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000 \
|
||||
--context-length 32768 \
|
||||
--mem-fraction-static 0.85
|
||||
```
|
||||
|
||||
**Container version:** Use `lmsysorg/sglang:latest-cu130`. The `cu130` tag is required for Blackwell SM103 support.
|
||||
|
||||
**First launch** downloads the model and compiles kernels. This takes extra time — subsequent starts are faster.
|
||||
|
||||
5. **Wait for the server to be ready.** Monitor logs:
|
||||
```bash
|
||||
docker logs -f sglang-server
|
||||
```
|
||||
|
||||
6. **Test the server:**
|
||||
```bash
|
||||
curl http://localhost:30000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "<MODEL>",
|
||||
"messages": [{"role": "user", "content": "Hello"}],
|
||||
"max_tokens": 64
|
||||
}'
|
||||
```
|
||||
|
||||
7. **Report the result** to the user, including:
|
||||
- Model loaded and serving on port 30000
|
||||
- How to stop: `docker stop sglang-server && docker rm sglang-server`
|
||||
|
||||
## Key features
|
||||
|
||||
- **RadixAttention** — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with: `docker logs sglang-server 2>&1 | grep "cached-token" | tail -5`
|
||||
- **Structured JSON output** — use `response_format.json_schema` in API requests for guaranteed valid JSON.
|
||||
- **Chunked prefill** — add `--chunked-prefill-size 8192` to break long prefills into chunks, reducing time-to-first-token.
|
||||
|
||||
## Tuning parameters
|
||||
|
||||
| Parameter | Default | Agent workloads | Throughput workloads |
|
||||
|-----------|---------|-----------------|---------------------|
|
||||
| `--context-length` | 32768 | 32768-65536 | 8192-16384 |
|
||||
| `--mem-fraction-static` | 0.85 | 0.80-0.85 | 0.85-0.88 |
|
||||
| `--chunked-prefill-size` | off | 4096-8192 | 8192 |
|
||||
| `--enable-metrics` | off | Optional | Recommended |
|
||||
|
||||
## Structured output example
|
||||
|
||||
```bash
|
||||
curl http://localhost:30000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "<MODEL>",
|
||||
"messages": [{"role": "user", "content": "List three programming languages."}],
|
||||
"max_tokens": 512,
|
||||
"response_format": {
|
||||
"type": "json_schema",
|
||||
"json_schema": {
|
||||
"name": "languages",
|
||||
"schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"languages": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"name": {"type": "string"},
|
||||
"primary_use": {"type": "string"}
|
||||
},
|
||||
"required": ["name", "primary_use"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["languages"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
81
nvidia/station-ai-skills/assets/skills/vllm-setup/SKILL.md
Normal file
81
nvidia/station-ai-skills/assets/skills/vllm-setup/SKILL.md
Normal file
@ -0,0 +1,81 @@
|
||||
---
|
||||
name: vllm-setup
|
||||
description: Deploy a vLLM inference server on an NVIDIA DGX Station GB300 with validated container, GPU targeting, and tuning parameters. Use when the user asks to serve a model with vLLM, start a vLLM endpoint, or set up OpenAI-compatible inference on DGX Station.
|
||||
metadata:
|
||||
publisher: nvidia
|
||||
hardware: DGX Station GB300
|
||||
---
|
||||
|
||||
# vLLM Setup on DGX Station
|
||||
|
||||
Deploy a vLLM inference server on DGX Station with validated configuration.
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Find the GB300 GPU index.** Run:
|
||||
```bash
|
||||
nvidia-smi --query-gpu=index,name --format=csv,noheader
|
||||
```
|
||||
Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures.
|
||||
|
||||
2. **Ask the user which model to serve.** If they don't have a preference, suggest:
|
||||
- `nvidia/Qwen3-235B-A22B-NVFP4` — large MoE model, fits in 279 GB HBM
|
||||
- `meta-llama/Llama-3.1-70B-Instruct` — solid general-purpose model
|
||||
- `Qwen/Qwen3-8B` — small model for testing
|
||||
|
||||
3. **Check if the user has an HF_TOKEN.** Many models require HuggingFace authentication. The token must be passed inline with `-e HF_TOKEN="..."` — do not rely on shell export in background Docker tasks.
|
||||
|
||||
4. **Deploy the container.** Use this validated configuration:
|
||||
|
||||
```bash
|
||||
docker pull nvcr.io/nvidia/vllm:26.01-py3
|
||||
|
||||
docker run -d \
|
||||
--name vllm-server \
|
||||
--gpus '"device=<GB300_INDEX>"' \
|
||||
--ipc host \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-p 8000:8000 \
|
||||
-e HF_TOKEN="<TOKEN>" \
|
||||
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||
nvcr.io/nvidia/vllm:26.01-py3 \
|
||||
vllm serve "<MODEL>" \
|
||||
--max-model-len 32768 \
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
**Container version:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Do NOT use 25.10 — it has a FlashInfer buffer overflow on DGX Station.
|
||||
|
||||
5. **Wait for the server to be ready.** Monitor logs:
|
||||
```bash
|
||||
docker logs -f vllm-server
|
||||
```
|
||||
Wait for the line indicating the server is listening on port 8000.
|
||||
|
||||
6. **Test the server:**
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "<MODEL>",
|
||||
"messages": [{"role": "user", "content": "Hello"}],
|
||||
"max_tokens": 64
|
||||
}'
|
||||
```
|
||||
|
||||
7. **Report the result** to the user, including:
|
||||
- Model loaded and serving on port 8000
|
||||
- GPU memory utilization
|
||||
- How to stop: `docker stop vllm-server && docker rm vllm-server`
|
||||
|
||||
## Tuning parameters
|
||||
|
||||
Adjust these based on the user's workload:
|
||||
|
||||
| Parameter | Default | Agent workloads | Throughput workloads |
|
||||
|-----------|---------|-----------------|---------------------|
|
||||
| `--max-model-len` | 32768 | 32768-65536 | 8192-16384 |
|
||||
| `--gpu-memory-utilization` | 0.9 | 0.85-0.90 | 0.90-0.92 |
|
||||
| `--enable-prefix-caching` | off | Enable (multi-turn reuse) | Enable |
|
||||
| `--max-num-seqs` | default | 4-16 (lower latency) | 32+ (higher throughput) |
|
||||
413
nvidia/station-ai-skills/endpoint-test.yaml
Normal file
413
nvidia/station-ai-skills/endpoint-test.yaml
Normal file
@ -0,0 +1,413 @@
|
||||
kind: Playbook
|
||||
metadata:
|
||||
name: station-ai-skills
|
||||
displayName: DGX Station AI Skills for Coding Agents
|
||||
shortDescription: Give your coding agent (Claude Code, Codex, Gemini CLI, Cursor) DGX Station expertise via an AGENTS.md and on-demand Agent Skills
|
||||
|
||||
publisher: nvidia
|
||||
description: |
|
||||
# REPLACE THIS WITH YOUR MODEL CARD
|
||||
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
|
||||
|
||||
labelsV2:
|
||||
- gpuType:playbook:gpu_type_station
|
||||
- DGX Station
|
||||
- GB300
|
||||
- Blackwell
|
||||
- AI Agents
|
||||
- Agent Skills
|
||||
- AGENTS.md
|
||||
- Claude Code
|
||||
- Codex
|
||||
- Gemini CLI
|
||||
- Cursor
|
||||
- vLLM
|
||||
- SGLang
|
||||
- MIG
|
||||
- Mixed Coherency
|
||||
|
||||
attributes:
|
||||
- key: DURATION
|
||||
value: 15 MIN
|
||||
|
||||
spec:
|
||||
artifactName: station-ai-skills
|
||||
nvcfFunctionId: None
|
||||
attributes:
|
||||
|
||||
showUnavailableBanner: false
|
||||
apiDocsUrl: None
|
||||
termsOfUse: |
|
||||
|
||||
cta:
|
||||
text: View on GitHub
|
||||
url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-ai-skills/
|
||||
|
||||
|
||||
tabs:
|
||||
-
|
||||
id: overview
|
||||
|
||||
label: Overview
|
||||
content: |
|
||||
# Basic idea
|
||||
|
||||
Modern coding agents — Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor — all support two extension mechanisms: a project-level **context file** that's loaded into every conversation, and **on-demand procedural workflows** (called skills, prompts, commands, or rules depending on the harness). This playbook ships both for DGX Station:
|
||||
|
||||
- An **`AGENTS.md`** with the critical DGX Station constraints your agent should always know (mixed coherency, GPU targeting, common pitfalls). `AGENTS.md` is the cross-harness standard; an `install.sh` lays it down as `CLAUDE.md`, `GEMINI.md`, or `AGENTS.md` depending on the agent you use.
|
||||
- **Four Agent Skills** — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` — authored once in the [Anthropic Agent Skills format](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview) and installed into the right per-harness location (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`).
|
||||
|
||||
This approach keeps your agent's context lean in every conversation while giving it deep procedural knowledge on demand, regardless of which agent you use.
|
||||
|
||||
## AGENTS.md vs Agent Skill — why split?
|
||||
|
||||
| | AGENTS.md | Agent Skill |
|
||||
|---|---|---|
|
||||
| **Loaded** | Every conversation, automatically | Only when invoked by name (or matched by description, in Claude) |
|
||||
| **Best for** | Constraints, pitfalls, "never do X" rules | Step-by-step workflows, deployment procedures |
|
||||
| **Context cost** | Consumed every time | Zero until invoked |
|
||||
|
||||
The DGX Station mixed-coherency constraint (`--gpus all` will crash) should be in every conversation. The full vLLM deployment procedure should not.
|
||||
|
||||
# What you'll accomplish
|
||||
|
||||
- Install the `AGENTS.md` and four Agent Skills into your project directory for your chosen agent (Claude Code, Codex, Gemini CLI, or Cursor).
|
||||
- Verify the agent loads the constraints automatically and the skills on demand.
|
||||
- Invoke `vllm-setup` to deploy a vLLM inference server with validated configuration.
|
||||
- Invoke `sglang-setup` to deploy an SGLang inference server.
|
||||
- Invoke `mig-configure` to partition the GB300 into MIG instances.
|
||||
- Invoke `dgx-diagnose` to troubleshoot common DGX Station issues.
|
||||
|
||||
# What to know before starting
|
||||
|
||||
- Basic familiarity with one supported coding agent (running it, giving it prompts, using slash commands or rule references)
|
||||
- General understanding of DGX Station (two GPUs, Docker-based workflows)
|
||||
|
||||
# Prerequisites
|
||||
|
||||
- NVIDIA DGX Station with GB300
|
||||
- One of the supported coding agents installed:
|
||||
- **Claude Code:** `curl -fsSL https://claude.ai/install.sh | sh`
|
||||
- **OpenAI Codex CLI:** `npm i -g @openai/codex`
|
||||
- **Gemini CLI:** `npm i -g @google/gemini-cli`
|
||||
- **Cursor:** download from `https://cursor.com/`
|
||||
- A project directory where you do DGX Station work
|
||||
|
||||
# Ancillary files
|
||||
|
||||
- `assets/AGENTS.md` — canonical context file with critical constraints, GPU targeting, software versions, and common pitfalls. Cross-harness standard.
|
||||
- `assets/skills/vllm-setup/SKILL.md` — skill: deploy vLLM with validated configuration.
|
||||
- `assets/skills/sglang-setup/SKILL.md` — skill: deploy SGLang with validated configuration.
|
||||
- `assets/skills/mig-configure/SKILL.md` — skill: configure MIG partitions on the GB300.
|
||||
- `assets/skills/dgx-diagnose/SKILL.md` — skill: troubleshoot common DGX Station issues.
|
||||
- `assets/install.sh` — per-harness installer (`claude`, `codex`, `gemini`, `cursor`, or `all`).
|
||||
|
||||
# Time & risk
|
||||
|
||||
* **Duration:** 10-15 minutes
|
||||
* **Risk level:** Low — this playbook copies markdown files into your project directory
|
||||
* **Rollback:** Delete the context file (`AGENTS.md` / `CLAUDE.md` / `GEMINI.md`) and the harness-specific skill directory (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`) from your project directory
|
||||
* **Last Updated:** 05/18/2026
|
||||
* Restructured as harness-agnostic Agent Skills (Claude Code, Codex, Gemini CLI, Cursor)
|
||||
|
||||
|
||||
|
||||
-
|
||||
id: instructions
|
||||
|
||||
label: Instructions
|
||||
content: |
|
||||
# Step 1. Install your coding agent
|
||||
|
||||
Pick whichever agent you prefer — the rest of this playbook works the same regardless. Install commands:
|
||||
|
||||
| Agent | Install |
|
||||
|-------|---------|
|
||||
| Claude Code | `curl -fsSL https://claude.ai/install.sh \| sh` |
|
||||
| OpenAI Codex CLI | `npm i -g @openai/codex` |
|
||||
| Gemini CLI | `npm i -g @google/gemini-cli` |
|
||||
| Cursor | Download from `https://cursor.com/` |
|
||||
|
||||
Verify with `claude --version`, `codex --version`, `gemini --version`, or by launching Cursor.
|
||||
|
||||
# Step 2. Install the skills into your project
|
||||
|
||||
Navigate to the project where you want DGX Station expertise, then run the installer with the harness you use:
|
||||
|
||||
```bash
|
||||
cd ~/your-project
|
||||
|
||||
# Pick one:
|
||||
/path/to/this/playbook/assets/install.sh claude
|
||||
/path/to/this/playbook/assets/install.sh codex
|
||||
/path/to/this/playbook/assets/install.sh gemini
|
||||
/path/to/this/playbook/assets/install.sh cursor
|
||||
|
||||
# Or install for all four at once:
|
||||
/path/to/this/playbook/assets/install.sh all
|
||||
```
|
||||
|
||||
If you downloaded the playbook as a zip, the path is relative to the extracted directory:
|
||||
|
||||
```bash
|
||||
station-ai-skills/assets/install.sh claude ~/your-project
|
||||
```
|
||||
|
||||
The installer is additive for skill directories (won't clobber existing skills you've written) and refuses to overwrite an existing context file (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`) unless you pass `--force`.
|
||||
|
||||
**Resulting layout** (per harness):
|
||||
|
||||
```text
|
||||
your-project/
|
||||
AGENTS.md or CLAUDE.md or GEMINI.md # context file (named for your agent)
|
||||
.claude/skills/<name>/SKILL.md # claude
|
||||
.codex/prompts/<name>.md # codex
|
||||
.gemini/commands/<name>.md # gemini
|
||||
.cursor/rules/<name>.mdc # cursor
|
||||
```
|
||||
|
||||
Where `<name>` is each of `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose`.
|
||||
|
||||
> [!NOTE]
|
||||
> Every supported agent automatically reads the context file from the working directory at startup. Skills/prompts/rules in the harness-specific directory are discovered automatically — no additional configuration needed.
|
||||
|
||||
# Step 3. Verify the setup
|
||||
|
||||
Start your agent in the project directory and ask a question that requires constraint knowledge:
|
||||
|
||||
```text
|
||||
Can I use --gpus all to run my CUDA workload on DGX Station?
|
||||
```
|
||||
|
||||
The agent should immediately warn about the mixed-coherency constraint and recommend `--gpus '"device=N"'` targeting. If you don't get the warning, the context file isn't being loaded — see Troubleshooting.
|
||||
|
||||
Then verify the skills are discoverable:
|
||||
|
||||
| Agent | How to check |
|
||||
|-------|--------------|
|
||||
| Claude Code | Type `/` — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` should appear in the autocomplete |
|
||||
| Codex CLI | Type `/prompts:` — same four names appear |
|
||||
| Gemini CLI | Type `/` — same four names appear |
|
||||
| Cursor | Open the Rules panel — same four rules appear |
|
||||
|
||||
# Step 4. Use vllm-setup to deploy an inference server
|
||||
|
||||
Invoke the skill in your agent:
|
||||
|
||||
| Agent | Invocation |
|
||||
|-------|-----------|
|
||||
| Claude Code | `/vllm-setup` (slash command) or just describe the task ("deploy vllm with Qwen3-8B") |
|
||||
| Codex CLI | `/prompts:vllm-setup` |
|
||||
| Gemini CLI | `/vllm-setup` |
|
||||
| Cursor | In chat: "use the vllm-setup rule to deploy a vllm server" |
|
||||
|
||||
The agent will walk you through deploying a vLLM server with a validated container image, correct GPU targeting, and recommended parameters. It will check your GPU index, ask which model you want to serve, and generate the full `docker run` command.
|
||||
|
||||
# Step 5. Use sglang-setup to deploy SGLang
|
||||
|
||||
Same invocation pattern, but for SGLang with the `cu130` container, RadixAttention prefix caching, and structured JSON output support.
|
||||
|
||||
# Step 6. Use mig-configure to partition the GB300
|
||||
|
||||
The agent will query your current MIG state, show available profiles, help you choose a layout for your workloads, and execute the partitioning commands.
|
||||
|
||||
# Step 7. Use dgx-diagnose to troubleshoot issues
|
||||
|
||||
If you encounter problems, invoke `dgx-diagnose`. The agent will check GPU status, driver version, running processes, MIG state, and Fabric Manager to identify the issue.
|
||||
|
||||
# Step 8. Customize
|
||||
|
||||
Both the `AGENTS.md` and the skills are plain markdown — extend them freely.
|
||||
|
||||
**Add project-specific constraints to `AGENTS.md`** (or your harness-specific context file):
|
||||
|
||||
```markdown
|
||||
## Project-specific
|
||||
|
||||
- Our production MIG layout is 3g.139gb + 2g.70gb + 2g.70gb
|
||||
- Always use port 8080 for inference (nginx proxy on 443)
|
||||
- Model weights are cached at /data/models, mount with -v /data/models:/root/.cache/huggingface/hub
|
||||
```
|
||||
|
||||
**Create new skills** by adding a directory and `SKILL.md` to `assets/skills/`, then re-run `install.sh`:
|
||||
|
||||
```bash
|
||||
mkdir -p assets/skills/run-benchmarks
|
||||
cat > assets/skills/run-benchmarks/SKILL.md << 'EOF'
|
||||
---
|
||||
name: run-benchmarks
|
||||
description: Run our standard inference benchmark suite against the running vLLM or SGLang server and compare against the baseline.
|
||||
---
|
||||
|
||||
# Run benchmarks
|
||||
|
||||
1. Check which inference server is running (vLLM on port 8000 or SGLang on port 30000)
|
||||
2. Run the appropriate benchmark script from ./benchmarks/
|
||||
3. Report throughput (tokens/sec), latency (TTFT, ITL), and memory utilization
|
||||
4. Compare against the baseline in ./benchmarks/baseline.json
|
||||
EOF
|
||||
```
|
||||
|
||||
> [!TIP]
|
||||
> Keep `AGENTS.md` focused on constraints and pitfalls (things that break). Put procedural workflows in skills (things you do step-by-step).
|
||||
|
||||
|
||||
|
||||
-
|
||||
id: troubleshooting
|
||||
|
||||
label: Troubleshooting
|
||||
content: |
|
||||
# Skills don't appear in autocomplete / aren't discoverable
|
||||
|
||||
Each agent discovers skills from a harness-specific directory in the current directory (or a parent). Check the right one:
|
||||
|
||||
| Agent | Expected location |
|
||||
|-------|-------------------|
|
||||
| Claude Code | `.claude/skills/<name>/SKILL.md` |
|
||||
| Codex CLI | `.codex/prompts/<name>.md` |
|
||||
| Gemini CLI | `.gemini/commands/<name>.md` |
|
||||
| Cursor | `.cursor/rules/<name>.mdc` |
|
||||
|
||||
```bash
|
||||
# Examples — check the directory for your agent
|
||||
ls -la .claude/skills/
|
||||
ls -la .codex/prompts/
|
||||
ls -la .gemini/commands/
|
||||
ls -la .cursor/rules/
|
||||
```
|
||||
|
||||
You should see entries for `vllm-setup`, `sglang-setup`, `mig-configure`, and `dgx-diagnose`.
|
||||
|
||||
**Check you're in the right directory:**
|
||||
|
||||
```bash
|
||||
pwd
|
||||
```
|
||||
|
||||
The agent must be started from the directory containing the harness directory, or a subdirectory of it.
|
||||
|
||||
# Context file not loaded
|
||||
|
||||
If the agent gives generic answers without DGX Station awareness, the context file isn't being picked up. Each agent reads a different filename — verify the one for your agent exists:
|
||||
|
||||
| Agent | Expected filename |
|
||||
|-------|-------------------|
|
||||
| Claude Code | `CLAUDE.md` (also reads `AGENTS.md` as fallback) |
|
||||
| Codex CLI | `AGENTS.md` |
|
||||
| Gemini CLI | `GEMINI.md` |
|
||||
| Cursor | `AGENTS.md` |
|
||||
|
||||
```bash
|
||||
# Verify the file exists for your agent
|
||||
cat AGENTS.md | head -5
|
||||
cat CLAUDE.md | head -5
|
||||
cat GEMINI.md | head -5
|
||||
|
||||
# Restart the agent in the correct directory
|
||||
cd ~/your-project
|
||||
claude # or codex, gemini, etc.
|
||||
```
|
||||
|
||||
All four agents read the context file from the working directory (and parent directories up to the project root).
|
||||
|
||||
# Skill gives outdated information
|
||||
|
||||
The skills contain validated container versions and parameters as of the publication date. If a newer container is available, edit the canonical source and re-install:
|
||||
|
||||
```bash
|
||||
nano /path/to/playbook/assets/skills/vllm-setup/SKILL.md
|
||||
/path/to/playbook/assets/install.sh all --force
|
||||
```
|
||||
|
||||
Or edit the installed copy directly:
|
||||
|
||||
```bash
|
||||
# Claude Code
|
||||
nano .claude/skills/vllm-setup/SKILL.md
|
||||
# Codex
|
||||
nano .codex/prompts/vllm-setup.md
|
||||
# Gemini CLI
|
||||
nano .gemini/commands/vllm-setup.md
|
||||
# Cursor
|
||||
nano .cursor/rules/vllm-setup.mdc
|
||||
```
|
||||
|
||||
> [!TIP]
|
||||
> Skills are plain markdown — you can version them in git alongside your project code.
|
||||
|
||||
# "Both GPUs cannot be used" errors
|
||||
|
||||
This is the mixed-coherency constraint working as intended. If you see CUDA errors when using `--gpus all`:
|
||||
|
||||
```bash
|
||||
# Find the GB300 index
|
||||
nvidia-smi --query-gpu=index,name --format=csv,noheader
|
||||
|
||||
# Use device-specific targeting
|
||||
docker run --gpus '"device=1"' ...
|
||||
```
|
||||
|
||||
The `AGENTS.md` covers this constraint, but if you removed that section, add it back — it's the most important piece of DGX Station knowledge.
|
||||
|
||||
# Skills conflict with existing project directory
|
||||
|
||||
If your project already has a `.claude/`, `.codex/`, `.gemini/`, or `.cursor/` directory with its own contents, `install.sh` is **additive** for skill directories — it adds the new skill files alongside whatever you already have and warns on collision rather than overwriting.
|
||||
|
||||
For context files (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`), the installer **refuses** to overwrite an existing file. Pass `--force` to override, or merge the new content manually:
|
||||
|
||||
```bash
|
||||
# See what would be written
|
||||
diff /path/to/playbook/assets/AGENTS.md ./AGENTS.md
|
||||
|
||||
# Force overwrite
|
||||
/path/to/playbook/assets/install.sh claude . --force
|
||||
```
|
||||
|
||||
# Installer reports "WROTE" for some files but "SKIP" for others
|
||||
|
||||
That's the safe-by-default behavior. The installer skips any file that already exists, prints a warning, and continues with the rest. To get a clean install, either:
|
||||
|
||||
1. Delete the existing files first: `rm -rf .claude/skills/{vllm-setup,sglang-setup,mig-configure,dgx-diagnose}`
|
||||
2. Or pass `--force` (only affects context files; skill files are still skipped if present)
|
||||
|
||||
|
||||
|
||||
|
||||
resources:
|
||||
- name: Anthropic Agent Skills Overview
|
||||
url: https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview
|
||||
|
||||
|
||||
- name: AGENTS.md Standard
|
||||
url: https://agents.md/
|
||||
|
||||
|
||||
- name: Claude Code Documentation
|
||||
url: https://docs.anthropic.com/en/docs/claude-code
|
||||
|
||||
|
||||
- name: OpenAI Codex AGENTS.md Guide
|
||||
url: https://developers.openai.com/codex/guides/agents-md
|
||||
|
||||
|
||||
- name: Gemini CLI Custom Commands
|
||||
url: https://geminicli.com/docs/cli/custom-commands/
|
||||
|
||||
|
||||
- name: Cursor Rules Documentation
|
||||
url: https://docs.cursor.com/
|
||||
|
||||
|
||||
- name: vLLM Documentation
|
||||
url: https://docs.vllm.ai/en/latest/
|
||||
|
||||
|
||||
- name: SGLang Documentation
|
||||
url: https://docs.sglang.io/
|
||||
|
||||
|
||||
- name: MIG User Guide
|
||||
url: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
|
||||
|
||||
|
||||
2
nvidia/station-ai-skills/overview.md
Normal file
2
nvidia/station-ai-skills/overview.md
Normal file
@ -0,0 +1,2 @@
|
||||
# REPLACE THIS WITH YOUR MODEL CARD
|
||||
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
|
||||
@ -1,4 +1,4 @@
|
||||
# Station Register to Brev
|
||||
# Register DGX Station to Brev
|
||||
|
||||
> Link your DGX Station to Brev for remote access and sharing
|
||||
|
||||
@ -27,7 +27,7 @@ You’ll register your DGX Station with Brev and it will be visible as a healthy
|
||||
While Brev automates the complex configuration, understanding a few key concepts when establishing the initial connection will be useful:
|
||||
|
||||
* **Terminal Basics**:
|
||||
* Familiarity with the command line to run a few simple setup commands
|
||||
* Familiarity with command-line use to run a few simple setup commands.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
@ -45,23 +45,28 @@ You will also need the following:
|
||||
* **Estimated time:** 5-10 minutes
|
||||
* **Risk level:** Low - Registration configures the Station for secure remote access without altering your existing workloads
|
||||
* **Rollback:** The Brev configuration can be removed through the UI and CLI
|
||||
* **Last Updated:** 05/29/2026
|
||||
* First Publication
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Login to Brev
|
||||
## Step 1. Log in to Brev
|
||||
|
||||
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm you’re in the correct org (by clicking the org button on the top right hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
|
||||
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm you’re in the correct org (by clicking the org button on the top right-hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
|
||||
|
||||
Click the “Register Compute” button and follow the instructions in the pop-up window.
|
||||
|
||||
## Step 2. Complete Popup Instructions
|
||||
## Step 2. Complete Pop-up Instructions
|
||||
|
||||
* Install the Brev CLI
|
||||
* Configure your compute
|
||||
* Add a name for compute
|
||||
* To configure ssh, ensure the “Enable SSH access” toggle is on
|
||||
* To configure SSH, ensure the “Enable SSH access” toggle is on
|
||||
* Run the registration command
|
||||
|
||||
> [!IMPORTANT]
|
||||
> Run the Brev CLI install command **without `sudo`**. Prefixing the installer with `sudo` writes the `brev` binary into root's home directory, which is not on your user shell's `PATH` — the next command will fail with `brev: command not found`. Copy the install command from the pop-up and run it as your normal user.
|
||||
|
||||
## Step 3. Follow Registration Flow
|
||||
|
||||
In the CLI, you’ll be walked through registration. Go through the flow until registration is complete.
|
||||
@ -70,7 +75,7 @@ In the CLI, you’ll be walked through registration. Go through the flow until r
|
||||
|
||||
* Go to the [Brev UI](https://brev.nvidia.com)
|
||||
* Navigate to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute)
|
||||
* Confirm that the DGX Station appears as a registered node with an **Available** status
|
||||
* Confirm that the DGX Station appears as a registered node with a **Connected** status
|
||||
|
||||
## Step 5. Next Steps
|
||||
|
||||
@ -78,7 +83,14 @@ Your DGX Station is now integrated into Brev as a secure, remotely accessible GP
|
||||
|
||||
Now that your hardware is connected, you can:
|
||||
|
||||
* **Share Access Anywhere:** Access your machine from anywhere and share access with others through the Brev UI under [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
|
||||
* **Access your machine from anywhere:** Open the [Brev UI](https://brev.nvidia.com) and launch a session from [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
|
||||
* **Share access with others:** Invite teammates to your DGX Station from the Brev UI:
|
||||
* Go to the [Brev UI](https://brev.nvidia.com) and open [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
|
||||
* Find your DGX Station in the list and open the row's three-dot (⋯) menu.
|
||||
* Select **Share Access**.
|
||||
* Enter the email address of the person you want to share with.
|
||||
* Choose their role / permission level.
|
||||
* Confirm to send the invitation.
|
||||
|
||||
## Step 6. Cleanup
|
||||
|
||||
@ -93,12 +105,12 @@ brev deregister
|
||||
In the UI:
|
||||
* Go to the [Brev UI](https://brev.nvidia.com)
|
||||
* Navigate to the section listing “GPU Environments” and look under “Registered Compute”
|
||||
* Click the “Deregister” menu item on the device you wish to delete from Brev
|
||||
* Confirm your selection
|
||||
* Click the “Remove” menu item on the device you wish to delete from Brev.
|
||||
* Confirm your selection.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set <my-org>` and then redo the registration process |
|
||||
| Unable to `brev shell <name>` | Need to refresh | `brev refresh` |
|
||||
| Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set <my-org>` and then redo the registration process. |
|
||||
| Unable to `brev shell <name>` | Need to refresh | `brev refresh`. |
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
kind: Playbook
|
||||
metadata:
|
||||
name: station-brev
|
||||
displayName: Station Register to Brev
|
||||
displayName: Register DGX Station to Brev
|
||||
shortDescription: Link your DGX Station to Brev for remote access and sharing
|
||||
publisher: nvidia
|
||||
description: |
|
||||
@ -10,8 +10,7 @@ metadata:
|
||||
|
||||
labelsV2:
|
||||
- gpuType:playbook:gpu_type_station
|
||||
- DGX
|
||||
- Station
|
||||
- DGX Station
|
||||
- Brev
|
||||
|
||||
attributes:
|
||||
@ -53,7 +52,7 @@ spec:
|
||||
While Brev automates the complex configuration, understanding a few key concepts when establishing the initial connection will be useful:
|
||||
|
||||
* **Terminal Basics**:
|
||||
* Familiarity with the command line to run a few simple setup commands
|
||||
* Familiarity with command-line use to run a few simple setup commands.
|
||||
|
||||
# Prerequisites
|
||||
|
||||
@ -71,6 +70,8 @@ spec:
|
||||
* **Estimated time:** 5-10 minutes
|
||||
* **Risk level:** Low - Registration configures the Station for secure remote access without altering your existing workloads
|
||||
* **Rollback:** The Brev configuration can be removed through the UI and CLI
|
||||
* **Last Updated:** 05/29/2026
|
||||
* First Publication
|
||||
|
||||
|
||||
|
||||
@ -79,20 +80,23 @@ spec:
|
||||
|
||||
label: Instructions
|
||||
content: |
|
||||
# Step 1. Login to Brev
|
||||
# Step 1. Log in to Brev
|
||||
|
||||
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm you’re in the correct org (by clicking the org button on the top right hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
|
||||
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm you’re in the correct org (by clicking the org button on the top right-hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
|
||||
|
||||
Click the “Register Compute” button and follow the instructions in the pop-up window.
|
||||
|
||||
# Step 2. Complete Popup Instructions
|
||||
# Step 2. Complete Pop-up Instructions
|
||||
|
||||
* Install the Brev CLI
|
||||
* Configure your compute
|
||||
* Add a name for compute
|
||||
* To configure ssh, ensure the “Enable SSH access” toggle is on
|
||||
* To configure SSH, ensure the “Enable SSH access” toggle is on
|
||||
* Run the registration command
|
||||
|
||||
> [!IMPORTANT]
|
||||
> Run the Brev CLI install command **without `sudo`**. Prefixing the installer with `sudo` writes the `brev` binary into root's home directory, which is not on your user shell's `PATH` — the next command will fail with `brev: command not found`. Copy the install command from the pop-up and run it as your normal user.
|
||||
|
||||
# Step 3. Follow Registration Flow
|
||||
|
||||
In the CLI, you’ll be walked through registration. Go through the flow until registration is complete.
|
||||
@ -101,7 +105,7 @@ spec:
|
||||
|
||||
* Go to the [Brev UI](https://brev.nvidia.com)
|
||||
* Navigate to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute)
|
||||
* Confirm that the DGX Station appears as a registered node with an **Available** status
|
||||
* Confirm that the DGX Station appears as a registered node with a **Connected** status
|
||||
|
||||
# Step 5. Next Steps
|
||||
|
||||
@ -109,7 +113,14 @@ spec:
|
||||
|
||||
Now that your hardware is connected, you can:
|
||||
|
||||
* **Share Access Anywhere:** Access your machine from anywhere and share access with others through the Brev UI under [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
|
||||
* **Access your machine from anywhere:** Open the [Brev UI](https://brev.nvidia.com) and launch a session from [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
|
||||
* **Share access with others:** Invite teammates to your DGX Station from the Brev UI:
|
||||
* Go to the [Brev UI](https://brev.nvidia.com) and open [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
|
||||
* Find your DGX Station in the list and open the row's three-dot (⋯) menu.
|
||||
* Select **Share Access**.
|
||||
* Enter the email address of the person you want to share with.
|
||||
* Choose their role / permission level.
|
||||
* Confirm to send the invitation.
|
||||
|
||||
# Step 6. Cleanup
|
||||
|
||||
@ -124,8 +135,8 @@ spec:
|
||||
In the UI:
|
||||
* Go to the [Brev UI](https://brev.nvidia.com)
|
||||
* Navigate to the section listing “GPU Environments” and look under “Registered Compute”
|
||||
* Click the “Deregister” menu item on the device you wish to delete from Brev
|
||||
* Confirm your selection
|
||||
* Click the “Remove” menu item on the device you wish to delete from Brev.
|
||||
* Confirm your selection.
|
||||
|
||||
|
||||
|
||||
@ -136,8 +147,8 @@ spec:
|
||||
content: |
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set <my-org>` and then redo the registration process |
|
||||
| Unable to `brev shell <name>` | Need to refresh | `brev refresh` |
|
||||
| Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set <my-org>` and then redo the registration process. |
|
||||
| Unable to `brev shell <name>` | Need to refresh | `brev refresh`. |
|
||||
|
||||
|
||||
|
||||
|
||||
@ -45,18 +45,16 @@ spec:
|
||||
content: |
|
||||
# Basic idea
|
||||
|
||||
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI.
|
||||
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
|
||||
|
||||
The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation.
|
||||
The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
|
||||
|
||||
# What you'll accomplish
|
||||
|
||||
You will have a working nanochat setup that trains a small LLM and serves it for chat.
|
||||
|
||||
- **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station.
|
||||
- **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation.
|
||||
- **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints.
|
||||
- **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples.
|
||||
- **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
|
||||
- **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
|
||||
- **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
|
||||
- **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples.
|
||||
|
||||
# What to know before starting
|
||||
|
||||
@ -68,36 +66,58 @@ spec:
|
||||
|
||||
**Hardware:**
|
||||
|
||||
- NVIDIA DGX Station with GB300 Ultra Superchip.
|
||||
- Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun).
|
||||
- Adequate storage for cache (~24GB+ for FineWeb data and checkpoints).
|
||||
- NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
|
||||
- Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).
|
||||
|
||||
**Software:**
|
||||
|
||||
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi`
|
||||
- Network access to download datasets (Hugging Face, FineWeb) and container images.
|
||||
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`
|
||||
- Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
|
||||
- [Weights & Biases](https://wandb.ai/) account and API key.
|
||||
- [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets.
|
||||
|
||||
# Model architecture (d24)
|
||||
|
||||
```
|
||||
Layers: 24
|
||||
Attention Heads: 12
|
||||
Head Dimension: 128
|
||||
Context Length: 2048 tokens
|
||||
Vocabulary Size: 65,536 (2^16, trained BPE)
|
||||
Precision: FP8 (e4m3, tensorwise scaling)
|
||||
```
|
||||
|
||||
# Training stages
|
||||
|
||||
| Stage | Description |
|
||||
|-------|-------------|
|
||||
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
|
||||
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 |
|
||||
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
|
||||
| Report | Generates `report.md` with metrics, samples, and system info |
|
||||
|
||||
# Ancillary files
|
||||
|
||||
All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository).
|
||||
|
||||
- `assets/Dockerfile` – PyTorch NGC image plus nanochat dependencies and venv.
|
||||
- `assets/setup.sh` – Clones nanochat, checks out the supported commit, and builds the Docker image.
|
||||
- `assets/launch.sh` – Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation).
|
||||
- `assets/README.md` – Additional detail on training stages, inference, and troubleshooting.
|
||||
All required assets are in `nvidia/station-nanochat/assets/`:
|
||||
|
||||
- `Dockerfile` – PyTorch NGC image with nanochat pip dependencies.
|
||||
- `setup.sh` – Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image.
|
||||
- `launch.sh` – Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
|
||||
- `speedrun_station.sh` – Modified speedrun script adapted for single-GPU DGX Station.
|
||||
|
||||
# Time & risk
|
||||
|
||||
- **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more).
|
||||
- **Risk level:** Medium
|
||||
- Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space.
|
||||
- API keys (W&B, HF) must be set or the launch script will exit.
|
||||
- **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed.
|
||||
* **Last Updated:** 03/02/2026
|
||||
* First Publication
|
||||
- **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra.
|
||||
- **Risk level:** Medium
|
||||
- Large downloads (FineWeb) can be slow; ensure stable network and disk space.
|
||||
- API keys (W&B, HF) must be set or `launch.sh` will exit immediately.
|
||||
- **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed.
|
||||
|
||||
# Credits
|
||||
|
||||
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy
|
||||
- [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data)
|
||||
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data)
|
||||
|
||||
|
||||
|
||||
@ -108,69 +128,86 @@ spec:
|
||||
content: |
|
||||
# Step 1. Prerequisites and environment
|
||||
|
||||
This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
|
||||
Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
|
||||
|
||||
```bash
|
||||
# Verify GPU and Docker
|
||||
nvidia-smi
|
||||
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi
|
||||
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
|
||||
```
|
||||
|
||||
Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them.
|
||||
Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell:
|
||||
|
||||
```bash
|
||||
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
|
||||
export HF_TOKEN=<YOUR_HF_TOKEN>
|
||||
```
|
||||
|
||||
# Step 2. Clone the playbook and set up nanochat
|
||||
# Step 2. Clone and set up
|
||||
|
||||
Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image.
|
||||
Clone the playbook repository and navigate to the assets directory:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/NVIDIA/dgx-spark-playbooks
|
||||
cd dgx-spark-playbooks/nvidia/station-nanochat/assets
|
||||
```
|
||||
|
||||
From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.).
|
||||
Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies):
|
||||
|
||||
```bash
|
||||
./setup.sh
|
||||
```
|
||||
|
||||
Setup may take several minutes while the image builds. Verify the image:
|
||||
You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this:
|
||||
|
||||
```bash
|
||||
docker images | grep nanochat
|
||||
```
|
||||
assets/
|
||||
├── Dockerfile
|
||||
├── launch.sh
|
||||
├── setup.sh
|
||||
├── speedrun_station.sh
|
||||
└── nanochat/
|
||||
```
|
||||
|
||||
You should see the `nanochat` image listed.
|
||||
# Step 3. Launch training
|
||||
|
||||
# Step 3. Launch full training
|
||||
|
||||
> [!NOTE]
|
||||
> The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running.
|
||||
|
||||
To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours:
|
||||
Ensure your API keys are exported, then launch:
|
||||
|
||||
```bash
|
||||
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
|
||||
export HF_TOKEN=<YOUR_HF_TOKEN>
|
||||
./launch_full.sh
|
||||
./launch.sh
|
||||
```
|
||||
|
||||
This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation.
|
||||
The training runs inside the `nanochat` container and executes the full pipeline automatically:
|
||||
|
||||
# Step 4. Verify and use the model
|
||||
1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer
|
||||
2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8
|
||||
3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat
|
||||
4. **Report generation** — produces `report.md` with metrics and samples
|
||||
|
||||
After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station.
|
||||
Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run.
|
||||
|
||||
# Step 4. Monitor training
|
||||
|
||||
**W&B dashboard:**
|
||||
|
||||
Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics:
|
||||
- Training loss
|
||||
- Validation BPB
|
||||
- Throughput (tokens/sec)
|
||||
|
||||
# Step 5. Inference
|
||||
|
||||
After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively:
|
||||
|
||||
**Web UI (recommended):**
|
||||
|
||||
```bash
|
||||
cd nanochat
|
||||
source ../.venv/bin/activate # if using venv from container context; otherwise use the container
|
||||
python -m scripts.chat_web
|
||||
docker run --rm --gpus all --net=host \
|
||||
-v $(pwd)/nanochat:/workspace/nanochat \
|
||||
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
|
||||
-w /workspace/nanochat \
|
||||
nanochat \
|
||||
python -m scripts.chat_web
|
||||
```
|
||||
|
||||
Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX Station’s IP address.
|
||||
@ -178,14 +215,15 @@ spec:
|
||||
**CLI:**
|
||||
|
||||
```bash
|
||||
cd nanochat
|
||||
python -m scripts.chat_cli -p "Why is the sky blue?"
|
||||
python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning"
|
||||
docker run --rm -it --gpus all \
|
||||
-v $(pwd)/nanochat:/workspace/nanochat \
|
||||
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
|
||||
-w /workspace/nanochat \
|
||||
nanochat \
|
||||
python -m scripts.chat_cli -p "Why is the sky blue?"
|
||||
```
|
||||
|
||||
A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project.
|
||||
|
||||
# Step 5. Cleanup
|
||||
# Step 6. Cleanup
|
||||
|
||||
To stop training early, interrupt the launch script or stop the container:
|
||||
|
||||
@ -195,23 +233,32 @@ spec:
|
||||
```bash
|
||||
# If launch.sh is running: press Ctrl+C
|
||||
|
||||
# Or stop the container by name
|
||||
# Or stop the container directly
|
||||
docker stop $(docker ps -q --filter ancestor=nanochat)
|
||||
```
|
||||
|
||||
To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`):
|
||||
To free disk space:
|
||||
|
||||
```bash
|
||||
rm -rf ./nanochat_cache ./hf_cache
|
||||
docker system prune -a
|
||||
```
|
||||
|
||||
# Step 6. Next steps and customization
|
||||
# Step 7. Customization
|
||||
|
||||
- **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time.
|
||||
- **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory.
|
||||
- **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput).
|
||||
- **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`.
|
||||
**Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size:
|
||||
|
||||
```bash
|
||||
# Fewer data shards (10 instead of default)
|
||||
python -m nanochat.dataset -n 10 &
|
||||
|
||||
# Smaller model (d4 instead of d24), smaller batch size
|
||||
python -m scripts.base_train --depth=4 --device-batch-size=32
|
||||
```
|
||||
|
||||
**Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.
|
||||
|
||||
Then re-run `./setup.sh` to rebuild with the changes.
|
||||
|
||||
|
||||
|
||||
@ -221,14 +268,16 @@ spec:
|
||||
label: Troubleshooting
|
||||
content: |
|
||||
| Symptom | Cause | Fix |
|
||||
|--------|--------|-----|
|
||||
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<your_key>` and `export HF_TOKEN=<your_token>` in the same shell, then run `./launch.sh`. |
|
||||
| `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). |
|
||||
| Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. |
|
||||
| `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` don’t exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p <your_cache_dirs>` and re-run `launch.sh`. |
|
||||
| `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). |
|
||||
| Training exits immediately or script doesn’t wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars, cache paths, or batch size as above. |
|
||||
| Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. |
|
||||
|---------|-------|-----|
|
||||
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<key>` and `export HF_TOKEN=<token>` in the same shell, then re-run `./launch.sh` |
|
||||
| `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` |
|
||||
| Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars or paths as needed |
|
||||
| `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` |
|
||||
| `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` |
|
||||
| Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` |
|
||||
| W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct |
|
||||
| Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs <id>` |
|
||||
| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |
|
||||
|
||||
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@ -1,8 +1,8 @@
|
||||
kind: Playbook
|
||||
metadata:
|
||||
name: station-vllm
|
||||
displayName: Serve Qwen3-235B with vLLM
|
||||
shortDescription: Set up vLLM server with Qwen3-235B on DGX Station
|
||||
displayName: vLLM for Inference
|
||||
shortDescription: Install and use vLLM on DGX Station
|
||||
publisher: nvidia
|
||||
description: |
|
||||
# REPLACE THIS WITH YOUR MODEL CARD
|
||||
@ -15,7 +15,7 @@ metadata:
|
||||
|
||||
attributes:
|
||||
- key: DURATION
|
||||
value: 20 MIN
|
||||
value: 30 MIN
|
||||
|
||||
spec:
|
||||
artifactName: station-vllm
|
||||
@ -42,7 +42,9 @@ spec:
|
||||
|
||||
# What you'll accomplish
|
||||
|
||||
Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.
|
||||
Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
|
||||
|
||||
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
|
||||
|
||||
# What to know before starting
|
||||
|
||||
@ -57,21 +59,33 @@ spec:
|
||||
- HuggingFace account with access token
|
||||
- Network access to NGC and HuggingFace
|
||||
|
||||
# Model Support Matrix
|
||||
|
||||
The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:
|
||||
|
||||
| Model | Quantization | Support Status | HF Handle |
|
||||
|-------|-------------|----------------|-----------|
|
||||
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
|
||||
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
|
||||
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
|
||||
| **Kimi-K2.5 (1T)** | NVFP4 | ✅ | [`nvidia/Kimi-K2.5-NVFP4`](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) |
|
||||
| **DeepSeek-V4-Flash** | NVFP4 | ✅ | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
|
||||
|
||||
# Time & risk
|
||||
|
||||
* **Duration:** 15-20 minutes (longer on first run due to model download)
|
||||
* **Duration:** 30 minutes (longer on first run due to model download)
|
||||
* **Risks:** Model download requires HuggingFace authentication
|
||||
* **Rollback:** Stop and remove the container to restore state
|
||||
* **Last Updated:** 03/02/2026
|
||||
* First Publication
|
||||
* **Last Updated:** 05/29/2026
|
||||
* Update models
|
||||
* Add base configuration example, per-setting explanations, and DeepSeek-V4-Flash recipe
|
||||
|
||||
|
||||
|
||||
-
|
||||
id: instructions
|
||||
|
||||
label: Serve Qwen3-235B
|
||||
label: Instructions
|
||||
content: |
|
||||
# Step 1. Set up Docker permissions
|
||||
|
||||
@ -92,7 +106,7 @@ spec:
|
||||
export HF_TOKEN="your_huggingface_token"
|
||||
|
||||
# Model to serve
|
||||
export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4"
|
||||
export MODEL_HANDLE="<HF_HANDLE>"
|
||||
|
||||
# Maximum context length
|
||||
export MAX_MODEL_LEN=8192
|
||||
@ -106,9 +120,28 @@ spec:
|
||||
docker pull nvcr.io/nvidia/vllm:26.01-py3
|
||||
```
|
||||
|
||||
For Step-3.7-Flash models, pull the custom VLLM container
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:stepfun37
|
||||
```
|
||||
|
||||
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, pull the **26.03** image, which includes the `--cpu-offload-params` support used below:
|
||||
```bash
|
||||
docker pull nvcr.io/nvidia/vllm:26.03-py3
|
||||
```
|
||||
|
||||
For DeepSeek-V4-Flash, pull the stable DeepSeek-V4 release container. Use the **cu130** build on DGX Station (Blackwell):
|
||||
```bash
|
||||
docker pull vllm/vllm-openai:v0.20.0-cu130
|
||||
```
|
||||
|
||||
# Step 4. Start vLLM server
|
||||
|
||||
Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
|
||||
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
|
||||
|
||||
## Base configuration (most models)
|
||||
|
||||
This is the recommended starting point for any model that fits entirely in VRAM on the GB300. The Qwen3-235B-A22B-NVFP4 model, for example, runs directly with this configuration.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
@ -126,6 +159,122 @@ spec:
|
||||
--gpu-memory-utilization 0.9
|
||||
```
|
||||
|
||||
Settings used:
|
||||
- `--max-model-len` — maximum context length (prompt + output) per request. Larger values reserve more GPU memory for the KV cache; size it to your workload.
|
||||
- `--gpu-memory-utilization 0.9` — fraction of GPU memory vLLM may use for weights and KV cache. `0.9` leaves headroom for other processes; raise toward `0.95` to fit more KV cache if the GPU is dedicated.
|
||||
|
||||
## Step-3.7-Flash (FP8 / NVFP4)
|
||||
|
||||
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name vllm-server \
|
||||
--gpus all \
|
||||
--ipc host \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-p 8000:8000 \
|
||||
-e HF_TOKEN="$HF_TOKEN" \
|
||||
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||
vllm/vllm-openai:stepfun37 \
|
||||
"$MODEL_HANDLE" \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--trust-remote-code \
|
||||
--reasoning-parser step3p5 \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser step3p5 \
|
||||
--kv-cache-dtype fp8
|
||||
```
|
||||
|
||||
Settings used (in addition to the base configuration):
|
||||
- `--trust-remote-code` — allows the model's custom modeling code (shipped in its repo) to load. Required for Step-3.7.
|
||||
- `--reasoning-parser step3p5` — parses the model's reasoning/thinking tokens into the dedicated `reasoning_content` response field.
|
||||
- `--enable-auto-tool-choice` — lets the model decide when to call a tool, enabling OpenAI-compatible function calling.
|
||||
- `--tool-call-parser step3p5` — parses the model's tool-call output into structured `tool_calls`. Pairs with `--enable-auto-tool-choice`.
|
||||
- `--kv-cache-dtype fp8` — stores the KV cache in FP8, roughly halving KV-cache memory versus 16-bit and allowing more concurrent/longer sequences.
|
||||
|
||||
## Kimi-K2.5 NVFP4 (1T) — CPU offloading
|
||||
|
||||
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, run with the **26.03** NGC container. This model does not fit entirely in VRAM, so the MoE expert weights are offloaded to CPU DRAM with `--cpu-offload-gb 375 --cpu-offload-params experts`. Ensure the system has enough free DRAM to hold the offloaded weights.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name vllm-server \
|
||||
--gpus all \
|
||||
--ipc host \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-p 8000:8000 \
|
||||
-e HF_TOKEN="$HF_TOKEN" \
|
||||
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||
nvcr.io/nvidia/vllm:26.03-py3 \
|
||||
vllm serve nvidia/Kimi-K2.5-NVFP4 \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--dtype auto \
|
||||
--kv-cache-dtype auto \
|
||||
--gpu-memory-utilization 0.95 \
|
||||
--served-model-name nvidia/Kimi-K2.5-NVFP4 \
|
||||
--tensor-parallel-size 1 \
|
||||
--no-enable-prefix-caching \
|
||||
--trust-remote-code \
|
||||
--max-model-len 40960 \
|
||||
--max-num-seqs 1 \
|
||||
--max-num-batched-tokens 32768 \
|
||||
--cpu-offload-gb 375 \
|
||||
--cpu-offload-params experts
|
||||
```
|
||||
|
||||
Settings used (in addition to the base configuration):
|
||||
- `--cpu-offload-gb 375` — amount of CPU DRAM (in GiB) vLLM may use to hold weights that don't fit in VRAM. Must be large enough for the offloaded experts; the system needs at least this much free DRAM.
|
||||
- `--cpu-offload-params experts` — offloads only the MoE expert weights (the bulk of a large MoE model) to DRAM, keeping attention and other hot weights in VRAM.
|
||||
- `--tensor-parallel-size 1` — single GPU; the GB300 serves the whole model.
|
||||
- `--max-num-seqs 1` / `--max-num-batched-tokens 32768` — caps concurrency to one sequence and the batch token budget. With expert weights paged from DRAM, throughput is offload-bound, so a low concurrency keeps latency predictable.
|
||||
- `--no-enable-prefix-caching` — disables prefix-cache reuse. Offloaded experts make the memory budget tight, so the cache is turned off here rather than spent on KV reuse.
|
||||
- `--kv-cache-dtype auto` / `--dtype auto` — let vLLM pick the KV-cache and compute dtypes from the model's quantization (NVFP4).
|
||||
|
||||
## DeepSeek-V4-Flash — MTP + agentic
|
||||
|
||||
For DeepSeek-V4-Flash, run with the stable **v0.20.0-cu130** container. This recipe targets agentic workloads and enables Multi-Token Prediction (MTP) speculative decoding. On a single GB300 (TP1) the MoE expert-parallel path is used; the `deep_gemm_mega_moe` backend from some internal recipes is not needed at TP1 and is omitted here.
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name vllm-server \
|
||||
--gpus all \
|
||||
--ipc host \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-p 8000:8000 \
|
||||
-e HF_TOKEN="$HF_TOKEN" \
|
||||
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||||
vllm/vllm-openai:v0.20.0-cu130 \
|
||||
deepseek-ai/DeepSeek-V4-Flash \
|
||||
--enable-expert-parallel \
|
||||
--kv-cache-dtype fp8 \
|
||||
--trust-remote-code \
|
||||
--block-size 256 \
|
||||
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
|
||||
--attention_config.use_fp4_indexer_cache True \
|
||||
--tokenizer-mode deepseek_v4 \
|
||||
--tool-call-parser deepseek_v4 \
|
||||
--enable-auto-tool-choice \
|
||||
--reasoning-parser deepseek_v4 \
|
||||
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
|
||||
--max-model-len 32768
|
||||
```
|
||||
|
||||
Settings used (in addition to the base configuration):
|
||||
- `--enable-expert-parallel` — shards the MoE experts across the available GPU(s) using expert parallelism, the recommended MoE execution path for DeepSeek-V4.
|
||||
- `--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'` — enables **MTP (Multi-Token Prediction)** speculative decoding: the model proposes 3 tokens per step that are verified in a single forward pass, cutting latency for accepted tokens.
|
||||
- `--kv-cache-dtype fp8` — FP8 KV cache to fit more concurrent/longer sequences.
|
||||
- `--block-size 256` — KV-cache page size in tokens. DeepSeek-V4 uses multiple KV-cache groups; `256` matches the recipe validated on Station.
|
||||
- `--attention_config.use_fp4_indexer_cache True` — enables the FP4 indexer cache used by DeepSeek-V4's attention. (Drop this flag on platforms without native FP4, e.g. Hopper.)
|
||||
- `--tokenizer-mode deepseek_v4` / `--tool-call-parser deepseek_v4` / `--reasoning-parser deepseek_v4` — DeepSeek-V4-specific tokenizer, tool-call, and reasoning parsers.
|
||||
- `--enable-auto-tool-choice` — OpenAI-compatible function calling for agentic use.
|
||||
- `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` — uses full + piecewise CUDA graph capture and enables all custom ops for lower per-step overhead.
|
||||
- **Prefix caching is left enabled (the vLLM default).** For agentic workloads with large shared prefixes (e.g. a 32k system/context prefix) at low batch sizes (~BS 3–4), prefix caching gives a significant throughput boost by reusing the cached prefix across requests. Some internal recipes carry `--no-enable-prefix-caching`, but that was inherited from random-data benchmarking and is not recommended for agentic use here.
|
||||
|
||||
Check the server logs for startup progress:
|
||||
|
||||
```bash
|
||||
@ -135,7 +284,7 @@ spec:
|
||||
Expected output includes:
|
||||
- Model download progress (first run only)
|
||||
- Model loading into GPU memory
|
||||
- `Uvicorn running on http://0.0.0.0:8000`
|
||||
- `Application startup complete.`
|
||||
|
||||
Press `Ctrl+C` to exit log view once the server is ready.
|
||||
|
||||
@ -166,9 +315,10 @@ spec:
|
||||
|
||||
Optionally, remove the image and cached model:
|
||||
|
||||
Eg.
|
||||
```bash
|
||||
docker rmi nvcr.io/nvidia/vllm:26.01-py3
|
||||
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4
|
||||
docker rmi "<docker image name>"
|
||||
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
|
||||
```
|
||||
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user