chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2026-05-30 11:49:27 +00:00
parent 718d8288e3
commit 1d1a95b3cb
20 changed files with 2709 additions and 1230 deletions

View File

@ -0,0 +1,327 @@
# DGX Station AI Skills for Coding Agents
> Give your coding agent (Claude Code, Codex, Gemini CLI, Cursor) DGX Station expertise via an AGENTS.md and on-demand Agent Skills
## Table of Contents
- [Overview](#overview)
- [AGENTS.md vs Agent Skill — why split?](#agentsmd-vs-agent-skill-why-split)
- [Instructions](#instructions)
- [Project-specific](#project-specific)
- [Troubleshooting](#troubleshooting)
---
## Overview
## Basic idea
Modern coding agents — Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor — all support two extension mechanisms: a project-level **context file** that's loaded into every conversation, and **on-demand procedural workflows** (called skills, prompts, commands, or rules depending on the harness). This playbook ships both for DGX Station:
- An **`AGENTS.md`** with the critical DGX Station constraints your agent should always know (mixed coherency, GPU targeting, common pitfalls). `AGENTS.md` is the cross-harness standard; an `install.sh` lays it down as `CLAUDE.md`, `GEMINI.md`, or `AGENTS.md` depending on the agent you use.
- **Four Agent Skills**`vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` — authored once in the [Anthropic Agent Skills format](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview) and installed into the right per-harness location (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`).
This approach keeps your agent's context lean in every conversation while giving it deep procedural knowledge on demand, regardless of which agent you use.
### AGENTS.md vs Agent Skill — why split?
| | AGENTS.md | Agent Skill |
|---|---|---|
| **Loaded** | Every conversation, automatically | Only when invoked by name (or matched by description, in Claude) |
| **Best for** | Constraints, pitfalls, "never do X" rules | Step-by-step workflows, deployment procedures |
| **Context cost** | Consumed every time | Zero until invoked |
The DGX Station mixed-coherency constraint (`--gpus all` will crash) should be in every conversation. The full vLLM deployment procedure should not.
## What you'll accomplish
- Install the `AGENTS.md` and four Agent Skills into your project directory for your chosen agent (Claude Code, Codex, Gemini CLI, or Cursor).
- Verify the agent loads the constraints automatically and the skills on demand.
- Invoke `vllm-setup` to deploy a vLLM inference server with validated configuration.
- Invoke `sglang-setup` to deploy an SGLang inference server.
- Invoke `mig-configure` to partition the GB300 into MIG instances.
- Invoke `dgx-diagnose` to troubleshoot common DGX Station issues.
## What to know before starting
- Basic familiarity with one supported coding agent (running it, giving it prompts, using slash commands or rule references)
- General understanding of DGX Station (two GPUs, Docker-based workflows)
## Prerequisites
- NVIDIA DGX Station with GB300
- One of the supported coding agents installed:
- **Claude Code:** `curl -fsSL https://claude.ai/install.sh | sh`
- **OpenAI Codex CLI:** `npm i -g @openai/codex`
- **Gemini CLI:** `npm i -g @google/gemini-cli`
- **Cursor:** download from `https://cursor.com/`
- A project directory where you do DGX Station work
## Ancillary files
- `assets/AGENTS.md` — canonical context file with critical constraints, GPU targeting, software versions, and common pitfalls. Cross-harness standard.
- `assets/skills/vllm-setup/SKILL.md` — skill: deploy vLLM with validated configuration.
- `assets/skills/sglang-setup/SKILL.md` — skill: deploy SGLang with validated configuration.
- `assets/skills/mig-configure/SKILL.md` — skill: configure MIG partitions on the GB300.
- `assets/skills/dgx-diagnose/SKILL.md` — skill: troubleshoot common DGX Station issues.
- `assets/install.sh` — per-harness installer (`claude`, `codex`, `gemini`, `cursor`, or `all`).
## Time & risk
* **Duration:** 10-15 minutes
* **Risk level:** Low — this playbook copies markdown files into your project directory
* **Rollback:** Delete the context file (`AGENTS.md` / `CLAUDE.md` / `GEMINI.md`) and the harness-specific skill directory (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`) from your project directory
* **Last Updated:** 05/18/2026
* Restructured as harness-agnostic Agent Skills (Claude Code, Codex, Gemini CLI, Cursor)
## Instructions
## Step 1. Install your coding agent
Pick whichever agent you prefer — the rest of this playbook works the same regardless. Install commands:
| Agent | Install |
|-------|---------|
| Claude Code | `curl -fsSL https://claude.ai/install.sh \| sh` |
| OpenAI Codex CLI | `npm i -g @openai/codex` |
| Gemini CLI | `npm i -g @google/gemini-cli` |
| Cursor | Download from `https://cursor.com/` |
Verify with `claude --version`, `codex --version`, `gemini --version`, or by launching Cursor.
## Step 2. Install the skills into your project
Navigate to the project where you want DGX Station expertise, then run the installer with the harness you use:
```bash
cd ~/your-project
## Pick one:
/path/to/this/playbook/assets/install.sh claude
/path/to/this/playbook/assets/install.sh codex
/path/to/this/playbook/assets/install.sh gemini
/path/to/this/playbook/assets/install.sh cursor
## Or install for all four at once:
/path/to/this/playbook/assets/install.sh all
```
If you downloaded the playbook as a zip, the path is relative to the extracted directory:
```bash
station-ai-skills/assets/install.sh claude ~/your-project
```
The installer is additive for skill directories (won't clobber existing skills you've written) and refuses to overwrite an existing context file (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`) unless you pass `--force`.
**Resulting layout** (per harness):
```text
your-project/
AGENTS.md or CLAUDE.md or GEMINI.md # context file (named for your agent)
.claude/skills/<name>/SKILL.md # claude
.codex/prompts/<name>.md # codex
.gemini/commands/<name>.md # gemini
.cursor/rules/<name>.mdc # cursor
```
Where `<name>` is each of `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose`.
> [!NOTE]
> Every supported agent automatically reads the context file from the working directory at startup. Skills/prompts/rules in the harness-specific directory are discovered automatically — no additional configuration needed.
## Step 3. Verify the setup
Start your agent in the project directory and ask a question that requires constraint knowledge:
```text
Can I use --gpus all to run my CUDA workload on DGX Station?
```
The agent should immediately warn about the mixed-coherency constraint and recommend `--gpus '"device=N"'` targeting. If you don't get the warning, the context file isn't being loaded — see Troubleshooting.
Then verify the skills are discoverable:
| Agent | How to check |
|-------|--------------|
| Claude Code | Type `/``vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` should appear in the autocomplete |
| Codex CLI | Type `/prompts:` — same four names appear |
| Gemini CLI | Type `/` — same four names appear |
| Cursor | Open the Rules panel — same four rules appear |
## Step 4. Use vllm-setup to deploy an inference server
Invoke the skill in your agent:
| Agent | Invocation |
|-------|-----------|
| Claude Code | `/vllm-setup` (slash command) or just describe the task ("deploy vllm with Qwen3-8B") |
| Codex CLI | `/prompts:vllm-setup` |
| Gemini CLI | `/vllm-setup` |
| Cursor | In chat: "use the vllm-setup rule to deploy a vllm server" |
The agent will walk you through deploying a vLLM server with a validated container image, correct GPU targeting, and recommended parameters. It will check your GPU index, ask which model you want to serve, and generate the full `docker run` command.
## Step 5. Use sglang-setup to deploy SGLang
Same invocation pattern, but for SGLang with the `cu130` container, RadixAttention prefix caching, and structured JSON output support.
## Step 6. Use mig-configure to partition the GB300
The agent will query your current MIG state, show available profiles, help you choose a layout for your workloads, and execute the partitioning commands.
## Step 7. Use dgx-diagnose to troubleshoot issues
If you encounter problems, invoke `dgx-diagnose`. The agent will check GPU status, driver version, running processes, MIG state, and Fabric Manager to identify the issue.
## Step 8. Customize
Both the `AGENTS.md` and the skills are plain markdown — extend them freely.
**Add project-specific constraints to `AGENTS.md`** (or your harness-specific context file):
```markdown
### Project-specific
- Our production MIG layout is 3g.139gb + 2g.70gb + 2g.70gb
- Always use port 8080 for inference (nginx proxy on 443)
- Model weights are cached at /data/models, mount with -v /data/models:/root/.cache/huggingface/hub
```
**Create new skills** by adding a directory and `SKILL.md` to `assets/skills/`, then re-run `install.sh`:
```bash
mkdir -p assets/skills/run-benchmarks
cat > assets/skills/run-benchmarks/SKILL.md << 'EOF'
---
name: run-benchmarks
description: Run our standard inference benchmark suite against the running vLLM or SGLang server and compare against the baseline.
---
## Run benchmarks
1. Check which inference server is running (vLLM on port 8000 or SGLang on port 30000)
2. Run the appropriate benchmark script from ./benchmarks/
3. Report throughput (tokens/sec), latency (TTFT, ITL), and memory utilization
4. Compare against the baseline in ./benchmarks/baseline.json
EOF
```
> [!TIP]
> Keep `AGENTS.md` focused on constraints and pitfalls (things that break). Put procedural workflows in skills (things you do step-by-step).
## Troubleshooting
## Skills don't appear in autocomplete / aren't discoverable
Each agent discovers skills from a harness-specific directory in the current directory (or a parent). Check the right one:
| Agent | Expected location |
|-------|-------------------|
| Claude Code | `.claude/skills/<name>/SKILL.md` |
| Codex CLI | `.codex/prompts/<name>.md` |
| Gemini CLI | `.gemini/commands/<name>.md` |
| Cursor | `.cursor/rules/<name>.mdc` |
```bash
## Examples — check the directory for your agent
ls -la .claude/skills/
ls -la .codex/prompts/
ls -la .gemini/commands/
ls -la .cursor/rules/
```
You should see entries for `vllm-setup`, `sglang-setup`, `mig-configure`, and `dgx-diagnose`.
**Check you're in the right directory:**
```bash
pwd
```
The agent must be started from the directory containing the harness directory, or a subdirectory of it.
## Context file not loaded
If the agent gives generic answers without DGX Station awareness, the context file isn't being picked up. Each agent reads a different filename — verify the one for your agent exists:
| Agent | Expected filename |
|-------|-------------------|
| Claude Code | `CLAUDE.md` (also reads `AGENTS.md` as fallback) |
| Codex CLI | `AGENTS.md` |
| Gemini CLI | `GEMINI.md` |
| Cursor | `AGENTS.md` |
```bash
## Verify the file exists for your agent
cat AGENTS.md | head -5
cat CLAUDE.md | head -5
cat GEMINI.md | head -5
## Restart the agent in the correct directory
cd ~/your-project
claude # or codex, gemini, etc.
```
All four agents read the context file from the working directory (and parent directories up to the project root).
## Skill gives outdated information
The skills contain validated container versions and parameters as of the publication date. If a newer container is available, edit the canonical source and re-install:
```bash
nano /path/to/playbook/assets/skills/vllm-setup/SKILL.md
/path/to/playbook/assets/install.sh all --force
```
Or edit the installed copy directly:
```bash
## Claude Code
nano .claude/skills/vllm-setup/SKILL.md
## Codex
nano .codex/prompts/vllm-setup.md
## Gemini CLI
nano .gemini/commands/vllm-setup.md
## Cursor
nano .cursor/rules/vllm-setup.mdc
```
> [!TIP]
> Skills are plain markdown — you can version them in git alongside your project code.
## "Both GPUs cannot be used" errors
This is the mixed-coherency constraint working as intended. If you see CUDA errors when using `--gpus all`:
```bash
## Find the GB300 index
nvidia-smi --query-gpu=index,name --format=csv,noheader
## Use device-specific targeting
docker run --gpus '"device=1"' ...
```
The `AGENTS.md` covers this constraint, but if you removed that section, add it back — it's the most important piece of DGX Station knowledge.
## Skills conflict with existing project directory
If your project already has a `.claude/`, `.codex/`, `.gemini/`, or `.cursor/` directory with its own contents, `install.sh` is **additive** for skill directories — it adds the new skill files alongside whatever you already have and warns on collision rather than overwriting.
For context files (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`), the installer **refuses** to overwrite an existing file. Pass `--force` to override, or merge the new content manually:
```bash
## See what would be written
diff /path/to/playbook/assets/AGENTS.md ./AGENTS.md
## Force overwrite
/path/to/playbook/assets/install.sh claude . --force
```
## Installer reports "WROTE" for some files but "SKIP" for others
That's the safe-by-default behavior. The installer skips any file that already exists, prints a warning, and continues with the rest. To get a clean install, either:
1. Delete the existing files first: `rm -rf .claude/skills/{vllm-setup,sglang-setup,mig-configure,dgx-diagnose}`
2. Or pass `--force` (only affects context files; skill files are still skipped if present)

Binary file not shown.

View File

@ -0,0 +1,85 @@
# DGX Station Diagnostics
Diagnose common DGX Station issues. Run through the checks below to identify the problem.
## Step 1. Gather system state
Run these commands and analyze the output:
```bash
# GPU status
nvidia-smi
# GPU device list with indices
nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader
# Driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1
# MIG state
nvidia-smi -i 1 -q 2>/dev/null | grep -i "MIG Mode" || echo "Could not query MIG on device 1"
# Fabric Manager
systemctl is-active nvidia-fabricmanager
# GPU processes
sudo fuser -v /dev/nvidia* 2>/dev/null || echo "No GPU processes found"
# Docker containers using GPUs
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" 2>/dev/null
```
## Step 2. Match symptoms to known issues
Based on the gathered state and the user's reported problem, check for these known issues:
### CUDA crashes with `--gpus all`
**Cause:** Mixed coherency — GB300 (ATS) and RTX PRO (non-ATS) cannot share a CUDA context.
**Fix:** Use `--gpus '"device=N"'` targeting only the GB300.
### Model running on wrong GPU (RTX PRO instead of GB300)
**Check:** The device index in the docker command vs actual GPU indices.
**Fix:** Verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` and correct the `--gpus` flag.
### vLLM crash / FlashInfer buffer overflow
**Check:** Container version — `docker inspect vllm-server | grep Image`
**Fix:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Version 25.10 has a known FlashInfer bug on DGX Station.
### SGLang CUDA errors
**Check:** Container tag — must be `cu130` for Blackwell SM103.
**Fix:** Use `lmsysorg/sglang:latest-cu130`.
### CUDA OOM despite 279 GB HBM
**Check:** `--max-model-len` / `--context-length` and memory utilization settings.
**Fix:** Reduce context length or lower `--gpu-memory-utilization` / `--mem-fraction-static`.
### `nvidia-smi -mig 1` returns "In use by another client"
**Check:** `sudo fuser -v /dev/nvidia*` — GPU processes must be stopped first.
**Fix:** Stop all GPU workloads, then retry.
### NVLink errors after disabling MIG
**Check:** `systemctl is-active nvidia-fabricmanager`
**Fix:** `sudo systemctl start nvidia-fabricmanager`
### X server crash after nvidia-xconfig -a
**Fix:** `sudo cp /etc/X11/xorg.conf.nvidia-xconfig-original /etc/X11/xorg.conf`
### Vulkan VK_ERROR_INITIALIZATION_FAILED
**Cause:** CUDA initialized before Vulkan, binding to GB300.
**Fix:** Run CUDA and Vulkan workloads in separate processes. For Vulkan apps: `__GL_DeviceModalityPreference=2 ./your_app`
### HuggingFace 401 / token errors
**Fix:** Pass token inline: `-e HF_TOKEN="hf_..."`. Don't rely on shell export for background Docker tasks.
### Port already in use
**Check:** `lsof -i :<PORT>`
**Fix:** Stop the conflicting process or use a different host port: `-p 8001:8000`.
## Step 3. Report findings
Tell the user:
1. What the issue is
2. Why it happens (root cause)
3. The specific command to fix it
4. How to verify the fix worked

View File

@ -0,0 +1,103 @@
# MIG Configuration on DGX Station
Configure MIG (Multi-Instance GPU) partitions on the DGX Station GB300.
## Steps
1. **Find the GB300 GPU index.** Run:
```bash
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
2. **Check current MIG state:**
```bash
nvidia-smi -i <GB300_INDEX> -q | grep -i "MIG Mode"
```
3. **If MIG is already enabled, show current instances:**
```bash
nvidia-smi mig -lgi -i <GB300_INDEX>
nvidia-smi mig -lci -i <GB300_INDEX>
```
If the user wants to reconfigure, destroy existing instances first (step 6).
4. **If MIG is not enabled, enable it.** All GPU processes must be stopped first:
```bash
# Check for running GPU processes
sudo fuser -v /dev/nvidia*
# Enable MIG
sudo nvidia-smi -i <GB300_INDEX> -mig 1
# Verify
nvidia-smi -i <GB300_INDEX> -q | grep -i "MIG Mode"
```
5. **Show available profiles and help the user choose a layout:**
```bash
nvidia-smi mig -lgip -i <GB300_INDEX>
```
Common GB300 MIG profiles:
| Profile | ID | Memory | Use case |
|---------|----|--------|----------|
| 1g.35gb | 19 | ~35 GB | Small models (7-8B), dev/test |
| 1g.35gb+me | 20 | ~35 GB | Same + media extensions |
| 1g.70gb | 15 | ~70 GB | Slightly larger inference |
| 2g.70gb | 14 | ~70 GB | Medium models (14-30B) |
| 3g.139gb | 9 | ~139 GB | Large models (70B quantized) |
| 4g.139gb | 5 | ~139 GB | Large models, more compute |
| 7g.278gb | 0 | ~278 GB | Full GPU as single instance |
Suggest layouts based on the user's workload. Examples:
- **Two models (70B + 8B):** `3g.139gb + 2g.70gb + 2g.70gb` → IDs `9,14,14`
- **Many small models:** `7 × 1g.35gb` → IDs `19,19,19,19,19,19,19`
- **One large model with isolation:** `7g.278gb` → ID `0`
Ask the user what models they want to run before suggesting a layout.
6. **Create (or recreate) instances:**
If reconfiguring, destroy existing instances first:
```bash
sudo nvidia-smi mig -dci -i <GB300_INDEX>
sudo nvidia-smi mig -dgi -i <GB300_INDEX>
```
Then create the new layout:
```bash
sudo nvidia-smi mig -cgi <PROFILE_IDS> -C -i <GB300_INDEX>
```
7. **Get the MIG device UUIDs:**
```bash
nvidia-smi -L
```
Note the `MIG-<uuid>` entries — these are used to target specific MIG instances.
8. **Show the user how to use MIG devices:**
```bash
# Bare metal
export CUDA_VISIBLE_DEVICES=MIG-<uuid>
# Docker
docker run --gpus '"device=MIG-<uuid>"' ...
```
9. **Report the final layout** to the user with UUIDs and suggested docker commands for each instance.
## Disabling MIG
If the user wants to return to full-GPU mode:
```bash
# Stop all workloads using MIG instances first
sudo nvidia-smi mig -dci -i <GB300_INDEX>
sudo nvidia-smi mig -dgi -i <GB300_INDEX>
sudo nvidia-smi -i <GB300_INDEX> -mig 0
# Ensure Fabric Manager is running for NVLink re-initialization
sudo systemctl start nvidia-fabricmanager
```

View File

@ -0,0 +1,115 @@
# SGLang Setup on DGX Station
Deploy an SGLang inference server on DGX Station with validated configuration.
## Steps
1. **Find the GB300 GPU index.** Run:
```bash
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures.
2. **Ask the user which model to serve.** If they don't have a preference, suggest:
- `Qwen/Qwen3-8B` — small, fast, good for testing
- `Qwen/Qwen3-32B` — medium, good balance
- `meta-llama/Llama-3.1-70B-Instruct` — large general-purpose
3. **Check if the user has an HF_TOKEN.** Pass inline with `-e HF_TOKEN="..."`.
4. **Deploy the container.** Use this validated configuration:
```bash
docker pull lmsysorg/sglang:latest-cu130
docker run -d \
--name sglang-server \
--gpus '"device=<GB300_INDEX>"' \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 30000:30000 \
-e HF_TOKEN="<TOKEN>" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
lmsysorg/sglang:latest-cu130 \
sglang serve --model-path "<MODEL>" \
--host 0.0.0.0 \
--port 30000 \
--context-length 32768 \
--mem-fraction-static 0.85
```
**Container version:** Use `lmsysorg/sglang:latest-cu130`. The `cu130` tag is required for Blackwell SM103 support.
**First launch** downloads the model and compiles kernels. This takes extra time — subsequent starts are faster.
5. **Wait for the server to be ready.** Monitor logs:
```bash
docker logs -f sglang-server
```
6. **Test the server:**
```bash
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<MODEL>",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 64
}'
```
7. **Report the result** to the user, including:
- Model loaded and serving on port 30000
- How to stop: `docker stop sglang-server && docker rm sglang-server`
## Key features
- **RadixAttention** — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with: `docker logs sglang-server 2>&1 | grep "cached-token" | tail -5`
- **Structured JSON output** — use `response_format.json_schema` in API requests for guaranteed valid JSON.
- **Chunked prefill** — add `--chunked-prefill-size 8192` to break long prefills into chunks, reducing time-to-first-token.
## Tuning parameters
| Parameter | Default | Agent workloads | Throughput workloads |
|-----------|---------|-----------------|---------------------|
| `--context-length` | 32768 | 32768-65536 | 8192-16384 |
| `--mem-fraction-static` | 0.85 | 0.80-0.85 | 0.85-0.88 |
| `--chunked-prefill-size` | off | 4096-8192 | 8192 |
| `--enable-metrics` | off | Optional | Recommended |
## Structured output example
```bash
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<MODEL>",
"messages": [{"role": "user", "content": "List three programming languages."}],
"max_tokens": 512,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "languages",
"schema": {
"type": "object",
"properties": {
"languages": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"primary_use": {"type": "string"}
},
"required": ["name", "primary_use"]
}
}
},
"required": ["languages"]
}
}
}
}'
```

View File

@ -0,0 +1,74 @@
# vLLM Setup on DGX Station
Deploy a vLLM inference server on DGX Station with validated configuration.
## Steps
1. **Find the GB300 GPU index.** Run:
```bash
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures.
2. **Ask the user which model to serve.** If they don't have a preference, suggest:
- `nvidia/Qwen3-235B-A22B-NVFP4` — large MoE model, fits in 279 GB HBM
- `meta-llama/Llama-3.1-70B-Instruct` — solid general-purpose model
- `Qwen/Qwen3-8B` — small model for testing
3. **Check if the user has an HF_TOKEN.** Many models require HuggingFace authentication. The token must be passed inline with `-e HF_TOKEN="..."` — do not rely on shell export in background Docker tasks.
4. **Deploy the container.** Use this validated configuration:
```bash
docker pull nvcr.io/nvidia/vllm:26.01-py3
docker run -d \
--name vllm-server \
--gpus '"device=<GB300_INDEX>"' \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="<TOKEN>" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
nvcr.io/nvidia/vllm:26.01-py3 \
vllm serve "<MODEL>" \
--max-model-len 32768 \
--gpu-memory-utilization 0.9
```
**Container version:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Do NOT use 25.10 — it has a FlashInfer buffer overflow on DGX Station.
5. **Wait for the server to be ready.** Monitor logs:
```bash
docker logs -f vllm-server
```
Wait for the line indicating the server is listening on port 8000.
6. **Test the server:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<MODEL>",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 64
}'
```
7. **Report the result** to the user, including:
- Model loaded and serving on port 8000
- GPU memory utilization
- How to stop: `docker stop vllm-server && docker rm vllm-server`
## Tuning parameters
Adjust these based on the user's workload:
| Parameter | Default | Agent workloads | Throughput workloads |
|-----------|---------|-----------------|---------------------|
| `--max-model-len` | 32768 | 32768-65536 | 8192-16384 |
| `--gpu-memory-utilization` | 0.9 | 0.85-0.90 | 0.90-0.92 |
| `--enable-prefix-caching` | off | Enable (multi-turn reuse) | Enable |
| `--max-num-seqs` | default | 4-16 (lower latency) | 32+ (higher throughput) |

View File

@ -0,0 +1,81 @@
# DGX Station Essential Constraints
This file gives your coding agent the critical constraints it needs to avoid breaking things on NVIDIA DGX Station. When you need a step-by-step workflow, invoke the bundled skills: `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose`. In Codex, install them into `$CODEX_HOME/skills` and mention them as `$vllm-setup` or plain text like "use vllm-setup"; in Claude Code or Gemini CLI, type `/<name>`; in Cursor, reference the rule by name.
## System architecture (quick reference)
- **GB300 GPU** — Blackwell Ultra (SM103), up to 279 GB HBM3e, 20 PFLOPS sparse FP4. This is the AI compute GPU.
- **Grace CPU** — 72-core ARM Neoverse V2, up to 496 GB LPDDR5x.
- **RTX PRO 6000** — Discrete display GPU (PCIe, non-coherent). For graphics only.
- **NVLink C2C** — Coherent CPU-GPU link. CPU + GPU memory = up to 775 GB total.
- The GB300 is typically device **1** and RTX PRO is device **0**. Always verify: `nvidia-smi --query-gpu=index,name --format=csv,noheader`
## Critical constraint: mixed coherency
**CUDA cannot handle mixed-coherency GPUs in the same process.** The GB300 uses hardware-coherent memory (ATS) while the RTX PRO uses non-coherent memory (HMM via PCIe). A single CUDA context can use one or the other, not both.
**Never use `--gpus all`** — it will cause CUDA assert failures.
## GPU targeting
There are three ways to target the GB300:
**1. By device index** (most common):
```bash
export CUDA_VISIBLE_DEVICES=1 # bare metal
docker run --gpus '"device=1"' ... # Docker
```
**2. By coherency modality:**
```bash
export CUDA_DEVICE_MODALITY=ATS # GB300 (coherent)
export CUDA_DEVICE_MODALITY=NONATS # RTX PRO (non-coherent)
```
**3. By driver application profiles** in `~/.nv/nvidia-application-profiles-rc`:
```json
{
"rules": [
{ "pattern": { "feature": "cmdline", "matches": "my_app" }, "profile": "UseATSGpuInMixedCoherencySystems" }
]
}
```
## Display and graphics
- The GB300 does not support X display. Display runs on RTX PRO only.
- **Do not run `nvidia-xconfig -a`** — it generates an invalid config.
- If CUDA initializes before Vulkan in a process, it may bind to the GB300, causing `VK_ERROR_INITIALIZATION_FAILED`. Run CUDA and Vulkan in separate processes.
## Memory
- GB300 HBM is in the system memory pool (NUMA node 1). `malloc` may allocate there.
- Use `numactl --membind=0` for CPU-only processes that shouldn't touch GPU memory.
- CPU can cache accesses to GB300 memory, but GB300 cannot cache accesses to CPU memory.
## Software versions
| Component | Validated version | Notes |
|-----------|-------------------|-------|
| NVIDIA Driver | 590.48.01 | Check with `nvidia-smi` |
| CUDA (driver) | 13.1 | Containers bring their own runtime |
| vLLM container | `nvcr.io/nvidia/vllm:26.01-py3` | **Avoid 25.10** (FlashInfer buffer overflow) |
| SGLang container | `lmsysorg/sglang:latest-cu130` | cu130 required for SM103 |
| CUDA base image | `nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04` | For custom containers |
| Ubuntu | 24.04 | Preinstalled |
## Common pitfalls
| Symptom | Cause | Fix |
|---------|-------|-----|
| `--gpus all` CUDA assert failure | Mixed coherency | Use `--gpus '"device=N"'` for the GB300 |
| vLLM 25.10 FlashInfer crash | Known DGX Station bug | Use `vllm:26.01-py3` or newer |
| SGLang CUDA errors | Wrong CUDA for Blackwell | Use `sglang:latest-cu130` |
| Model runs on RTX PRO | Wrong device index | Verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` |
| `nvidia-smi -mig 1` "In use" | GPU processes running | `sudo fuser -v /dev/nvidia*` |
| NVLink errors after disabling MIG | Fabric Manager stopped | `sudo systemctl start nvidia-fabricmanager` |
| `malloc` lands in GPU memory | HBM in system pool | `numactl --membind=0` |
| X crash after `nvidia-xconfig -a` | Invalid mixed-coherency config | Restore from `/etc/X11/xorg.conf.nvidia-xconfig-original` |
| Vulkan `VK_ERROR_INITIALIZATION_FAILED` | CUDA bound GB300 first | Separate CUDA and Vulkan into different processes |
| HuggingFace 401 | Missing HF_TOKEN | Pass inline: `-e HF_TOKEN="hf_..."` |
| Port conflict | Port already in use | `lsof -i :PORT`, use different port |

View File

@ -0,0 +1,218 @@
#!/bin/sh
# install.sh — Install DGX Station AI Skills into a project for a chosen coding agent.
#
# Usage: ./install.sh <harness> [target-dir] [--force]
# harness: claude | codex | gemini | cursor | all
# target-dir: where to install (default: current directory)
# --force: overwrite existing context files (AGENTS.md, CLAUDE.md, GEMINI.md)
#
# Layout produced per harness:
# claude -> CLAUDE.md + .claude/skills/<name>/SKILL.md
# codex -> AGENTS.md + $CODEX_HOME/skills/<name>/SKILL.md
# gemini -> GEMINI.md + .gemini/commands/<name>.md
# cursor -> AGENTS.md + .cursor/rules/<name>.mdc
# all -> all of the above
set -eu
usage() {
cat <<EOF
Usage: $0 <harness> [target-dir] [--force]
Harnesses:
claude Claude Code -> CLAUDE.md + .claude/skills/<name>/SKILL.md
codex OpenAI Codex CLI -> AGENTS.md + \$CODEX_HOME/skills/<name>/SKILL.md
gemini Gemini CLI -> GEMINI.md + .gemini/commands/<name>.md
cursor Cursor -> AGENTS.md + .cursor/rules/<name>.mdc
all Install for all four
Options:
--force Overwrite existing context files instead of erroring
EOF
}
if [ $# -lt 1 ]; then
usage >&2
exit 2
fi
case "$1" in
-h|--help) usage; exit 0 ;;
esac
HARNESS="$1"
shift
TARGET="."
FORCE=0
while [ $# -gt 0 ]; do
case "$1" in
--force) FORCE=1 ;;
-h|--help) usage; exit 0 ;;
*) TARGET="$1" ;;
esac
shift
done
case "$HARNESS" in
claude|codex|gemini|cursor|all) ;;
*) printf 'Error: unknown harness "%s"\n\n' "$HARNESS" >&2; usage >&2; exit 2 ;;
esac
ASSETS="$(cd "$(dirname "$0")" && pwd)"
SKILLS_DIR="$ASSETS/skills"
AGENTS_MD="$ASSETS/AGENTS.md"
if [ ! -f "$AGENTS_MD" ]; then
printf 'Error: %s not found\n' "$AGENTS_MD" >&2
exit 1
fi
if [ ! -d "$SKILLS_DIR" ]; then
printf 'Error: %s not found\n' "$SKILLS_DIR" >&2
exit 1
fi
mkdir -p "$TARGET"
SKILL_NAMES="vllm-setup sglang-setup mig-configure dgx-diagnose"
# write_context <target-filename>
# Copies AGENTS.md to <target-dir>/<target-filename>, refusing to overwrite without --force.
write_context() {
fname="$1"
dest="$TARGET/$fname"
if [ -e "$dest" ] && [ "$FORCE" -ne 1 ]; then
printf ' SKIP %s (exists; pass --force to overwrite)\n' "$dest" >&2
return 1
fi
cp "$AGENTS_MD" "$dest"
printf ' WROTE %s\n' "$dest"
}
# strip_frontmatter <src> <dest>
# Emits the SKILL.md body (everything after the closing `---`) to <dest>.
# Note: POSIX sh has no local vars; use unique names to avoid clobbering callers.
strip_frontmatter() {
_sf_src="$1"
_sf_dest="$2"
awk 'BEGIN { in_fm=0; past_fm=0 }
past_fm == 1 { print; next }
/^---$/ && in_fm == 0 { in_fm=1; next }
/^---$/ && in_fm == 1 { past_fm=1; next }
in_fm == 0 && past_fm == 0 { past_fm=1; print }' "$_sf_src" > "$_sf_dest"
}
# write_cursor_rule <src> <dest> <name> <description>
# Writes a Cursor .mdc rule: replaces Anthropic frontmatter with Cursor's shape, keeps the body.
write_cursor_rule() {
_wc_src="$1"
_wc_dest="$2"
_wc_name="$3"
_wc_desc="$4"
{
printf -- '---\n'
printf 'description: %s\n' "$_wc_desc"
printf 'globs: ["**/*"]\n'
printf 'alwaysApply: false\n'
printf -- '---\n\n'
} > "$_wc_dest"
strip_frontmatter "$_wc_src" "$_wc_dest.body"
cat "$_wc_dest.body" >> "$_wc_dest"
rm -f "$_wc_dest.body"
}
# extract_description <skill-name>
# Reads the description: line from the skill's SKILL.md frontmatter.
extract_description() {
_ed_name="$1"
awk '/^description: / { sub(/^description: /, ""); print; exit }' "$SKILLS_DIR/$_ed_name/SKILL.md"
}
install_claude() {
printf 'Installing for Claude Code into %s/\n' "$TARGET"
write_context "CLAUDE.md" || true
for name in $SKILL_NAMES; do
dest_dir="$TARGET/.claude/skills/$name"
dest="$dest_dir/SKILL.md"
mkdir -p "$dest_dir"
if [ -e "$dest" ]; then
printf ' SKIP %s (exists)\n' "$dest" >&2
continue
fi
cp "$SKILLS_DIR/$name/SKILL.md" "$dest"
printf ' WROTE %s\n' "$dest"
done
printf 'Next: cd %s && claude (type "/" to see vllm-setup, sglang-setup, mig-configure, dgx-diagnose)\n' "$TARGET"
}
install_codex() {
printf 'Installing for OpenAI Codex CLI into %s/\n' "$TARGET"
write_context "AGENTS.md" || true
codex_home="${CODEX_HOME:-$HOME/.codex}"
codex_skills="$codex_home/skills"
mkdir -p "$codex_skills"
for name in $SKILL_NAMES; do
dest_dir="$codex_skills/$name"
dest="$dest_dir/SKILL.md"
if [ -e "$dest" ] && [ "$FORCE" -ne 1 ]; then
printf ' SKIP %s (exists)\n' "$dest" >&2
continue
fi
mkdir -p "$dest_dir"
cp -R "$SKILLS_DIR/$name/." "$dest_dir/"
printf ' WROTE %s\n' "$dest_dir"
done
printf 'Next: cd %s && codex (mention $vllm-setup or "use vllm-setup"; restart Codex if it was already running)\n' "$TARGET"
}
install_gemini() {
printf 'Installing for Gemini CLI into %s/\n' "$TARGET"
write_context "GEMINI.md" || true
mkdir -p "$TARGET/.gemini/commands"
for name in $SKILL_NAMES; do
dest="$TARGET/.gemini/commands/$name.md"
if [ -e "$dest" ]; then
printf ' SKIP %s (exists)\n' "$dest" >&2
continue
fi
strip_frontmatter "$SKILLS_DIR/$name/SKILL.md" "$dest"
printf ' WROTE %s\n' "$dest"
done
printf 'Next: cd %s && gemini (type /<name> to invoke a skill)\n' "$TARGET"
}
install_cursor() {
printf 'Installing for Cursor into %s/\n' "$TARGET"
write_context "AGENTS.md" || true
mkdir -p "$TARGET/.cursor/rules"
for name in $SKILL_NAMES; do
dest="$TARGET/.cursor/rules/$name.mdc"
if [ -e "$dest" ]; then
printf ' SKIP %s (exists)\n' "$dest" >&2
continue
fi
desc="$(extract_description "$name")"
write_cursor_rule "$SKILLS_DIR/$name/SKILL.md" "$dest" "$name" "$desc"
printf ' WROTE %s\n' "$dest"
done
printf 'Next: open %s in Cursor (reference rules by name in chat, e.g. "use the vllm-setup rule")\n' "$TARGET"
}
case "$HARNESS" in
claude) install_claude ;;
codex) install_codex ;;
gemini) install_gemini ;;
cursor) install_cursor ;;
all)
install_claude
printf '\n'
install_codex
printf '\n'
install_gemini
printf '\n'
install_cursor
;;
esac
printf '\nDone.\n'

View File

@ -0,0 +1,92 @@
---
name: dgx-diagnose
description: Diagnose common DGX Station GB300 issues — CUDA crashes, wrong-GPU targeting, vLLM/SGLang container bugs, MIG state problems, NVLink/Fabric Manager errors, X/Vulkan failures, HuggingFace auth, and port conflicts. Use when the user reports a GPU error, inference server crash, MIG problem, or any unexplained DGX Station failure.
metadata:
publisher: nvidia
hardware: DGX Station GB300
---
# DGX Station Diagnostics
Diagnose common DGX Station issues. Run through the checks below to identify the problem.
## Step 1. Gather system state
Run these commands and analyze the output:
```bash
# GPU status
nvidia-smi
# GPU device list with indices
nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader
# Driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1
# MIG state
nvidia-smi -i 1 -q 2>/dev/null | grep -i "MIG Mode" || echo "Could not query MIG on device 1"
# Fabric Manager
systemctl is-active nvidia-fabricmanager
# GPU processes
sudo fuser -v /dev/nvidia* 2>/dev/null || echo "No GPU processes found"
# Docker containers using GPUs
docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" 2>/dev/null
```
## Step 2. Match symptoms to known issues
Based on the gathered state and the user's reported problem, check for these known issues:
### CUDA crashes with `--gpus all`
**Cause:** Mixed coherency — GB300 (ATS) and RTX PRO (non-ATS) cannot share a CUDA context.
**Fix:** Use `--gpus '"device=N"'` targeting only the GB300.
### Model running on wrong GPU (RTX PRO instead of GB300)
**Check:** The device index in the docker command vs actual GPU indices.
**Fix:** Verify with `nvidia-smi --query-gpu=index,name --format=csv,noheader` and correct the `--gpus` flag.
### vLLM crash / FlashInfer buffer overflow
**Check:** Container version — `docker inspect vllm-server | grep Image`
**Fix:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Version 25.10 has a known FlashInfer bug on DGX Station.
### SGLang CUDA errors
**Check:** Container tag — must be `cu130` for Blackwell SM103.
**Fix:** Use `lmsysorg/sglang:latest-cu130`.
### CUDA OOM despite 279 GB HBM
**Check:** `--max-model-len` / `--context-length` and memory utilization settings.
**Fix:** Reduce context length or lower `--gpu-memory-utilization` / `--mem-fraction-static`.
### `nvidia-smi -mig 1` returns "In use by another client"
**Check:** `sudo fuser -v /dev/nvidia*` — GPU processes must be stopped first.
**Fix:** Stop all GPU workloads, then retry.
### NVLink errors after disabling MIG
**Check:** `systemctl is-active nvidia-fabricmanager`
**Fix:** `sudo systemctl start nvidia-fabricmanager`
### X server crash after nvidia-xconfig -a
**Fix:** `sudo cp /etc/X11/xorg.conf.nvidia-xconfig-original /etc/X11/xorg.conf`
### Vulkan VK_ERROR_INITIALIZATION_FAILED
**Cause:** CUDA initialized before Vulkan, binding to GB300.
**Fix:** Run CUDA and Vulkan workloads in separate processes. For Vulkan apps: `__GL_DeviceModalityPreference=2 ./your_app`
### HuggingFace 401 / token errors
**Fix:** Pass token inline: `-e HF_TOKEN="hf_..."`. Don't rely on shell export for background Docker tasks.
### Port already in use
**Check:** `lsof -i :<PORT>`
**Fix:** Stop the conflicting process or use a different host port: `-p 8001:8000`.
## Step 3. Report findings
Tell the user:
1. What the issue is
2. Why it happens (root cause)
3. The specific command to fix it
4. How to verify the fix worked

View File

@ -0,0 +1,110 @@
---
name: mig-configure
description: Configure NVIDIA MIG (Multi-Instance GPU) partitions on the DGX Station GB300, including enabling MIG mode, choosing a profile layout, creating instances, and retrieving MIG UUIDs. Use when the user asks to partition the GB300, set up MIG, run multiple models in isolation on one GPU, or reconfigure existing MIG instances.
metadata:
publisher: nvidia
hardware: DGX Station GB300
---
# MIG Configuration on DGX Station
Configure MIG (Multi-Instance GPU) partitions on the DGX Station GB300.
## Steps
1. **Find the GB300 GPU index.** Run:
```bash
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
2. **Check current MIG state:**
```bash
nvidia-smi -i <GB300_INDEX> -q | grep -i "MIG Mode"
```
3. **If MIG is already enabled, show current instances:**
```bash
nvidia-smi mig -lgi -i <GB300_INDEX>
nvidia-smi mig -lci -i <GB300_INDEX>
```
If the user wants to reconfigure, destroy existing instances first (step 6).
4. **If MIG is not enabled, enable it.** All GPU processes must be stopped first:
```bash
# Check for running GPU processes
sudo fuser -v /dev/nvidia*
# Enable MIG
sudo nvidia-smi -i <GB300_INDEX> -mig 1
# Verify
nvidia-smi -i <GB300_INDEX> -q | grep -i "MIG Mode"
```
5. **Show available profiles and help the user choose a layout:**
```bash
nvidia-smi mig -lgip -i <GB300_INDEX>
```
Common GB300 MIG profiles:
| Profile | ID | Memory | Use case |
|---------|----|--------|----------|
| 1g.35gb | 19 | ~35 GB | Small models (7-8B), dev/test |
| 1g.35gb+me | 20 | ~35 GB | Same + media extensions |
| 1g.70gb | 15 | ~70 GB | Slightly larger inference |
| 2g.70gb | 14 | ~70 GB | Medium models (14-30B) |
| 3g.139gb | 9 | ~139 GB | Large models (70B quantized) |
| 4g.139gb | 5 | ~139 GB | Large models, more compute |
| 7g.278gb | 0 | ~278 GB | Full GPU as single instance |
Suggest layouts based on the user's workload. Examples:
- **Two models (70B + 8B):** `3g.139gb + 2g.70gb + 2g.70gb` → IDs `9,14,14`
- **Many small models:** `7 × 1g.35gb` → IDs `19,19,19,19,19,19,19`
- **One large model with isolation:** `7g.278gb` → ID `0`
Ask the user what models they want to run before suggesting a layout.
6. **Create (or recreate) instances:**
If reconfiguring, destroy existing instances first:
```bash
sudo nvidia-smi mig -dci -i <GB300_INDEX>
sudo nvidia-smi mig -dgi -i <GB300_INDEX>
```
Then create the new layout:
```bash
sudo nvidia-smi mig -cgi <PROFILE_IDS> -C -i <GB300_INDEX>
```
7. **Get the MIG device UUIDs:**
```bash
nvidia-smi -L
```
Note the `MIG-<uuid>` entries — these are used to target specific MIG instances.
8. **Show the user how to use MIG devices:**
```bash
# Bare metal
export CUDA_VISIBLE_DEVICES=MIG-<uuid>
# Docker
docker run --gpus '"device=MIG-<uuid>"' ...
```
9. **Report the final layout** to the user with UUIDs and suggested docker commands for each instance.
## Disabling MIG
If the user wants to return to full-GPU mode:
```bash
# Stop all workloads using MIG instances first
sudo nvidia-smi mig -dci -i <GB300_INDEX>
sudo nvidia-smi mig -dgi -i <GB300_INDEX>
sudo nvidia-smi -i <GB300_INDEX> -mig 0
# Ensure Fabric Manager is running for NVLink re-initialization
sudo systemctl start nvidia-fabricmanager
```

View File

@ -0,0 +1,122 @@
---
name: sglang-setup
description: Deploy an SGLang inference server on an NVIDIA DGX Station GB300 with the cu130 container, RadixAttention prefix caching, and structured JSON output support. Use when the user asks to serve a model with SGLang, start an SGLang endpoint, or needs structured-output inference on DGX Station.
metadata:
publisher: nvidia
hardware: DGX Station GB300
---
# SGLang Setup on DGX Station
Deploy an SGLang inference server on DGX Station with validated configuration.
## Steps
1. **Find the GB300 GPU index.** Run:
```bash
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures.
2. **Ask the user which model to serve.** If they don't have a preference, suggest:
- `Qwen/Qwen3-8B` — small, fast, good for testing
- `Qwen/Qwen3-32B` — medium, good balance
- `meta-llama/Llama-3.1-70B-Instruct` — large general-purpose
3. **Check if the user has an HF_TOKEN.** Pass inline with `-e HF_TOKEN="..."`.
4. **Deploy the container.** Use this validated configuration:
```bash
docker pull lmsysorg/sglang:latest-cu130
docker run -d \
--name sglang-server \
--gpus '"device=<GB300_INDEX>"' \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 30000:30000 \
-e HF_TOKEN="<TOKEN>" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
lmsysorg/sglang:latest-cu130 \
sglang serve --model-path "<MODEL>" \
--host 0.0.0.0 \
--port 30000 \
--context-length 32768 \
--mem-fraction-static 0.85
```
**Container version:** Use `lmsysorg/sglang:latest-cu130`. The `cu130` tag is required for Blackwell SM103 support.
**First launch** downloads the model and compiles kernels. This takes extra time — subsequent starts are faster.
5. **Wait for the server to be ready.** Monitor logs:
```bash
docker logs -f sglang-server
```
6. **Test the server:**
```bash
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<MODEL>",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 64
}'
```
7. **Report the result** to the user, including:
- Model loaded and serving on port 30000
- How to stop: `docker stop sglang-server && docker rm sglang-server`
## Key features
- **RadixAttention** — automatic KV cache reuse across requests sharing prefixes. On by default, no flag needed. Verify with: `docker logs sglang-server 2>&1 | grep "cached-token" | tail -5`
- **Structured JSON output** — use `response_format.json_schema` in API requests for guaranteed valid JSON.
- **Chunked prefill** — add `--chunked-prefill-size 8192` to break long prefills into chunks, reducing time-to-first-token.
## Tuning parameters
| Parameter | Default | Agent workloads | Throughput workloads |
|-----------|---------|-----------------|---------------------|
| `--context-length` | 32768 | 32768-65536 | 8192-16384 |
| `--mem-fraction-static` | 0.85 | 0.80-0.85 | 0.85-0.88 |
| `--chunked-prefill-size` | off | 4096-8192 | 8192 |
| `--enable-metrics` | off | Optional | Recommended |
## Structured output example
```bash
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<MODEL>",
"messages": [{"role": "user", "content": "List three programming languages."}],
"max_tokens": 512,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "languages",
"schema": {
"type": "object",
"properties": {
"languages": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"primary_use": {"type": "string"}
},
"required": ["name", "primary_use"]
}
}
},
"required": ["languages"]
}
}
}
}'
```

View File

@ -0,0 +1,81 @@
---
name: vllm-setup
description: Deploy a vLLM inference server on an NVIDIA DGX Station GB300 with validated container, GPU targeting, and tuning parameters. Use when the user asks to serve a model with vLLM, start a vLLM endpoint, or set up OpenAI-compatible inference on DGX Station.
metadata:
publisher: nvidia
hardware: DGX Station GB300
---
# vLLM Setup on DGX Station
Deploy a vLLM inference server on DGX Station with validated configuration.
## Steps
1. **Find the GB300 GPU index.** Run:
```bash
nvidia-smi --query-gpu=index,name --format=csv,noheader
```
Identify the device index for the GB300 (typically device 1). Use this index for `--gpus` below. Do NOT use `--gpus all` — mixed coherency will cause CUDA failures.
2. **Ask the user which model to serve.** If they don't have a preference, suggest:
- `nvidia/Qwen3-235B-A22B-NVFP4` — large MoE model, fits in 279 GB HBM
- `meta-llama/Llama-3.1-70B-Instruct` — solid general-purpose model
- `Qwen/Qwen3-8B` — small model for testing
3. **Check if the user has an HF_TOKEN.** Many models require HuggingFace authentication. The token must be passed inline with `-e HF_TOKEN="..."` — do not rely on shell export in background Docker tasks.
4. **Deploy the container.** Use this validated configuration:
```bash
docker pull nvcr.io/nvidia/vllm:26.01-py3
docker run -d \
--name vllm-server \
--gpus '"device=<GB300_INDEX>"' \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="<TOKEN>" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
nvcr.io/nvidia/vllm:26.01-py3 \
vllm serve "<MODEL>" \
--max-model-len 32768 \
--gpu-memory-utilization 0.9
```
**Container version:** Use `nvcr.io/nvidia/vllm:26.01-py3`. Do NOT use 25.10 — it has a FlashInfer buffer overflow on DGX Station.
5. **Wait for the server to be ready.** Monitor logs:
```bash
docker logs -f vllm-server
```
Wait for the line indicating the server is listening on port 8000.
6. **Test the server:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<MODEL>",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 64
}'
```
7. **Report the result** to the user, including:
- Model loaded and serving on port 8000
- GPU memory utilization
- How to stop: `docker stop vllm-server && docker rm vllm-server`
## Tuning parameters
Adjust these based on the user's workload:
| Parameter | Default | Agent workloads | Throughput workloads |
|-----------|---------|-----------------|---------------------|
| `--max-model-len` | 32768 | 32768-65536 | 8192-16384 |
| `--gpu-memory-utilization` | 0.9 | 0.85-0.90 | 0.90-0.92 |
| `--enable-prefix-caching` | off | Enable (multi-turn reuse) | Enable |
| `--max-num-seqs` | default | 4-16 (lower latency) | 32+ (higher throughput) |

View File

@ -0,0 +1,413 @@
kind: Playbook
metadata:
name: station-ai-skills
displayName: DGX Station AI Skills for Coding Agents
shortDescription: Give your coding agent (Claude Code, Codex, Gemini CLI, Cursor) DGX Station expertise via an AGENTS.md and on-demand Agent Skills
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads
labelsV2:
- gpuType:playbook:gpu_type_station
- DGX Station
- GB300
- Blackwell
- AI Agents
- Agent Skills
- AGENTS.md
- Claude Code
- Codex
- Gemini CLI
- Cursor
- vLLM
- SGLang
- MIG
- Mixed Coherency
attributes:
- key: DURATION
value: 15 MIN
spec:
artifactName: station-ai-skills
nvcfFunctionId: None
attributes:
showUnavailableBanner: false
apiDocsUrl: None
termsOfUse: |
cta:
text: View on GitHub
url: https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/station-ai-skills/
tabs:
-
id: overview
label: Overview
content: |
# Basic idea
Modern coding agents — Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor — all support two extension mechanisms: a project-level **context file** that's loaded into every conversation, and **on-demand procedural workflows** (called skills, prompts, commands, or rules depending on the harness). This playbook ships both for DGX Station:
- An **`AGENTS.md`** with the critical DGX Station constraints your agent should always know (mixed coherency, GPU targeting, common pitfalls). `AGENTS.md` is the cross-harness standard; an `install.sh` lays it down as `CLAUDE.md`, `GEMINI.md`, or `AGENTS.md` depending on the agent you use.
- **Four Agent Skills** — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` — authored once in the [Anthropic Agent Skills format](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview) and installed into the right per-harness location (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`).
This approach keeps your agent's context lean in every conversation while giving it deep procedural knowledge on demand, regardless of which agent you use.
## AGENTS.md vs Agent Skill — why split?
| | AGENTS.md | Agent Skill |
|---|---|---|
| **Loaded** | Every conversation, automatically | Only when invoked by name (or matched by description, in Claude) |
| **Best for** | Constraints, pitfalls, "never do X" rules | Step-by-step workflows, deployment procedures |
| **Context cost** | Consumed every time | Zero until invoked |
The DGX Station mixed-coherency constraint (`--gpus all` will crash) should be in every conversation. The full vLLM deployment procedure should not.
# What you'll accomplish
- Install the `AGENTS.md` and four Agent Skills into your project directory for your chosen agent (Claude Code, Codex, Gemini CLI, or Cursor).
- Verify the agent loads the constraints automatically and the skills on demand.
- Invoke `vllm-setup` to deploy a vLLM inference server with validated configuration.
- Invoke `sglang-setup` to deploy an SGLang inference server.
- Invoke `mig-configure` to partition the GB300 into MIG instances.
- Invoke `dgx-diagnose` to troubleshoot common DGX Station issues.
# What to know before starting
- Basic familiarity with one supported coding agent (running it, giving it prompts, using slash commands or rule references)
- General understanding of DGX Station (two GPUs, Docker-based workflows)
# Prerequisites
- NVIDIA DGX Station with GB300
- One of the supported coding agents installed:
- **Claude Code:** `curl -fsSL https://claude.ai/install.sh | sh`
- **OpenAI Codex CLI:** `npm i -g @openai/codex`
- **Gemini CLI:** `npm i -g @google/gemini-cli`
- **Cursor:** download from `https://cursor.com/`
- A project directory where you do DGX Station work
# Ancillary files
- `assets/AGENTS.md` — canonical context file with critical constraints, GPU targeting, software versions, and common pitfalls. Cross-harness standard.
- `assets/skills/vllm-setup/SKILL.md` — skill: deploy vLLM with validated configuration.
- `assets/skills/sglang-setup/SKILL.md` — skill: deploy SGLang with validated configuration.
- `assets/skills/mig-configure/SKILL.md` — skill: configure MIG partitions on the GB300.
- `assets/skills/dgx-diagnose/SKILL.md` — skill: troubleshoot common DGX Station issues.
- `assets/install.sh` — per-harness installer (`claude`, `codex`, `gemini`, `cursor`, or `all`).
# Time & risk
* **Duration:** 10-15 minutes
* **Risk level:** Low — this playbook copies markdown files into your project directory
* **Rollback:** Delete the context file (`AGENTS.md` / `CLAUDE.md` / `GEMINI.md`) and the harness-specific skill directory (`.claude/skills/`, `.codex/prompts/`, `.gemini/commands/`, or `.cursor/rules/`) from your project directory
* **Last Updated:** 05/18/2026
* Restructured as harness-agnostic Agent Skills (Claude Code, Codex, Gemini CLI, Cursor)
-
id: instructions
label: Instructions
content: |
# Step 1. Install your coding agent
Pick whichever agent you prefer — the rest of this playbook works the same regardless. Install commands:
| Agent | Install |
|-------|---------|
| Claude Code | `curl -fsSL https://claude.ai/install.sh \| sh` |
| OpenAI Codex CLI | `npm i -g @openai/codex` |
| Gemini CLI | `npm i -g @google/gemini-cli` |
| Cursor | Download from `https://cursor.com/` |
Verify with `claude --version`, `codex --version`, `gemini --version`, or by launching Cursor.
# Step 2. Install the skills into your project
Navigate to the project where you want DGX Station expertise, then run the installer with the harness you use:
```bash
cd ~/your-project
# Pick one:
/path/to/this/playbook/assets/install.sh claude
/path/to/this/playbook/assets/install.sh codex
/path/to/this/playbook/assets/install.sh gemini
/path/to/this/playbook/assets/install.sh cursor
# Or install for all four at once:
/path/to/this/playbook/assets/install.sh all
```
If you downloaded the playbook as a zip, the path is relative to the extracted directory:
```bash
station-ai-skills/assets/install.sh claude ~/your-project
```
The installer is additive for skill directories (won't clobber existing skills you've written) and refuses to overwrite an existing context file (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`) unless you pass `--force`.
**Resulting layout** (per harness):
```text
your-project/
AGENTS.md or CLAUDE.md or GEMINI.md # context file (named for your agent)
.claude/skills/<name>/SKILL.md # claude
.codex/prompts/<name>.md # codex
.gemini/commands/<name>.md # gemini
.cursor/rules/<name>.mdc # cursor
```
Where `<name>` is each of `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose`.
> [!NOTE]
> Every supported agent automatically reads the context file from the working directory at startup. Skills/prompts/rules in the harness-specific directory are discovered automatically — no additional configuration needed.
# Step 3. Verify the setup
Start your agent in the project directory and ask a question that requires constraint knowledge:
```text
Can I use --gpus all to run my CUDA workload on DGX Station?
```
The agent should immediately warn about the mixed-coherency constraint and recommend `--gpus '"device=N"'` targeting. If you don't get the warning, the context file isn't being loaded — see Troubleshooting.
Then verify the skills are discoverable:
| Agent | How to check |
|-------|--------------|
| Claude Code | Type `/` — `vllm-setup`, `sglang-setup`, `mig-configure`, `dgx-diagnose` should appear in the autocomplete |
| Codex CLI | Type `/prompts:` — same four names appear |
| Gemini CLI | Type `/` — same four names appear |
| Cursor | Open the Rules panel — same four rules appear |
# Step 4. Use vllm-setup to deploy an inference server
Invoke the skill in your agent:
| Agent | Invocation |
|-------|-----------|
| Claude Code | `/vllm-setup` (slash command) or just describe the task ("deploy vllm with Qwen3-8B") |
| Codex CLI | `/prompts:vllm-setup` |
| Gemini CLI | `/vllm-setup` |
| Cursor | In chat: "use the vllm-setup rule to deploy a vllm server" |
The agent will walk you through deploying a vLLM server with a validated container image, correct GPU targeting, and recommended parameters. It will check your GPU index, ask which model you want to serve, and generate the full `docker run` command.
# Step 5. Use sglang-setup to deploy SGLang
Same invocation pattern, but for SGLang with the `cu130` container, RadixAttention prefix caching, and structured JSON output support.
# Step 6. Use mig-configure to partition the GB300
The agent will query your current MIG state, show available profiles, help you choose a layout for your workloads, and execute the partitioning commands.
# Step 7. Use dgx-diagnose to troubleshoot issues
If you encounter problems, invoke `dgx-diagnose`. The agent will check GPU status, driver version, running processes, MIG state, and Fabric Manager to identify the issue.
# Step 8. Customize
Both the `AGENTS.md` and the skills are plain markdown — extend them freely.
**Add project-specific constraints to `AGENTS.md`** (or your harness-specific context file):
```markdown
## Project-specific
- Our production MIG layout is 3g.139gb + 2g.70gb + 2g.70gb
- Always use port 8080 for inference (nginx proxy on 443)
- Model weights are cached at /data/models, mount with -v /data/models:/root/.cache/huggingface/hub
```
**Create new skills** by adding a directory and `SKILL.md` to `assets/skills/`, then re-run `install.sh`:
```bash
mkdir -p assets/skills/run-benchmarks
cat > assets/skills/run-benchmarks/SKILL.md << 'EOF'
---
name: run-benchmarks
description: Run our standard inference benchmark suite against the running vLLM or SGLang server and compare against the baseline.
---
# Run benchmarks
1. Check which inference server is running (vLLM on port 8000 or SGLang on port 30000)
2. Run the appropriate benchmark script from ./benchmarks/
3. Report throughput (tokens/sec), latency (TTFT, ITL), and memory utilization
4. Compare against the baseline in ./benchmarks/baseline.json
EOF
```
> [!TIP]
> Keep `AGENTS.md` focused on constraints and pitfalls (things that break). Put procedural workflows in skills (things you do step-by-step).
-
id: troubleshooting
label: Troubleshooting
content: |
# Skills don't appear in autocomplete / aren't discoverable
Each agent discovers skills from a harness-specific directory in the current directory (or a parent). Check the right one:
| Agent | Expected location |
|-------|-------------------|
| Claude Code | `.claude/skills/<name>/SKILL.md` |
| Codex CLI | `.codex/prompts/<name>.md` |
| Gemini CLI | `.gemini/commands/<name>.md` |
| Cursor | `.cursor/rules/<name>.mdc` |
```bash
# Examples — check the directory for your agent
ls -la .claude/skills/
ls -la .codex/prompts/
ls -la .gemini/commands/
ls -la .cursor/rules/
```
You should see entries for `vllm-setup`, `sglang-setup`, `mig-configure`, and `dgx-diagnose`.
**Check you're in the right directory:**
```bash
pwd
```
The agent must be started from the directory containing the harness directory, or a subdirectory of it.
# Context file not loaded
If the agent gives generic answers without DGX Station awareness, the context file isn't being picked up. Each agent reads a different filename — verify the one for your agent exists:
| Agent | Expected filename |
|-------|-------------------|
| Claude Code | `CLAUDE.md` (also reads `AGENTS.md` as fallback) |
| Codex CLI | `AGENTS.md` |
| Gemini CLI | `GEMINI.md` |
| Cursor | `AGENTS.md` |
```bash
# Verify the file exists for your agent
cat AGENTS.md | head -5
cat CLAUDE.md | head -5
cat GEMINI.md | head -5
# Restart the agent in the correct directory
cd ~/your-project
claude # or codex, gemini, etc.
```
All four agents read the context file from the working directory (and parent directories up to the project root).
# Skill gives outdated information
The skills contain validated container versions and parameters as of the publication date. If a newer container is available, edit the canonical source and re-install:
```bash
nano /path/to/playbook/assets/skills/vllm-setup/SKILL.md
/path/to/playbook/assets/install.sh all --force
```
Or edit the installed copy directly:
```bash
# Claude Code
nano .claude/skills/vllm-setup/SKILL.md
# Codex
nano .codex/prompts/vllm-setup.md
# Gemini CLI
nano .gemini/commands/vllm-setup.md
# Cursor
nano .cursor/rules/vllm-setup.mdc
```
> [!TIP]
> Skills are plain markdown — you can version them in git alongside your project code.
# "Both GPUs cannot be used" errors
This is the mixed-coherency constraint working as intended. If you see CUDA errors when using `--gpus all`:
```bash
# Find the GB300 index
nvidia-smi --query-gpu=index,name --format=csv,noheader
# Use device-specific targeting
docker run --gpus '"device=1"' ...
```
The `AGENTS.md` covers this constraint, but if you removed that section, add it back — it's the most important piece of DGX Station knowledge.
# Skills conflict with existing project directory
If your project already has a `.claude/`, `.codex/`, `.gemini/`, or `.cursor/` directory with its own contents, `install.sh` is **additive** for skill directories — it adds the new skill files alongside whatever you already have and warns on collision rather than overwriting.
For context files (`AGENTS.md`, `CLAUDE.md`, `GEMINI.md`), the installer **refuses** to overwrite an existing file. Pass `--force` to override, or merge the new content manually:
```bash
# See what would be written
diff /path/to/playbook/assets/AGENTS.md ./AGENTS.md
# Force overwrite
/path/to/playbook/assets/install.sh claude . --force
```
# Installer reports "WROTE" for some files but "SKIP" for others
That's the safe-by-default behavior. The installer skips any file that already exists, prints a warning, and continues with the rest. To get a clean install, either:
1. Delete the existing files first: `rm -rf .claude/skills/{vllm-setup,sglang-setup,mig-configure,dgx-diagnose}`
2. Or pass `--force` (only affects context files; skill files are still skipped if present)
resources:
- name: Anthropic Agent Skills Overview
url: https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview
- name: AGENTS.md Standard
url: https://agents.md/
- name: Claude Code Documentation
url: https://docs.anthropic.com/en/docs/claude-code
- name: OpenAI Codex AGENTS.md Guide
url: https://developers.openai.com/codex/guides/agents-md
- name: Gemini CLI Custom Commands
url: https://geminicli.com/docs/cli/custom-commands/
- name: Cursor Rules Documentation
url: https://docs.cursor.com/
- name: vLLM Documentation
url: https://docs.vllm.ai/en/latest/
- name: SGLang Documentation
url: https://docs.sglang.io/
- name: MIG User Guide
url: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/

View File

@ -0,0 +1,2 @@
# REPLACE THIS WITH YOUR MODEL CARD
https://gitlab-master.nvidia.com/api-catalog/examples/-/blob/main/modelcard-example-mixtral8x7b.md?ref_type=heads

View File

@ -1,4 +1,4 @@
# Station Register to Brev
# Register DGX Station to Brev
> Link your DGX Station to Brev for remote access and sharing
@ -27,7 +27,7 @@ Youll register your DGX Station with Brev and it will be visible as a healthy
While Brev automates the complex configuration, understanding a few key concepts when establishing the initial connection will be useful:
* **Terminal Basics**:
* Familiarity with the command line to run a few simple setup commands
* Familiarity with command-line use to run a few simple setup commands.
## Prerequisites
@ -45,23 +45,28 @@ You will also need the following:
* **Estimated time:** 5-10 minutes
* **Risk level:** Low - Registration configures the Station for secure remote access without altering your existing workloads
* **Rollback:** The Brev configuration can be removed through the UI and CLI
* **Last Updated:** 05/29/2026
* First Publication
## Instructions
## Step 1. Login to Brev
## Step 1. Log in to Brev
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm youre in the correct org (by clicking the org button on the top right hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm youre in the correct org (by clicking the org button on the top right-hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
Click the “Register Compute” button and follow the instructions in the pop-up window.
## Step 2. Complete Popup Instructions
## Step 2. Complete Pop-up Instructions
* Install the Brev CLI
* Configure your compute
* Add a name for compute
* To configure ssh, ensure the “Enable SSH access” toggle is on
* To configure SSH, ensure the “Enable SSH access” toggle is on
* Run the registration command
> [!IMPORTANT]
> Run the Brev CLI install command **without `sudo`**. Prefixing the installer with `sudo` writes the `brev` binary into root's home directory, which is not on your user shell's `PATH` — the next command will fail with `brev: command not found`. Copy the install command from the pop-up and run it as your normal user.
## Step 3. Follow Registration Flow
In the CLI, youll be walked through registration. Go through the flow until registration is complete.
@ -70,7 +75,7 @@ In the CLI, youll be walked through registration. Go through the flow until r
* Go to the [Brev UI](https://brev.nvidia.com)
* Navigate to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute)
* Confirm that the DGX Station appears as a registered node with an **Available** status
* Confirm that the DGX Station appears as a registered node with a **Connected** status
## Step 5. Next Steps
@ -78,7 +83,14 @@ Your DGX Station is now integrated into Brev as a secure, remotely accessible GP
Now that your hardware is connected, you can:
* **Share Access Anywhere:** Access your machine from anywhere and share access with others through the Brev UI under [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
* **Access your machine from anywhere:** Open the [Brev UI](https://brev.nvidia.com) and launch a session from [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
* **Share access with others:** Invite teammates to your DGX Station from the Brev UI:
* Go to the [Brev UI](https://brev.nvidia.com) and open [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
* Find your DGX Station in the list and open the row's three-dot (⋯) menu.
* Select **Share Access**.
* Enter the email address of the person you want to share with.
* Choose their role / permission level.
* Confirm to send the invitation.
## Step 6. Cleanup
@ -93,12 +105,12 @@ brev deregister
In the UI:
* Go to the [Brev UI](https://brev.nvidia.com)
* Navigate to the section listing “GPU Environments” and look under “Registered Compute”
* Click the “Deregister” menu item on the device you wish to delete from Brev
* Confirm your selection
* Click the “Remove” menu item on the device you wish to delete from Brev.
* Confirm your selection.
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set <my-org>` and then redo the registration process |
| Unable to `brev shell <name>` | Need to refresh | `brev refresh` |
| Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set <my-org>` and then redo the registration process. |
| Unable to `brev shell <name>` | Need to refresh | `brev refresh`. |

View File

@ -1,7 +1,7 @@
kind: Playbook
metadata:
name: station-brev
displayName: Station Register to Brev
displayName: Register DGX Station to Brev
shortDescription: Link your DGX Station to Brev for remote access and sharing
publisher: nvidia
description: |
@ -10,8 +10,7 @@ metadata:
labelsV2:
- gpuType:playbook:gpu_type_station
- DGX
- Station
- DGX Station
- Brev
attributes:
@ -53,7 +52,7 @@ spec:
While Brev automates the complex configuration, understanding a few key concepts when establishing the initial connection will be useful:
* **Terminal Basics**:
* Familiarity with the command line to run a few simple setup commands
* Familiarity with command-line use to run a few simple setup commands.
# Prerequisites
@ -71,6 +70,8 @@ spec:
* **Estimated time:** 5-10 minutes
* **Risk level:** Low - Registration configures the Station for secure remote access without altering your existing workloads
* **Rollback:** The Brev configuration can be removed through the UI and CLI
* **Last Updated:** 05/29/2026
* First Publication
@ -79,20 +80,23 @@ spec:
label: Instructions
content: |
# Step 1. Login to Brev
# Step 1. Log in to Brev
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm youre in the correct org (by clicking the org button on the top right hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
Go to the [Brev UI](https://brev.nvidia.com), log in, and confirm youre in the correct org (by clicking the org button on the top right-hand side of the page). Once logged in, go to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section under the "GPU" tab in the main navigation.
Click the “Register Compute” button and follow the instructions in the pop-up window.
# Step 2. Complete Popup Instructions
# Step 2. Complete Pop-up Instructions
* Install the Brev CLI
* Configure your compute
* Add a name for compute
* To configure ssh, ensure the “Enable SSH access” toggle is on
* To configure SSH, ensure the “Enable SSH access” toggle is on
* Run the registration command
> [!IMPORTANT]
> Run the Brev CLI install command **without `sudo`**. Prefixing the installer with `sudo` writes the `brev` binary into root's home directory, which is not on your user shell's `PATH` — the next command will fail with `brev: command not found`. Copy the install command from the pop-up and run it as your normal user.
# Step 3. Follow Registration Flow
In the CLI, youll be walked through registration. Go through the flow until registration is complete.
@ -101,7 +105,7 @@ spec:
* Go to the [Brev UI](https://brev.nvidia.com)
* Navigate to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute)
* Confirm that the DGX Station appears as a registered node with an **Available** status
* Confirm that the DGX Station appears as a registered node with a **Connected** status
# Step 5. Next Steps
@ -109,7 +113,14 @@ spec:
Now that your hardware is connected, you can:
* **Share Access Anywhere:** Access your machine from anywhere and share access with others through the Brev UI under [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
* **Access your machine from anywhere:** Open the [Brev UI](https://brev.nvidia.com) and launch a session from [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
* **Share access with others:** Invite teammates to your DGX Station from the Brev UI:
* Go to the [Brev UI](https://brev.nvidia.com) and open [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
* Find your DGX Station in the list and open the row's three-dot (⋯) menu.
* Select **Share Access**.
* Enter the email address of the person you want to share with.
* Choose their role / permission level.
* Confirm to send the invitation.
# Step 6. Cleanup
@ -124,8 +135,8 @@ spec:
In the UI:
* Go to the [Brev UI](https://brev.nvidia.com)
* Navigate to the section listing “GPU Environments” and look under “Registered Compute”
* Click the “Deregister” menu item on the device you wish to delete from Brev
* Confirm your selection
* Click the “Remove” menu item on the device you wish to delete from Brev.
* Confirm your selection.
@ -136,8 +147,8 @@ spec:
content: |
| Symptom | Cause | Fix |
|---------|-------|-----|
| Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set <my-org>` and then redo the registration process |
| Unable to `brev shell <name>` | Need to refresh | `brev refresh` |
| Your DGX Station is showing up in the wrong org | It was registered to the wrong org | Run `brev set <my-org>` and then redo the registration process. |
| Unable to `brev shell <name>` | Need to refresh | `brev refresh`. |

View File

@ -45,18 +45,16 @@ spec:
content: |
# Basic idea
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI.
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation.
The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
# What you'll accomplish
You will have a working nanochat setup that trains a small LLM and serves it for chat.
- **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station.
- **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples.
- **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
- **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples.
# What to know before starting
@ -68,36 +66,58 @@ spec:
**Hardware:**
- NVIDIA DGX Station with GB300 Ultra Superchip.
- Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun).
- Adequate storage for cache (~24GB+ for FineWeb data and checkpoints).
- NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
- Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).
**Software:**
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images.
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
- [Weights & Biases](https://wandb.ai/) account and API key.
- [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets.
# Model architecture (d24)
```
Layers: 24
Attention Heads: 12
Head Dimension: 128
Context Length: 2048 tokens
Vocabulary Size: 65,536 (2^16, trained BPE)
Precision: FP8 (e4m3, tensorwise scaling)
```
# Training stages
| Stage | Description |
|-------|-------------|
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 |
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
| Report | Generates `report.md` with metrics, samples, and system info |
# Ancillary files
All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository).
- `assets/Dockerfile` PyTorch NGC image plus nanochat dependencies and venv.
- `assets/setup.sh` Clones nanochat, checks out the supported commit, and builds the Docker image.
- `assets/launch.sh` Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation).
- `assets/README.md` Additional detail on training stages, inference, and troubleshooting.
All required assets are in `nvidia/station-nanochat/assets/`:
- `Dockerfile` PyTorch NGC image with nanochat pip dependencies.
- `setup.sh` Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image.
- `launch.sh` Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
- `speedrun_station.sh` Modified speedrun script adapted for single-GPU DGX Station.
# Time & risk
- **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more).
- **Risk level:** Medium
- Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or the launch script will exit.
- **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed.
* **Last Updated:** 03/02/2026
* First Publication
- **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra.
- **Risk level:** Medium
- Large downloads (FineWeb) can be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or `launch.sh` will exit immediately.
- **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed.
# Credits
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy
- [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data)
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data)
@ -108,69 +128,86 @@ spec:
content: |
# Step 1. Prerequisites and environment
This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
```bash
# Verify GPU and Docker
nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
```
Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them.
Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell:
```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
```
# Step 2. Clone the playbook and set up nanochat
# Step 2. Clone and set up
Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image.
Clone the playbook repository and navigate to the assets directory:
```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nanochat/assets
```
From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.).
Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies):
```bash
./setup.sh
```
Setup may take several minutes while the image builds. Verify the image:
You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this:
```bash
docker images | grep nanochat
```
assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/
```
You should see the `nanochat` image listed.
# Step 3. Launch training
# Step 3. Launch full training
> [!NOTE]
> The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running.
To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours:
Ensure your API keys are exported, then launch:
```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
./launch_full.sh
./launch.sh
```
This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation.
The training runs inside the `nanochat` container and executes the full pipeline automatically:
# Step 4. Verify and use the model
1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer
2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8
3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat
4. **Report generation** — produces `report.md` with metrics and samples
After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station.
Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run.
# Step 4. Monitor training
**W&B dashboard:**
Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics:
- Training loss
- Validation BPB
- Throughput (tokens/sec)
# Step 5. Inference
After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively:
**Web UI (recommended):**
```bash
cd nanochat
source ../.venv/bin/activate # if using venv from container context; otherwise use the container
python -m scripts.chat_web
docker run --rm --gpus all --net=host \
-v $(pwd)/nanochat:/workspace/nanochat \
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_web
```
Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX Stations IP address.
@ -178,14 +215,15 @@ spec:
**CLI:**
```bash
cd nanochat
python -m scripts.chat_cli -p "Why is the sky blue?"
python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning"
docker run --rm -it --gpus all \
-v $(pwd)/nanochat:/workspace/nanochat \
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_cli -p "Why is the sky blue?"
```
A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project.
# Step 5. Cleanup
# Step 6. Cleanup
To stop training early, interrupt the launch script or stop the container:
@ -195,23 +233,32 @@ spec:
```bash
# If launch.sh is running: press Ctrl+C
# Or stop the container by name
# Or stop the container directly
docker stop $(docker ps -q --filter ancestor=nanochat)
```
To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`):
To free disk space:
```bash
rm -rf ./nanochat_cache ./hf_cache
docker system prune -a
```
# Step 6. Next steps and customization
# Step 7. Customization
- **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time.
- **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory.
- **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput).
- **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`.
**Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size:
```bash
# Fewer data shards (10 instead of default)
python -m nanochat.dataset -n 10 &
# Smaller model (d4 instead of d24), smaller batch size
python -m scripts.base_train --depth=4 --device-batch-size=32
```
**Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.
Then re-run `./setup.sh` to rebuild with the changes.
@ -221,14 +268,16 @@ spec:
label: Troubleshooting
content: |
| Symptom | Cause | Fix |
|--------|--------|-----|
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<your_key>` and `export HF_TOKEN=<your_token>` in the same shell, then run `./launch.sh`. |
| `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). |
| Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. |
| `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` dont exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p <your_cache_dirs>` and re-run `launch.sh`. |
| `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). |
| Training exits immediately or script doesnt wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars, cache paths, or batch size as above. |
| Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. |
|---------|-------|-----|
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<key>` and `export HF_TOKEN=<token>` in the same shell, then re-run `./launch.sh` |
| `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` |
| Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars or paths as needed |
| `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` |
| `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` |
| Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` |
| W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct |
| Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs <id>` |
| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,8 +1,8 @@
kind: Playbook
metadata:
name: station-vllm
displayName: Serve Qwen3-235B with vLLM
shortDescription: Set up vLLM server with Qwen3-235B on DGX Station
displayName: vLLM for Inference
shortDescription: Install and use vLLM on DGX Station
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
@ -15,7 +15,7 @@ metadata:
attributes:
- key: DURATION
value: 20 MIN
value: 30 MIN
spec:
artifactName: station-vllm
@ -42,7 +42,9 @@ spec:
# What you'll accomplish
Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.
Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
# What to know before starting
@ -57,21 +59,33 @@ spec:
- HuggingFace account with access token
- Network access to NGC and HuggingFace
# Model Support Matrix
The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
| **Kimi-K2.5 (1T)** | NVFP4 | ✅ | [`nvidia/Kimi-K2.5-NVFP4`](https://huggingface.co/nvidia/Kimi-K2.5-NVFP4) |
| **DeepSeek-V4-Flash** | NVFP4 | ✅ | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
# Time & risk
* **Duration:** 15-20 minutes (longer on first run due to model download)
* **Duration:** 30 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 03/02/2026
* First Publication
* **Last Updated:** 05/29/2026
* Update models
* Add base configuration example, per-setting explanations, and DeepSeek-V4-Flash recipe
-
id: instructions
label: Serve Qwen3-235B
label: Instructions
content: |
# Step 1. Set up Docker permissions
@ -92,7 +106,7 @@ spec:
export HF_TOKEN="your_huggingface_token"
# Model to serve
export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4"
export MODEL_HANDLE="<HF_HANDLE>"
# Maximum context length
export MAX_MODEL_LEN=8192
@ -106,9 +120,28 @@ spec:
docker pull nvcr.io/nvidia/vllm:26.01-py3
```
For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
```
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, pull the **26.03** image, which includes the `--cpu-offload-params` support used below:
```bash
docker pull nvcr.io/nvidia/vllm:26.03-py3
```
For DeepSeek-V4-Flash, pull the stable DeepSeek-V4 release container. Use the **cu130** build on DGX Station (Blackwell):
```bash
docker pull vllm/vllm-openai:v0.20.0-cu130
```
# Step 4. Start vLLM server
Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
## Base configuration (most models)
This is the recommended starting point for any model that fits entirely in VRAM on the GB300. The Qwen3-235B-A22B-NVFP4 model, for example, runs directly with this configuration.
```bash
docker run -d \
@ -126,6 +159,122 @@ spec:
--gpu-memory-utilization 0.9
```
Settings used:
- `--max-model-len` — maximum context length (prompt + output) per request. Larger values reserve more GPU memory for the KV cache; size it to your workload.
- `--gpu-memory-utilization 0.9` — fraction of GPU memory vLLM may use for weights and KV cache. `0.9` leaves headroom for other processes; raise toward `0.95` to fit more KV cache if the GPU is dedicated.
## Step-3.7-Flash (FP8 / NVFP4)
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:stepfun37 \
"$MODEL_HANDLE" \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--kv-cache-dtype fp8
```
Settings used (in addition to the base configuration):
- `--trust-remote-code` — allows the model's custom modeling code (shipped in its repo) to load. Required for Step-3.7.
- `--reasoning-parser step3p5` — parses the model's reasoning/thinking tokens into the dedicated `reasoning_content` response field.
- `--enable-auto-tool-choice` — lets the model decide when to call a tool, enabling OpenAI-compatible function calling.
- `--tool-call-parser step3p5` — parses the model's tool-call output into structured `tool_calls`. Pairs with `--enable-auto-tool-choice`.
- `--kv-cache-dtype fp8` — stores the KV cache in FP8, roughly halving KV-cache memory versus 16-bit and allowing more concurrent/longer sequences.
## Kimi-K2.5 NVFP4 (1T) — CPU offloading
For Kimi-K2.5 NVFP4 (1T) with DRAM offloading, run with the **26.03** NGC container. This model does not fit entirely in VRAM, so the MoE expert weights are offloaded to CPU DRAM with `--cpu-offload-gb 375 --cpu-offload-params experts`. Ensure the system has enough free DRAM to hold the offloaded weights.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
nvcr.io/nvidia/vllm:26.03-py3 \
vllm serve nvidia/Kimi-K2.5-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--dtype auto \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.95 \
--served-model-name nvidia/Kimi-K2.5-NVFP4 \
--tensor-parallel-size 1 \
--no-enable-prefix-caching \
--trust-remote-code \
--max-model-len 40960 \
--max-num-seqs 1 \
--max-num-batched-tokens 32768 \
--cpu-offload-gb 375 \
--cpu-offload-params experts
```
Settings used (in addition to the base configuration):
- `--cpu-offload-gb 375` — amount of CPU DRAM (in GiB) vLLM may use to hold weights that don't fit in VRAM. Must be large enough for the offloaded experts; the system needs at least this much free DRAM.
- `--cpu-offload-params experts` — offloads only the MoE expert weights (the bulk of a large MoE model) to DRAM, keeping attention and other hot weights in VRAM.
- `--tensor-parallel-size 1` — single GPU; the GB300 serves the whole model.
- `--max-num-seqs 1` / `--max-num-batched-tokens 32768` — caps concurrency to one sequence and the batch token budget. With expert weights paged from DRAM, throughput is offload-bound, so a low concurrency keeps latency predictable.
- `--no-enable-prefix-caching` — disables prefix-cache reuse. Offloaded experts make the memory budget tight, so the cache is turned off here rather than spent on KV reuse.
- `--kv-cache-dtype auto` / `--dtype auto` — let vLLM pick the KV-cache and compute dtypes from the model's quantization (NVFP4).
## DeepSeek-V4-Flash — MTP + agentic
For DeepSeek-V4-Flash, run with the stable **v0.20.0-cu130** container. This recipe targets agentic workloads and enables Multi-Token Prediction (MTP) speculative decoding. On a single GB300 (TP1) the MoE expert-parallel path is used; the `deep_gemm_mega_moe` backend from some internal recipes is not needed at TP1 and is omitted here.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:v0.20.0-cu130 \
deepseek-ai/DeepSeek-V4-Flash \
--enable-expert-parallel \
--kv-cache-dtype fp8 \
--trust-remote-code \
--block-size 256 \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache True \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
--max-model-len 32768
```
Settings used (in addition to the base configuration):
- `--enable-expert-parallel` — shards the MoE experts across the available GPU(s) using expert parallelism, the recommended MoE execution path for DeepSeek-V4.
- `--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'` — enables **MTP (Multi-Token Prediction)** speculative decoding: the model proposes 3 tokens per step that are verified in a single forward pass, cutting latency for accepted tokens.
- `--kv-cache-dtype fp8` — FP8 KV cache to fit more concurrent/longer sequences.
- `--block-size 256` — KV-cache page size in tokens. DeepSeek-V4 uses multiple KV-cache groups; `256` matches the recipe validated on Station.
- `--attention_config.use_fp4_indexer_cache True` — enables the FP4 indexer cache used by DeepSeek-V4's attention. (Drop this flag on platforms without native FP4, e.g. Hopper.)
- `--tokenizer-mode deepseek_v4` / `--tool-call-parser deepseek_v4` / `--reasoning-parser deepseek_v4` — DeepSeek-V4-specific tokenizer, tool-call, and reasoning parsers.
- `--enable-auto-tool-choice` — OpenAI-compatible function calling for agentic use.
- `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` — uses full + piecewise CUDA graph capture and enables all custom ops for lower per-step overhead.
- **Prefix caching is left enabled (the vLLM default).** For agentic workloads with large shared prefixes (e.g. a 32k system/context prefix) at low batch sizes (~BS 34), prefix caching gives a significant throughput boost by reusing the cached prefix across requests. Some internal recipes carry `--no-enable-prefix-caching`, but that was inherited from random-data benchmarking and is not recommended for agentic use here.
Check the server logs for startup progress:
```bash
@ -135,7 +284,7 @@ spec:
Expected output includes:
- Model download progress (first run only)
- Model loading into GPU memory
- `Uvicorn running on http://0.0.0.0:8000`
- `Application startup complete.`
Press `Ctrl+C` to exit log view once the server is ready.
@ -166,9 +315,10 @@ spec:
Optionally, remove the image and cached model:
Eg.
```bash
docker rmi nvcr.io/nvidia/vllm:26.01-py3
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4
docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
```