feat: scaffold skills plugin from DGX Spark playbooks

Adds a Claude Code plugin structure that exposes each NVIDIA DGX Spark playbook as a triggerable skill, with an index skill ('dgx-spark') that routes users to the right leaf based on intent and encodes the relationship graph between playbooks (prerequisites, alternatives, composes-with, upgrade paths). Structure: - overrides/*.md hand-curated frontmatter + Related sections - scripts/generate.mjs zero-dep Node generator: nvidia + overrides → skills - scripts/install.sh symlinks skills into ~/.claude/skills (--plugin mode available) - skills/ committed, browsable, installable without Node - .github/workflows/ auto-regenerates skills/ when playbooks/overrides change Initial curated leaves: ollama, open-webui, vllm, connect-to-your-spark. Remaining 37 leaves use generator fallback (title + tagline + summary extracted from README) and can be curated incrementally via overrides/.
2026-06-20 13:19:34 +00:00 · 2026-04-19 10:22:08 +01:00 · 2026-04-19 10:22:08 +01:00 · a680d0472b
commit a680d0472b
parent 3ba4d58f1e
53 changed files with 1714 additions and 0 deletions
--- a/.github/workflows/regenerate-skills.yml
+++ b/.github/workflows/regenerate-skills.yml
@ -0,0 +1,47 @@
+name: Regenerate skills
+
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'nvidia/**/README.md'
+      - 'overrides/**'
+      - 'scripts/generate.mjs'
+      - 'plugin.json'
+  workflow_dispatch:
+
+concurrency:
+  group: regenerate-skills-${{ github.ref }}
+  cancel-in-progress: true
+
+permissions:
+  contents: write
+
+jobs:
+  regenerate:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          persist-credentials: true
+
+      - uses: actions/setup-node@v4
+        with:
+          node-version: '20'
+
+      - name: Regenerate skills
+        run: node scripts/generate.mjs
+
+      - name: Commit if changed
+        run: |
+          if git diff --quiet skills/; then
+            echo "No changes to skills/ - skipping commit"
+            exit 0
+          fi
+
+          git config user.name  "github-actions[bot]"
+          git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
+
+          git add skills/
+          git commit -m "chore: regenerate skills/ from upstream playbooks [skip ci]"
+          git push
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,2 @@
+node_modules/
+.DS_Store
--- a/overrides/_index.md
+++ b/overrides/_index.md
@ -0,0 +1,120 @@
+---
+description: Catalog and router for NVIDIA DGX Spark playbooks — use when a user asks about setting up their DGX Spark, wants an overview of what they can run on Spark hardware, or needs help choosing between inference runtimes, fine-tuning frameworks, or networking setups. Lists all available dgx-spark-* skills and encodes the relationships between them (prerequisites, alternatives, composes-with, upgrade paths).
+---
+
+# DGX Spark Playbooks — Index
+
+Use this catalog to route the user to the right specific `dgx-spark-*` skill. Each entry below names a leaf skill; invoke it when the user's intent matches.
+
+## Categories
+
+### Inference runtimes (serve models)
+- `dgx-spark-ollama` — easiest, good default for most users
+- `dgx-spark-vllm` — higher throughput, production-grade serving
+- `dgx-spark-trt-llm` — maximum Blackwell performance, most complex setup
+- `dgx-spark-sglang` — structured generation, batched inference
+- `dgx-spark-llama-cpp` — lightweight, CPU/GPU flexibility
+- `dgx-spark-lm-studio` — GUI-based model management
+- `dgx-spark-nim-llm` — NVIDIA NIM microservices
+
+### Chat & UI
+- `dgx-spark-open-webui` — web chat UI, pairs with Ollama
+- `dgx-spark-live-vlm-webui` — vision-language model interface
+- `dgx-spark-dgx-dashboard` — GPU/system monitoring
+
+### Fine-tuning
+- `dgx-spark-pytorch-fine-tune` — baseline PyTorch fine-tuning
+- `dgx-spark-nemo-fine-tune` — NVIDIA NeMo framework
+- `dgx-spark-unsloth` — memory-efficient fine-tuning
+- `dgx-spark-llama-factory` — multi-model fine-tuning framework
+- `dgx-spark-flux-finetuning` — FLUX.1 Dreambooth LoRA (image models)
+
+### Networking & multi-Spark
+- `dgx-spark-connect-to-your-spark` — **foundational: local network access setup**
+- `dgx-spark-tailscale` — VPN-based remote access
+- `dgx-spark-connect-two-sparks` — link two Sparks
+- `dgx-spark-connect-three-sparks` — ring topology
+- `dgx-spark-multi-sparks-through-switch` — switched multi-Spark
+- `dgx-spark-nccl` — collective communication across Sparks
+
+### Dev environments & tooling
+- `dgx-spark-vscode` — VS Code remote setup
+- `dgx-spark-vibe-coding` — agentic coding in VS Code
+- `dgx-spark-rag-ai-workbench` — RAG app in AI Workbench
+- `dgx-spark-openshell` — secure long-running agents
+- `dgx-spark-openclaw` — (advanced agent setup)
+- `dgx-spark-nemoclaw` — Nemotron + Telegram agent
+
+### Specialized workloads
+- `dgx-spark-comfy-ui` — image generation UI
+- `dgx-spark-isaac` — Isaac Sim / Isaac Lab (robotics)
+- `dgx-spark-jax` — JAX on Spark
+- `dgx-spark-cuda-x-data-science` — RAPIDS / data science
+- `dgx-spark-multi-agent-chatbot` — multi-agent deployment
+- `dgx-spark-multi-modal-inference` — multi-modal models
+- `dgx-spark-nemotron` — Nemotron-3-Nano with llama.cpp
+- `dgx-spark-nvfp4-quantization` — FP4 quantization workflows
+- `dgx-spark-portfolio-optimization` — finance example
+- `dgx-spark-single-cell` — single-cell RNA sequencing
+- `dgx-spark-speculative-decoding` — speculative decoding inference
+- `dgx-spark-spark-reachy-photo-booth` — Reachy robot demo
+- `dgx-spark-txt2kg` — text-to-knowledge-graph
+- `dgx-spark-vss` — video search & summarization agent
+
+## Relationship graph
+
+Edge types: `→prereq→` (must do first), `→pairs→` (composes naturally), `→alt→` (pick one, roughly equivalent choice), `→upgrade→` (next step when outgrowing this).
+
+### Networking (almost everything depends on this)
+- `connect-to-your-spark` →prereq→ **all remote-access playbooks**
+- `tailscale` →alt→ `connect-to-your-spark`
+- `connect-two-sparks` →prereq→ `nccl`
+- `connect-two-sparks` →upgrade→ `connect-three-sparks` →upgrade→ `multi-sparks-through-switch`
+
+### Inference stack
+- `ollama` →pairs→ `open-webui` *(most common pairing — chat UI on top of Ollama)*
+- `ollama` →alt→ `lm-studio` *(GUI vs CLI; roughly equivalent for local single-user use)*
+- `ollama` →alt→ `llama-cpp` *(lower-level control)*
+- `ollama` →upgrade→ `vllm` *(when throughput / OpenAI-compatible API matters)*
+- `vllm` →alt→ `trt-llm` *(different use case — trt-llm for lowest latency with compiled engines; not strictly an upgrade)*
+- `vllm` →composes→ `connect-two-sparks` + `nccl` *(multi-Spark serving for very large models)*
+- `nemotron` →pairs→ `llama-cpp` *(playbook specifically uses llama.cpp runtime)*
+- `nim-llm` →alt→ `vllm` *(NIM microservices vs. raw vLLM serving)*
+
+### Fine-tuning pipelines
+- `pytorch-fine-tune` →prereq→ `flux-finetuning` *(need baseline PyTorch setup first)*
+- `nemo-fine-tune` →pairs→ `nim-llm` *(deploy tuned NeMo models via NIM)*
+- `unsloth` →related→ `llama-factory` *(related but different specialties — unsloth for memory-efficient LoRA/QLoRA; llama-factory is a broader multi-technique framework)*
+
+### Monitoring & observability
+- `dgx-dashboard` →pairs→ **all inference/fine-tuning skills** *(GPU/system monitoring during workloads)*
+
+### Performance tuning (compose with inference)
+- `speculative-decoding` →composes→ `vllm`, `trt-llm` *(inference acceleration technique)*
+- `nvfp4-quantization` →composes→ `vllm`, `trt-llm` *(quantize first, then serve)*
+
+### Agent & automation stacks
+- `nemoclaw` →pairs→ `nemotron` *(nemoclaw uses Nemotron internally)*
+- `openclaw` →pairs→ `openshell` *(agent security pattern)*
+
+### Dev env dependencies
+- `vscode` →prereq→ `vibe-coding` *(vibe-coding builds on VS Code remote setup)*
+
+## Suggestion rules
+
+When the user's request is broad, narrow it with these questions before invoking a leaf:
+
+| User says... | Ask / suggest |
+|---|---|
+| "chat with a model on Spark" | Default: `dgx-spark-ollama` + `dgx-spark-open-webui`. Ask: CLI-only or web UI? |
+| "fastest inference" | `dgx-spark-trt-llm`, but warn it's the most complex. Ask if `vllm` would suffice. |
+| "train" / "fine-tune a model" | Ask: from scratch (`nemo-fine-tune`) or adapt existing (`unsloth`, `llama-factory`)? Image model? → `flux-finetuning`. |
+| "connect to my Spark" / "remote access" | `dgx-spark-connect-to-your-spark` first. Suggest `tailscale` as alternative for VPN use. |
+| "multiple Sparks" | Ask: 2 (`connect-two-sparks`), 3 (`connect-three-sparks`), or more via switch? NCCL after physical link. |
+| "I just got my Spark, what can I do" | List categories above. Suggest starting with `connect-to-your-spark` → `ollama`. |
+
+## Curation notes
+
+Edges above are a working starting point. Revise as real usage reveals which pairings matter most. In particular:
+- `vllm →alt→ trt-llm` is inferred from the READMEs' positioning — confirm with users whether they see these as alternatives or a progression
+- `speculative-decoding` and `nvfp4-quantization` compose with serving runtimes but the exact integration path may vary — check if users typically apply them standalone or as part of vllm/trt-llm setup
--- a/overrides/connect-to-your-spark.md
+++ b/overrides/connect-to-your-spark.md
@ -0,0 +1,32 @@
+---
+description: Set up SSH access to an NVIDIA DGX Spark from a laptop using NVIDIA Sync (recommended) or manual SSH. Use when a user is new to their Spark and needs to connect remotely, before doing anything else. This is a prerequisite for nearly every other dgx-spark-* skill — if a user hasn't set this up, do this first.
+---
+
+## When to use this skill
+- User just got their DGX Spark and wants to use it from their laptop
+- Any other dgx-spark-* skill needs SSH access and the user hasn't configured it yet
+- User reports "can't connect to my Spark" or "SSH hangs / can't resolve spark-abcd.local"
+
+## Two paths — help the user pick
+- **NVIDIA Sync (recommended)** — GUI, handles SSH key generation + aliasing + port forwarding for apps. Required if they want one-click app launchers (DGX Dashboard, VS Code, Open WebUI tunnels).
+- **Manual SSH** — if they prefer CLI-only workflow, or Sync isn't supported on their platform.
+
+Most users should use NVIDIA Sync unless they have a specific reason not to.
+
+## Key decisions
+- **Hostname vs IP** — default is mDNS hostname (`spark-abcd.local`). On corporate networks that block mDNS, they'll need to use the IP address from their router's admin panel. Quick test: `ping spark-abcd.local` — if it hangs, mDNS is blocked.
+- **First-boot wait** — after initial system setup, the Spark can take 3–4 minutes to finish updates before SSH becomes available. Don't diagnose connection issues in this window.
+
+## Non-obvious gotchas
+- NVIDIA Sync's password prompt happens **once** — it uses the password only to install the SSH key, then discards it. If auth fails, the key install didn't complete; re-run the add-device flow.
+- mDNS `.local` resolution is OS + network-stack specific. Works on most home Wi-Fi; often broken on corporate VPNs or guest networks.
+- Port-forwarding for web apps is a separate step (SSH `-L` flag or Custom Ports in Sync) — connecting to SSH alone doesn't give laptop browsers access to web UIs running on the Spark.
+
+## Related skills
+- **Alternative**: `dgx-spark-tailscale` — use Tailscale VPN for remote access instead of local-network SSH. Works off-network.
+- **Follow-ups (what users typically do next)**:
+  - `dgx-spark-ollama` — run a local LLM
+  - `dgx-spark-open-webui` — web chat UI
+  - `dgx-spark-vscode` — remote development
+  - `dgx-spark-dgx-dashboard` — system monitoring (already pre-installed, just needs the tunnel)
+- **Multi-Spark setups depend on this first**: `dgx-spark-connect-two-sparks`, `dgx-spark-connect-three-sparks`, `dgx-spark-multi-sparks-through-switch`
--- a/overrides/ollama.md
+++ b/overrides/ollama.md
@ -0,0 +1,24 @@
+---
+description: Install Ollama on an NVIDIA DGX Spark and expose its API to a local laptop via NVIDIA Sync SSH tunnel. Use when a user wants to run LLM inference on DGX Spark hardware and call the API from their laptop on localhost:11434 without exposing ports on their network.
+---
+
+## When to use this skill
+- User has an NVIDIA DGX Spark with NVIDIA Sync installed on their laptop
+- Wants Ollama running on Spark, API accessible from their laptop
+- Wants an easy-to-use inference runtime (vs. the complexity of vLLM or TRT-LLM)
+
+## Key decisions to confirm before executing
+- **Model choice** — default in the playbook is `qwen2.5:32b` (~18GB, optimized for Blackwell). Ask the user if they want a smaller model (`qwen2.5:7b`, `llama3.1:8b`, `phi3.5:3.8b`) for lower VRAM or faster download.
+- **Check first** — run `ollama --version` on the Spark before installing; skip installation if already present.
+
+## Non-obvious gotchas
+- The SSH tunnel must be re-activated after NVIDIA Sync restarts — `localhost:11434` only works while the "Ollama Server" custom app is active in Sync.
+- Uninstall is destructive: `sudo rm -rf /usr/share/ollama` removes all downloaded models (often tens of GB). Confirm with user before running cleanup.
+- Streaming responses (`"stream": true`) behave differently than non-streaming — use `curl -N` to see them.
+
+## Related skills
+- **Prerequisite**: `dgx-spark-connect-to-your-spark` — NVIDIA Sync + local network access basics. If the user hasn't set this up yet, do it first.
+- **Composes with**: `dgx-spark-open-webui` — web chat UI on top of Ollama. Most common follow-up.
+- **Alternative**: `dgx-spark-lm-studio` — GUI-based model management instead of Ollama's CLI.
+- **Alternative**: `dgx-spark-llama-cpp` — lower-level control over inference.
+- **Upgrade path**: `dgx-spark-vllm` — when the user needs higher throughput or is serving multiple concurrent users.
--- a/overrides/open-webui.md
+++ b/overrides/open-webui.md
@ -0,0 +1,22 @@
+---
+description: Install Open WebUI on NVIDIA DGX Spark for a web-based chat interface to LLMs running on Spark GPU. Use when a user wants a browser UI for chatting with local models — most commonly paired with Ollama, either bundled inside Open WebUI or as a separate backend.
+---
+
+## When to use this skill
+- User has Spark SSH access (`dgx-spark-connect-to-your-spark`) and wants a web chat UI, not just CLI
+- User already has Ollama and wants to chat through a browser
+- User wants a self-hosted ChatGPT-like interface running entirely on their own hardware
+
+## Key decisions
+- **Bundled Ollama or separate?** — Open WebUI ships with an integrated Ollama option (single Docker container). Simpler for first-time users. If the user already ran `dgx-spark-ollama` separately, configure Open WebUI to connect to that existing Ollama instead of running two copies.
+- **Sync-managed or manual Docker?** — NVIDIA Sync can manage the SSH tunnel + custom-port setup automatically. Manual Docker gives more control but requires the user to handle port forwarding themselves.
+
+## Non-obvious gotchas
+- User must be in the `docker` group on the Spark — `docker ps` without sudo must work. If not, add via `sudo usermod -aG docker $USER` and **log out/in to apply** (a new SSH session is not enough — the session must be fully re-established).
+- The Open WebUI container stores user accounts and chat history in a named volume. Don't `docker rm -v` the container unless you intend to lose history.
+- First-run creates an admin account from whoever signs up first — if the UI is port-forwarded somewhere other users can reach, sign up immediately before anyone else does.
+
+## Related skills
+- **Prerequisite**: `dgx-spark-connect-to-your-spark` — SSH + Sync setup
+- **Pairs with**: `dgx-spark-ollama` — the most common backend. Open WebUI can bundle its own Ollama, but if `dgx-spark-ollama` was already set up, reuse it (saves disk, one set of models).
+- **Alternative UIs**: `dgx-spark-lm-studio` (desktop GUI, not web) · `dgx-spark-live-vlm-webui` (vision-language models specifically)
--- a/overrides/vllm.md
+++ b/overrides/vllm.md
@ -0,0 +1,39 @@
+---
+description: Install and run vLLM for high-throughput LLM inference on NVIDIA DGX Spark, including multi-Spark serving for very large models (e.g., Llama 405B across two Sparks). Use when a user needs an OpenAI-compatible API, higher throughput than Ollama, or wants to run models too large for a single Spark. Significantly more complex setup than Ollama — ensure user actually needs what vLLM offers before recommending.
+---
+
+## When to use this skill
+- User's current runtime (usually Ollama) can't handle their throughput requirements
+- User wants an OpenAI-compatible API to plug applications into
+- User wants to run a model too large for one Spark (vLLM supports tensor-parallel across 2+ Sparks)
+- User specifically asked for vLLM
+
+## When NOT to use this skill
+- User is just exploring — `dgx-spark-ollama` is far simpler
+- User needs single-user chat — Ollama + Open WebUI covers that case
+- User needs absolute lowest latency with pre-compiled models — that's `dgx-spark-trt-llm` territory
+
+## Key decisions
+- **Docker container or build from source?** — Pre-built container is the recommended path. Source build is only needed if the user has a specific reason (custom patches, bleeding-edge vLLM version not yet in the container).
+- **Single-Spark or multi-Spark?** — Multi-Spark adds major complexity: networking (`dgx-spark-connect-two-sparks` or `dgx-spark-multi-sparks-through-switch`) + NCCL (`dgx-spark-nccl`) must be working first. Only pursue for 120B+ param models that don't fit on one Spark.
+- **Model + quantization** — the playbook's support matrix lists specific NVFP4/FP8/MXFP4 combinations. Don't assume any HF model works — check the matrix.
+
+## Prerequisites (hard requirements)
+- CUDA 13.0 toolkit installed (`nvcc --version`)
+- Docker + NVIDIA Container Toolkit configured
+- Python 3.12 available
+- `dgx-spark-connect-to-your-spark` for remote access
+
+## Non-obvious gotchas
+- This is ARM64 + Blackwell. PyPI wheels built for x86_64 CUDA 12.x **will not work** — the playbook's container has ARM64-specific LLVM/Triton patches.
+- vLLM's default GPU memory utilization is high (~0.9). On a Spark that's also running other workloads, drop to 0.7–0.8 or the container will OOM.
+- Multi-Spark serving is sensitive to NCCL configuration and link quality — a single flaky cable will destroy throughput. Validate `dgx-spark-nccl` first before assuming vLLM is the problem.
+
+## Related skills
+- **Prerequisite**: `dgx-spark-connect-to-your-spark`
+- **Simpler alternative**: `dgx-spark-ollama` — recommend this first unless the user needs vLLM's specific capabilities
+- **Alternative for max perf**: `dgx-spark-trt-llm` — TensorRT-LLM with compiled engines. Different use case (lowest latency, more setup cost), not strictly an upgrade path
+- **Multi-Spark composition**:
+  - `dgx-spark-connect-two-sparks` or `dgx-spark-multi-sparks-through-switch` (physical link)
+  - `dgx-spark-nccl` (collective comms)
+- **Pairs with**: `dgx-spark-dgx-dashboard` for GPU monitoring during serving
--- a/plugin.json
+++ b/plugin.json
@ -0,0 +1,9 @@
+{
+  "name": "dgx-spark-playbooks",
+  "version": "0.1.0",
+  "description": "Skills for setting up AI/ML workloads on NVIDIA DGX Spark hardware. Derived from the official DGX Spark playbooks.",
+  "author": {
+    "name": "Jason Kneen",
+    "email": "jason@bouncingfish.com"
+  }
+}
--- a/scripts/generate.mjs
+++ b/scripts/generate.mjs
@ -0,0 +1,151 @@
+#!/usr/bin/env node
+// Reads nvidia/<name>/README.md + overrides/<name>.md → writes dist/skills/dgx-spark-<name>/SKILL.md
+// Overrides provide hand-curated frontmatter description and extra body sections (Related, etc.).
+// Generator-owned content is bounded by GENERATED markers and rewritten on every run; override
+// content is appended verbatim and preserved across regenerations.
+
+import { readdir, readFile, writeFile, mkdir, rm } from 'node:fs/promises'
+import { existsSync } from 'node:fs'
+import { join, dirname, resolve } from 'node:path'
+import { fileURLToPath } from 'node:url'
+
+const REPO = resolve(dirname(fileURLToPath(import.meta.url)), '..')
+const NVIDIA = join(REPO, 'nvidia')
+const OVERRIDES = join(REPO, 'overrides')
+const SKILLS_OUT = join(REPO, 'skills')
+
+async function main() {
+  // skills/ is entirely generator-owned — hand edits belong in overrides/
+  await rm(SKILLS_OUT, { recursive: true, force: true })
+  await mkdir(SKILLS_OUT, { recursive: true })
+
+  const leafNames = (await readdir(NVIDIA, { withFileTypes: true }))
+    .filter(d => d.isDirectory())
+    .map(d => d.name)
+    .sort()
+
+  await writeIndex()
+  for (const name of leafNames) await writeLeaf(name)
+
+  console.log(`✓ Generated ${leafNames.length + 1} skills in ${SKILLS_OUT}`)
+  console.log(`  • dgx-spark (index)`)
+  console.log(`  • ${leafNames.length} leaves: ${leafNames.slice(0, 3).join(', ')}, ...`)
+}
+
+async function writeIndex() {
+  const override = await readOverride('_index')
+  if (!override) {
+    throw new Error('overrides/_index.md is required (contains the catalog and relationship graph)')
+  }
+  const content = `---
+name: dgx-spark
+description: ${inlineDescription(override.fm.description)}
+---
+
+${override.body.trim()}
+`
+  await writeSkill('dgx-spark', content)
+}
+
+async function writeLeaf(name) {
+  const readme = await readFile(join(NVIDIA, name, 'README.md'), 'utf8')
+  const override = await readOverride(name)
+  const description = override?.fm.description ?? fallbackDescription(name, readme)
+  const generated = extractGeneratedBody(name, readme)
+
+  const content = `---
+name: dgx-spark-${name}
+description: ${inlineDescription(description)}
+---
+
+<!-- GENERATED:BEGIN from nvidia/${name}/README.md -->
+${generated}
+<!-- GENERATED:END -->
+${override?.body ? '\n' + override.body.trim() + '\n' : ''}`
+
+  await writeSkill(`dgx-spark-${name}`, content)
+}
+
+function extractGeneratedBody(name, readme) {
+  const title = firstMatch(readme, /^#\s+(.+)$/m) ?? name
+  const tagline = firstMatch(readme, /^>\s+(.+)$/m) ?? ''
+  const basicIdea = extractSection(readme, 'Basic idea') || extractSection(readme, 'Overview')
+  const accomplish = extractSection(readme, "What you'll accomplish")
+  const duration = firstMatch(readme, /\*\*Duration\*\*:\s*(.+)$/m)
+  const risk = firstMatch(readme, /\*\*Risk level\*\*:\s*(.+)$/m)
+
+  const parts = [`# ${title}`]
+  if (tagline) parts.push(`> ${tagline}`)
+  if (basicIdea) parts.push(basicIdea)
+  if (accomplish) parts.push(`**Outcome**: ${accomplish}`)
+  if (duration || risk) {
+    const meta = []
+    if (duration) meta.push(`Duration: ${duration}`)
+    if (risk) meta.push(`Risk: ${risk}`)
+    parts.push(meta.join(' · '))
+  }
+  parts.push(`**Full playbook**: \`${join(NVIDIA, name, 'README.md')}\``)
+  return parts.join('\n\n')
+}
+
+function fallbackDescription(name, readme) {
+  const tagline = firstMatch(readme, /^>\s+(.+)$/m)
+  if (tagline) return `${tagline} — on NVIDIA DGX Spark. Use when setting up ${name} on Spark hardware.`
+  return `Set up ${name} on NVIDIA DGX Spark. Use when the user wants to install or configure ${name} on Spark hardware.`
+}
+
+function inlineDescription(desc) {
+  // YAML-safe single-line description. Our parser doesn't handle multi-line, so collapse + escape quotes.
+  const collapsed = desc.replace(/\s+/g, ' ').trim()
+  if (collapsed.includes(':') || collapsed.includes('#')) {
+    return JSON.stringify(collapsed) // YAML accepts double-quoted strings
+  }
+  return collapsed
+}
+
+function firstMatch(s, re) {
+  const m = s.match(re)
+  return m ? m[1].trim() : null
+}
+
+function extractSection(md, heading) {
+  const re = new RegExp(`##\\s+${heading}\\s*\\n+([\\s\\S]*?)(?=\\n##\\s|\\n---|$)`, 'i')
+  const m = md.match(re)
+  if (!m) return ''
+  return m[1].trim().split('\n').slice(0, 8).join('\n').trim()
+}
+
+async function readOverride(name) {
+  const path = join(OVERRIDES, `${name}.md`)
+  if (!existsSync(path)) return null
+  const text = await readFile(path, 'utf8')
+  return parseFrontmatter(text)
+}
+
+function parseFrontmatter(md) {
+  const m = md.match(/^---\n([\s\S]*?)\n---\n?([\s\S]*)$/)
+  if (!m) return { fm: {}, body: md }
+  const fm = {}
+  for (const line of m[1].split('\n')) {
+    const idx = line.indexOf(':')
+    if (idx === -1) continue
+    const key = line.slice(0, idx).trim()
+    let value = line.slice(idx + 1).trim()
+    if ((value.startsWith('"') && value.endsWith('"')) || (value.startsWith("'") && value.endsWith("'"))) {
+      value = value.slice(1, -1)
+    }
+    fm[key] = value
+  }
+  return { fm, body: m[2] }
+}
+
+async function writeSkill(name, content) {
+  const dir = join(SKILLS_OUT, name)
+  await mkdir(dir, { recursive: true })
+  await writeFile(join(dir, 'SKILL.md'), content)
+}
+
+main().catch(err => {
+  console.error('generate failed:', err)
+  process.exit(1)
+})
--- a/scripts/install.sh
+++ b/scripts/install.sh
@ -0,0 +1,89 @@
+#!/usr/bin/env bash
+# Install dgx-spark-playbooks into Claude Code.
+#
+# Modes:
+#   --skills  (default) Symlink each generated skill into ~/.claude/skills/
+#   --plugin            Symlink the whole repo into ~/.claude/plugins/ as a plugin
+#
+# Env overrides:
+#   CLAUDE_SKILLS   target for individual skills (default: ~/.claude/skills)
+#   CLAUDE_PLUGINS  target for plugin install    (default: ~/.claude/plugins)
+
+set -euo pipefail
+
+REPO="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+SKILLS_DIR="${CLAUDE_SKILLS:-$HOME/.claude/skills}"
+PLUGINS_DIR="${CLAUDE_PLUGINS:-$HOME/.claude/plugins}"
+
+MODE="skills"
+for arg in "$@"; do
+  case "$arg" in
+    --skills) MODE="skills" ;;
+    --plugin) MODE="plugin" ;;
+    -h|--help)
+      sed -n '2,11p' "$0" | sed 's/^# \{0,1\}//'
+      exit 0
+      ;;
+    *)
+      echo "unknown argument: $arg" >&2
+      exit 1
+      ;;
+  esac
+done
+
+# Regenerate skills/ if Node is available; otherwise use what's already committed.
+if command -v node >/dev/null 2>&1; then
+  NODE_MAJOR=$(node -p "process.versions.node.split('.')[0]")
+  if [ "$NODE_MAJOR" -ge 18 ]; then
+    echo "→ Regenerating skills/ from overrides/ + nvidia/*/README.md"
+    node "$REPO/scripts/generate.mjs"
+  else
+    echo "  node >= 18 required to regenerate (have $(node -v)) — using existing skills/"
+  fi
+else
+  echo "  node not found — using existing skills/"
+fi
+
+if [ ! -d "$REPO/skills" ]; then
+  echo "error: $REPO/skills does not exist and could not be regenerated" >&2
+  exit 1
+fi
+
+# Clean previous installs from BOTH targets so switching modes stays clean.
+cleanup_skills() {
+  [ -d "$SKILLS_DIR" ] || return 0
+  for link in "$SKILLS_DIR/dgx-spark" "$SKILLS_DIR/dgx-spark-"*; do
+    [ -L "$link" ] && rm "$link"
+  done
+}
+cleanup_plugin() {
+  local link="$PLUGINS_DIR/dgx-spark-playbooks"
+  [ -L "$link" ] && rm "$link"
+  return 0
+}
+cleanup_skills
+cleanup_plugin
+
+if [ "$MODE" = "plugin" ]; then
+  mkdir -p "$PLUGINS_DIR"
+  ln -s "$REPO" "$PLUGINS_DIR/dgx-spark-playbooks"
+  echo "✓ Installed as plugin: $PLUGINS_DIR/dgx-spark-playbooks → $REPO"
+else
+  mkdir -p "$SKILLS_DIR"
+  count=0
+  for dir in "$REPO/skills"/*/; do
+    name=$(basename "$dir")
+    link="$SKILLS_DIR/$name"
+    if [ -e "$link" ] && [ ! -L "$link" ]; then
+      echo "  ! $name already exists as a real directory — skipping (remove manually to replace)"
+      continue
+    fi
+    ln -s "$dir" "$link"
+    count=$((count + 1))
+  done
+  echo "✓ Installed $count skills: $SKILLS_DIR/dgx-spark-*"
+fi
+
+echo ""
+echo "Update:    cd $REPO && git pull && ./scripts/install.sh --$MODE"
+echo "Uninstall: $REPO/scripts/uninstall.sh"
--- a/scripts/uninstall.sh
+++ b/scripts/uninstall.sh
@ -0,0 +1,26 @@
+#!/usr/bin/env bash
+# Remove dgx-spark-playbooks symlinks from both skill-mode and plugin-mode install targets.
+# Only removes symlinks — never touches real directories.
+
+set -euo pipefail
+
+SKILLS_DIR="${CLAUDE_SKILLS:-$HOME/.claude/skills}"
+PLUGINS_DIR="${CLAUDE_PLUGINS:-$HOME/.claude/plugins}"
+
+count=0
+
+if [ -d "$SKILLS_DIR" ]; then
+  for link in "$SKILLS_DIR/dgx-spark" "$SKILLS_DIR/dgx-spark-"*; do
+    [ -L "$link" ] || continue
+    rm "$link"
+    count=$((count + 1))
+  done
+fi
+
+plugin_link="$PLUGINS_DIR/dgx-spark-playbooks"
+if [ -L "$plugin_link" ]; then
+  rm "$plugin_link"
+  count=$((count + 1))
+fi
+
+echo "✓ Removed $count symlinks"
--- a/skills/dgx-spark-comfy-ui/SKILL.md
+++ b/skills/dgx-spark-comfy-ui/SKILL.md
@ -0,0 +1,20 @@
+---
+name: dgx-spark-comfy-ui
+description: Install and use Comfy UI to generate images — on NVIDIA DGX Spark. Use when setting up comfy-ui on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/comfy-ui/README.md -->
+# Comfy UI
+
+> Install and use Comfy UI to generate images
+
+ComfyUI is an open-source web server application for AI image generation using diffusion-based models like SDXL, Flux, and others. It has a browser-based UI that lets you create, edit, and run image generation and editing workflows with multiple steps. These generation and editing steps (e.g., loading a model, adding text or sampling) are configurable in the UI as a node, and you connect nodes with wires to form a workflow.
+
+ComfyUI uses the host's GPU for inference, so you can install it on your DGX Spark and do all of your image generation and editing directly on your device.  
+
+Workflows are saved as JSON files, so you can version them for future work, collaboration, and reproducibility.
+
+**Outcome**: You'll install and configure ComfyUI on your NVIDIA DGX Spark device so you can use the unified memory to work with large models.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/comfy-ui/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-connect-three-sparks/SKILL.md
+++ b/skills/dgx-spark-connect-three-sparks/SKILL.md
@ -0,0 +1,20 @@
+---
+name: dgx-spark-connect-three-sparks
+description: Connect and set up three DGX Spark devices in a ring topology — on NVIDIA DGX Spark. Use when setting up connect-three-sparks on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/connect-three-sparks/README.md -->
+# Connect Three DGX Spark in a Ring Topology
+
+> Connect and set up three DGX Spark devices in a ring topology
+
+Configure three DGX Spark systems in a ring topology for high-speed inter-node communication
+using 200GbE direct QSFP connections. This setup enables distributed workloads across three
+DGX Spark nodes by establishing network connectivity and configuring SSH authentication.
+
+**Outcome**: You will physically connect three DGX Spark devices with QSFP cables, configure network
+interfaces for cluster communication, and establish passwordless SSH between nodes to create
+a functional distributed computing environment.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/connect-three-sparks/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-connect-to-your-spark/SKILL.md
+++ b/skills/dgx-spark-connect-to-your-spark/SKILL.md
@ -0,0 +1,55 @@
+---
+name: dgx-spark-connect-to-your-spark
+description: Set up SSH access to an NVIDIA DGX Spark from a laptop using NVIDIA Sync (recommended) or manual SSH. Use when a user is new to their Spark and needs to connect remotely, before doing anything else. This is a prerequisite for nearly every other dgx-spark-* skill — if a user hasn't set this up, do this first.
+---
+
+<!-- GENERATED:BEGIN from nvidia/connect-to-your-spark/README.md -->
+# Set Up Local Network Access
+
+> NVIDIA Sync helps set up and configure SSH access
+
+If you primarily work on another system, such as a laptop, and want to use your DGX Spark as a
+remote resource, this playbook shows you how to connect and work over SSH. With SSH, you can
+securely open a terminal session or tunnel ports to access web apps and APIs on your DGX Spark
+from your local machine. 
+
+There are two approaches: **NVIDIA Sync (recommended)** for streamlined
+device management, or **manual SSH** for direct command-line control.
+
+**Outcome**: You will establish secure SSH access to your DGX Spark device using either NVIDIA Sync or a manual
+SSH configuration. NVIDIA Sync provides a graphical interface for device management with
+integrated app launching, while manual SSH gives you direct command-line control with port
+forwarding capabilities. Both approaches enable you to run terminal commands, access web
+applications, and manage your DGX Spark remotely from your laptop.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/connect-to-your-spark/README.md`
+<!-- GENERATED:END -->
+
+## When to use this skill
+- User just got their DGX Spark and wants to use it from their laptop
+- Any other dgx-spark-* skill needs SSH access and the user hasn't configured it yet
+- User reports "can't connect to my Spark" or "SSH hangs / can't resolve spark-abcd.local"
+
+## Two paths — help the user pick
+- **NVIDIA Sync (recommended)** — GUI, handles SSH key generation + aliasing + port forwarding for apps. Required if they want one-click app launchers (DGX Dashboard, VS Code, Open WebUI tunnels).
+- **Manual SSH** — if they prefer CLI-only workflow, or Sync isn't supported on their platform.
+
+Most users should use NVIDIA Sync unless they have a specific reason not to.
+
+## Key decisions
+- **Hostname vs IP** — default is mDNS hostname (`spark-abcd.local`). On corporate networks that block mDNS, they'll need to use the IP address from their router's admin panel. Quick test: `ping spark-abcd.local` — if it hangs, mDNS is blocked.
+- **First-boot wait** — after initial system setup, the Spark can take 3–4 minutes to finish updates before SSH becomes available. Don't diagnose connection issues in this window.
+
+## Non-obvious gotchas
+- NVIDIA Sync's password prompt happens **once** — it uses the password only to install the SSH key, then discards it. If auth fails, the key install didn't complete; re-run the add-device flow.
+- mDNS `.local` resolution is OS + network-stack specific. Works on most home Wi-Fi; often broken on corporate VPNs or guest networks.
+- Port-forwarding for web apps is a separate step (SSH `-L` flag or Custom Ports in Sync) — connecting to SSH alone doesn't give laptop browsers access to web UIs running on the Spark.
+
+## Related skills
+- **Alternative**: `dgx-spark-tailscale` — use Tailscale VPN for remote access instead of local-network SSH. Works off-network.
+- **Follow-ups (what users typically do next)**:
+  - `dgx-spark-ollama` — run a local LLM
+  - `dgx-spark-open-webui` — web chat UI
+  - `dgx-spark-vscode` — remote development
+  - `dgx-spark-dgx-dashboard` — system monitoring (already pre-installed, just needs the tunnel)
+- **Multi-Spark setups depend on this first**: `dgx-spark-connect-two-sparks`, `dgx-spark-connect-three-sparks`, `dgx-spark-multi-sparks-through-switch`
--- a/skills/dgx-spark-connect-two-sparks/SKILL.md
+++ b/skills/dgx-spark-connect-two-sparks/SKILL.md
@ -0,0 +1,20 @@
+---
+name: dgx-spark-connect-two-sparks
+description: Connect two Spark devices and setup them up for inference and fine-tuning — on NVIDIA DGX Spark. Use when setting up connect-two-sparks on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/connect-two-sparks/README.md -->
+# Connect Two Sparks
+
+> Connect two Spark devices and setup them up for inference and fine-tuning
+
+Configure two DGX Spark systems for high-speed inter-node communication using 200GbE direct
+QSFP connections. This setup enables distributed workloads across multiple DGX Spark nodes
+by establishing network connectivity and configuring SSH authentication.
+
+**Outcome**: You will physically connect two DGX Spark devices with a QSFP cable, configure network
+interfaces for cluster communication, and establish passwordless SSH between nodes to create
+a functional distributed computing environment.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/connect-two-sparks/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-cuda-x-data-science/SKILL.md
+++ b/skills/dgx-spark-cuda-x-data-science/SKILL.md
@ -0,0 +1,21 @@
+---
+name: dgx-spark-cuda-x-data-science
+description: Install and use NVIDIA cuML and NVIDIA cuDF to accelerate UMAP, HDBSCAN, pandas and more with zero code changes — on NVIDIA DGX Spark. Use when setting up cuda-x-data-science on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/cuda-x-data-science/README.md -->
+# CUDA-X Data Science
+
+> Install and use NVIDIA cuML and NVIDIA cuDF to accelerate UMAP, HDBSCAN, pandas and more with zero code changes
+
+This playbook includes two example notebooks that demonstrate the acceleration of key machine learning algorithms and core pandas operations using CUDA-X Data Science libraries:
+
+- **NVIDIA cuDF:** Accelerates operations for data preparation and core data processing of 8GB of strings data, with no code changes.
+- **NVIDIA cuML:** Accelerates popular, compute intensive machine learning algorithms in sci-kit learn (LinearSVC), UMAP, and HDBSCAN, with no code changes.
+
+CUDA-X Data Science (formally RAPIDS) is an open-source library collection that accelerates the data science and data processing ecosystem. These libraries accelerate popular Python tools like scikit-learn and pandas with zero code changes. On DGX Spark, these libraries maximize performance at your desk with your existing code.
+
+**Outcome**: You will accelerate popular machine learning algorithms and data analytics operations GPU. You will understand how to accelerate popular Python tools, and the value of running data science workflows on your DGX Spark.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/cuda-x-data-science/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-dgx-dashboard/SKILL.md
+++ b/skills/dgx-spark-dgx-dashboard/SKILL.md
@ -0,0 +1,16 @@
+---
+name: dgx-spark-dgx-dashboard
+description: Monitor your DGX system and launch JupyterLab — on NVIDIA DGX Spark. Use when setting up dgx-dashboard on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/dgx-dashboard/README.md -->
+# DGX Dashboard
+
+> Monitor your DGX system and launch JupyterLab
+
+The DGX Dashboard is a web application that runs locally on DGX Spark devices, providing a graphical interface for system updates, resource monitoring, and an integrated JupyterLab environment. Users can access the dashboard locally from the app launcher or remotely through NVIDIA Sync or SSH tunneling. The dashboard is the easiest way to update system packages and firmware when working remotely.
+
+**Outcome**: You will learn how to access and use the DGX Dashboard on your DGX Spark device. By the end of this walkthrough, you will be able to launch JupyterLab instances with pre-configured Python environments, monitor GPU performance, manage system updates, and run a sample AI workload using Stable Diffusion. You'll understand multiple access methods including desktop shortcuts, NVIDIA Sync, and manual SSH tunneling.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/dgx-dashboard/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-flux-finetuning/SKILL.md
+++ b/skills/dgx-spark-flux-finetuning/SKILL.md
@ -0,0 +1,28 @@
+---
+name: dgx-spark-flux-finetuning
+description: Fine-tune FLUX.1-dev 12B model using Dreambooth LoRA for custom image generation — on NVIDIA DGX Spark. Use when setting up flux-finetuning on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/flux-finetuning/README.md -->
+# FLUX.1 Dreambooth LoRA Fine-tuning
+
+> Fine-tune FLUX.1-dev 12B model using Dreambooth LoRA for custom image generation
+
+This playbook demonstrates how to fine-tune the FLUX.1-dev 12B model using multi-concept Dreambooth LoRA (Low-Rank Adaptation) for custom image generation on DGX Spark. 
+With 128GB of unified memory and powerful GPU acceleration, DGX Spark provides an ideal environment for training an image generation model with multiple models loaded in memory, such as the Diffusion Transformer, CLIP Text Encoder, T5 Text Encoder, and the Autoencoder.
+
+Multi-concept Dreambooth LoRA fine-tuning allows you to teach FLUX.1 new concepts, characters, and styles. The trained LoRA weights can be easily integrated into existing ComfyUI workflows, making it perfect for prototyping and experimentation.
+Moreover, this playbook demonstrates how DGX Spark can not only load several models in memory, but also train and generate high-resolution images such as 1024px and higher.
+
+**Outcome**: You will have a fine-tuned FLUX.1 model capable of generating images with your custom concepts, readily available for ComfyUI workflows.
+The setup includes:
+- FLUX.1-dev model fine-tuning using Dreambooth LoRA technique
+- Training on custom concepts ("tjtoy" toy and "sparkgpu" GPU)
+- High-resolution 1K diffusion training and inference
+- ComfyUI integration for intuitive visual workflows
+- Docker containerization for reproducible environments
+
+Duration: * 30-45 minutes for initial setup model download time
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/flux-finetuning/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-isaac/SKILL.md
+++ b/skills/dgx-spark-isaac/SKILL.md
@ -0,0 +1,18 @@
+---
+name: dgx-spark-isaac
+description: Build Isaac Sim and Isaac Lab from source for Spark — on NVIDIA DGX Spark. Use when setting up isaac on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/isaac/README.md -->
+# Install and Use Isaac Sim and Isaac Lab
+
+> Build Isaac Sim and Isaac Lab from source for Spark
+
+Isaac Sim is a robotics simulation platform built on NVIDIA Omniverse that enables photorealistic, physically accurate simulations of robots and environments. It provides a comprehensive toolkit for robotics development, including physics simulation, sensor simulation, and visualization capabilities. Isaac Lab is a reinforcement learning framework built on top of Isaac Sim, designed for training and deploying RL policies for robotics applications.
+
+Isaac Sim uses GPU-accelerated physics simulation to enable fast, realistic robot simulations that can run faster than real-time. Isaac Lab extends this with pre-built RL environments, training scripts, and evaluation tools for common robotics tasks like locomotion, manipulation, and navigation. Together, they provide an end-to-end solution for developing, training, and testing robotics applications entirely in simulation before deploying to real hardware.
+
+**Outcome**: You'll build Isaac Sim from source on your NVIDIA DGX Spark device and set up Isaac Lab for reinforcement learning experiments. This includes compiling the Isaac Sim engine, configuring the development environment, and running a sample RL training task to verify the installation.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/isaac/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-jax/SKILL.md
+++ b/skills/dgx-spark-jax/SKILL.md
@ -0,0 +1,25 @@
+---
+name: dgx-spark-jax
+description: Optimize JAX to run on Spark — on NVIDIA DGX Spark. Use when setting up jax on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/jax/README.md -->
+# Optimized JAX
+
+> Optimize JAX to run on Spark
+
+JAX lets you write **NumPy-style Python code** and run it fast on GPUs without writing CUDA. It does this by:
+
+- **NumPy on accelerators**: Use `jax.numpy` just like NumPy, but arrays live on the GPU.  
+- **Function transformations**:  
+  - `jit` → Compiles your function into fast GPU code  
+  - `grad` → Gives you automatic differentiation 
+  - `vmap` → Vectorizes your function across batches  
+  - `pmap` → Runs across multiple GPUs in parallel
+
+**Outcome**: You'll set up a JAX development environment on NVIDIA Spark with Blackwell architecture that enables 
+high-performance machine learning prototyping using familiar NumPy-like abstractions, complete with 
+GPU acceleration and performance optimization capabilities.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/jax/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-live-vlm-webui/SKILL.md
+++ b/skills/dgx-spark-live-vlm-webui/SKILL.md
@ -0,0 +1,24 @@
+---
+name: dgx-spark-live-vlm-webui
+description: Real-time Vision Language Model interaction with webcam streaming — on NVIDIA DGX Spark. Use when setting up live-vlm-webui on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/live-vlm-webui/README.md -->
+# Live VLM WebUI
+
+> Real-time Vision Language Model interaction with webcam streaming
+
+Live VLM WebUI is a universal web interface for real-time Vision Language Model (VLM) interaction and benchmarking. It enables you to stream your webcam directly to any VLM backend (Ollama, vLLM, SGLang, or cloud APIs) and receive live AI-powered analysis. This tool is perfect for testing VLM models, benchmarking performance across different hardware configurations, and exploring vision AI capabilities.
+
+The interface provides WebRTC-based video streaming, integrated GPU monitoring, customizable prompts, and support for multiple VLM backends. It works seamlessly with the powerful Blackwell GPU in your DGX Spark, enabling real-time vision inference at impressive speeds.
+
+**Outcome**: You'll set up a complete real-time vision AI testing environment on your DGX Spark that allows you to:
+
+- Stream webcam video and get instant VLM analysis through a web browser
+- Test and compare different vision language models (Gemma 3, Llama Vision, Qwen VL, etc.)
+- Monitor GPU and system performance in real-time while models process video frames
+- Customize prompts for various use cases (object detection, scene description, OCR, safety monitoring)
+- Access the interface from any device on your network with a web browser
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/live-vlm-webui/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-llama-cpp/SKILL.md
+++ b/skills/dgx-spark-llama-cpp/SKILL.md
@ -0,0 +1,22 @@
+---
+name: dgx-spark-llama-cpp
+description: Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Gemma 4 31B IT as example) — on NVIDIA DGX Spark. Use when setting up llama-cpp on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/llama-cpp/README.md -->
+# Run models with llama.cpp on DGX Spark
+
+> Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Gemma 4 31B IT as example)
+
+[llama.cpp](https://github.com/ggml-org/llama.cpp) is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through `llama-server`’s OpenAI-compatible HTTP API.
+
+This playbook walks through that stack end to end. As the model example, it uses **Gemma 4 31B IT** - a frontier reasoning model built by Google DeepMind that llama.cpp supports, with strengths in coding, agentic workflows, and fine-tuning. The instructions download its **F16** GGUF from Hugging Face. The same build and server steps apply to other GGUFs (including other sizes in the support matrix below).
+
+**Outcome**: You will build llama.cpp with CUDA for GB10, download a Gemma 4 31B IT model checkpoint, and run **`llama-server`** with GPU offload. You get:
+
+- Local inference through llama.cpp (no separate Python inference framework required)
+- An OpenAI-compatible `/v1/chat/completions` endpoint for tools and apps
+- A concrete validation that **Gemma 4 31B IT** runs on this stack on DGX Spark
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/llama-cpp/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-llama-factory/SKILL.md
+++ b/skills/dgx-spark-llama-factory/SKILL.md
@ -0,0 +1,22 @@
+---
+name: dgx-spark-llama-factory
+description: Install and fine-tune models with LLaMA Factory — on NVIDIA DGX Spark. Use when setting up llama-factory on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/llama-factory/README.md -->
+# LLaMA Factory
+
+> Install and fine-tune models with LLaMA Factory
+
+LLaMA Factory is an open-source framework that simplifies the process of training and fine
+tuning large language models. It offers a unified interface for a variety of cutting edge
+methods such as SFT, RLHF, and QLoRA techniques. It also supports a wide range of LLM
+architectures such as LLaMA, Mistral and Qwen. This playbook demonstrates how to fine-tune
+large language models using LLaMA Factory CLI on your NVIDIA Spark device.
+
+**Outcome**: You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large
+language models using LoRA, QLoRA, and full fine-tuning methods. This enables efficient
+model adaptation for specialized domains while leveraging hardware-specific optimizations.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/llama-factory/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-lm-studio/SKILL.md
+++ b/skills/dgx-spark-lm-studio/SKILL.md
@ -0,0 +1,25 @@
+---
+name: dgx-spark-lm-studio
+description: Deploy LM Studio and serve LLMs on a Spark device; use LM Link to access models remotely. — on NVIDIA DGX Spark. Use when setting up lm-studio on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/lm-studio/README.md -->
+# LM Studio on DGX Spark
+
+> Deploy LM Studio and serve LLMs on a Spark device; use LM Link to access models remotely.
+
+LM Studio is an application for discovering, running, and serving large language models entirely on your own hardware. You can run local LLMs like gpt-oss, Qwen3, Gemma3, DeepSeek, and many more models privately and for free.
+
+This playbook shows you how to deploy LM Studio on an NVIDIA DGX Spark device to run LLMs locally with GPU acceleration. Running LM Studio on DGX Spark enables Spark to act as your own private, high-performance LLM server.
+
+**LM Link** (optional) lets you use your Spark’s models from another machine as if they were local. You can link your DGX Spark and your laptop (or other devices) over an end-to-end encrypted connection, so you can load and run models on the Spark from your laptop without being on the same LAN or opening network access. See [LM Link](https://lmstudio.ai/link) and Step 3b in the Instructions.
+
+**Outcome**: You'll deploy LM Studio on an NVIDIA DGX Spark device to run gpt-oss 120B, and use the model from your laptop. More specifically, you will:
+
+- Install **llmster**, a totally headless, terminal native LM Studio on the Spark
+- Run LLM inference locally on DGX Spark via API
+- Interact with models from your laptop using the LM Studio SDK
+- Optionally use **LM Link** to connect Spark and laptop over an encrypted link so remote models appear as local (no same-network or bind setup required)
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/lm-studio/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-multi-agent-chatbot/SKILL.md
+++ b/skills/dgx-spark-multi-agent-chatbot/SKILL.md
@ -0,0 +1,27 @@
+---
+name: dgx-spark-multi-agent-chatbot
+description: Deploy a multi-agent chatbot system and chat with agents on your Spark — on NVIDIA DGX Spark. Use when setting up multi-agent-chatbot on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/multi-agent-chatbot/README.md -->
+# Build and Deploy a Multi-Agent Chatbot
+
+> Deploy a multi-agent chatbot system and chat with agents on your Spark
+
+This playbook shows you how to use DGX Spark to prototype, build, and deploy a fully local multi-agent system. 
+With 128GB of unified memory, DGX Spark can run multiple LLMs and VLMs in parallel — enabling interactions across agents.
+
+At the core is a supervisor agent powered by gpt-oss-120B, orchestrating specialized downstream agents for coding, retrieval-augmented generation (RAG), and image understanding. 
+Thanks to DGX Spark's out-of-the-box support for popular AI frameworks and libraries, development and prototyping are fast and frictionless. 
+Together, these components demonstrate how complex, multimodal workflows can be executed efficiently on local, high-performance hardware.
+
+**Outcome**: You will have a full-stack multi-agent chatbot system running on your DGX Spark, accessible through
+your local web browser. 
+The setup includes:
+- LLM and VLM model serving using llama.cpp servers and TensorRT LLM servers
+- GPU acceleration for both model inference and document retrieval
+- Multi-agent system orchestration using a supervisor agent powered by gpt-oss-120B
+- MCP (Model Context Protocol) servers as tools for the supervisor agent
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/multi-agent-chatbot/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-multi-modal-inference/SKILL.md
+++ b/skills/dgx-spark-multi-modal-inference/SKILL.md
@ -0,0 +1,23 @@
+---
+name: dgx-spark-multi-modal-inference
+description: Setup multi-modal inference with TensorRT — on NVIDIA DGX Spark. Use when setting up multi-modal-inference on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/multi-modal-inference/README.md -->
+# Multi-modal Inference
+
+> Setup multi-modal inference with TensorRT
+
+Multi-modal inference combines different data types, such as **text, images, and audio**, within a single model pipeline to generate or interpret richer outputs.  
+Instead of processing one input type at a time, multi-modal systems have shared representations that  **text-to-image generation**, **image captioning**, or **vision-language reasoning**.  
+
+On GPUs, this enables **parallel processing across modalities** for faster, higher-fidelity results for tasks that combine language and vision.
+
+**Outcome**: You'll deploy GPU-accelerated multi-modal inference capabilities on NVIDIA Spark using TensorRT to run 
+Flux.1 and SDXL diffusion models with optimized performance across multiple precision formats (FP16, 
+FP8, FP4).
+
+Duration: 45-90 minutes depending on model downloads and optimization steps
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/multi-modal-inference/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-multi-sparks-through-switch/SKILL.md
+++ b/skills/dgx-spark-multi-sparks-through-switch/SKILL.md
@ -0,0 +1,14 @@
+---
+name: dgx-spark-multi-sparks-through-switch
+description: Set up a cluster of DGX Spark devices that are connected through Switch — on NVIDIA DGX Spark. Use when setting up multi-sparks-through-switch on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/multi-sparks-through-switch/README.md -->
+# Connect Multiple DGX Spark through a Switch
+
+> Set up a cluster of DGX Spark devices that are connected through Switch
+
+Configure four DGX Spark systems for high-speed inter-node communication using 200Gbps QSFP connections through a QSFP switch. This setup enables distributed workloads across multiple DGX Spark nodes by establishing network connectivity and configuring SSH authentication.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/multi-sparks-through-switch/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-nccl/SKILL.md
+++ b/skills/dgx-spark-nccl/SKILL.md
@ -0,0 +1,23 @@
+---
+name: dgx-spark-nccl
+description: Install and test NCCL on two Sparks — on NVIDIA DGX Spark. Use when setting up nccl on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/nccl/README.md -->
+# NCCL for Two Sparks
+
+> Install and test NCCL on two Sparks
+
+NCCL (NVIDIA Collective Communication Library) enables high-performance GPU-to-GPU communication
+across multiple nodes. This walkthrough sets up NCCL for multi-node distributed training on
+DGX Spark systems with Blackwell architecture. You'll configure networking, build NCCL from
+source with Blackwell support, and validate communication between nodes.
+
+**Outcome**: You'll have a working multi-node NCCL environment that enables high-bandwidth GPU communication
+across DGX Spark systems for distributed training workloads, with validated network performance
+and proper GPU topology detection.
+
+Duration: 30 minutes for setup and validation · Risk: Medium - involves network configuration changes
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/nccl/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-nemo-fine-tune/SKILL.md
+++ b/skills/dgx-spark-nemo-fine-tune/SKILL.md
@ -0,0 +1,16 @@
+---
+name: dgx-spark-nemo-fine-tune
+description: Use NVIDIA NeMo to fine-tune models locally — on NVIDIA DGX Spark. Use when setting up nemo-fine-tune on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/nemo-fine-tune/README.md -->
+# Fine-tune with NeMo
+
+> Use NVIDIA NeMo to fine-tune models locally
+
+This playbook guides you through setting up and using NVIDIA NeMo AutoModel for fine-tuning large language models and vision-language models on NVIDIA Spark devices. NeMo AutoModel provides GPU-accelerated, end-to-end training for Hugging Face models with native PyTorch support, enabling instant fine-tuning without conversion delays. The framework supports distributed training across single GPU to multi-node clusters, with optimized kernels and memory-efficient recipes specifically designed for ARM64 architecture and Blackwell GPU systems.
+
+**Outcome**: You'll establish a complete fine-tuning environment for large language models (1-70B parameters) and vision-language models using NeMo AutoModel on your NVIDIA Spark device. By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT), supervised fine-tuning (SFT), and distributed training capabilities with FP8 precision optimizations, all while maintaining compatibility with the Hugging Face ecosystem.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/nemo-fine-tune/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-nemoclaw/SKILL.md
+++ b/skills/dgx-spark-nemoclaw/SKILL.md
@ -0,0 +1,30 @@
+---
+name: dgx-spark-nemoclaw
+description: Install NemoClaw on DGX Spark with local Ollama inference and Telegram bot integration — on NVIDIA DGX Spark. Use when setting up nemoclaw on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/nemoclaw/README.md -->
+# NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
+
+> Install NemoClaw on DGX Spark with local Ollama inference and Telegram bot integration
+
+**NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime -- an environment designed for executing agents with additional security -- and open-source models like NVIDIA Nemotron. A single installer command handles Node.js, OpenShell, and the NemoClaw CLI, then walks you through an onboard wizard to create a sandboxed agent on your DGX Spark using Ollama with Nemotron 3 Super.
+
+By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, accessible via a web dashboard and a Telegram bot, with inference routed to a local Nemotron 3 Super 120B model on your Spark -- all without exposing your host filesystem or network to the agent.
+
+### What you'll accomplish
+
+- Configure Docker and the NVIDIA container runtime for OpenShell on DGX Spark
+- Install Ollama, pull Nemotron 3 Super 120B, and configure it for sandbox access
+
+**Outcome**: - Configure Docker and the NVIDIA container runtime for OpenShell on DGX Spark
+- Install Ollama, pull Nemotron 3 Super 120B, and configure it for sandbox access
+- Install NemoClaw with a single command (handles Node.js, OpenShell, and the CLI)
+- Run the onboard wizard to create a sandbox and configure local inference
+- Chat with the agent via the CLI, TUI, and web UI
+- Set up a Telegram bot that forwards messages to your sandboxed agent
+
+### Notice and disclaimers
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/nemoclaw/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-nemotron/SKILL.md
+++ b/skills/dgx-spark-nemotron/SKILL.md
@ -0,0 +1,22 @@
+---
+name: dgx-spark-nemotron
+description: Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark — on NVIDIA DGX Spark. Use when setting up nemotron on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/nemotron/README.md -->
+# Nemotron-3-Nano with llama.cpp
+
+> Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark
+
+Nemotron-3-Nano-30B-A3B is NVIDIA's powerful language model featuring a 30 billion parameter Mixture of Experts (MoE) architecture with only 3 billion active parameters. This efficient design enables high-quality inference with lower computational requirements, making it ideal for DGX Spark's GB10 GPU.
+
+This playbook demonstrates how to run Nemotron-3-Nano using llama.cpp, which compiles CUDA kernels at build time specifically for your GPU architecture. The model includes built-in reasoning (thinking mode) and tool calling support via the chat template.
+
+**Outcome**: You will have a fully functional Nemotron-3-Nano-30B-A3B inference server running on your DGX Spark, accessible via an OpenAI-compatible API. This setup enables:
+
+- Local LLM inference
+- OpenAI-compatible API endpoint for easy integration with existing tools
+- Built-in reasoning and tool calling capabilities
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/nemotron/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-nim-llm/SKILL.md
+++ b/skills/dgx-spark-nim-llm/SKILL.md
@ -0,0 +1,29 @@
+---
+name: dgx-spark-nim-llm
+description: Deploy a NIM on Spark — on NVIDIA DGX Spark. Use when setting up nim-llm on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/nim-llm/README.md -->
+# NIM on Spark
+
+> Deploy a NIM on Spark
+
+NVIDIA NIM is containerized software for fast, reliable AI model serving and inference on NVIDIA GPUs. This playbook demonstrates how to run NIM microservices for LLMs on DGX Spark devices, enabling local GPU inference through a simple Docker workflow. You'll authenticate with NVIDIA's registry, launch the NIM inference microservice, and perform basic inference testing to verify functionality.
+
+### What you'll accomplish
+
+You'll launch a NIM container on your DGX Spark device to expose a GPU-accelerated HTTP endpoint for text completions. While these instructions feature working with the Llama 3.1 8B NIM, additional NIM including the [Qwen3-32 NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/qwen/containers/qwen3-32b-dgx-spark) are available for DGX Spark (see them [here](https://docs.nvidia.com/nim/large-language-models/1.14.0/release-notes.html#new-language-models%20)).
+
+### What to know before starting
+
+**Outcome**: You'll launch a NIM container on your DGX Spark device to expose a GPU-accelerated HTTP endpoint for text completions. While these instructions feature working with the Llama 3.1 8B NIM, additional NIM including the [Qwen3-32 NIM](https://catalog.ngc.nvidia.com/orgs/nim/teams/qwen/containers/qwen3-32b-dgx-spark) are available for DGX Spark (see them [here](https://docs.nvidia.com/nim/large-language-models/1.14.0/release-notes.html#new-language-models%20)).
+
+### What to know before starting
+
+- Working in a terminal environment
+- Using Docker commands and GPU-enabled containers
+- Basic familiarity with REST APIs and curl commands
+- Understanding of NVIDIA GPU environments and CUDA
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/nim-llm/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-nvfp4-quantization/SKILL.md
+++ b/skills/dgx-spark-nvfp4-quantization/SKILL.md
@ -0,0 +1,27 @@
+---
+name: dgx-spark-nvfp4-quantization
+description: Quantize a model to NVFP4 to run on Spark using TensorRT Model Optimizer — on NVIDIA DGX Spark. Use when setting up nvfp4-quantization on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/nvfp4-quantization/README.md -->
+# NVFP4 Quantization
+
+> Quantize a model to NVFP4 to run on Spark using TensorRT Model Optimizer
+
+NVFP4 is a 4-bit floating-point format introduced with NVIDIA Blackwell GPUs to maintain model accuracy while reducing memory bandwidth and storage requirements for inference workloads. 
+Unlike uniform INT4 quantization, NVFP4 retains floating-point semantics with a shared exponent and a compact mantissa, allowing higher dynamic range and more stable convergence.
+NVIDIA Blackwell Tensor Cores natively support mixed-precision execution across FP16, FP8, and FP4, enabling models to use FP4 for weights and activations while accumulating in higher precision (typically FP16). 
+This design minimizes quantization error during matrix multiplications and supports efficient conversion pipelines in TensorRT-LLM for fine-tuned layer-wise quantization.
+
+Immediate benefits are:
+  - Cut memory use ~3.5x vs FP16 and ~1.8x vs FP8
+  - Maintain accuracy close to FP8 (usually <1% loss)
+
+**Outcome**: You'll quantize the DeepSeek-R1-Distill-Llama-8B model using NVIDIA's TensorRT Model Optimizer
+inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployment on NVIDIA DGX Spark.
+
+The examples use NVIDIA FP4 quantized models which help reduce model size by approximately 2x by reducing the precision of model layers.
+This quantization approach aims to preserve accuracy while providing significant throughput improvements. However, it's important to note that quantization can potentially impact model accuracy - we recommend running evaluations to verify if the quantized model maintains acceptable performance for your use case.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/nvfp4-quantization/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-ollama/SKILL.md
+++ b/skills/dgx-spark-ollama/SKILL.md
@ -0,0 +1,46 @@
+---
+name: dgx-spark-ollama
+description: "Install Ollama on an NVIDIA DGX Spark and expose its API to a local laptop via NVIDIA Sync SSH tunnel. Use when a user wants to run LLM inference on DGX Spark hardware and call the API from their laptop on localhost:11434 without exposing ports on their network."
+---
+
+<!-- GENERATED:BEGIN from nvidia/ollama/README.md -->
+# Ollama
+
+> Install and use Ollama
+
+This playbook demonstrates how to set up remote access to an Ollama server running on your NVIDIA
+Spark device using NVIDIA Sync's Custom Apps feature. You'll install Ollama on your Spark device,
+configure NVIDIA Sync to create an SSH tunnel, and access the Ollama API from your local machine.
+This eliminates the need to expose ports on your network while enabling AI inference from your
+laptop through a secure SSH tunnel.
+
+**Outcome**: You will have Ollama running on your NVIDIA Spark with Blackwell architecture and accessible via
+API calls from your local laptop. This setup allows you to build applications or use tools on your
+local machine that communicate with the Ollama API for large language model inference, leveraging
+the powerful GPU capabilities of your Spark device without complex network configuration.
+
+Duration: 10-15 minutes for initial setup, 2-3 minutes for model download (varies by model size) · Risk: Low - No system-level changes, easily reversible by stopping the custom app
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/ollama/README.md`
+<!-- GENERATED:END -->
+
+## When to use this skill
+- User has an NVIDIA DGX Spark with NVIDIA Sync installed on their laptop
+- Wants Ollama running on Spark, API accessible from their laptop
+- Wants an easy-to-use inference runtime (vs. the complexity of vLLM or TRT-LLM)
+
+## Key decisions to confirm before executing
+- **Model choice** — default in the playbook is `qwen2.5:32b` (~18GB, optimized for Blackwell). Ask the user if they want a smaller model (`qwen2.5:7b`, `llama3.1:8b`, `phi3.5:3.8b`) for lower VRAM or faster download.
+- **Check first** — run `ollama --version` on the Spark before installing; skip installation if already present.
+
+## Non-obvious gotchas
+- The SSH tunnel must be re-activated after NVIDIA Sync restarts — `localhost:11434` only works while the "Ollama Server" custom app is active in Sync.
+- Uninstall is destructive: `sudo rm -rf /usr/share/ollama` removes all downloaded models (often tens of GB). Confirm with user before running cleanup.
+- Streaming responses (`"stream": true`) behave differently than non-streaming — use `curl -N` to see them.
+
+## Related skills
+- **Prerequisite**: `dgx-spark-connect-to-your-spark` — NVIDIA Sync + local network access basics. If the user hasn't set this up yet, do it first.
+- **Composes with**: `dgx-spark-open-webui` — web chat UI on top of Ollama. Most common follow-up.
+- **Alternative**: `dgx-spark-lm-studio` — GUI-based model management instead of Ollama's CLI.
+- **Alternative**: `dgx-spark-llama-cpp` — lower-level control over inference.
+- **Upgrade path**: `dgx-spark-vllm` — when the user needs higher throughput or is serving multiple concurrent users.
--- a/skills/dgx-spark-open-webui/SKILL.md
+++ b/skills/dgx-spark-open-webui/SKILL.md
@ -0,0 +1,38 @@
+---
+name: dgx-spark-open-webui
+description: Install Open WebUI on NVIDIA DGX Spark for a web-based chat interface to LLMs running on Spark GPU. Use when a user wants a browser UI for chatting with local models — most commonly paired with Ollama, either bundled inside Open WebUI or as a separate backend.
+---
+
+<!-- GENERATED:BEGIN from nvidia/open-webui/README.md -->
+# Open WebUI with Ollama
+
+> Install Open WebUI and use Ollama to chat with models on your Spark
+
+Open WebUI is an extensible, self-hosted AI interface that operates entirely offline.
+This playbook shows you how to deploy Open WebUI with an integrated Ollama server on your DGX Spark device that lets you access the web interface from your local browser while the models run on Spark's GPU.
+
+**Outcome**: You will have a fully functional Open WebUI installation running on your DGX Spark. This will be accessible through your local web browser either via **NVIDIA Sync's managed SSH tunneling (recommended)** or via manual setup. The setup includes integrated Ollama for model management, persistent data storage, and GPU acceleration for model inference.
+
+Duration: 15-20 minutes for initial setup, plus model download time (varies by model size)
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/open-webui/README.md`
+<!-- GENERATED:END -->
+
+## When to use this skill
+- User has Spark SSH access (`dgx-spark-connect-to-your-spark`) and wants a web chat UI, not just CLI
+- User already has Ollama and wants to chat through a browser
+- User wants a self-hosted ChatGPT-like interface running entirely on their own hardware
+
+## Key decisions
+- **Bundled Ollama or separate?** — Open WebUI ships with an integrated Ollama option (single Docker container). Simpler for first-time users. If the user already ran `dgx-spark-ollama` separately, configure Open WebUI to connect to that existing Ollama instead of running two copies.
+- **Sync-managed or manual Docker?** — NVIDIA Sync can manage the SSH tunnel + custom-port setup automatically. Manual Docker gives more control but requires the user to handle port forwarding themselves.
+
+## Non-obvious gotchas
+- User must be in the `docker` group on the Spark — `docker ps` without sudo must work. If not, add via `sudo usermod -aG docker $USER` and **log out/in to apply** (a new SSH session is not enough — the session must be fully re-established).
+- The Open WebUI container stores user accounts and chat history in a named volume. Don't `docker rm -v` the container unless you intend to lose history.
+- First-run creates an admin account from whoever signs up first — if the UI is port-forwarded somewhere other users can reach, sign up immediately before anyone else does.
+
+## Related skills
+- **Prerequisite**: `dgx-spark-connect-to-your-spark` — SSH + Sync setup
+- **Pairs with**: `dgx-spark-ollama` — the most common backend. Open WebUI can bundle its own Ollama, but if `dgx-spark-ollama` was already set up, reuse it (saves disk, one set of models).
+- **Alternative UIs**: `dgx-spark-lm-studio` (desktop GUI, not web) · `dgx-spark-live-vlm-webui` (vision-language models specifically)
--- a/skills/dgx-spark-openclaw/SKILL.md
+++ b/skills/dgx-spark-openclaw/SKILL.md
@ -0,0 +1,20 @@
+---
+name: dgx-spark-openclaw
+description: Run OpenClaw locally on DGX Spark with LM Studio or Ollama — on NVIDIA DGX Spark. Use when setting up openclaw on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/openclaw/README.md -->
+# OpenClaw 🦞
+
+> Run OpenClaw locally on DGX Spark with LM Studio or Ollama
+
+OpenClaw (formerly Clawdbot & Moltbot) is a **local-first** AI agent that runs on your machine. It combines multiple capabilities into a single assistant: it remembers conversations, adapts to your usage, runs continuously, uses context from your files and apps, and can be extended with community **skills**.
+
+Running OpenClaw and its LLMs **fully on your DGX Spark** keeps your data private and avoids ongoing cloud API costs. DGX Spark is well suited for this: it runs Linux, is designed to stay on, and has **128GB memory**, so you can run large local models for better accuracy and more capable behavior.
+
+**Outcome**: You will have OpenClaw installed on your DGX Spark and connected to a local LLM (via LM Studio or Ollama). You can use the OpenClaw web UI to chat with your agent, and optionally connect communication channels and skills. The agent and models run entirely on your Spark—no data leaves your machine unless you add cloud or external integrations.
+
+Duration: About 30 minutes for install and first-time model setup; model download time depends on size and network (gpt-oss-120b is ~65GB and may take longer on slower connections). · Risk: **Medium to High**—the agent has access to whatever files, tools, and channels you configure. Risk increases significantly if you enable terminal/command execution skills or connect external accounts. Without proper isolation, this setup could expose sensitive data or allow code execution. **Always follow the security measures above.**
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/openclaw/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-openshell/SKILL.md
+++ b/skills/dgx-spark-openshell/SKILL.md
@ -0,0 +1,23 @@
+---
+name: dgx-spark-openshell
+description: Run OpenClaw with local models in an NVIDIA OpenShell sandbox on DGX Spark — on NVIDIA DGX Spark. Use when setting up openshell on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/openshell/README.md -->
+# Secure Long Running AI Agents with OpenShell on DGX Spark
+
+> Run OpenClaw with local models in an NVIDIA OpenShell sandbox on DGX Spark
+
+OpenClaw is a local-first AI agent that runs on your machine, combining memory, file access, tool use, and community skills into a persistent assistant. Running it directly on your system means the agent can access your files, credentials, and network—creating real security risks.
+
+**NVIDIA OpenShell** solves this problem. It is an open-source sandbox runtime that wraps the agent in kernel-level isolation with declarative YAML policies. OpenShell controls what the agent can read on disk, which network endpoints it can reach, and what privileges it has—without disabling the capabilities that make the agent useful.
+
+By combining OpenClaw with OpenShell on DGX Spark, you get the full power of a local AI agent backed by 128GB of unified memory for large models, while enforcing explicit controls over filesystem access, network egress, and credential handling.
+
+### Notice & Disclaimers
+#### Quick Start Safety Check
+
+**Outcome**: You will install the OpenShell CLI (`openshell`), deploy a gateway on your DGX Spark, and launch OpenClaw inside a sandboxed environment using the pre-built OpenClaw community sandbox. The sandbox enforces filesystem, network, and process isolation by default. You will also configure local inference routing so OpenClaw uses a model running on your Spark without needing external API keys.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/openshell/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-portfolio-optimization/SKILL.md
+++ b/skills/dgx-spark-portfolio-optimization/SKILL.md
@ -0,0 +1,23 @@
+---
+name: dgx-spark-portfolio-optimization
+description: GPU-Accelerated portfolio optimization using cuOpt and cuML — on NVIDIA DGX Spark. Use when setting up portfolio-optimization on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/portfolio-optimization/README.md -->
+# Portfolio Optimization
+
+> GPU-Accelerated portfolio optimization using cuOpt and cuML
+
+This playbook demonstrates an end-to-end GPU-accelerated workflow using NVIDIA cuOpt and NVIDIA cuML to solve large-scale portfolio optimization problems, using the Mean-CVaR (Conditional Value-at-Risk) model, in near real-time. 
+
+Portfolio Optimization (PO) involves solving high-dimensional, non-linear numerical optimization problems to balance risk and return. Modern portfolios often contain thousands of assets, making traditional CPU-based solvers too slow for advanced workflows. By moving the computational heavy lifting to the GPU, this solution dramatically reduces computation time.
+
+**Outcome**: You will implement a pipeline that provides tools for performance evaluation, strategy backtesting, benchmarking, and visualization. The workflow includes:
+- **GPU-Accelerated Optimization:** Leveraging NVIDIA cuOpt LP/MILP solvers 
+- **Data-Driven Risk Modeling:** Implementing CVaR as a scenario-based risk measure that models tail risks without making assumptions about asset return distributions.
+- **Scenario Generation:** Using GPU-accelerated Kernel Density Estimation (KDE) via NVIDIA cuML to model return distributions.
+- **Real-World Constraint Management:** Implementing constraints including concentration limits, leverage constraints, turnover limits, and cardinality constraints.
+- **Comprehensive Backtesting:** Evaluating portfolio performance with specific tools for testing rebalancing strategies.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/portfolio-optimization/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-pytorch-fine-tune/SKILL.md
+++ b/skills/dgx-spark-pytorch-fine-tune/SKILL.md
@ -0,0 +1,17 @@
+---
+name: dgx-spark-pytorch-fine-tune
+description: Use Pytorch to fine-tune models locally — on NVIDIA DGX Spark. Use when setting up pytorch-fine-tune on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/pytorch-fine-tune/README.md -->
+# Fine-tune with Pytorch
+
+> Use Pytorch to fine-tune models locally
+
+This playbook guides you through setting up and using Pytorch for fine-tuning large language models on NVIDIA Spark devices.
+
+**Outcome**: You'll establish a complete fine-tuning environment for large language models (1-70B parameters) on your NVIDIA Spark device. 
+By the end, you'll have a working installation that supports parameter-efficient fine-tuning (PEFT) and supervised fine-tuning (SFT).
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/pytorch-fine-tune/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-rag-ai-workbench/SKILL.md
+++ b/skills/dgx-spark-rag-ai-workbench/SKILL.md
@ -0,0 +1,24 @@
+---
+name: dgx-spark-rag-ai-workbench
+description: Install and use AI Workbench to clone and run a reproducible RAG application — on NVIDIA DGX Spark. Use when setting up rag-ai-workbench on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/rag-ai-workbench/README.md -->
+# RAG Application in AI Workbench
+
+> Install and use AI Workbench to clone and run a reproducible RAG application
+
+This walkthrough demonstrates how to set up and run an agentic retrieval-augmented generation (RAG)
+project using NVIDIA AI Workbench. You'll use AI Workbench to clone and run a pre-built agentic RAG
+application that intelligently routes queries, evaluates responses for relevancy and hallucination, and
+iterates through evaluation and generation cycles. The project uses a Gradio web interface and can work
+with both NVIDIA-hosted API endpoints or self-hosted models.
+
+**Outcome**: You'll have a fully functional agentic RAG application running in NVIDIA AI Workbench with a web
+interface where you can submit queries and receive intelligent responses. The system will demonstrate
+advanced RAG capabilities including query routing, response evaluation, and iterative refinement,
+giving you hands-on experience with both AI Workbench's development environment and sophisticated RAG
+architectures.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/rag-ai-workbench/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-sglang/SKILL.md
+++ b/skills/dgx-spark-sglang/SKILL.md
@ -0,0 +1,22 @@
+---
+name: dgx-spark-sglang
+description: Install and use SGLang on DGX Spark — on NVIDIA DGX Spark. Use when setting up sglang on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/sglang/README.md -->
+# SGLang for Inference
+
+> Install and use SGLang on DGX Spark
+
+SGLang is a fast serving framework for large language models and vision language models that makes
+your interaction with models faster and more controllable by co-designing the backend runtime and
+frontend language. This setup uses the optimized NVIDIA SGLang NGC Container on a single NVIDIA
+Spark device with Blackwell architecture, providing GPU-accelerated inference with all dependencies
+pre-installed.
+
+**Outcome**: You'll deploy SGLang in both server and offline inference modes on your NVIDIA Spark device,
+enabling high-performance LLM serving with support for text generation, chat completion, and
+vision-language tasks using models like DeepSeek-V2-Lite.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/sglang/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-single-cell/SKILL.md
+++ b/skills/dgx-spark-single-cell/SKILL.md
@ -0,0 +1,24 @@
+---
+name: dgx-spark-single-cell
+description: An end-to-end GPU-powered workflow for scRNA-seq using RAPIDS — on NVIDIA DGX Spark. Use when setting up single-cell on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/single-cell/README.md -->
+# Single-cell RNA Sequencing
+
+> An end-to-end GPU-powered workflow for scRNA-seq using RAPIDS
+
+Single-cell RNA sequencing (scRNA-seq) lets researchers study gene activity in each cell on its own, exposing variation, cell types, and cell states that bulk methods hide. But these large, high-dimensional datasets take heavy compute to handle.
+
+This playbook shows an end-to-end GPU-powered workflow for scRNA-seq using [RAPIDS-singlecell](https://rapids-singlecell.readthedocs.io/en/latest/), a RAPIDS powered library in the [scverse® ecosystem](https://github.com/scverse). It follows the familiar [Scanpy API](https://scanpy.readthedocs.io/en/stable/) and lets researchers run the steps of data preprocessing, quality control (QC) and cleanup, visualization, and investigation faster than CPU tools by working with sparse count matrices directly on the GPU.
+
+**Outcome**: 1. GPU-Accelerated Data Loading & Preprocessing
+2. QC cells visually to understand the data
+3. Filter unusual cells
+4. Remove unwanted sources of variation 
+5. Cluster and visualize PCA and UMAP data
+6. Batch Correction and analysis using Harmony, k-nearest neighbors, UMAP, and tSNE
+7. Explore the biological information from the data with differential expression analysis and trajectory analysis
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/single-cell/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-spark-reachy-photo-booth/SKILL.md
+++ b/skills/dgx-spark-spark-reachy-photo-booth/SKILL.md
@ -0,0 +1,23 @@
+---
+name: dgx-spark-spark-reachy-photo-booth
+description: AI augmented photo booth using the DGX Spark and Reachy Mini. — on NVIDIA DGX Spark. Use when setting up spark-reachy-photo-booth on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/spark-reachy-photo-booth/README.md -->
+# Spark & Reachy Photo Booth
+
+> AI augmented photo booth using the DGX Spark and Reachy Mini.
+
+![Teaser](assets/teaser.jpg)
+
+Spark & Reachy Photo Booth is an interactive and event-driven photo booth demo that combines the **DGX Spark™** with the **Reachy Mini** robot to create an engaging multimodal AI experience. The system showcases:
+
+- **A multi-modal agent** built with the `NeMo Agent Toolkit`
+- **A ReAct loop** driven by the `openai/gpt-oss-20b` LLM powered by `TensorRT-LLM`
+- **Voice interaction** based on `nvidia/riva-parakeet-ctc-1.1B` and `hexgrad/Kokoro-82M`
+- **Image generation** with `black-forest-labs/FLUX.1-Kontext-dev` for image-to-image restyling
+
+**Outcome**: You'll deploy a complete photo booth system on DGX Spark running multiple inference models locally — LLM, image generation, speech recognition, speech generation, and computer vision — all without cloud dependencies. The Reachy robot interacts with users through natural conversation, captures photos, and generates custom images based on prompts, demonstrating real-time multimodal AI processing on edge hardware.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/spark-reachy-photo-booth/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-speculative-decoding/SKILL.md
+++ b/skills/dgx-spark-speculative-decoding/SKILL.md
@ -0,0 +1,18 @@
+---
+name: dgx-spark-speculative-decoding
+description: Learn how to set up speculative decoding for fast inference on Spark — on NVIDIA DGX Spark. Use when setting up speculative-decoding on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/speculative-decoding/README.md -->
+# Speculative Decoding
+
+> Learn how to set up speculative decoding for fast inference on Spark
+
+Speculative decoding speeds up text generation by using a **small, fast model** to draft several tokens ahead, then having the **larger model** quickly verify or adjust them.
+This way, the big model doesn't need to predict every token step-by-step, reducing latency while keeping output quality.
+
+**Outcome**: You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using two approaches: EAGLE-3 and Draft-Target.
+These examples demonstrate how to accelerate large language model inference while maintaining output quality.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/speculative-decoding/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-tailscale/SKILL.md
+++ b/skills/dgx-spark-tailscale/SKILL.md
@ -0,0 +1,26 @@
+---
+name: dgx-spark-tailscale
+description: Use Tailscale to connect to your Spark on your home network no matter where you are — on NVIDIA DGX Spark. Use when setting up tailscale on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/tailscale/README.md -->
+# Set up Tailscale on Your Spark
+
+> Use Tailscale to connect to your Spark on your home network no matter where you are
+
+Tailscale creates an encrypted peer-to-peer mesh network that allows secure access
+to your NVIDIA DGX Spark device from anywhere without complex firewall configurations
+or port forwarding. By installing Tailscale on both your DGX Spark and client devices,
+you establish a private "tailnet" where each device gets a stable private IP
+address and hostname, enabling seamless SSH access whether you're at home, work,
+or a coffee shop.
+
+**Outcome**: You will set up Tailscale on your DGX Spark device and client machines to
+create secure remote access. After completion, you'll be able to SSH into your
+DGX Spark from anywhere using simple commands like `ssh user@spark-hostname`, with
+all traffic automatically encrypted and NAT traversal handled transparently.
+
+Duration: 15-30 minutes for initial setup, 5 minutes per additional device
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/tailscale/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-trt-llm/SKILL.md
+++ b/skills/dgx-spark-trt-llm/SKILL.md
@ -0,0 +1,23 @@
+---
+name: dgx-spark-trt-llm
+description: Install and use TensorRT-LLM on DGX Spark — on NVIDIA DGX Spark. Use when setting up trt-llm on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/trt-llm/README.md -->
+# TRT LLM for Inference
+
+> Install and use TensorRT-LLM on DGX Spark
+
+**NVIDIA TensorRT-LLM (TRT-LLM)** is an open-source library for optimizing and accelerating large language model (LLM) inference on NVIDIA GPUs.
+
+It provides highly efficient kernels, memory management, and parallelism strategies—like tensor, pipeline, and sequence parallelism—so developers can serve LLMs with lower latency and higher throughput.
+
+TRT-LLM integrates with frameworks like Hugging Face and PyTorch, making it easier to deploy state-of-the-art models at scale.
+
+**Outcome**: You'll set up TensorRT-LLM to optimize and deploy large language models on your DGX Spark, achieving significantly higher throughput and lower latency than standard PyTorch
+inference through kernel-level optimizations, efficient memory layouts, and advanced quantization.
+
+Duration: 45-60 minutes for setup and API server deployment · Risk: Medium - container pulls and model downloads may fail due to network issues
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/trt-llm/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-txt2kg/SKILL.md
+++ b/skills/dgx-spark-txt2kg/SKILL.md
@ -0,0 +1,30 @@
+---
+name: dgx-spark-txt2kg
+description: Transform unstructured text into interactive knowledge graphs with LLM inference and graph visualization — on NVIDIA DGX Spark. Use when setting up txt2kg on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/txt2kg/README.md -->
+# Text to Knowledge Graph on DGX Spark
+
+> Transform unstructured text into interactive knowledge graphs with LLM inference and graph visualization
+
+This playbook demonstrates how to build and deploy a comprehensive knowledge graph generation and visualization solution that serves as a reference for knowledge graph extraction.
+The unified memory architecture enables running larger, more accurate models that produce higher-quality knowledge graphs and deliver superior downstream GraphRAG performance.
+
+This txt2kg playbook transforms unstructured text documents into structured knowledge graphs using:
+- **Knowledge Triple Extraction**: Using Ollama with GPU acceleration for local LLM inference to extract subject-predicate-object relationships
+- **Graph Database Storage**: ArangoDB for storing and querying knowledge triples with relationship traversal
+- **GPU-Accelerated Visualization**: Three.js WebGPU rendering for interactive 2D/3D graph exploration
+
+**Outcome**: You will have a fully functional system capable of processing documents, generating and editing knowledge graphs, and providing querying, accessible through an interactive web interface.
+The setup includes:
+- **Local LLM Inference**: Ollama for GPU-accelerated LLM inference with no API keys required
+- **Graph Database**: ArangoDB for storing and querying triples with relationship traversal
+- **Interactive Visualization**: GPU-accelerated graph rendering with Three.js WebGPU
+- **Modern Web Interface**: Next.js frontend with document management and query interface
+- **Fully Containerized**: Reproducible deployment with Docker Compose and GPU support
+
+Duration: - 2-3 minutes for initial setup and container deployment
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/txt2kg/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-unsloth/SKILL.md
+++ b/skills/dgx-spark-unsloth/SKILL.md
@ -0,0 +1,24 @@
+---
+name: dgx-spark-unsloth
+description: Optimized fine-tuning with Unsloth — on NVIDIA DGX Spark. Use when setting up unsloth on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/unsloth/README.md -->
+# Unsloth on DGX Spark
+
+> Optimized fine-tuning with Unsloth
+
+- **Performance-first**: It claims to speed up training (e.g. 2× faster on single GPU, up to 30× in multi-GPU setups) and reduce memory usage compared to standard methods.   
+- **Kernel-level optimizations**: Core compute is built with custom kernels (e.g. with Triton) and hand-optimized math to boost throughput and efficiency.  
+- **Quantization & model formats**: Supports dynamic quantization (4-bit, 16-bit) and GGUF formats to reduce footprint, while aiming to retain accuracy.    
+- **Broad model support**: Works with many LLMs (LLaMA, Mistral, Qwen, DeepSeek, etc.) and allows training, fine-tuning, exporting to formats like Ollama, vLLM, GGUF, Hugging Face.   
+- **Simplified interface**: Provides easy-to-use notebooks and tools so users can fine-tune models with minimal boilerplate.
+
+**Outcome**: You'll set up Unsloth for optimized fine-tuning of large language models on NVIDIA Spark devices, 
+achieving up to 2x faster training speeds with reduced memory usage through efficient 
+parameter-efficient fine-tuning methods like LoRA and QLoRA.
+
+Duration: 30-60 minutes for initial setup and test run
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/unsloth/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-vibe-coding/SKILL.md
+++ b/skills/dgx-spark-vibe-coding/SKILL.md
@ -0,0 +1,30 @@
+---
+name: dgx-spark-vibe-coding
+description: Use DGX Spark as a local or remote Vibe Coding assistant with Ollama and Continue — on NVIDIA DGX Spark. Use when setting up vibe-coding on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/vibe-coding/README.md -->
+# Vibe Coding in VS Code
+
+> Use DGX Spark as a local or remote Vibe Coding assistant with Ollama and Continue
+
+This playbook walks you through setting up DGX Spark as a **Vibe Coding assistant** — locally or as a remote coding companion for VSCode with Continue.dev.  
+This guide uses **Ollama** with **GPT-OSS 120B** to provide easy deployment of a coding assistant to VSCode. Included is advanced instructions to allow DGX Spark and Ollama to provide the coding assistant to be available over your local network. This guide is also written on a **fresh installation** of the OS. If your OS is not freshly installed and you have issues, see the troubleshooting tab.
+
+### What You'll Accomplish
+
+You'll have a fully configured DGX Spark system capable of:
+- Running local code assistance through Ollama.
+- Serving models remotely for Continue and VSCode integration.
+
+**Outcome**: You'll have a fully configured DGX Spark system capable of:
+- Running local code assistance through Ollama.
+- Serving models remotely for Continue and VSCode integration.
+- Hosting large LLMs like GPT-OSS 120B using unified memory.
+
+### Prerequisites
+
+- DGX Spark (128GB unified memory recommended)
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/vibe-coding/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-vllm/SKILL.md
+++ b/skills/dgx-spark-vllm/SKILL.md
@ -0,0 +1,58 @@
+---
+name: dgx-spark-vllm
+description: Install and run vLLM for high-throughput LLM inference on NVIDIA DGX Spark, including multi-Spark serving for very large models (e.g., Llama 405B across two Sparks). Use when a user needs an OpenAI-compatible API, higher throughput than Ollama, or wants to run models too large for a single Spark. Significantly more complex setup than Ollama — ensure user actually needs what vLLM offers before recommending.
+---
+
+<!-- GENERATED:BEGIN from nvidia/vllm/README.md -->
+# vLLM for Inference
+
+> Install and use vLLM on DGX Spark
+
+vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.
+
+- It uses a memory-efficient attention algoritm called **PagedAttention** to handle long sequences without running out of GPU memory.
+- New requests can be added to a batch already in process through **continuous batching** to keep GPUs fully utilized.
+- It has an **OpenAI-compatible API** so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.
+
+**Outcome**: You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture,
+either using a pre-built Docker container or building from source with custom LLVM/Triton
+support for ARM64.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/vllm/README.md`
+<!-- GENERATED:END -->
+
+## When to use this skill
+- User's current runtime (usually Ollama) can't handle their throughput requirements
+- User wants an OpenAI-compatible API to plug applications into
+- User wants to run a model too large for one Spark (vLLM supports tensor-parallel across 2+ Sparks)
+- User specifically asked for vLLM
+
+## When NOT to use this skill
+- User is just exploring — `dgx-spark-ollama` is far simpler
+- User needs single-user chat — Ollama + Open WebUI covers that case
+- User needs absolute lowest latency with pre-compiled models — that's `dgx-spark-trt-llm` territory
+
+## Key decisions
+- **Docker container or build from source?** — Pre-built container is the recommended path. Source build is only needed if the user has a specific reason (custom patches, bleeding-edge vLLM version not yet in the container).
+- **Single-Spark or multi-Spark?** — Multi-Spark adds major complexity: networking (`dgx-spark-connect-two-sparks` or `dgx-spark-multi-sparks-through-switch`) + NCCL (`dgx-spark-nccl`) must be working first. Only pursue for 120B+ param models that don't fit on one Spark.
+- **Model + quantization** — the playbook's support matrix lists specific NVFP4/FP8/MXFP4 combinations. Don't assume any HF model works — check the matrix.
+
+## Prerequisites (hard requirements)
+- CUDA 13.0 toolkit installed (`nvcc --version`)
+- Docker + NVIDIA Container Toolkit configured
+- Python 3.12 available
+- `dgx-spark-connect-to-your-spark` for remote access
+
+## Non-obvious gotchas
+- This is ARM64 + Blackwell. PyPI wheels built for x86_64 CUDA 12.x **will not work** — the playbook's container has ARM64-specific LLVM/Triton patches.
+- vLLM's default GPU memory utilization is high (~0.9). On a Spark that's also running other workloads, drop to 0.7–0.8 or the container will OOM.
+- Multi-Spark serving is sensitive to NCCL configuration and link quality — a single flaky cable will destroy throughput. Validate `dgx-spark-nccl` first before assuming vLLM is the problem.
+
+## Related skills
+- **Prerequisite**: `dgx-spark-connect-to-your-spark`
+- **Simpler alternative**: `dgx-spark-ollama` — recommend this first unless the user needs vLLM's specific capabilities
+- **Alternative for max perf**: `dgx-spark-trt-llm` — TensorRT-LLM with compiled engines. Different use case (lowest latency, more setup cost), not strictly an upgrade path
+- **Multi-Spark composition**:
+  - `dgx-spark-connect-two-sparks` or `dgx-spark-multi-sparks-through-switch` (physical link)
+  - `dgx-spark-nccl` (collective comms)
+- **Pairs with**: `dgx-spark-dgx-dashboard` for GPU monitoring during serving
--- a/skills/dgx-spark-vscode/SKILL.md
+++ b/skills/dgx-spark-vscode/SKILL.md
@ -0,0 +1,20 @@
+---
+name: dgx-spark-vscode
+description: Install and use VS Code locally or remotely — on NVIDIA DGX Spark. Use when setting up vscode on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/vscode/README.md -->
+# VS Code
+
+> Install and use VS Code locally or remotely
+
+This walkthrough will help you set up Visual Studio Code, a full-featured IDE with extensions, an integrated terminal, and Git integration, while leveraging your DGX Spark device for development and testing. There are two different approaches for using VS Code:
+
+ * **Direct Installation**: Install the VS Code development environment directly on your ARM64-based Spark system for local development on the target hardware without remote development overhead.
+
+ * **Access with NVIDIA Sync**: Set up NVIDIA Sync to remotely connect to Spark over SSH and configure VS Code as one of your development tools.
+
+**Outcome**: You will have VS Code set up for development on your DGX Spark device with access to the system's ARM64 architecture and GPU resources. This setup enables direct code development, debugging, and execution.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/vscode/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark-vss/SKILL.md
+++ b/skills/dgx-spark-vss/SKILL.md
@ -0,0 +1,16 @@
+---
+name: dgx-spark-vss
+description: Run the VSS Blueprint on your Spark — on NVIDIA DGX Spark. Use when setting up vss on Spark hardware.
+---
+
+<!-- GENERATED:BEGIN from nvidia/vss/README.md -->
+# Build a Video Search and Summarization (VSS) Agent
+
+> Run the VSS Blueprint on your Spark
+
+Deploy NVIDIA's Video Search and Summarization (VSS) AI Blueprint to build intelligent video analytics systems that combine vision language models, large language models, and retrieval-augmented generation. The system transforms raw video content into real-time actionable insights with video summarization, Q&A, and real-time alerts. You'll set up either a completely local Event Reviewer deployment or a hybrid deployment using remote model endpoints.
+
+**Outcome**: You will deploy NVIDIA's VSS AI Blueprint on NVIDIA Spark hardware with Blackwell architecture, choosing between two deployment scenarios: VSS Event Reviewer (completely local with VLM pipeline) or Standard VSS (hybrid deployment with remote LLM/embedding endpoints). This includes setting up Alert Bridge, VLM Pipeline, Alert Inspector UI, Video Storage Toolkit, and optional DeepStream CV pipeline for automated video analysis and event review.
+
+**Full playbook**: `/Users/jkneen/Documents/GitHub/dgx-spark-playbooks/nvidia/vss/README.md`
+<!-- GENERATED:END -->
--- a/skills/dgx-spark/SKILL.md
+++ b/skills/dgx-spark/SKILL.md
@ -0,0 +1,121 @@
+---
+name: dgx-spark
+description: Catalog and router for NVIDIA DGX Spark playbooks — use when a user asks about setting up their DGX Spark, wants an overview of what they can run on Spark hardware, or needs help choosing between inference runtimes, fine-tuning frameworks, or networking setups. Lists all available dgx-spark-* skills and encodes the relationships between them (prerequisites, alternatives, composes-with, upgrade paths).
+---
+
+# DGX Spark Playbooks — Index
+
+Use this catalog to route the user to the right specific `dgx-spark-*` skill. Each entry below names a leaf skill; invoke it when the user's intent matches.
+
+## Categories
+
+### Inference runtimes (serve models)
+- `dgx-spark-ollama` — easiest, good default for most users
+- `dgx-spark-vllm` — higher throughput, production-grade serving
+- `dgx-spark-trt-llm` — maximum Blackwell performance, most complex setup
+- `dgx-spark-sglang` — structured generation, batched inference
+- `dgx-spark-llama-cpp` — lightweight, CPU/GPU flexibility
+- `dgx-spark-lm-studio` — GUI-based model management
+- `dgx-spark-nim-llm` — NVIDIA NIM microservices
+
+### Chat & UI
+- `dgx-spark-open-webui` — web chat UI, pairs with Ollama
+- `dgx-spark-live-vlm-webui` — vision-language model interface
+- `dgx-spark-dgx-dashboard` — GPU/system monitoring
+
+### Fine-tuning
+- `dgx-spark-pytorch-fine-tune` — baseline PyTorch fine-tuning
+- `dgx-spark-nemo-fine-tune` — NVIDIA NeMo framework
+- `dgx-spark-unsloth` — memory-efficient fine-tuning
+- `dgx-spark-llama-factory` — multi-model fine-tuning framework
+- `dgx-spark-flux-finetuning` — FLUX.1 Dreambooth LoRA (image models)
+
+### Networking & multi-Spark
+- `dgx-spark-connect-to-your-spark` — **foundational: local network access setup**
+- `dgx-spark-tailscale` — VPN-based remote access
+- `dgx-spark-connect-two-sparks` — link two Sparks
+- `dgx-spark-connect-three-sparks` — ring topology
+- `dgx-spark-multi-sparks-through-switch` — switched multi-Spark
+- `dgx-spark-nccl` — collective communication across Sparks
+
+### Dev environments & tooling
+- `dgx-spark-vscode` — VS Code remote setup
+- `dgx-spark-vibe-coding` — agentic coding in VS Code
+- `dgx-spark-rag-ai-workbench` — RAG app in AI Workbench
+- `dgx-spark-openshell` — secure long-running agents
+- `dgx-spark-openclaw` — (advanced agent setup)
+- `dgx-spark-nemoclaw` — Nemotron + Telegram agent
+
+### Specialized workloads
+- `dgx-spark-comfy-ui` — image generation UI
+- `dgx-spark-isaac` — Isaac Sim / Isaac Lab (robotics)
+- `dgx-spark-jax` — JAX on Spark
+- `dgx-spark-cuda-x-data-science` — RAPIDS / data science
+- `dgx-spark-multi-agent-chatbot` — multi-agent deployment
+- `dgx-spark-multi-modal-inference` — multi-modal models
+- `dgx-spark-nemotron` — Nemotron-3-Nano with llama.cpp
+- `dgx-spark-nvfp4-quantization` — FP4 quantization workflows
+- `dgx-spark-portfolio-optimization` — finance example
+- `dgx-spark-single-cell` — single-cell RNA sequencing
+- `dgx-spark-speculative-decoding` — speculative decoding inference
+- `dgx-spark-spark-reachy-photo-booth` — Reachy robot demo
+- `dgx-spark-txt2kg` — text-to-knowledge-graph
+- `dgx-spark-vss` — video search & summarization agent
+
+## Relationship graph
+
+Edge types: `→prereq→` (must do first), `→pairs→` (composes naturally), `→alt→` (pick one, roughly equivalent choice), `→upgrade→` (next step when outgrowing this).
+
+### Networking (almost everything depends on this)
+- `connect-to-your-spark` →prereq→ **all remote-access playbooks**
+- `tailscale` →alt→ `connect-to-your-spark`
+- `connect-two-sparks` →prereq→ `nccl`
+- `connect-two-sparks` →upgrade→ `connect-three-sparks` →upgrade→ `multi-sparks-through-switch`
+
+### Inference stack
+- `ollama` →pairs→ `open-webui` *(most common pairing — chat UI on top of Ollama)*
+- `ollama` →alt→ `lm-studio` *(GUI vs CLI; roughly equivalent for local single-user use)*
+- `ollama` →alt→ `llama-cpp` *(lower-level control)*
+- `ollama` →upgrade→ `vllm` *(when throughput / OpenAI-compatible API matters)*
+- `vllm` →alt→ `trt-llm` *(different use case — trt-llm for lowest latency with compiled engines; not strictly an upgrade)*
+- `vllm` →composes→ `connect-two-sparks` + `nccl` *(multi-Spark serving for very large models)*
+- `nemotron` →pairs→ `llama-cpp` *(playbook specifically uses llama.cpp runtime)*
+- `nim-llm` →alt→ `vllm` *(NIM microservices vs. raw vLLM serving)*
+
+### Fine-tuning pipelines
+- `pytorch-fine-tune` →prereq→ `flux-finetuning` *(need baseline PyTorch setup first)*
+- `nemo-fine-tune` →pairs→ `nim-llm` *(deploy tuned NeMo models via NIM)*
+- `unsloth` →related→ `llama-factory` *(related but different specialties — unsloth for memory-efficient LoRA/QLoRA; llama-factory is a broader multi-technique framework)*
+
+### Monitoring & observability
+- `dgx-dashboard` →pairs→ **all inference/fine-tuning skills** *(GPU/system monitoring during workloads)*
+
+### Performance tuning (compose with inference)
+- `speculative-decoding` →composes→ `vllm`, `trt-llm` *(inference acceleration technique)*
+- `nvfp4-quantization` →composes→ `vllm`, `trt-llm` *(quantize first, then serve)*
+
+### Agent & automation stacks
+- `nemoclaw` →pairs→ `nemotron` *(nemoclaw uses Nemotron internally)*
+- `openclaw` →pairs→ `openshell` *(agent security pattern)*
+
+### Dev env dependencies
+- `vscode` →prereq→ `vibe-coding` *(vibe-coding builds on VS Code remote setup)*
+
+## Suggestion rules
+
+When the user's request is broad, narrow it with these questions before invoking a leaf:
+
+| User says... | Ask / suggest |
+|---|---|
+| "chat with a model on Spark" | Default: `dgx-spark-ollama` + `dgx-spark-open-webui`. Ask: CLI-only or web UI? |
+| "fastest inference" | `dgx-spark-trt-llm`, but warn it's the most complex. Ask if `vllm` would suffice. |
+| "train" / "fine-tune a model" | Ask: from scratch (`nemo-fine-tune`) or adapt existing (`unsloth`, `llama-factory`)? Image model? → `flux-finetuning`. |
+| "connect to my Spark" / "remote access" | `dgx-spark-connect-to-your-spark` first. Suggest `tailscale` as alternative for VPN use. |
+| "multiple Sparks" | Ask: 2 (`connect-two-sparks`), 3 (`connect-three-sparks`), or more via switch? NCCL after physical link. |
+| "I just got my Spark, what can I do" | List categories above. Suggest starting with `connect-to-your-spark` → `ollama`. |
+
+## Curation notes
+
+Edges above are a working starting point. Revise as real usage reveals which pairings matter most. In particular:
+- `vllm →alt→ trt-llm` is inferred from the READMEs' positioning — confirm with users whether they see these as alternatives or a progression
+- `speculative-decoding` and `nvfp4-quantization` compose with serving runtimes but the exact integration path may vary — check if users typically apply them standalone or as part of vllm/trt-llm setup