dgx-spark-playbooks/overrides/vllm.md
Jason Kneen a680d0472b feat: scaffold skills plugin from DGX Spark playbooks
Adds a Claude Code plugin structure that exposes each NVIDIA DGX Spark
playbook as a triggerable skill, with an index skill ('dgx-spark') that
routes users to the right leaf based on intent and encodes the
relationship graph between playbooks (prerequisites, alternatives,
composes-with, upgrade paths).

Structure:
- overrides/*.md       hand-curated frontmatter + Related sections
- scripts/generate.mjs zero-dep Node generator: nvidia + overrides → skills
- scripts/install.sh   symlinks skills into ~/.claude/skills (--plugin mode available)
- skills/              committed, browsable, installable without Node
- .github/workflows/   auto-regenerates skills/ when playbooks/overrides change

Initial curated leaves: ollama, open-webui, vllm, connect-to-your-spark.
Remaining 37 leaves use generator fallback (title + tagline + summary
extracted from README) and can be curated incrementally via overrides/.
2026-04-19 10:22:08 +01:00

40 lines
3.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
description: Install and run vLLM for high-throughput LLM inference on NVIDIA DGX Spark, including multi-Spark serving for very large models (e.g., Llama 405B across two Sparks). Use when a user needs an OpenAI-compatible API, higher throughput than Ollama, or wants to run models too large for a single Spark. Significantly more complex setup than Ollama — ensure user actually needs what vLLM offers before recommending.
---
## When to use this skill
- User's current runtime (usually Ollama) can't handle their throughput requirements
- User wants an OpenAI-compatible API to plug applications into
- User wants to run a model too large for one Spark (vLLM supports tensor-parallel across 2+ Sparks)
- User specifically asked for vLLM
## When NOT to use this skill
- User is just exploring — `dgx-spark-ollama` is far simpler
- User needs single-user chat — Ollama + Open WebUI covers that case
- User needs absolute lowest latency with pre-compiled models — that's `dgx-spark-trt-llm` territory
## Key decisions
- **Docker container or build from source?** — Pre-built container is the recommended path. Source build is only needed if the user has a specific reason (custom patches, bleeding-edge vLLM version not yet in the container).
- **Single-Spark or multi-Spark?** — Multi-Spark adds major complexity: networking (`dgx-spark-connect-two-sparks` or `dgx-spark-multi-sparks-through-switch`) + NCCL (`dgx-spark-nccl`) must be working first. Only pursue for 120B+ param models that don't fit on one Spark.
- **Model + quantization** — the playbook's support matrix lists specific NVFP4/FP8/MXFP4 combinations. Don't assume any HF model works — check the matrix.
## Prerequisites (hard requirements)
- CUDA 13.0 toolkit installed (`nvcc --version`)
- Docker + NVIDIA Container Toolkit configured
- Python 3.12 available
- `dgx-spark-connect-to-your-spark` for remote access
## Non-obvious gotchas
- This is ARM64 + Blackwell. PyPI wheels built for x86_64 CUDA 12.x **will not work** — the playbook's container has ARM64-specific LLVM/Triton patches.
- vLLM's default GPU memory utilization is high (~0.9). On a Spark that's also running other workloads, drop to 0.70.8 or the container will OOM.
- Multi-Spark serving is sensitive to NCCL configuration and link quality — a single flaky cable will destroy throughput. Validate `dgx-spark-nccl` first before assuming vLLM is the problem.
## Related skills
- **Prerequisite**: `dgx-spark-connect-to-your-spark`
- **Simpler alternative**: `dgx-spark-ollama` — recommend this first unless the user needs vLLM's specific capabilities
- **Alternative for max perf**: `dgx-spark-trt-llm` — TensorRT-LLM with compiled engines. Different use case (lowest latency, more setup cost), not strictly an upgrade path
- **Multi-Spark composition**:
- `dgx-spark-connect-two-sparks` or `dgx-spark-multi-sparks-through-switch` (physical link)
- `dgx-spark-nccl` (collective comms)
- **Pairs with**: `dgx-spark-dgx-dashboard` for GPU monitoring during serving