dgx-spark-playbooks/vllm.md at a680d0472b85d9766fc00025e5dae308d27d35da

mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-23 02:23:53 +00:00

Jason Kneen a680d0472b feat: scaffold skills plugin from DGX Spark playbooks

Adds a Claude Code plugin structure that exposes each NVIDIA DGX Spark
playbook as a triggerable skill, with an index skill ('dgx-spark') that
routes users to the right leaf based on intent and encodes the
relationship graph between playbooks (prerequisites, alternatives,
composes-with, upgrade paths).

Structure:
- overrides/*.md       hand-curated frontmatter + Related sections
- scripts/generate.mjs zero-dep Node generator: nvidia + overrides → skills
- scripts/install.sh   symlinks skills into ~/.claude/skills (--plugin mode available)
- skills/              committed, browsable, installable without Node
- .github/workflows/   auto-regenerates skills/ when playbooks/overrides change

Initial curated leaves: ollama, open-webui, vllm, connect-to-your-spark.
Remaining 37 leaves use generator fallback (title + tagline + summary
extracted from README) and can be curated incrementally via overrides/.

2026-04-19 10:22:08 +01:00

3.0 KiB

Raw Blame History

description
Install and run vLLM for high-throughput LLM inference on NVIDIA DGX Spark, including multi-Spark serving for very large models (e.g., Llama 405B across two Sparks). Use when a user needs an OpenAI-compatible API, higher throughput than Ollama, or wants to run models too large for a single Spark. Significantly more complex setup than Ollama — ensure user actually needs what vLLM offers before recommending.

description

Install and run vLLM for high-throughput LLM inference on NVIDIA DGX Spark, including multi-Spark serving for very large models (e.g., Llama 405B across two Sparks). Use when a user needs an OpenAI-compatible API, higher throughput than Ollama, or wants to run models too large for a single Spark. Significantly more complex setup than Ollama — ensure user actually needs what vLLM offers before recommending.

When to use this skill

User's current runtime (usually Ollama) can't handle their throughput requirements
User wants an OpenAI-compatible API to plug applications into
User wants to run a model too large for one Spark (vLLM supports tensor-parallel across 2+ Sparks)
User specifically asked for vLLM

When NOT to use this skill

User is just exploring — dgx-spark-ollama is far simpler
User needs single-user chat — Ollama + Open WebUI covers that case
User needs absolute lowest latency with pre-compiled models — that's dgx-spark-trt-llm territory

Key decisions

Docker container or build from source? — Pre-built container is the recommended path. Source build is only needed if the user has a specific reason (custom patches, bleeding-edge vLLM version not yet in the container).
Single-Spark or multi-Spark? — Multi-Spark adds major complexity: networking (dgx-spark-connect-two-sparks or dgx-spark-multi-sparks-through-switch) + NCCL (dgx-spark-nccl) must be working first. Only pursue for 120B+ param models that don't fit on one Spark.
Model + quantization — the playbook's support matrix lists specific NVFP4/FP8/MXFP4 combinations. Don't assume any HF model works — check the matrix.

Prerequisites (hard requirements)

CUDA 13.0 toolkit installed (nvcc --version)
Docker + NVIDIA Container Toolkit configured
Python 3.12 available
dgx-spark-connect-to-your-spark for remote access

Non-obvious gotchas

This is ARM64 + Blackwell. PyPI wheels built for x86_64 CUDA 12.x will not work — the playbook's container has ARM64-specific LLVM/Triton patches.
vLLM's default GPU memory utilization is high (~0.9). On a Spark that's also running other workloads, drop to 0.7–0.8 or the container will OOM.
Multi-Spark serving is sensitive to NCCL configuration and link quality — a single flaky cable will destroy throughput. Validate dgx-spark-nccl first before assuming vLLM is the problem.

Prerequisite: dgx-spark-connect-to-your-spark
Simpler alternative: dgx-spark-ollama — recommend this first unless the user needs vLLM's specific capabilities
Alternative for max perf: dgx-spark-trt-llm — TensorRT-LLM with compiled engines. Different use case (lowest latency, more setup cost), not strictly an upgrade path
Multi-Spark composition:
- dgx-spark-connect-two-sparks or dgx-spark-multi-sparks-through-switch (physical link)
- dgx-spark-nccl (collective comms)
Pairs with: dgx-spark-dgx-dashboard for GPU monitoring during serving

3.0 KiB Raw Blame History Unescape Escape