dgx-spark-playbooks/skills/dgx-spark-vllm/SKILL.md

---
name: dgx-spark-vllm
description: Install and run vLLM for high-throughput LLM inference on NVIDIA DGX Spark, including multi-Spark serving for very large models (e.g., Llama 405B across two Sparks). Use when a user needs an OpenAI-compatible API, higher throughput than Ollama, or wants to run models too large for a single Spark. Significantly more complex setup than Ollama — ensure user actually needs what vLLM offers before recommending.
---

<!-- GENERATED:BEGIN from nvidia/vllm/README.md -->
# vLLM for Inference

> Install and use vLLM on DGX Spark

vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.

- It uses a memory-efficient attention algoritm called **PagedAttention** to handle long sequences without running out of GPU memory.
- New requests can be added to a batch already in process through **continuous batching** to keep GPUs fully utilized.
- It has an **OpenAI-compatible API** so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.

**Outcome**: You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture,
either using a pre-built Docker container or building from source with custom LLVM/Triton
support for ARM64.

**Full playbook**: `/home/runner/work/dgx-spark-playbooks/dgx-spark-playbooks/nvidia/vllm/README.md`
<!-- GENERATED:END -->

## When to use this skill
- User's current runtime (usually Ollama) can't handle their throughput requirements
- User wants an OpenAI-compatible API to plug applications into
- User wants to run a model too large for one Spark (vLLM supports tensor-parallel across 2+ Sparks)
- User specifically asked for vLLM

## When NOT to use this skill
- User is just exploring — `dgx-spark-ollama` is far simpler
- User needs single-user chat — Ollama + Open WebUI covers that case
- User needs absolute lowest latency with pre-compiled models — that's `dgx-spark-trt-llm` territory

## Key decisions
- **Docker container or build from source?** — Pre-built container is the recommended path. Source build is only needed if the user has a specific reason (custom patches, bleeding-edge vLLM version not yet in the container).
- **Single-Spark or multi-Spark?** — Multi-Spark adds major complexity: networking (`dgx-spark-connect-two-sparks` or `dgx-spark-multi-sparks-through-switch`) + NCCL (`dgx-spark-nccl`) must be working first. Only pursue for 120B+ param models that don't fit on one Spark.
- **Model + quantization** — the playbook's support matrix lists specific NVFP4/FP8/MXFP4 combinations. Don't assume any HF model works — check the matrix.

## Prerequisites (hard requirements)
- CUDA 13.0 toolkit installed (`nvcc --version`)
- Docker + NVIDIA Container Toolkit configured
- Python 3.12 available
- `dgx-spark-connect-to-your-spark` for remote access

## Non-obvious gotchas
- This is ARM64 + Blackwell. PyPI wheels built for x86_64 CUDA 12.x **will not work** — the playbook's container has ARM64-specific LLVM/Triton patches.
- vLLM's default GPU memory utilization is high (~0.9). On a Spark that's also running other workloads, drop to 0.7–0.8 or the container will OOM.
- Multi-Spark serving is sensitive to NCCL configuration and link quality — a single flaky cable will destroy throughput. Validate `dgx-spark-nccl` first before assuming vLLM is the problem.

## Related skills
- **Prerequisite**: `dgx-spark-connect-to-your-spark`
- **Simpler alternative**: `dgx-spark-ollama` — recommend this first unless the user needs vLLM's specific capabilities
- **Alternative for max perf**: `dgx-spark-trt-llm` — TensorRT-LLM with compiled engines. Different use case (lowest latency, more setup cost), not strictly an upgrade path
- **Multi-Spark composition**:
  - `dgx-spark-connect-two-sparks` or `dgx-spark-multi-sparks-through-switch` (physical link)
  - `dgx-spark-nccl` (collective comms)
- **Pairs with**: `dgx-spark-dgx-dashboard` for GPU monitoring during serving
-												feat: scaffold skills plugin from DGX Spark playbooks

Adds a Claude Code plugin structure that exposes each NVIDIA DGX Spark
playbook as a triggerable skill, with an index skill ('dgx-spark') that
routes users to the right leaf based on intent and encodes the
relationship graph between playbooks (prerequisites, alternatives,
composes-with, upgrade paths).

Structure:
- overrides/*.md       hand-curated frontmatter + Related sections
- scripts/generate.mjs zero-dep Node generator: nvidia + overrides → skills
- scripts/install.sh   symlinks skills into ~/.claude/skills (--plugin mode available)
- skills/              committed, browsable, installable without Node
- .github/workflows/   auto-regenerates skills/ when playbooks/overrides change

Initial curated leaves: ollama, open-webui, vllm, connect-to-your-spark.
Remaining 37 leaves use generator fallback (title + tagline + summary
extracted from README) and can be curated incrementally via overrides/.

											
										
										
											2026-04-19 09:22:08 +00:00
+								---
 								name: dgx-spark-vllm
 								description: Install and run vLLM for high-throughput LLM inference on NVIDIA DGX Spark, including multi-Spark serving for very large models (e.g., Llama 405B across two Sparks). Use when a user needs an OpenAI-compatible API, higher throughput than Ollama, or wants to run models too large for a single Spark. Significantly more complex setup than Ollama — ensure user actually needs what vLLM offers before recommending.
 								---
 								<!-- GENERATED:BEGIN from nvidia/vllm/README.md -->
 								# vLLM for Inference
 								> Install and use vLLM on DGX Spark
 								vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.
 								- It uses a memory-efficient attention algoritm called **PagedAttention** to handle long sequences without running out of GPU memory.
 								- New requests can be added to a batch already in process through **continuous batching** to keep GPUs fully utilized.
 								- It has an **OpenAI-compatible API** so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.
 								**Outcome**: You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture,
 								either using a pre-built Docker container or building from source with custom LLVM/Triton
 								support for ARM64.
-												chore: regenerate skills/ from upstream playbooks [skip ci]

											
										
										
											2026-04-19 09:25:00 +00:00
+								**Full playbook**: `/home/runner/work/dgx-spark-playbooks/dgx-spark-playbooks/nvidia/vllm/README.md`
-												feat: scaffold skills plugin from DGX Spark playbooks

Adds a Claude Code plugin structure that exposes each NVIDIA DGX Spark
playbook as a triggerable skill, with an index skill ('dgx-spark') that
routes users to the right leaf based on intent and encodes the
relationship graph between playbooks (prerequisites, alternatives,
composes-with, upgrade paths).

Structure:
- overrides/*.md       hand-curated frontmatter + Related sections
- scripts/generate.mjs zero-dep Node generator: nvidia + overrides → skills
- scripts/install.sh   symlinks skills into ~/.claude/skills (--plugin mode available)
- skills/              committed, browsable, installable without Node
- .github/workflows/   auto-regenerates skills/ when playbooks/overrides change

Initial curated leaves: ollama, open-webui, vllm, connect-to-your-spark.
Remaining 37 leaves use generator fallback (title + tagline + summary
extracted from README) and can be curated incrementally via overrides/.

											
										
										
											2026-04-19 09:22:08 +00:00
+								<!-- GENERATED:END -->
 								## When to use this skill
 								- User's current runtime (usually Ollama) can't handle their throughput requirements
 								- User wants an OpenAI-compatible API to plug applications into
 								- User wants to run a model too large for one Spark (vLLM supports tensor-parallel across 2+ Sparks)
 								- User specifically asked for vLLM
 								## When NOT to use this skill
 								- User is just exploring — `dgx-spark-ollama` is far simpler
 								- User needs single-user chat — Ollama + Open WebUI covers that case
 								- User needs absolute lowest latency with pre-compiled models — that's `dgx-spark-trt-llm` territory
 								## Key decisions
 								- **Docker container or build from source?** — Pre-built container is the recommended path. Source build is only needed if the user has a specific reason (custom patches, bleeding-edge vLLM version not yet in the container).
 								- **Single-Spark or multi-Spark?** — Multi-Spark adds major complexity: networking (`dgx-spark-connect-two-sparks` or `dgx-spark-multi-sparks-through-switch`) + NCCL (`dgx-spark-nccl`) must be working first. Only pursue for 120B+ param models that don't fit on one Spark.
 								- **Model + quantization** — the playbook's support matrix lists specific NVFP4/FP8/MXFP4 combinations. Don't assume any HF model works — check the matrix.
 								## Prerequisites (hard requirements)
 								- CUDA 13.0 toolkit installed (`nvcc --version`)
 								- Docker + NVIDIA Container Toolkit configured
 								- Python 3.12 available
 								- `dgx-spark-connect-to-your-spark` for remote access
 								## Non-obvious gotchas
 								- This is ARM64 + Blackwell. PyPI wheels built for x86_64 CUDA 12.x **will not work** — the playbook's container has ARM64-specific LLVM/Triton patches.
 								- vLLM's default GPU memory utilization is high (~0.9). On a Spark that's also running other workloads, drop to 0.7–0.8 or the container will OOM.
 								- Multi-Spark serving is sensitive to NCCL configuration and link quality — a single flaky cable will destroy throughput. Validate `dgx-spark-nccl` first before assuming vLLM is the problem.
 								## Related skills
 								- **Prerequisite**: `dgx-spark-connect-to-your-spark`
 								- **Simpler alternative**: `dgx-spark-ollama` — recommend this first unless the user needs vLLM's specific capabilities
 								- **Alternative for max perf**: `dgx-spark-trt-llm` — TensorRT-LLM with compiled engines. Different use case (lowest latency, more setup cost), not strictly an upgrade path
 								- **Multi-Spark composition**:
 								  - `dgx-spark-connect-two-sparks` or `dgx-spark-multi-sparks-through-switch` (physical link)
 								  - `dgx-spark-nccl` (collective comms)
 								- **Pairs with**: `dgx-spark-dgx-dashboard` for GPU monitoring during serving