sarman/dgx-spark-playbooks

Fork 0

mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-23 18:33:54 +00:00

github-actions[bot] 4d0d20d39f chore: regenerate skills/ from upstream playbooks [skip ci]

2026-04-19 09:25:00 +00:00

4.0 KiB

Raw Blame History

name	description
dgx-spark-vllm	Install and run vLLM for high-throughput LLM inference on NVIDIA DGX Spark, including multi-Spark serving for very large models (e.g., Llama 405B across two Sparks). Use when a user needs an OpenAI-compatible API, higher throughput than Ollama, or wants to run models too large for a single Spark. Significantly more complex setup than Ollama — ensure user actually needs what vLLM offers before recommending.

name

description

dgx-spark-vllm

Install and run vLLM for high-throughput LLM inference on NVIDIA DGX Spark, including multi-Spark serving for very large models (e.g., Llama 405B across two Sparks). Use when a user needs an OpenAI-compatible API, higher throughput than Ollama, or wants to run models too large for a single Spark. Significantly more complex setup than Ollama — ensure user actually needs what vLLM offers before recommending.

vLLM for Inference

Install and use vLLM on DGX Spark

vLLM is an inference engine designed to run large language models efficiently. The key idea is maximizing throughput and minimizing memory waste when serving LLMs.

It uses a memory-efficient attention algoritm called PagedAttention to handle long sequences without running out of GPU memory.
New requests can be added to a batch already in process through continuous batching to keep GPUs fully utilized.
It has an OpenAI-compatible API so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.

Outcome: You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture, either using a pre-built Docker container or building from source with custom LLVM/Triton support for ARM64.

Full playbook: /home/runner/work/dgx-spark-playbooks/dgx-spark-playbooks/nvidia/vllm/README.md

When to use this skill

User's current runtime (usually Ollama) can't handle their throughput requirements
User wants an OpenAI-compatible API to plug applications into
User wants to run a model too large for one Spark (vLLM supports tensor-parallel across 2+ Sparks)
User specifically asked for vLLM

When NOT to use this skill

User is just exploring — dgx-spark-ollama is far simpler
User needs single-user chat — Ollama + Open WebUI covers that case
User needs absolute lowest latency with pre-compiled models — that's dgx-spark-trt-llm territory

Key decisions

Docker container or build from source? — Pre-built container is the recommended path. Source build is only needed if the user has a specific reason (custom patches, bleeding-edge vLLM version not yet in the container).
Single-Spark or multi-Spark? — Multi-Spark adds major complexity: networking (dgx-spark-connect-two-sparks or dgx-spark-multi-sparks-through-switch) + NCCL (dgx-spark-nccl) must be working first. Only pursue for 120B+ param models that don't fit on one Spark.
Model + quantization — the playbook's support matrix lists specific NVFP4/FP8/MXFP4 combinations. Don't assume any HF model works — check the matrix.

Prerequisites (hard requirements)

CUDA 13.0 toolkit installed (nvcc --version)
Docker + NVIDIA Container Toolkit configured
Python 3.12 available
dgx-spark-connect-to-your-spark for remote access

Non-obvious gotchas

This is ARM64 + Blackwell. PyPI wheels built for x86_64 CUDA 12.x will not work — the playbook's container has ARM64-specific LLVM/Triton patches.
vLLM's default GPU memory utilization is high (~0.9). On a Spark that's also running other workloads, drop to 0.7–0.8 or the container will OOM.
Multi-Spark serving is sensitive to NCCL configuration and link quality — a single flaky cable will destroy throughput. Validate dgx-spark-nccl first before assuming vLLM is the problem.

Prerequisite: dgx-spark-connect-to-your-spark
Simpler alternative: dgx-spark-ollama — recommend this first unless the user needs vLLM's specific capabilities
Alternative for max perf: dgx-spark-trt-llm — TensorRT-LLM with compiled engines. Different use case (lowest latency, more setup cost), not strictly an upgrade path
Multi-Spark composition:
- dgx-spark-connect-two-sparks or dgx-spark-multi-sparks-through-switch (physical link)
- dgx-spark-nccl (collective comms)
Pairs with: dgx-spark-dgx-dashboard for GPU monitoring during serving

4.0 KiB Raw Blame History Unescape Escape