mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-22 18:13:52 +00:00

github-actions[bot] 4d0d20d39f chore: regenerate skills/ from upstream playbooks [skip ci]

2026-04-19 09:25:00 +00:00

1.6 KiB

Raw Blame History

name	description
dgx-spark-llama-cpp	Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Gemma 4 31B IT as example) — on NVIDIA DGX Spark. Use when setting up llama-cpp on Spark hardware.

Run models with llama.cpp on DGX Spark

Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Gemma 4 31B IT as example)

llama.cpp is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through llama-server’s OpenAI-compatible HTTP API.

This playbook walks through that stack end to end. As the model example, it uses Gemma 4 31B IT - a frontier reasoning model built by Google DeepMind that llama.cpp supports, with strengths in coding, agentic workflows, and fine-tuning. The instructions download its F16 GGUF from Hugging Face. The same build and server steps apply to other GGUFs (including other sizes in the support matrix below).

Outcome: You will build llama.cpp with CUDA for GB10, download a Gemma 4 31B IT model checkpoint, and run llama-server with GPU offload. You get:

Local inference through llama.cpp (no separate Python inference framework required)
An OpenAI-compatible /v1/chat/completions endpoint for tools and apps
A concrete validation that Gemma 4 31B IT runs on this stack on DGX Spark

Full playbook: /home/runner/work/dgx-spark-playbooks/dgx-spark-playbooks/nvidia/llama-cpp/README.md

1.6 KiB Raw Blame History Unescape Escape

Run models with llama.cpp on DGX Spark

1.6 KiB

Raw Blame History