Compare commits

..

1 Commits

Author SHA1 Message Date
Omar Obando
19ac40d9e2
Merge 48fc5eb30e into 231a45230d 2026-06-04 00:17:33 +09:00
2 changed files with 110 additions and 88 deletions

View File

@ -15,148 +15,167 @@
## Basic idea ## Basic idea
[llama.cpp](https://github.com/ggml-org/llama.cpp) is a lightweight C/C++ inference stack for large language models. You build it with CUDA so it fully utilizes the DGX Spark GB10 GPU, then load GGUF weights and expose chat through `llama-server`s OpenAI-compatible HTTP API. [llama.cpp](https://github.com/ggml-org/llama.cpp) is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through `llama-server`s OpenAI-compatible HTTP API.
This playbook walks through that stack end to end using MTP-enabled **Qwen3.6-35B-A3B** as the hands-on example. Checkpoint choices and paths for all supported models are summarized in the matrix below; commands are in the instructions. This playbook walks through that stack end to end using **Nemotron 3 Nano Omni** as the hands-on example: an NVIDIA MoE family that runs well from quantized GGUF on Spark. Checkpoint choices and paths for all supported models are summarized in the matrix below; commands are in the instructions.
## What you'll accomplish ## What you'll accomplish
You will build llama.cpp with CUDA for GB10, download a **Qwen3.6-35B-A3B** checkpoint, and run **`llama-server`** with GPU offload. You get: You will build llama.cpp with CUDA for GB10, download a **Nemotron 3 Nano Omni** example checkpoint, and run **`llama-server`** with GPU offload. You get:
- Local inference through llama.cpp (no separate Python inference framework required) - Local inference through llama.cpp (no separate Python inference framework required)
- An OpenAI-compatible `/v1/chat/completions` endpoint for tools and apps - An OpenAI-compatible `/v1/chat/completions` endpoint for tools and apps
- A concrete validation that the **Qwen3.6-35B-A3B** example runs on this stack on DGX Spark with MTP support. - A concrete validation that the **Nemotron 3 Nano Omni** example runs on this stack on DGX Spark
## What to know before starting ## What to know before starting
- Basic familiarity with Linux command line and terminal commands - Basic familiarity with Linux command line and terminal commands
- Understanding of git and building from source with CMake - Understanding of git and building from source with CMake
- Basic knowledge of REST APIs and cURL for testing - Basic knowledge of REST APIs and cURL for testing
- Familiarity with Hugging Face Hub for downloading GGUF files
## Prerequisites ## Prerequisites
**Hardware requirements** **Hardware requirements**
- NVIDIA DGX Spark with GB10 GPU - NVIDIA DGX Spark with GB10 GPU
- Sufficient unified memory for the model and the KV-Cache being utilized (about 30GB free RAM for the model in the example) - Sufficient unified memory for the example **Q8_0** checkpoint (weights on the order of **~35GB**, plus KV cache and runtime overhead—scale up if you pick a larger quant or longer context)
- At least **\~40GB** free disk for the example download plus build artifacts (more if you keep multiple GGUFs) - At least **~40GB** free disk for the example download plus build artifacts (more if you keep multiple GGUFs)
**Software requirements** **Software requirements**
- NVIDIA DGX OS - NVIDIA DGX OS
- Git: `git --version` - Git: `git --version`
- CMake (3.14+): `cmake --version` - CMake (3.14+): `cmake --version`
- CUDA Toolkit: `nvcc --version` - CUDA Toolkit: `nvcc --version`
- Network access to GitHub and Hugging Face - Network access to GitHub and Hugging Face
## Model support matrix ## Model support matrix
DGX Spark supports any GGUF format model checkpoint with llama.cpp, as long as the system has memory available to host and run the checkpoint. The following models are supported with llama.cpp on Spark. The instructions use the **Nemotron 3 Nano Omni** example row by default.
| Model | Support Status | HF Handle |
|-------|----------------|-----------|
| **Nemotron 3 Nano Omni** (example walkthrough) | ✅ | `ggml-org/NVIDIA-Nemotron-3-Nano-Omni` |
| **Qwen3.6-35B-A3B** | ✅ | `unsloth/Qwen3.6-35B-A3B-GGUF` |
| **Qwen3.6-27B** | ✅ | `unsloth/Qwen3.6-27B-GGUF` |
| **Gemma 4 31B IT** | ✅ | `ggml-org/gemma-4-31B-it-GGUF` |
| **Gemma 4 26B A4B IT** | ✅ | `ggml-org/gemma-4-26B-A4B-it-GGUF` |
| **Gemma 4 E4B IT** | ✅ | `ggml-org/gemma-4-E4B-it-GGUF` |
| **Gemma 4 E2B IT** | ✅ | `ggml-org/gemma-4-E2B-it-GGUF` |
| **Nemotron-3-Nano** | ✅ | `unsloth/Nemotron-3-Nano-30B-A3B-GGUF` |
## Time & risk ## Time & risk
* **Estimated time:** About 30 minutes, plus downloading the example GGUF (\~35GB order of magnitude for the default quant) * **Estimated time:** About 30 minutes, plus downloading the example GGUF (~35GB order of magnitude for the default quant)
* **Risk level:** Low — build is local to your clone; no system-wide installs required for the steps below * **Risk level:** Low — build is local to your clone; no system-wide installs required for the steps below
* **Rollback:** Remove the `llama.cpp` clone and the model directory under `~/.cache/huggingface/hub/` to reclaim disk space * **Rollback:** Remove the `llama.cpp` clone and the model directory under `~/models/` to reclaim disk space
* **Last updated:** 06/03/2026 * **Last updated:** 04/28/2026
* Walkthrough now uses Qwen3.6-35B-A3B as an example * Walkthrough now uses Nemotron Omni; other model rows stay available
## Instructions ## Instructions
## Step 1. Install the dependencies ## Step 1. Verify prerequisites
Install the required dependencies: The **example** checkpoint is **`nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf`** from Hugging Face repo **`ggml-org/NVIDIA-Nemotron-3-Nano-Omni`** (full handle: `ggml-org/NVIDIA-Nemotron-3-Nano-Omni/nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf`). Other supported GGUFs—including Qwen3.6, Gemma, and alternate Nemotron Omni builds—use the same build and server steps; change `hf download` and `--model` paths (see the [overview model matrix](overview.md)).
```shell Ensure the required tools are installed:
sudo apt install -y git clang cmake libcurl4-openssl-dev libssl-dev
```bash
git --version
cmake --version
nvcc --version
```
All commands should return version information. If any are missing, install them before continuing.
Install the Hugging Face CLI:
```bash
python3 -m venv llama-cpp-venv
source llama-cpp-venv/bin/activate
pip install -U "huggingface_hub[cli]"
```
Verify installation:
```bash
hf version
``` ```
## Step 2. Clone the llama.cpp repository ## Step 2. Clone the llama.cpp repository
Clone upstream llama.cpp—the framework you are building: Clone upstream llama.cpp—the framework you are building:
```shell ```bash
git clone https://github.com/ggml-org/llama.cpp git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp cd llama.cpp
``` ```
## Step 3. Build llama.cpp with CUDA ## Step 3. Build llama.cpp with CUDA
Configure CMake with CUDA and GB10s **sm\_121** architecture so GGMLs CUDA backend matches your GPU: Configure CMake with CUDA and GB10s **sm_121** architecture so GGMLs CUDA backend matches your GPU:
```shell ```bash
cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CURL=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real mkdir build && cd build
cmake --build build --config Release -j cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8
``` ```
The build usually takes on the order of 510 minutes. When it finishes, binaries such as `llama-server` appear under `build/bin/`. The build usually takes on the order of 510 minutes. When it finishes, binaries such as `llama-server` appear under `build/bin/`.
## Step 4. Start llama-server with a model ## Step 4. Download example Nemotron 3 Nano Omni GGUF
llama.cpp loads models in **GGUF** format. This playbook uses the **Q4\_K\_XL** checkpoint from `unsloth/Qwen3.6-35B-A3B-MTP-GGUF`, which provides a good balance between quality and speed on DGX Spark. llama.cpp loads models in **GGUF** format. This playbook uses the **Q8_0** checkpoint from `ggml-org/NVIDIA-Nemotron-3-Nano-Omni`, which balances quality and memory on DGX Spark GB10 unified memory.
From your `llama.cpp/build` directory, launch the OpenAI-compatible server with GPU offload. It will load the model from HuggingFace first if it hasnt been downloaded before or if there are any updates. ```bash
hf download ggml-org/NVIDIA-Nemotron-3-Nano-Omni \
All models are saved in the default HuggingFace cache directory in \~/.cache/huggingface/hub. For instance, this model will be saved into \~/.cache/huggingface/hub/models--unsloth--Qwen3.6-35B-A3B-MTP-GGUF nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf \
--local-dir ~/models/NVIDIA-Nemotron-3-Nano-Omni
It will also automatically load mmproj file to enable vision capabilities if supported by the model. By default, llama-server will try to fit full model context with ability to serve 4 concurrent requests, but it will adjust parameters automatically if needed.
```shell
./bin/llama-server \
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
--host 0.0.0.0 \
--port 30000
``` ```
To run with MTP speculative decoding, provide additional parameters as shown in the example below. MTP requires a compatible model, like `unsloth/Qwen3.6-35B-A3B-MTP-GGUF` used in this example. The following example also sets “preserve\_thinking” flag that allows Qwen models to use so-called “interleaved thinking” by preserving all prior thinking blocks in the history which can be useful for agentic workflows. The file is on the order of **~35GB** (exact size may vary). The download can be resumed if interrupted.
```shell ## Step 5. Start llama-server with Nemotron 3 Nano Omni
From your `llama.cpp/build` directory, launch the OpenAI-compatible server with GPU offload:
```bash
./bin/llama-server \ ./bin/llama-server \
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \ --model ~/models/NVIDIA-Nemotron-3-Nano-Omni/nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf \
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 30000 \ --port 30000 \
--chat-template-kwargs '{"preserve_thinking": true}' \ --n-gpu-layers 99 \
--spec-type draft-mtp \ --ctx-size 8192 \
--spec-draft-n-max 3 --threads 8
``` ```
**Parameters (short):** **Parameters (short):**
- `--host` / `--port`: bind address and port for the HTTP API - `--host` / `--port`: bind address and port for the HTTP API
- `--chat-template-kwargs`: sets additional params for the json template parser, must be a valid json object string - `--n-gpu-layers 99`: offload layers to the GPU (adjust if you use a different model)
- `--spec-type`: comma-separated list of types of speculative decoding to use (default: none, most MTP-compatible models will use “draft-mtp”, but you need to check the model card first) - `--ctx-size`: context length (can be increased up to model/server limits; uses more memory)
- `--spec-draft-n-max`: number of tokens to draft for speculative decoding (default: 3\) - `--threads`: CPU threads for non-GPU work
You should see log lines similar to: You should see log lines similar to:
``` ```
0.14.322.968 I srv load_model: speculative decoding context initialized llama_new_context_with_model: n_ctx = 8192
0.14.322.970 I slot load_model: id 0 | task -1 | new slot, n_ctx = 262144
0.14.322.972 I slot load_model: id 1 | task -1 | new slot, n_ctx = 262144
0.14.322.972 I slot load_model: id 2 | task -1 | new slot, n_ctx = 262144
0.14.322.973 I slot load_model: id 3 | task -1 | new slot, n_ctx = 262144
0.14.323.063 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
... ...
0.14.342.935 I srv llama_server: model loaded main: server is listening on 0.0.0.0:30000
0.14.342.939 I srv llama_server: server is listening on http://0.0.0.0:30000
0.14.342.944 I srv update_slots: all slots are idle
``` ```
**Keep this terminal open** while testing. Large GGUFs can take a minute or more to load, and initial model download can take a while if the model is not downloaded yet. You will see a progress bar when model is being downloaded. **Keep this terminal open** while testing. Large GGUFs can take a minute or more to load; until you see `server is listening`, nothing accepts connections on port 30000 (see Troubleshooting if `curl` reports connection refused).
The server is only ready to accept incoming connections on port 30000 after you see `server is listening` message (see Troubleshooting if `curl` reports connection refused). ## Step 6. Test the API
## Step 5. Test the API Use a **second terminal on the same machine** that runs `llama-server` (for example another SSH session into DGX Spark). If you run `curl` on your laptop while the server runs only on Spark, use the Spark hostname or IP instead of `localhost`.
Use a **second terminal on the same machine** that runs `llama-server` (for example another SSH session into DGX Spark). If you run `curl` on your laptop while the server runs only on Spark, use the Spark hostname or IP instead of `localhost`. ```bash
```shell
curl -X POST http://127.0.0.1:30000/v1/chat/completions \ curl -X POST http://127.0.0.1:30000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL", "model": "nemotron",
"messages": [{"role": "user", "content": "New York is a great city because..."}], "messages": [{"role": "user", "content": "New York is a great city because..."}],
"max_tokens": 100 "max_tokens": 100
}' }'
@ -179,7 +198,7 @@ Example shape of the response (fields vary by llama.cpp version; `message` may i
} }
], ],
"created": 1765916539, "created": 1765916539,
"model": "$MODEL_PATH", "model": "nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf",
"object": "chat.completion", "object": "chat.completion",
"usage": { "usage": {
"completion_tokens": 100, "completion_tokens": 100,
@ -193,35 +212,41 @@ Example shape of the response (fields vary by llama.cpp version; `message` may i
} }
``` ```
## Step 6. Longer completion (with Qwen3.6-35B-A3B) ## Step 7. Longer completion (with Nemotron 3 Nano Omni)
Try a slightly longer prompt to confirm stable generation with **Qwen3.6-35B-A3B**: Try a slightly longer prompt to confirm stable generation with **Nemotron 3 Nano Omni**:
```shell ```bash
curl -X POST http://127.0.0.1:30000/v1/chat/completions \ curl -X POST http://127.0.0.1:30000/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL", "model": "nemotron",
"messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}], "messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
"max_tokens": 500 "max_tokens": 500
}' }'
``` ```
## Step 7. Cleanup ## Step 8. Cleanup
Stop the server with `Ctrl+C` in the terminal where it is running. Stop the server with `Ctrl+C` in the terminal where it is running.
To remove this tutorials artifacts: To remove this tutorials artifacts:
```shell ```bash
rm -rf ~/llama.cpp rm -rf ~/llama.cpp
rm -rf ~/.cache/huggingface/hub/models--unsloth--Qwen3.6-35B-A3B-MTP-GGUF rm -rf ~/models/NVIDIA-Nemotron-3-Nano-Omni
``` ```
## Step 8. Next steps Deactivate the Python venv if you no longer need `hf`:
1. **Context length:** By default, llama.cpp tries to allocate maximum context size supported for the model if possible, but you can also set it manually using `--ctx-size` (or `-c`) to adjust for your needs. For agentic or coding needs you need a minimum of 32768 tokens, preferably 100000 or more. ```bash
2. **Other models:** You can use `--model` to load any compatible GGUF downloaded locally; the llama.cpp server API stays the same. Use `-hf` to let llama.cpp automatically manage downloads/updates. Please note that if you use `--model` with multi-modal models, you need to provide a path to .mmproj file using `--mmproj` parameter. If you use `-hf` it will load the mmproj file automatically. deactivate
```
## Step 9. Next steps
1. **Context length:** Increase `--ctx-size` for longer chats (watch memory; 1M-token class contexts are possible only when the build, model, and hardware allow).
2. **Other models:** Point `--model` at any compatible GGUF; the llama.cpp server API stays the same.
3. **Integrations:** Point Open WebUI, Continue.dev, or custom clients at `http://<spark-host>:30000/v1` using the OpenAI client pattern. 3. **Integrations:** Point Open WebUI, Continue.dev, or custom clients at `http://<spark-host>:30000/v1` using the OpenAI client pattern.
The server implements the usual OpenAI-style chat features your llama.cpp build enables (including streaming and tool-related flows where supported). The server implements the usual OpenAI-style chat features your llama.cpp build enables (including streaming and tool-related flows where supported).

View File

@ -69,7 +69,7 @@ In the CLI, youll be walked through registration. Go through the flow until r
* Go to the [Brev UI](https://brev.nvidia.com) * Go to the [Brev UI](https://brev.nvidia.com)
* Navigate to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) * Navigate to the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute)
* Confirm that the DGX Spark appears as a registered node with a **Connected** status * Confirm that the DGX Spark appears as a registered node with an **Available** status
## Step 5. Next Steps ## Step 5. Next Steps
@ -77,10 +77,7 @@ Your Spark is now integrated into Brev as a secure, remotely accessible GPU envi
Now that your hardware is connected, you can: Now that your hardware is connected, you can:
* **Share Access Anywhere:** Access your machine from anywhere and share access with others through the Brev UI by: * **Share Access Anywhere:** Access your machine from anywhere and share access with others through the Brev UI under [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute).
* Adding the user to your [Team](https://brev.nvidia.com/org/team)
* Navigating to your instance in the [Registered Compute](https://brev.nvidia.com/org/environments?tab=registered-compute) section
* In **SSH Access** section of the instance, search for the user you wish to add and click **Modify Access** to enable access
## Step 6. Cleanup ## Step 6. Cleanup
@ -95,7 +92,7 @@ brev deregister
In the UI: In the UI:
* Go to the [Brev UI](https://brev.nvidia.com) * Go to the [Brev UI](https://brev.nvidia.com)
* Navigate to the section listing “GPU Environments” and look under “Registered Compute” * Navigate to the section listing “GPU Environments” and look under “Registered Compute”
* Click the “Remove” menu item on the Spark you wish to delete from Brev. * Click the “Deregister” menu item on the Spark you wish to delete from Brev.
* Confirm your selection. * Confirm your selection.
## Troubleshooting ## Troubleshooting