dgx-spark-playbooks/nvidia/nemotron/README.md

# Nemotron-3-Nano with llama.cpp

> Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark

## Table of Contents

- [Overview](#overview)
- [Instructions](#instructions)
- [Troubleshooting](#troubleshooting)

---

## Overview

## Basic idea

Nemotron-3-Nano-30B-A3B is NVIDIA's powerful language model featuring a 30 billion parameter Mixture of Experts (MoE) architecture with only 3 billion active parameters. This efficient design enables high-quality inference with lower computational requirements, making it ideal for DGX Spark's GB10 GPU.

This playbook demonstrates how to run Nemotron-3-Nano using llama.cpp, which compiles CUDA kernels at build time specifically for your GPU architecture. The model includes built-in reasoning (thinking mode) and tool calling support via the chat template.

## What you'll accomplish

You will have a fully functional Nemotron-3-Nano-30B-A3B inference server running on your DGX Spark, accessible via an OpenAI-compatible API. This setup enables:

- Local LLM inference
- OpenAI-compatible API endpoint for easy integration with existing tools
- Built-in reasoning and tool calling capabilities

## What to know before starting

- Basic familiarity with Linux command line and terminal commands
- Understanding of git and working with branches
- Experience building software from source with CMake
- Basic knowledge of REST APIs and cURL for testing
- Familiarity with Hugging Face Hub for model downloads

## Prerequisites

**Hardware Requirements:**
- NVIDIA DGX Spark with GB10 GPU
- At least 40GB available GPU memory (model uses ~38GB VRAM)
- At least 50GB available storage space for model downloads and build artifacts

**Software Requirements:**
- NVIDIA DGX OS
- Git: `git --version`
- CMake (3.14+): `cmake --version`
- CUDA Toolkit: `nvcc --version`
- Network access to GitHub and Hugging Face

## Time & risk

* **Estimated time:** 30 minutes (including model download of ~38GB)
* **Risk level:** Low
  * Build process compiles from source but doesn't modify system files
  * Model downloads can be resumed if interrupted
* **Rollback:** Delete the cloned `llama.cpp` directory and downloaded model files to fully remove the installation
* **Last Updated:** 12/17/2025
  * First Publication

## Instructions

## Step 1. Verify prerequisites

Ensure you have the required tools installed on your DGX Spark before proceeding.

```bash
git --version
cmake --version
nvcc --version
```

All commands should return version information. If any are missing, install them before continuing.

Install the Hugging Face CLI:

```bash
python3 -m venv nemotron-venv
source nemotron-venv/bin/activate
pip install -U "huggingface_hub[cli]"
```

Verify installation:

```bash
hf version
```

## Step 2. Clone llama.cpp repository

Clone the llama.cpp repository which provides the inference framework for running Nemotron models.

```bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
```

## Step 3. Build llama.cpp with CUDA support

Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute architecture. This compiles CUDA kernels specifically optimized for your DGX Spark GPU.

```bash
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8
```

The build process takes approximately 5-10 minutes. You should see compilation progress and eventually a successful build message.

## Step 4. Download the Nemotron GGUF model

Download the Q8 quantized GGUF model from Hugging Face. This model provides excellent quality while fitting within the GB10's memory capacity.

```bash
hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \
  Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
  --local-dir ~/models/nemotron3-gguf
```

This downloads approximately 38GB. The download can be resumed if interrupted.

## Step 5. Start the llama.cpp server

Launch the inference server with the Nemotron model. The server provides an OpenAI-compatible API endpoint.

```bash
./bin/llama-server \
  --model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
  --host 0.0.0.0 \
  --port 30000 \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --threads 8
```

**Parameter explanation:**
- `--host 0.0.0.0`: Listen on all network interfaces
- `--port 30000`: API server port
- `--n-gpu-layers 99`: Offload all layers to GPU
- `--ctx-size 8192`: Context window size (can increase up to 1M)
- `--threads 8`: CPU threads for non-GPU operations

You should see server startup messages indicating the model is loaded and ready:
```
llama_new_context_with_model: n_ctx = 8192
...
main: server is listening on 0.0.0.0:30000
```

## Step 6. Test the API

Open a new terminal and test the inference server using the OpenAI-compatible chat completions endpoint.

```bash
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "New York is a great city because..."}],
    "max_tokens": 100
  }'
```

Expected response format:
```json
{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "We need to respond to user statement: \"New York is a great city because...\". Probably they want continuation, maybe a discussion. It's a simple open-ended prompt. Provide reasons why New York is great. No policy issues. Just respond creatively.",
        "content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("
      }
    }
  ],
  "created": 1765916539,
  "model": "Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 100,
    "prompt_tokens": 25,
    "total_tokens": 125
  },
  "id": "chatcmpl-...",
  "timings": {
    ...
  }
}
```

## Step 7. Test reasoning capabilities

Nemotron-3-Nano includes built-in reasoning capabilities. Test with a more complex prompt:

```bash
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
    "max_tokens": 500
  }'
```

The model will provide a detailed reasoning chain before giving the final answer.

## Step 8. Cleanup

To stop the server, press `Ctrl+C` in the terminal where it's running.

To completely remove the installation:

```bash
## Remove llama.cpp build
rm -rf ~/llama.cpp

## Remove downloaded models
rm -rf ~/models/nemotron3-gguf
```

## Step 9. Next steps

1. **Increase context size**: For longer conversations, increase `--ctx-size` up to 1048576 (1M tokens), though this will use more memory
3. **Integrate with applications**: Use the OpenAI-compatible API with tools like Open WebUI, Continue.dev, or custom applications

The server supports the full OpenAI API specification including streaming responses, function calling, and multi-turn conversations.

## Troubleshooting

| Symptom | Cause | Fix |
|---------|-------|-----|
| `cmake` fails with "CUDA not found" | CUDA toolkit not in PATH | Run `export PATH=/usr/local/cuda/bin:$PATH` and retry |
| Model download fails or is interrupted | Network issues | Re-run the `hf download` command - it will resume from where it stopped |
| "CUDA out of memory" when starting server | Insufficient GPU memory | Reduce `--ctx-size` to 4096 or use a smaller quantization (Q4_K_M) |
| Server starts but inference is slow | Model not fully loaded to GPU | Verify `--n-gpu-layers 99` is set and check `nvidia-smi` shows GPU usage |
| "Connection refused" on port 30000 | Server not running or wrong port | Verify server is running and check the `--port` parameter |
| "model not found" in API response | Wrong model path | Verify the model path in `--model` parameter matches the downloaded file location |


> [!NOTE] 
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. 
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within 
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```

For latest known issues, please review the [DGX Spark User Guide](https://docs.nvidia.com/dgx/dgx-spark/known-issues.html).
-												chore: Regenerate all playbooks

											
										
										
											2025-12-18 00:18:05 +00:00
+								# Nemotron-3-Nano with llama.cpp
 								> Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark
 								## Table of Contents
 								- [Overview](#overview)
 								- [Instructions](#instructions)
 								- [Troubleshooting](#troubleshooting)
 								---
 								## Overview
 								## Basic idea
 								Nemotron-3-Nano-30B-A3B is NVIDIA's powerful language model featuring a 30 billion parameter Mixture of Experts (MoE) architecture with only 3 billion active parameters. This efficient design enables high-quality inference with lower computational requirements, making it ideal for DGX Spark's GB10 GPU.
 								This playbook demonstrates how to run Nemotron-3-Nano using llama.cpp, which compiles CUDA kernels at build time specifically for your GPU architecture. The model includes built-in reasoning (thinking mode) and tool calling support via the chat template.
 								## What you'll accomplish
 								You will have a fully functional Nemotron-3-Nano-30B-A3B inference server running on your DGX Spark, accessible via an OpenAI-compatible API. This setup enables:
 								- Local LLM inference
 								- OpenAI-compatible API endpoint for easy integration with existing tools
 								- Built-in reasoning and tool calling capabilities
 								## What to know before starting
 								- Basic familiarity with Linux command line and terminal commands
 								- Understanding of git and working with branches
 								- Experience building software from source with CMake
 								- Basic knowledge of REST APIs and cURL for testing
 								- Familiarity with Hugging Face Hub for model downloads
 								## Prerequisites
 								**Hardware Requirements:**
 								- NVIDIA DGX Spark with GB10 GPU
 								- At least 40GB available GPU memory (model uses ~38GB VRAM)
 								- At least 50GB available storage space for model downloads and build artifacts
 								**Software Requirements:**
 								- NVIDIA DGX OS
 								- Git: `git --version`
 								- CMake (3.14+): `cmake --version`
 								- CUDA Toolkit: `nvcc --version`
 								- Network access to GitHub and Hugging Face
 								## Time & risk
 								* **Estimated time:** 30 minutes (including model download of ~38GB)
 								* **Risk level:** Low
 								  * Build process compiles from source but doesn't modify system files
 								  * Model downloads can be resumed if interrupted
-												chore: Regenerate all playbooks

											
										
										
											2026-01-02 22:57:04 +00:00
+								* **Rollback:** Delete the cloned `llama.cpp` directory and downloaded model files to fully remove the installation
-												chore: Regenerate all playbooks

											
										
										
											2025-12-18 00:18:05 +00:00
+								* **Last Updated:** 12/17/2025
 								  * First Publication
 								## Instructions
 								## Step 1. Verify prerequisites
 								Ensure you have the required tools installed on your DGX Spark before proceeding.
 								```bash
 								git --version
 								cmake --version
 								nvcc --version
 								```
 								All commands should return version information. If any are missing, install them before continuing.
 								Install the Hugging Face CLI:
 								```bash
 								python3 -m venv nemotron-venv
 								source nemotron-venv/bin/activate
 								pip install -U "huggingface_hub[cli]"
 								```
 								Verify installation:
 								```bash
 								hf version
 								```
 								## Step 2. Clone llama.cpp repository
 								Clone the llama.cpp repository which provides the inference framework for running Nemotron models.
 								```bash
 								git clone https://github.com/ggml-org/llama.cpp
 								cd llama.cpp
 								```
 								## Step 3. Build llama.cpp with CUDA support
 								Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute architecture. This compiles CUDA kernels specifically optimized for your DGX Spark GPU.
 								```bash
 								mkdir build && cd build
 								cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
 								make -j8
 								```
 								The build process takes approximately 5-10 minutes. You should see compilation progress and eventually a successful build message.
 								## Step 4. Download the Nemotron GGUF model
 								Download the Q8 quantized GGUF model from Hugging Face. This model provides excellent quality while fitting within the GB10's memory capacity.
 								```bash
 								hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \
 								  Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
 								  --local-dir ~/models/nemotron3-gguf
 								```
 								This downloads approximately 38GB. The download can be resumed if interrupted.
 								## Step 5. Start the llama.cpp server
 								Launch the inference server with the Nemotron model. The server provides an OpenAI-compatible API endpoint.
 								```bash
 								./bin/llama-server \
 								  --model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
 								  --host 0.0.0.0 \
 								  --port 30000 \
 								  --n-gpu-layers 99 \
 								  --ctx-size 8192 \
 								  --threads 8
 								```
 								**Parameter explanation:**
 								- `--host 0.0.0.0`: Listen on all network interfaces
 								- `--port 30000`: API server port
 								- `--n-gpu-layers 99`: Offload all layers to GPU
 								- `--ctx-size 8192`: Context window size (can increase up to 1M)
 								- `--threads 8`: CPU threads for non-GPU operations
 								You should see server startup messages indicating the model is loaded and ready:
 								```
 								llama_new_context_with_model: n_ctx = 8192
 								...
 								main: server is listening on 0.0.0.0:30000
 								```
 								## Step 6. Test the API
 								Open a new terminal and test the inference server using the OpenAI-compatible chat completions endpoint.
 								```bash
 								curl http://localhost:30000/v1/chat/completions \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "nemotron",
 								    "messages": [{"role": "user", "content": "New York is a great city because..."}],
 								    "max_tokens": 100
 								  }'
 								```
 								Expected response format:
 								```json
 								{
 								  "choices": [
 								    {
 								      "finish_reason": "length",
 								      "index": 0,
 								      "message": {
 								        "role": "assistant",
 								        "reasoning_content": "We need to respond to user statement: \"New York is a great city because...\". Probably they want continuation, maybe a discussion. It's a simple open-ended prompt. Provide reasons why New York is great. No policy issues. Just respond creatively.",
 								        "content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("
 								      }
 								    }
 								  ],
 								  "created": 1765916539,
 								  "model": "Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf",
 								  "object": "chat.completion",
 								  "usage": {
 								    "completion_tokens": 100,
 								    "prompt_tokens": 25,
 								    "total_tokens": 125
 								  },
 								  "id": "chatcmpl-...",
 								  "timings": {
 								    ...
 								  }
 								}
 								```
 								## Step 7. Test reasoning capabilities
 								Nemotron-3-Nano includes built-in reasoning capabilities. Test with a more complex prompt:
 								```bash
 								curl http://localhost:30000/v1/chat/completions \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "nemotron",
 								    "messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
 								    "max_tokens": 500
 								  }'
 								```
 								The model will provide a detailed reasoning chain before giving the final answer.
 								## Step 8. Cleanup
 								To stop the server, press `Ctrl+C` in the terminal where it's running.
 								To completely remove the installation:
 								```bash
 								## Remove llama.cpp build
 								rm -rf ~/llama.cpp
 								## Remove downloaded models
 								rm -rf ~/models/nemotron3-gguf
 								```
 								## Step 9. Next steps
 . **Increase context size**: For longer conversations, increase `--ctx-size` up to 1048576 (1M tokens), though this will use more memory
 . **Integrate with applications**: Use the OpenAI-compatible API with tools like Open WebUI, Continue.dev, or custom applications
 								The server supports the full OpenAI API specification including streaming responses, function calling, and multi-turn conversations.
 								## Troubleshooting
 								| Symptom | Cause | Fix |
 								|---------|-------|-----|
 								| `cmake` fails with "CUDA not found" | CUDA toolkit not in PATH | Run `export PATH=/usr/local/cuda/bin:$PATH` and retry |
 								| Model download fails or is interrupted | Network issues | Re-run the `hf download` command - it will resume from where it stopped |
 								| "CUDA out of memory" when starting server | Insufficient GPU memory | Reduce `--ctx-size` to 4096 or use a smaller quantization (Q4_K_M) |
 								| Server starts but inference is slow | Model not fully loaded to GPU | Verify `--n-gpu-layers 99` is set and check `nvidia-smi` shows GPU usage |
 								| "Connection refused" on port 30000 | Server not running or wrong port | Verify server is running and check the `--port` parameter |
 								| "model not found" in API response | Wrong model path | Verify the model path in `--model` parameter matches the downloaded file location |
 								> [!NOTE]
 								> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
 								> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
 								> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
 								```bash
 								sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
 								```
 								For latest known issues, please review the [DGX Spark User Guide](https://docs.nvidia.com/dgx/dgx-spark/known-issues.html).