> Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark
## Table of Contents
- [Overview](#overview)
- [Instructions](#instructions)
- [Troubleshooting](#troubleshooting)
---
## Overview
## Basic idea
Nemotron-3-Nano-30B-A3B is NVIDIA's powerful language model featuring a 30 billion parameter Mixture of Experts (MoE) architecture with only 3 billion active parameters. This efficient design enables high-quality inference with lower computational requirements, making it ideal for DGX Spark's GB10 GPU.
This playbook demonstrates how to run Nemotron-3-Nano using llama.cpp, which compiles CUDA kernels at build time specifically for your GPU architecture. The model includes built-in reasoning (thinking mode) and tool calling support via the chat template.
## What you'll accomplish
You will have a fully functional Nemotron-3-Nano-30B-A3B inference server running on your DGX Spark, accessible via an OpenAI-compatible API. This setup enables:
- Local LLM inference
- OpenAI-compatible API endpoint for easy integration with existing tools
- Built-in reasoning and tool calling capabilities
## What to know before starting
- Basic familiarity with Linux command line and terminal commands
- Understanding of git and working with branches
- Experience building software from source with CMake
- Basic knowledge of REST APIs and cURL for testing
- Familiarity with Hugging Face Hub for model downloads
## Prerequisites
**Hardware Requirements:**
- NVIDIA DGX Spark with GB10 GPU
- At least 40GB available GPU memory (model uses ~38GB VRAM)
- At least 50GB available storage space for model downloads and build artifacts
**Software Requirements:**
- NVIDIA DGX OS
- Git: `git --version`
- CMake (3.14+): `cmake --version`
- CUDA Toolkit: `nvcc --version`
- Network access to GitHub and Hugging Face
## Time & risk
* **Estimated time:** 30 minutes (including model download of ~38GB)
* **Risk level:** Low
* Build process compiles from source but doesn't modify system files
Ensure you have the required tools installed on your DGX Spark before proceeding.
```bash
git --version
cmake --version
nvcc --version
```
All commands should return version information. If any are missing, install them before continuing.
Install the Hugging Face CLI:
```bash
python3 -m venv nemotron-venv
source nemotron-venv/bin/activate
pip install -U "huggingface_hub[cli]"
```
Verify installation:
```bash
hf version
```
## Step 2. Clone llama.cpp repository
Clone the llama.cpp repository which provides the inference framework for running Nemotron models.
```bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
```
## Step 3. Build llama.cpp with CUDA support
Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute architecture. This compiles CUDA kernels specifically optimized for your DGX Spark GPU.
-`--host 0.0.0.0`: Listen on all network interfaces
-`--port 30000`: API server port
-`--n-gpu-layers 99`: Offload all layers to GPU
-`--ctx-size 8192`: Context window size (can increase up to 1M)
-`--threads 8`: CPU threads for non-GPU operations
You should see server startup messages indicating the model is loaded and ready:
```
llama_new_context_with_model: n_ctx = 8192
...
main: server is listening on 0.0.0.0:30000
```
## Step 6. Test the API
Open a new terminal and test the inference server using the OpenAI-compatible chat completions endpoint.
```bash
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron",
"messages": [{"role": "user", "content": "New York is a great city because..."}],
"max_tokens": 100
}'
```
Expected response format:
```json
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "We need to respond to user statement: \"New York is a great city because...\". Probably they want continuation, maybe a discussion. It's a simple open-ended prompt. Provide reasons why New York is great. No policy issues. Just respond creatively.",
"content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("