dgx-spark-playbooks/nvidia/nemotron
2026-01-02 22:57:04 +00:00
..
README.md chore: Regenerate all playbooks 2026-01-02 22:57:04 +00:00

Nemotron-3-Nano with llama.cpp

Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark

Table of Contents


Overview

Basic idea

Nemotron-3-Nano-30B-A3B is NVIDIA's powerful language model featuring a 30 billion parameter Mixture of Experts (MoE) architecture with only 3 billion active parameters. This efficient design enables high-quality inference with lower computational requirements, making it ideal for DGX Spark's GB10 GPU.

This playbook demonstrates how to run Nemotron-3-Nano using llama.cpp, which compiles CUDA kernels at build time specifically for your GPU architecture. The model includes built-in reasoning (thinking mode) and tool calling support via the chat template.

What you'll accomplish

You will have a fully functional Nemotron-3-Nano-30B-A3B inference server running on your DGX Spark, accessible via an OpenAI-compatible API. This setup enables:

  • Local LLM inference
  • OpenAI-compatible API endpoint for easy integration with existing tools
  • Built-in reasoning and tool calling capabilities

What to know before starting

  • Basic familiarity with Linux command line and terminal commands
  • Understanding of git and working with branches
  • Experience building software from source with CMake
  • Basic knowledge of REST APIs and cURL for testing
  • Familiarity with Hugging Face Hub for model downloads

Prerequisites

Hardware Requirements:

  • NVIDIA DGX Spark with GB10 GPU
  • At least 40GB available GPU memory (model uses ~38GB VRAM)
  • At least 50GB available storage space for model downloads and build artifacts

Software Requirements:

  • NVIDIA DGX OS
  • Git: git --version
  • CMake (3.14+): cmake --version
  • CUDA Toolkit: nvcc --version
  • Network access to GitHub and Hugging Face

Time & risk

  • Estimated time: 30 minutes (including model download of ~38GB)
  • Risk level: Low
    • Build process compiles from source but doesn't modify system files
    • Model downloads can be resumed if interrupted
  • Rollback: Delete the cloned llama.cpp directory and downloaded model files to fully remove the installation
  • Last Updated: 12/17/2025
    • First Publication

Instructions

Step 1. Verify prerequisites

Ensure you have the required tools installed on your DGX Spark before proceeding.

git --version
cmake --version
nvcc --version

All commands should return version information. If any are missing, install them before continuing.

Install the Hugging Face CLI:

python3 -m venv nemotron-venv
source nemotron-venv/bin/activate
pip install -U "huggingface_hub[cli]"

Verify installation:

hf version

Step 2. Clone llama.cpp repository

Clone the llama.cpp repository which provides the inference framework for running Nemotron models.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Step 3. Build llama.cpp with CUDA support

Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute architecture. This compiles CUDA kernels specifically optimized for your DGX Spark GPU.

mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8

The build process takes approximately 5-10 minutes. You should see compilation progress and eventually a successful build message.

Step 4. Download the Nemotron GGUF model

Download the Q8 quantized GGUF model from Hugging Face. This model provides excellent quality while fitting within the GB10's memory capacity.

hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \
  Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
  --local-dir ~/models/nemotron3-gguf

This downloads approximately 38GB. The download can be resumed if interrupted.

Step 5. Start the llama.cpp server

Launch the inference server with the Nemotron model. The server provides an OpenAI-compatible API endpoint.

./bin/llama-server \
  --model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
  --host 0.0.0.0 \
  --port 30000 \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --threads 8

Parameter explanation:

  • --host 0.0.0.0: Listen on all network interfaces
  • --port 30000: API server port
  • --n-gpu-layers 99: Offload all layers to GPU
  • --ctx-size 8192: Context window size (can increase up to 1M)
  • --threads 8: CPU threads for non-GPU operations

You should see server startup messages indicating the model is loaded and ready:

llama_new_context_with_model: n_ctx = 8192
...
main: server is listening on 0.0.0.0:30000

Step 6. Test the API

Open a new terminal and test the inference server using the OpenAI-compatible chat completions endpoint.

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "New York is a great city because..."}],
    "max_tokens": 100
  }'

Expected response format:

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "We need to respond to user statement: \"New York is a great city because...\". Probably they want continuation, maybe a discussion. It's a simple open-ended prompt. Provide reasons why New York is great. No policy issues. Just respond creatively.",
        "content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, neversleeping metropolis. Here are just a few reasons that many people ("
      }
    }
  ],
  "created": 1765916539,
  "model": "Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 100,
    "prompt_tokens": 25,
    "total_tokens": 125
  },
  "id": "chatcmpl-...",
  "timings": {
    ...
  }
}

Step 7. Test reasoning capabilities

Nemotron-3-Nano includes built-in reasoning capabilities. Test with a more complex prompt:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
    "max_tokens": 500
  }'

The model will provide a detailed reasoning chain before giving the final answer.

Step 8. Cleanup

To stop the server, press Ctrl+C in the terminal where it's running.

To completely remove the installation:

## Remove llama.cpp build
rm -rf ~/llama.cpp

## Remove downloaded models
rm -rf ~/models/nemotron3-gguf

Step 9. Next steps

  1. Increase context size: For longer conversations, increase --ctx-size up to 1048576 (1M tokens), though this will use more memory
  2. Integrate with applications: Use the OpenAI-compatible API with tools like Open WebUI, Continue.dev, or custom applications

The server supports the full OpenAI API specification including streaming responses, function calling, and multi-turn conversations.

Troubleshooting

Symptom Cause Fix
cmake fails with "CUDA not found" CUDA toolkit not in PATH Run export PATH=/usr/local/cuda/bin:$PATH and retry
Model download fails or is interrupted Network issues Re-run the hf download command - it will resume from where it stopped
"CUDA out of memory" when starting server Insufficient GPU memory Reduce --ctx-size to 4096 or use a smaller quantization (Q4_K_M)
Server starts but inference is slow Model not fully loaded to GPU Verify --n-gpu-layers 99 is set and check nvidia-smi shows GPU usage
"Connection refused" on port 30000 Server not running or wrong port Verify server is running and check the --port parameter
"model not found" in API response Wrong model path Verify the model path in --model parameter matches the downloaded file location

Note

DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

For latest known issues, please review the DGX Spark User Guide.