| .. | ||
| README.md | ||
Nemotron-3-Nano with llama.cpp
Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark
Table of Contents
Overview
Basic idea
Nemotron-3-Nano-30B-A3B is NVIDIA's powerful language model featuring a 30 billion parameter Mixture of Experts (MoE) architecture with only 3 billion active parameters. This efficient design enables high-quality inference with lower computational requirements, making it ideal for DGX Spark's GB10 GPU.
This playbook demonstrates how to run Nemotron-3-Nano using llama.cpp, which compiles CUDA kernels at build time specifically for your GPU architecture. The model includes built-in reasoning (thinking mode) and tool calling support via the chat template.
What you'll accomplish
You will have a fully functional Nemotron-3-Nano-30B-A3B inference server running on your DGX Spark, accessible via an OpenAI-compatible API. This setup enables:
- Local LLM inference
- OpenAI-compatible API endpoint for easy integration with existing tools
- Built-in reasoning and tool calling capabilities
What to know before starting
- Basic familiarity with Linux command line and terminal commands
- Understanding of git and working with branches
- Experience building software from source with CMake
- Basic knowledge of REST APIs and cURL for testing
- Familiarity with Hugging Face Hub for model downloads
Prerequisites
Hardware Requirements:
- NVIDIA DGX Spark with GB10 GPU
- At least 40GB available GPU memory (model uses ~38GB VRAM)
- At least 50GB available storage space for model downloads and build artifacts
Software Requirements:
- NVIDIA DGX OS
- Git:
git --version - CMake (3.14+):
cmake --version - CUDA Toolkit:
nvcc --version - Network access to GitHub and Hugging Face
Time & risk
- Estimated time: 30 minutes (including model download of ~38GB)
- Risk level: Low
- Build process compiles from source but doesn't modify system files
- Model downloads can be resumed if interrupted
- Rollback: Delete the cloned
llama.cppdirectory and downloaded model files to fully remove the installation - Last Updated: 12/17/2025
- First Publication
Instructions
Step 1. Verify prerequisites
Ensure you have the required tools installed on your DGX Spark before proceeding.
git --version
cmake --version
nvcc --version
All commands should return version information. If any are missing, install them before continuing.
Install the Hugging Face CLI:
python3 -m venv nemotron-venv
source nemotron-venv/bin/activate
pip install -U "huggingface_hub[cli]"
Verify installation:
hf version
Step 2. Clone llama.cpp repository
Clone the llama.cpp repository which provides the inference framework for running Nemotron models.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Step 3. Build llama.cpp with CUDA support
Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute architecture. This compiles CUDA kernels specifically optimized for your DGX Spark GPU.
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8
The build process takes approximately 5-10 minutes. You should see compilation progress and eventually a successful build message.
Step 4. Download the Nemotron GGUF model
Download the Q8 quantized GGUF model from Hugging Face. This model provides excellent quality while fitting within the GB10's memory capacity.
hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \
Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
--local-dir ~/models/nemotron3-gguf
This downloads approximately 38GB. The download can be resumed if interrupted.
Step 5. Start the llama.cpp server
Launch the inference server with the Nemotron model. The server provides an OpenAI-compatible API endpoint.
./bin/llama-server \
--model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
--host 0.0.0.0 \
--port 30000 \
--n-gpu-layers 99 \
--ctx-size 8192 \
--threads 8
Parameter explanation:
--host 0.0.0.0: Listen on all network interfaces--port 30000: API server port--n-gpu-layers 99: Offload all layers to GPU--ctx-size 8192: Context window size (can increase up to 1M)--threads 8: CPU threads for non-GPU operations
You should see server startup messages indicating the model is loaded and ready:
llama_new_context_with_model: n_ctx = 8192
...
main: server is listening on 0.0.0.0:30000
Step 6. Test the API
Open a new terminal and test the inference server using the OpenAI-compatible chat completions endpoint.
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron",
"messages": [{"role": "user", "content": "New York is a great city because..."}],
"max_tokens": 100
}'
Expected response format:
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "We need to respond to user statement: \"New York is a great city because...\". Probably they want continuation, maybe a discussion. It's a simple open-ended prompt. Provide reasons why New York is great. No policy issues. Just respond creatively.",
"content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("
}
}
],
"created": 1765916539,
"model": "Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf",
"object": "chat.completion",
"usage": {
"completion_tokens": 100,
"prompt_tokens": 25,
"total_tokens": 125
},
"id": "chatcmpl-...",
"timings": {
...
}
}
Step 7. Test reasoning capabilities
Nemotron-3-Nano includes built-in reasoning capabilities. Test with a more complex prompt:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron",
"messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
"max_tokens": 500
}'
The model will provide a detailed reasoning chain before giving the final answer.
Step 8. Cleanup
To stop the server, press Ctrl+C in the terminal where it's running.
To completely remove the installation:
## Remove llama.cpp build
rm -rf ~/llama.cpp
## Remove downloaded models
rm -rf ~/models/nemotron3-gguf
Step 9. Next steps
- Increase context size: For longer conversations, increase
--ctx-sizeup to 1048576 (1M tokens), though this will use more memory - Integrate with applications: Use the OpenAI-compatible API with tools like Open WebUI, Continue.dev, or custom applications
The server supports the full OpenAI API specification including streaming responses, function calling, and multi-turn conversations.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
cmake fails with "CUDA not found" |
CUDA toolkit not in PATH | Run export PATH=/usr/local/cuda/bin:$PATH and retry |
| Model download fails or is interrupted | Network issues | Re-run the hf download command - it will resume from where it stopped |
| "CUDA out of memory" when starting server | Insufficient GPU memory | Reduce --ctx-size to 4096 or use a smaller quantization (Q4_K_M) |
| Server starts but inference is slow | Model not fully loaded to GPU | Verify --n-gpu-layers 99 is set and check nvidia-smi shows GPU usage |
| "Connection refused" on port 30000 | Server not running or wrong port | Verify server is running and check the --port parameter |
| "model not found" in API response | Wrong model path | Verify the model path in --model parameter matches the downloaded file location |
Note
DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
For latest known issues, please review the DGX Spark User Guide.