diff --git a/README.md b/README.md
index 30b0bfc..d515422 100644
--- a/README.md
+++ b/README.md
@@ -33,6 +33,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
 - [Multi-modal Inference](nvidia/multi-modal-inference/)
 - [NCCL for Two Sparks](nvidia/nccl/)
 - [Fine-tune with NeMo](nvidia/nemo-fine-tune/)
+- [Nemotron-3-Nano with llama.cpp](nvidia/nemotron/)
 - [NIM on Spark](nvidia/nim-llm/)
 - [NVFP4 Quantization](nvidia/nvfp4-quantization/)
 - [Ollama](nvidia/ollama/)
diff --git a/nvidia/nemotron/README.md b/nvidia/nemotron/README.md
new file mode 100644
index 0000000..2db18d3
--- /dev/null
+++ b/nvidia/nemotron/README.md
@@ -0,0 +1,250 @@
+# Nemotron-3-Nano with llama.cpp
+
+> Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Instructions](#instructions)
+- [Troubleshooting](#troubleshooting)
+
+---
+
+## Overview
+
+## Basic idea
+
+Nemotron-3-Nano-30B-A3B is NVIDIA's powerful language model featuring a 30 billion parameter Mixture of Experts (MoE) architecture with only 3 billion active parameters. This efficient design enables high-quality inference with lower computational requirements, making it ideal for DGX Spark's GB10 GPU.
+
+This playbook demonstrates how to run Nemotron-3-Nano using llama.cpp, which compiles CUDA kernels at build time specifically for your GPU architecture. The model includes built-in reasoning (thinking mode) and tool calling support via the chat template.
+
+## What you'll accomplish
+
+You will have a fully functional Nemotron-3-Nano-30B-A3B inference server running on your DGX Spark, accessible via an OpenAI-compatible API. This setup enables:
+
+- Local LLM inference
+- OpenAI-compatible API endpoint for easy integration with existing tools
+- Built-in reasoning and tool calling capabilities
+
+## What to know before starting
+
+- Basic familiarity with Linux command line and terminal commands
+- Understanding of git and working with branches
+- Experience building software from source with CMake
+- Basic knowledge of REST APIs and cURL for testing
+- Familiarity with Hugging Face Hub for model downloads
+
+## Prerequisites
+
+**Hardware Requirements:**
+- NVIDIA DGX Spark with GB10 GPU
+- At least 40GB available GPU memory (model uses ~38GB VRAM)
+- At least 50GB available storage space for model downloads and build artifacts
+
+**Software Requirements:**
+- NVIDIA DGX OS
+- Git: `git --version`
+- CMake (3.14+): `cmake --version`
+- CUDA Toolkit: `nvcc --version`
+- Network access to GitHub and Hugging Face
+
+## Time & risk
+
+* **Estimated time:** 30 minutes (including model download of ~38GB)
+* **Risk level:** Low
+  * Build process compiles from source but doesn't modify system files
+  * Model downloads can be resumed if interrupted
+* **Rollback:** Delete the cloned llama.cpp directory and downloaded model files to fully remove the installation
+* **Last Updated:** 12/17/2025
+  * First Publication
+
+## Instructions
+
+## Step 1. Verify prerequisites
+
+Ensure you have the required tools installed on your DGX Spark before proceeding.
+
+```bash
+git --version
+cmake --version
+nvcc --version
+```
+
+All commands should return version information. If any are missing, install them before continuing.
+
+Install the Hugging Face CLI:
+
+```bash
+python3 -m venv nemotron-venv
+source nemotron-venv/bin/activate
+pip install -U "huggingface_hub[cli]"
+```
+
+Verify installation:
+
+```bash
+hf version
+```
+
+## Step 2. Clone llama.cpp repository
+
+Clone the llama.cpp repository which provides the inference framework for running Nemotron models.
+
+```bash
+git clone https://github.com/ggml-org/llama.cpp
+cd llama.cpp
+```
+
+## Step 3. Build llama.cpp with CUDA support
+
+Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute architecture. This compiles CUDA kernels specifically optimized for your DGX Spark GPU.
+
+```bash
+mkdir build && cd build
+cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
+make -j8
+```
+
+The build process takes approximately 5-10 minutes. You should see compilation progress and eventually a successful build message.
+
+## Step 4. Download the Nemotron GGUF model
+
+Download the Q8 quantized GGUF model from Hugging Face. This model provides excellent quality while fitting within the GB10's memory capacity.
+
+```bash
+hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \
+  Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
+  --local-dir ~/models/nemotron3-gguf
+```
+
+This downloads approximately 38GB. The download can be resumed if interrupted.
+
+## Step 5. Start the llama.cpp server
+
+Launch the inference server with the Nemotron model. The server provides an OpenAI-compatible API endpoint.
+
+```bash
+./bin/llama-server \
+  --model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \
+  --host 0.0.0.0 \
+  --port 30000 \
+  --n-gpu-layers 99 \
+  --ctx-size 8192 \
+  --threads 8
+```
+
+**Parameter explanation:**
+- `--host 0.0.0.0`: Listen on all network interfaces
+- `--port 30000`: API server port
+- `--n-gpu-layers 99`: Offload all layers to GPU
+- `--ctx-size 8192`: Context window size (can increase up to 1M)
+- `--threads 8`: CPU threads for non-GPU operations
+
+You should see server startup messages indicating the model is loaded and ready:
+```
+llama_new_context_with_model: n_ctx = 8192
+...
+main: server is listening on 0.0.0.0:30000
+```
+
+## Step 6. Test the API
+
+Open a new terminal and test the inference server using the OpenAI-compatible chat completions endpoint.
+
+```bash
+curl http://localhost:30000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "nemotron",
+    "messages": [{"role": "user", "content": "New York is a great city because..."}],
+    "max_tokens": 100
+  }'
+```
+
+Expected response format:
+```json
+{
+  "choices": [
+    {
+      "finish_reason": "length",
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "reasoning_content": "We need to respond to user statement: \"New York is a great city because...\". Probably they want continuation, maybe a discussion. It's a simple open-ended prompt. Provide reasons why New York is great. No policy issues. Just respond creatively.",
+        "content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people ("
+      }
+    }
+  ],
+  "created": 1765916539,
+  "model": "Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf",
+  "object": "chat.completion",
+  "usage": {
+    "completion_tokens": 100,
+    "prompt_tokens": 25,
+    "total_tokens": 125
+  },
+  "id": "chatcmpl-...",
+  "timings": {
+    ...
+  }
+}
+```
+
+## Step 7. Test reasoning capabilities
+
+Nemotron-3-Nano includes built-in reasoning capabilities. Test with a more complex prompt:
+
+```bash
+curl http://localhost:30000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "nemotron",
+    "messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
+    "max_tokens": 500
+  }'
+```
+
+The model will provide a detailed reasoning chain before giving the final answer.
+
+## Step 8. Cleanup
+
+To stop the server, press `Ctrl+C` in the terminal where it's running.
+
+To completely remove the installation:
+
+```bash
+## Remove llama.cpp build
+rm -rf ~/llama.cpp
+
+## Remove downloaded models
+rm -rf ~/models/nemotron3-gguf
+```
+
+## Step 9. Next steps
+
+1. **Increase context size**: For longer conversations, increase `--ctx-size` up to 1048576 (1M tokens), though this will use more memory
+3. **Integrate with applications**: Use the OpenAI-compatible API with tools like Open WebUI, Continue.dev, or custom applications
+
+The server supports the full OpenAI API specification including streaming responses, function calling, and multi-turn conversations.
+
+## Troubleshooting
+
+| Symptom | Cause | Fix |
+|---------|-------|-----|
+| `cmake` fails with "CUDA not found" | CUDA toolkit not in PATH | Run `export PATH=/usr/local/cuda/bin:$PATH` and retry |
+| Model download fails or is interrupted | Network issues | Re-run the `hf download` command - it will resume from where it stopped |
+| "CUDA out of memory" when starting server | Insufficient GPU memory | Reduce `--ctx-size` to 4096 or use a smaller quantization (Q4_K_M) |
+| Server starts but inference is slow | Model not fully loaded to GPU | Verify `--n-gpu-layers 99` is set and check `nvidia-smi` shows GPU usage |
+| "Connection refused" on port 30000 | Server not running or wrong port | Verify server is running and check the `--port` parameter |
+| "model not found" in API response | Wrong model path | Verify the model path in `--model` parameter matches the downloaded file location |
+
+
+> [!NOTE] 
+> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. 
+> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within 
+> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```
+
+For latest known issues, please review the [DGX Spark User Guide](https://docs.nvidia.com/dgx/dgx-spark/known-issues.html).
diff --git a/nvidia/pytorch-fine-tune/README.md b/nvidia/pytorch-fine-tune/README.md
index 1ca65d7..d414426 100644
--- a/nvidia/pytorch-fine-tune/README.md
+++ b/nvidia/pytorch-fine-tune/README.md
@@ -269,7 +269,7 @@ For multi-node runs, we provide 2 configuration files:
 
 These configuration files need to be adapted:
 - Set `machine_rank` on each of your nodes according to its rank. Your master node should have a rank `0`. The second node has a rank `1`.
-- Set the correct IP address of your master node. Use `ifconfig` to find the correct value for your CX-7 IP address.
+- Set `main_process_ip` using the IP address of your master node. Ensure that both configuration files have the same value. Use `ifconfig` on your main node to find the correct value for the CX-7 IP address on this node.
 - Set a port number that can be used on your main node.
 
 The fields that need to be filled in your YAML files:
@@ -280,6 +280,8 @@ main_process_ip: < TODO: specify IP >
 main_process_port: < TODO: specify port >
 ```
 
+All the scripts and configuration files are available in this [**repository**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets).
+
 ### Step 10. Run finetuning scripts
 
 Once you successfully run the previous steps, you can use one of the `run-multi-llama_*` scripts for finetuning. Here is an example for Llama3 70B using LoRa for finetuning and FSDP2.
@@ -295,6 +297,8 @@ docker exec \
   accelerate launch --config_file=/workspace/configs/config_fsdp_lora.yaml /workspace/Llama3_70B_LoRA_finetuning.py'
 ```
 
+During the run, the progress bar of the finetuning will appear on your main node's stdout only. This is an expected behavior as `accelerate` uses a wrapper around `tqdm` to display the progress on the main process only as explained [**here**](https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/tqdm.py#L25). Using `nvidia-smi` on the worker node should show that the GPU is used.
+
 ### Step 14. Cleanup and rollback
 
 Stop and remove containers by using the following command on the leader node:
diff --git a/nvidia/pytorch-fine-tune/assets/docker-compose.yml b/nvidia/pytorch-fine-tune/assets/docker-compose.yml
index dcdf116..460c444 100644
--- a/nvidia/pytorch-fine-tune/assets/docker-compose.yml
+++ b/nvidia/pytorch-fine-tune/assets/docker-compose.yml
@@ -2,7 +2,7 @@ version: '3.8'
 
 services:
   finetunine:
-    image: nvcr.io/nvidia/pytorch:25.10-py3
+    image: nvcr.io/nvidia/pytorch:25.11-py3
     deploy:
       replicas: 2
       restart_policy: