diff --git a/README.md b/README.md index 30b0bfc..d515422 100644 --- a/README.md +++ b/README.md @@ -33,6 +33,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting - [Multi-modal Inference](nvidia/multi-modal-inference/) - [NCCL for Two Sparks](nvidia/nccl/) - [Fine-tune with NeMo](nvidia/nemo-fine-tune/) +- [Nemotron-3-Nano with llama.cpp](nvidia/nemotron/) - [NIM on Spark](nvidia/nim-llm/) - [NVFP4 Quantization](nvidia/nvfp4-quantization/) - [Ollama](nvidia/ollama/) diff --git a/nvidia/nemotron/README.md b/nvidia/nemotron/README.md new file mode 100644 index 0000000..2db18d3 --- /dev/null +++ b/nvidia/nemotron/README.md @@ -0,0 +1,250 @@ +# Nemotron-3-Nano with llama.cpp + +> Run Nemotron-3-Nano-30B model using llama.cpp on DGX Spark + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) +- [Troubleshooting](#troubleshooting) + +--- + +## Overview + +## Basic idea + +Nemotron-3-Nano-30B-A3B is NVIDIA's powerful language model featuring a 30 billion parameter Mixture of Experts (MoE) architecture with only 3 billion active parameters. This efficient design enables high-quality inference with lower computational requirements, making it ideal for DGX Spark's GB10 GPU. + +This playbook demonstrates how to run Nemotron-3-Nano using llama.cpp, which compiles CUDA kernels at build time specifically for your GPU architecture. The model includes built-in reasoning (thinking mode) and tool calling support via the chat template. + +## What you'll accomplish + +You will have a fully functional Nemotron-3-Nano-30B-A3B inference server running on your DGX Spark, accessible via an OpenAI-compatible API. This setup enables: + +- Local LLM inference +- OpenAI-compatible API endpoint for easy integration with existing tools +- Built-in reasoning and tool calling capabilities + +## What to know before starting + +- Basic familiarity with Linux command line and terminal commands +- Understanding of git and working with branches +- Experience building software from source with CMake +- Basic knowledge of REST APIs and cURL for testing +- Familiarity with Hugging Face Hub for model downloads + +## Prerequisites + +**Hardware Requirements:** +- NVIDIA DGX Spark with GB10 GPU +- At least 40GB available GPU memory (model uses ~38GB VRAM) +- At least 50GB available storage space for model downloads and build artifacts + +**Software Requirements:** +- NVIDIA DGX OS +- Git: `git --version` +- CMake (3.14+): `cmake --version` +- CUDA Toolkit: `nvcc --version` +- Network access to GitHub and Hugging Face + +## Time & risk + +* **Estimated time:** 30 minutes (including model download of ~38GB) +* **Risk level:** Low + * Build process compiles from source but doesn't modify system files + * Model downloads can be resumed if interrupted +* **Rollback:** Delete the cloned llama.cpp directory and downloaded model files to fully remove the installation +* **Last Updated:** 12/17/2025 + * First Publication + +## Instructions + +## Step 1. Verify prerequisites + +Ensure you have the required tools installed on your DGX Spark before proceeding. + +```bash +git --version +cmake --version +nvcc --version +``` + +All commands should return version information. If any are missing, install them before continuing. + +Install the Hugging Face CLI: + +```bash +python3 -m venv nemotron-venv +source nemotron-venv/bin/activate +pip install -U "huggingface_hub[cli]" +``` + +Verify installation: + +```bash +hf version +``` + +## Step 2. Clone llama.cpp repository + +Clone the llama.cpp repository which provides the inference framework for running Nemotron models. + +```bash +git clone https://github.com/ggml-org/llama.cpp +cd llama.cpp +``` + +## Step 3. Build llama.cpp with CUDA support + +Build llama.cpp with CUDA enabled and targeting the GB10's sm_121 compute architecture. This compiles CUDA kernels specifically optimized for your DGX Spark GPU. + +```bash +mkdir build && cd build +cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF +make -j8 +``` + +The build process takes approximately 5-10 minutes. You should see compilation progress and eventually a successful build message. + +## Step 4. Download the Nemotron GGUF model + +Download the Q8 quantized GGUF model from Hugging Face. This model provides excellent quality while fitting within the GB10's memory capacity. + +```bash +hf download unsloth/Nemotron-3-Nano-30B-A3B-GGUF \ + Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \ + --local-dir ~/models/nemotron3-gguf +``` + +This downloads approximately 38GB. The download can be resumed if interrupted. + +## Step 5. Start the llama.cpp server + +Launch the inference server with the Nemotron model. The server provides an OpenAI-compatible API endpoint. + +```bash +./bin/llama-server \ + --model ~/models/nemotron3-gguf/Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf \ + --host 0.0.0.0 \ + --port 30000 \ + --n-gpu-layers 99 \ + --ctx-size 8192 \ + --threads 8 +``` + +**Parameter explanation:** +- `--host 0.0.0.0`: Listen on all network interfaces +- `--port 30000`: API server port +- `--n-gpu-layers 99`: Offload all layers to GPU +- `--ctx-size 8192`: Context window size (can increase up to 1M) +- `--threads 8`: CPU threads for non-GPU operations + +You should see server startup messages indicating the model is loaded and ready: +``` +llama_new_context_with_model: n_ctx = 8192 +... +main: server is listening on 0.0.0.0:30000 +``` + +## Step 6. Test the API + +Open a new terminal and test the inference server using the OpenAI-compatible chat completions endpoint. + +```bash +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "nemotron", + "messages": [{"role": "user", "content": "New York is a great city because..."}], + "max_tokens": 100 + }' +``` + +Expected response format: +```json +{ + "choices": [ + { + "finish_reason": "length", + "index": 0, + "message": { + "role": "assistant", + "reasoning_content": "We need to respond to user statement: \"New York is a great city because...\". Probably they want continuation, maybe a discussion. It's a simple open-ended prompt. Provide reasons why New York is great. No policy issues. Just respond creatively.", + "content": "New York is a great city because it's a living, breathing collage of cultures, ideas, and possibilities—all stacked into one vibrant, never‑sleeping metropolis. Here are just a few reasons that many people (" + } + } + ], + "created": 1765916539, + "model": "Nemotron-3-Nano-30B-A3B-UD-Q8_K_XL.gguf", + "object": "chat.completion", + "usage": { + "completion_tokens": 100, + "prompt_tokens": 25, + "total_tokens": 125 + }, + "id": "chatcmpl-...", + "timings": { + ... + } +} +``` + +## Step 7. Test reasoning capabilities + +Nemotron-3-Nano includes built-in reasoning capabilities. Test with a more complex prompt: + +```bash +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "nemotron", + "messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}], + "max_tokens": 500 + }' +``` + +The model will provide a detailed reasoning chain before giving the final answer. + +## Step 8. Cleanup + +To stop the server, press `Ctrl+C` in the terminal where it's running. + +To completely remove the installation: + +```bash +## Remove llama.cpp build +rm -rf ~/llama.cpp + +## Remove downloaded models +rm -rf ~/models/nemotron3-gguf +``` + +## Step 9. Next steps + +1. **Increase context size**: For longer conversations, increase `--ctx-size` up to 1048576 (1M tokens), though this will use more memory +3. **Integrate with applications**: Use the OpenAI-compatible API with tools like Open WebUI, Continue.dev, or custom applications + +The server supports the full OpenAI API specification including streaming responses, function calling, and multi-turn conversations. + +## Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| `cmake` fails with "CUDA not found" | CUDA toolkit not in PATH | Run `export PATH=/usr/local/cuda/bin:$PATH` and retry | +| Model download fails or is interrupted | Network issues | Re-run the `hf download` command - it will resume from where it stopped | +| "CUDA out of memory" when starting server | Insufficient GPU memory | Reduce `--ctx-size` to 4096 or use a smaller quantization (Q4_K_M) | +| Server starts but inference is slow | Model not fully loaded to GPU | Verify `--n-gpu-layers 99` is set and check `nvidia-smi` shows GPU usage | +| "Connection refused" on port 30000 | Server not running or wrong port | Verify server is running and check the `--port` parameter | +| "model not found" in API response | Wrong model path | Verify the model path in `--model` parameter matches the downloaded file location | + + +> [!NOTE] +> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. +> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within +> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with: +```bash +sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' +``` + +For latest known issues, please review the [DGX Spark User Guide](https://docs.nvidia.com/dgx/dgx-spark/known-issues.html). diff --git a/nvidia/pytorch-fine-tune/README.md b/nvidia/pytorch-fine-tune/README.md index 1ca65d7..d414426 100644 --- a/nvidia/pytorch-fine-tune/README.md +++ b/nvidia/pytorch-fine-tune/README.md @@ -269,7 +269,7 @@ For multi-node runs, we provide 2 configuration files: These configuration files need to be adapted: - Set `machine_rank` on each of your nodes according to its rank. Your master node should have a rank `0`. The second node has a rank `1`. -- Set the correct IP address of your master node. Use `ifconfig` to find the correct value for your CX-7 IP address. +- Set `main_process_ip` using the IP address of your master node. Ensure that both configuration files have the same value. Use `ifconfig` on your main node to find the correct value for the CX-7 IP address on this node. - Set a port number that can be used on your main node. The fields that need to be filled in your YAML files: @@ -280,6 +280,8 @@ main_process_ip: < TODO: specify IP > main_process_port: < TODO: specify port > ``` +All the scripts and configuration files are available in this [**repository**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets). + ### Step 10. Run finetuning scripts Once you successfully run the previous steps, you can use one of the `run-multi-llama_*` scripts for finetuning. Here is an example for Llama3 70B using LoRa for finetuning and FSDP2. @@ -295,6 +297,8 @@ docker exec \ accelerate launch --config_file=/workspace/configs/config_fsdp_lora.yaml /workspace/Llama3_70B_LoRA_finetuning.py' ``` +During the run, the progress bar of the finetuning will appear on your main node's stdout only. This is an expected behavior as `accelerate` uses a wrapper around `tqdm` to display the progress on the main process only as explained [**here**](https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/tqdm.py#L25). Using `nvidia-smi` on the worker node should show that the GPU is used. + ### Step 14. Cleanup and rollback Stop and remove containers by using the following command on the leader node: diff --git a/nvidia/pytorch-fine-tune/assets/docker-compose.yml b/nvidia/pytorch-fine-tune/assets/docker-compose.yml index dcdf116..460c444 100644 --- a/nvidia/pytorch-fine-tune/assets/docker-compose.yml +++ b/nvidia/pytorch-fine-tune/assets/docker-compose.yml @@ -2,7 +2,7 @@ version: '3.8' services: finetunine: - image: nvcr.io/nvidia/pytorch:25.10-py3 + image: nvcr.io/nvidia/pytorch:25.11-py3 deploy: replicas: 2 restart_policy: