mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-29 05:03:52 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
90fe8c7cae
commit
5a9d5d1f2a
@ -1,6 +1,6 @@
|
||||
# Run models with llama.cpp on DGX Spark
|
||||
|
||||
> Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Qwen3.6 as example)
|
||||
> Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Nemotron 3 Nano Omni as example)
|
||||
|
||||
|
||||
## Table of Contents
|
||||
@ -17,15 +17,15 @@
|
||||
|
||||
[llama.cpp](https://github.com/ggml-org/llama.cpp) is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through `llama-server`’s OpenAI-compatible HTTP API.
|
||||
|
||||
This playbook walks through that stack end to end using **Qwen3.6** as the hands-on example: a current-generation family that runs well from quantized GGUF on Spark. Checkpoint choices and paths for all supported models are summarized in the matrix below; commands are in the instructions.
|
||||
This playbook walks through that stack end to end using **Nemotron 3 Nano Omni** as the hands-on example: an NVIDIA MoE family that runs well from quantized GGUF on Spark. Checkpoint choices and paths for all supported models are summarized in the matrix below; commands are in the instructions.
|
||||
|
||||
## What you'll accomplish
|
||||
|
||||
You will build llama.cpp with CUDA for GB10, download a **Qwen3.6** example checkpoint, and run **`llama-server`** with GPU offload. You get:
|
||||
You will build llama.cpp with CUDA for GB10, download a **Nemotron 3 Nano Omni** example checkpoint, and run **`llama-server`** with GPU offload. You get:
|
||||
|
||||
- Local inference through llama.cpp (no separate Python inference framework required)
|
||||
- An OpenAI-compatible `/v1/chat/completions` endpoint for tools and apps
|
||||
- A concrete validation that the **Qwen3.6** example runs on this stack on DGX Spark
|
||||
- A concrete validation that the **Nemotron 3 Nano Omni** example runs on this stack on DGX Spark
|
||||
|
||||
## What to know before starting
|
||||
|
||||
@ -39,8 +39,8 @@ You will build llama.cpp with CUDA for GB10, download a **Qwen3.6** example chec
|
||||
**Hardware requirements**
|
||||
|
||||
- NVIDIA DGX Spark with GB10 GPU
|
||||
- Sufficient unified memory for the example **UD-Q4_K_M** MoE checkpoint (weights on the order of **~20GB**, plus KV cache and runtime overhead—scale up if you pick a larger quant or longer context)
|
||||
- At least **~30GB** free disk for the example download plus build artifacts (more if you keep multiple GGUFs)
|
||||
- Sufficient unified memory for the example **Q8_0** checkpoint (weights on the order of **~35GB**, plus KV cache and runtime overhead—scale up if you pick a larger quant or longer context)
|
||||
- At least **~40GB** free disk for the example download plus build artifacts (more if you keep multiple GGUFs)
|
||||
|
||||
**Software requirements**
|
||||
|
||||
@ -52,12 +52,13 @@ You will build llama.cpp with CUDA for GB10, download a **Qwen3.6** example chec
|
||||
|
||||
## Model support matrix
|
||||
|
||||
The following models are supported with llama.cpp on Spark. The instructions use the **Qwen3.6** example row by default.
|
||||
The following models are supported with llama.cpp on Spark. The instructions use the **Nemotron 3 Nano Omni** example row by default.
|
||||
|
||||
| Model | Support Status | HF Handle |
|
||||
|-------|----------------|-----------|
|
||||
| **Qwen3.6-35B-A3B** (example walkthrough) | ✅ | `unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf` |
|
||||
| **Qwen3.6-27B** | ✅ | `unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q4_K_M.gguf` |
|
||||
| **Nemotron 3 Nano Omni** (example walkthrough) | ✅ | `ggml-org/NVIDIA-Nemotron-3-Nano-Omni` |
|
||||
| **Qwen3.6-35B-A3B** | ✅ | `unsloth/Qwen3.6-35B-A3B-GGUF` |
|
||||
| **Qwen3.6-27B** | ✅ | `unsloth/Qwen3.6-27B-GGUF` |
|
||||
| **Gemma 4 31B IT** | ✅ | `ggml-org/gemma-4-31B-it-GGUF` |
|
||||
| **Gemma 4 26B A4B IT** | ✅ | `ggml-org/gemma-4-26B-A4B-it-GGUF` |
|
||||
| **Gemma 4 E4B IT** | ✅ | `ggml-org/gemma-4-E4B-it-GGUF` |
|
||||
@ -66,17 +67,17 @@ The following models are supported with llama.cpp on Spark. The instructions use
|
||||
|
||||
## Time & risk
|
||||
|
||||
* **Estimated time:** About 30 minutes, plus downloading the example GGUF (~20GB order of magnitude for the default quant)
|
||||
* **Estimated time:** About 30 minutes, plus downloading the example GGUF (~35GB order of magnitude for the default quant)
|
||||
* **Risk level:** Low — build is local to your clone; no system-wide installs required for the steps below
|
||||
* **Rollback:** Remove the `llama.cpp` clone and the model directory under `~/models/` to reclaim disk space
|
||||
* **Last updated:** 04/27/2026
|
||||
* We now walk you through Qwen3.6 first; other models remain in the list
|
||||
* **Last updated:** 04/28/2026
|
||||
* Walkthrough now uses Nemotron Omni; other model rows stay available
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Verify prerequisites
|
||||
|
||||
The **example** checkpoint is **`Qwen3.6-35B-A3B-UD-Q4_K_M.gguf`** from Hugging Face repo **`unsloth/Qwen3.6-35B-A3B-GGUF`** (full handle: `unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf`). The other supported file is **`Qwen3.6-27B-Q4_K_M.gguf`** from **`unsloth/Qwen3.6-27B-GGUF`**—use the same build and server steps, changing `hf download` and `--model` paths (see the [overview model matrix](overview.md)).
|
||||
The **example** checkpoint is **`nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf`** from Hugging Face repo **`ggml-org/NVIDIA-Nemotron-3-Nano-Omni`** (full handle: `ggml-org/NVIDIA-Nemotron-3-Nano-Omni/nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf`). Other supported GGUFs—including Qwen3.6, Gemma, and alternate Nemotron Omni builds—use the same build and server steps; change `hf download` and `--model` paths (see the [overview model matrix](overview.md)).
|
||||
|
||||
Ensure the required tools are installed:
|
||||
|
||||
@ -123,25 +124,25 @@ make -j8
|
||||
|
||||
The build usually takes on the order of 5–10 minutes. When it finishes, binaries such as `llama-server` appear under `build/bin/`.
|
||||
|
||||
## Step 4. Download example Qwen3.6-35B-A3B GGUF
|
||||
## Step 4. Download example Nemotron 3 Nano Omni GGUF
|
||||
|
||||
llama.cpp loads models in **GGUF** format. This playbook uses the **UD-Q4_K_M** quantized MoE checkpoint from Unsloth, which fits comfortably on DGX Spark GB10 unified memory while keeping strong quality.
|
||||
llama.cpp loads models in **GGUF** format. This playbook uses the **Q8_0** checkpoint from `ggml-org/NVIDIA-Nemotron-3-Nano-Omni`, which balances quality and memory on DGX Spark GB10 unified memory.
|
||||
|
||||
```bash
|
||||
hf download unsloth/Qwen3.6-35B-A3B-GGUF \
|
||||
Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
|
||||
--local-dir ~/models/Qwen3.6-35B-A3B-GGUF
|
||||
hf download ggml-org/NVIDIA-Nemotron-3-Nano-Omni \
|
||||
nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf \
|
||||
--local-dir ~/models/NVIDIA-Nemotron-3-Nano-Omni
|
||||
```
|
||||
|
||||
The file is on the order of **~20GB** (exact size may vary). The download can be resumed if interrupted.
|
||||
The file is on the order of **~35GB** (exact size may vary). The download can be resumed if interrupted.
|
||||
|
||||
## Step 5. Start llama-server with Qwen3.6-35B-A3B
|
||||
## Step 5. Start llama-server with Nemotron 3 Nano Omni
|
||||
|
||||
From your `llama.cpp/build` directory, launch the OpenAI-compatible server with GPU offload:
|
||||
|
||||
```bash
|
||||
./bin/llama-server \
|
||||
--model ~/models/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
|
||||
--model ~/models/NVIDIA-Nemotron-3-Nano-Omni/nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000 \
|
||||
--n-gpu-layers 99 \
|
||||
@ -174,7 +175,7 @@ Use a **second terminal on the same machine** that runs `llama-server` (for exam
|
||||
curl -X POST http://127.0.0.1:30000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gemma4",
|
||||
"model": "nemotron",
|
||||
"messages": [{"role": "user", "content": "New York is a great city because..."}],
|
||||
"max_tokens": 100
|
||||
}'
|
||||
@ -197,7 +198,7 @@ Example shape of the response (fields vary by llama.cpp version; `message` may i
|
||||
}
|
||||
],
|
||||
"created": 1765916539,
|
||||
"model": "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf",
|
||||
"model": "nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf",
|
||||
"object": "chat.completion",
|
||||
"usage": {
|
||||
"completion_tokens": 100,
|
||||
@ -211,15 +212,15 @@ Example shape of the response (fields vary by llama.cpp version; `message` may i
|
||||
}
|
||||
```
|
||||
|
||||
## Step 7. Longer completion (with Qwen3.6)
|
||||
## Step 7. Longer completion (with Nemotron 3 Nano Omni)
|
||||
|
||||
Try a slightly longer prompt to confirm stable generation with **Qwen3.6-35B-A3B**:
|
||||
Try a slightly longer prompt to confirm stable generation with **Nemotron 3 Nano Omni**:
|
||||
|
||||
```bash
|
||||
curl -X POST http://127.0.0.1:30000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3",
|
||||
"model": "nemotron",
|
||||
"messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
|
||||
"max_tokens": 500
|
||||
}'
|
||||
@ -233,7 +234,7 @@ To remove this tutorial’s artifacts:
|
||||
|
||||
```bash
|
||||
rm -rf ~/llama.cpp
|
||||
rm -rf ~/models/Qwen3.6-35B-A3B-GGUF
|
||||
rm -rf ~/models/NVIDIA-Nemotron-3-Nano-Omni
|
||||
```
|
||||
|
||||
Deactivate the Python venv if you no longer need `hf`:
|
||||
|
||||
@ -27,7 +27,7 @@ This playbook shows you how to deploy LM Studio on an NVIDIA DGX Spark device to
|
||||
|
||||
## What you'll accomplish
|
||||
|
||||
You'll deploy LM Studio on an NVIDIA DGX Spark device to run gpt-oss 120B, and use the model from your laptop. More specifically, you will:
|
||||
You'll deploy LM Studio on an NVIDIA DGX Spark device to run **Nemotron 3 Nano Omni** (`nvidia/nemotron-3-nano-omni`), and use the model from your laptop. More specifically, you will:
|
||||
|
||||
- Install **llmster**, a totally headless, terminal native LM Studio on the Spark
|
||||
- Run LLM inference locally on DGX Spark via API
|
||||
@ -55,7 +55,13 @@ You'll deploy LM Studio on an NVIDIA DGX Spark device to run gpt-oss 120B, and u
|
||||
- Network access to download packages and models
|
||||
|
||||
## Model support matrix
|
||||
To explore supported models in LM Studio, check out [LM Studio model catalog](https://lmstudio.ai/models) page.
|
||||
To explore all supported models in LM Studio, check out [LM Studio model catalog](https://lmstudio.ai/models) page.
|
||||
|
||||
| Model | Support Status | Model Path |
|
||||
|-------|----------------|-----------|
|
||||
| **Nemotron 3 Nano Omni** | ✅ | `nvidia/nemotron-3-nano-omni` |
|
||||
| **Qwen3.6-35B-A3B** | ✅ | `qwen/qwen3.6-35b-a3b` |
|
||||
| **GPT-OSS-120B** | ✅ | `openai/gpt-oss-120b` |
|
||||
|
||||
## LM Link (optional)
|
||||
|
||||
@ -69,7 +75,7 @@ If you use LM Link, you can skip binding the server to `0.0.0.0` and using the S
|
||||
|
||||
## Ancillary files
|
||||
|
||||
All required assets can be found below. These sample scripts can be used in Step 6 of Instructions.
|
||||
All required assets can be found below. These sample scripts can be used in Step 7 of Instructions.
|
||||
|
||||
- [run.js](https://github.com/lmstudio-ai/docs/blob/main/_assets/nvidia-spark-playbook/js/run.js) - JavaScript script for sending a test prompt to Spark
|
||||
- [run.py](https://github.com/lmstudio-ai/docs/blob/main/_assets/nvidia-spark-playbook/py/run.py) - Python script for sending a test prompt to Spark
|
||||
@ -83,8 +89,8 @@ All required assets can be found below. These sample scripts can be used in Step
|
||||
* **Rollback:**
|
||||
* Downloaded models can be removed manually from the models directory.
|
||||
* Uninstall LM Studio or llmster
|
||||
* **Last Updated:** 04/27/2026
|
||||
* Introduce Qwen3.6 35B as example
|
||||
* **Last Updated:** 04/28/2026
|
||||
* Introduce Nemotron Omni as example
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -141,22 +147,22 @@ where `<SPARK_IP>` is your device's IP address. You can find your Spark’s IP a
|
||||
hostname -I
|
||||
```
|
||||
|
||||
## Step 3b. (Optional) Connect with LM Link
|
||||
## Step 4. (Optional) Connect with LM Link
|
||||
|
||||
**LM Link** lets you use your Spark’s models from your laptop (or other devices) as if they were local, over an end-to-end encrypted connection. You don’t need to be on the same local network or bind the server to `0.0.0.0`.
|
||||
|
||||
1. **Create a Link** — Go to [lmstudio.ai/link](https://lmstudio.ai/link) and follow **Create your Link** to set up your private LM Link network.
|
||||
2. **Link both devices** — On your DGX Spark (llmster) and on your laptop, sign in and join the same Link. LM Link uses Tailscale mesh VPNs; devices communicate without opening ports to the internet.
|
||||
3. **Use remote models** — On your laptop, open LM Studio (or use the local server). Remote models from your Spark appear in the model loader. Any tool that connects to `localhost:1234` — including the LM Studio SDK, Codex, Claude Code, OpenCode, and the scripts in Step 6 — can use those models without changing the endpoint.
|
||||
3. **Use remote models** — On your laptop, open LM Studio (or use the local server). Remote models from your Spark appear in the model loader. Any tool that connects to `localhost:1234` — including the LM Studio SDK, Codex, Claude Code, OpenCode, and the scripts in Step 7 — can use those models without changing the endpoint.
|
||||
|
||||
LM Link is in **Preview** and is free for up to 2 users, 5 devices each. For details and limits, see [LM Link](https://lmstudio.ai/link).
|
||||
|
||||
## Step 4. Download a model to your Spark
|
||||
## Step 5. Download a model to your Spark
|
||||
|
||||
As an example, let's download and run gpt-oss 120B, one of the best open source models from OpenAI. This model is too large for many laptops due to memory limitations, which makes this a fantastic use case for the Spark.
|
||||
As an example, download **NVIDIA Nemotron 3 Nano Omni** from the LM Studio catalog (`nvidia/nemotron-3-nano-omni`) so you can run it on Spark with plenty of unified memory.
|
||||
|
||||
```bash
|
||||
lms get qwen/qwen3.6-35b-a3b
|
||||
lms get nvidia/nemotron-3-nano-omni
|
||||
```
|
||||
|
||||
This download will take a while due to its large size. Verify that the model has been successfully downloaded by listing your models:
|
||||
@ -165,15 +171,15 @@ This download will take a while due to its large size. Verify that the model has
|
||||
lms ls
|
||||
```
|
||||
|
||||
## Step 5. Load the model
|
||||
## Step 6. Load the model
|
||||
|
||||
Load the model on your Spark so that it is ready to respond to requests from your laptop.
|
||||
|
||||
```bash
|
||||
lms load qwen/qwen3.6-35b-a3b
|
||||
lms load nvidia/nemotron-3-nano-omni
|
||||
```
|
||||
|
||||
## Step 6. Set up a simple program that uses LM Studio SDK on the laptop
|
||||
## Step 7. Set up a simple program that uses LM Studio SDK on the laptop
|
||||
|
||||
Install the LM Studio SDKs and use a simple script to send a prompt to your Spark and validate the response. To get started quickly, we provide simple scripts below for Python, JavaScript, and Bash. Download the scripts from the Overview page of this playbook and run the corresponding command from the directory containing it.
|
||||
|
||||
@ -205,12 +211,12 @@ Pre-reqs: User has installed `jq` and `curl`
|
||||
bash run.sh
|
||||
```
|
||||
|
||||
## Step 7. Next Steps
|
||||
## Step 8. Next Steps
|
||||
|
||||
- Try downloading and serving different models from the [LM Studio model catalog](https://lmstudio.ai/models).
|
||||
- Use [LM Link](https://lmstudio.ai/link) to connect more devices and use your Spark’s models from anywhere with end-to-end encryption.
|
||||
|
||||
## Step 8. Cleanup and rollback
|
||||
## Step 9. Cleanup and rollback
|
||||
Remove and uninstall LM Studio completely if needed. Note that LM Studio stores models separately from the application. Uninstalling LM Studio will not remove downloaded models unless you explicitly delete them.
|
||||
|
||||
If you want to remove the entire LM Studio application, quit LM Studio from the tray first, then move the application to trash.
|
||||
|
||||
@ -26,7 +26,7 @@
|
||||
- [Step 7. Interactive TUI](#step-7-interactive-tui)
|
||||
- [Step 8. Exit the sandbox and access the Web UI](#step-8-exit-the-sandbox-and-access-the-web-ui)
|
||||
- [Step 9. Create a Telegram bot](#step-9-create-a-telegram-bot)
|
||||
- [Step 10. Configure and start the Telegram bridge](#step-10-configure-and-start-the-telegram-bridge)
|
||||
- [Step 10. Install cloudflared and start the Telegram bridge](#step-10-install-cloudflared-and-start-the-telegram-bridge)
|
||||
- [Step 11. Stop services](#step-11-stop-services)
|
||||
- [Step 12. Uninstall NemoClaw](#step-12-uninstall-nemoclaw)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
@ -97,8 +97,7 @@ By participating in this demo, you acknowledge that you are solely responsible f
|
||||
**Hardware and access:**
|
||||
|
||||
- A DGX Spark (GB10) with keyboard and monitor, or SSH access
|
||||
- An **NVIDIA API key** from [build.nvidia.com](https://build.nvidia.com/settings/api-keys) (needed for the Telegram bridge)
|
||||
- A **Telegram bot token** from [@BotFather](https://t.me/BotFather) (create one with `/newbot`)
|
||||
- A **Telegram bot token** from [@BotFather](https://t.me/BotFather) (create one with `/newbot`) -- only needed if you want the Telegram bot. Have it ready *before* running the installer; the onboard wizard prompts for it.
|
||||
|
||||
**Software:**
|
||||
|
||||
@ -118,8 +117,7 @@ Expected: Ubuntu 24.04, NVIDIA GB10 GPU, Docker 28.x+.
|
||||
|
||||
| Item | Where to get it |
|
||||
|------|----------------|
|
||||
| NVIDIA API key | [build.nvidia.com/settings/api-keys](https://build.nvidia.com/settings/api-keys) |
|
||||
| Telegram bot token | [@BotFather](https://t.me/BotFather) on Telegram -- create with `/newbot` |
|
||||
| Telegram bot token (optional) | [@BotFather](https://t.me/BotFather) on Telegram -- create with `/newbot`. Required only for the Telegram bot; have it ready before running the installer. |
|
||||
|
||||
### Ancillary files
|
||||
|
||||
@ -129,8 +127,8 @@ All required assets are handled by the NemoClaw installer. No manual cloning is
|
||||
|
||||
- **Estimated time:** 20--30 minutes (with Ollama and model already downloaded). First-time model download adds ~15--30 minutes depending on network speed.
|
||||
- **Risk level:** Medium -- you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
|
||||
- **Last Updated:** 03/31/2026
|
||||
* First Publication
|
||||
- **Last Updated:** 04/28/2026
|
||||
* Updated for NemoClaw v0.0.22+: revised Telegram setup, renamed tunnel commands, refreshed uninstall instructions.
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -249,9 +247,13 @@ curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
|
||||
The onboard wizard walks you through setup:
|
||||
|
||||
1. **Sandbox name** -- Pick a name (e.g. `my-assistant`). Names must be lowercase alphanumeric with hyphens only.
|
||||
2. **Inference provider** -- Select **Local Ollama** (option 7).
|
||||
3. **Model** -- Select **nemotron-3-super:120b** (option 1).
|
||||
4. **Policy presets** -- Accept the suggested presets when prompted (hit **Y**).
|
||||
2. **Inference provider** -- Select **Local Ollama**.
|
||||
3. **Model** -- Select **nemotron-3-super:120b**.
|
||||
4. **Messaging channels** -- If you want a Telegram bot, select `telegram` here and paste your bot token when prompted. Create the bot first via [@BotFather](https://t.me/BotFather) in Telegram (see Step 9). If you skip this, you can re-run the installer later to recreate the sandbox with Telegram enabled.
|
||||
5. **Policy presets** -- Accept the suggested presets when prompted (hit **Y**).
|
||||
|
||||
> [!IMPORTANT]
|
||||
> Telegram must be configured at this step. The channel plugin and bot token are wired into the sandbox container during onboarding — they cannot be added to an existing sandbox by exporting environment variables on the host.
|
||||
|
||||
When complete you will see output like:
|
||||
|
||||
@ -362,26 +364,22 @@ http://127.0.0.1:18789/#token=<long-token-here>
|
||||
|
||||
## Phase 3: Telegram Bot
|
||||
|
||||
> [!NOTE]
|
||||
> If you already configured Telegram during the NemoClaw onboarding wizard (step 5/8), you can skip this phase. These steps cover adding Telegram after the initial setup.
|
||||
> [!IMPORTANT]
|
||||
> Telegram must be enabled in the **NemoClaw onboard wizard** (Step 4 → Messaging channels). The channel plugin and bot token are wired into the sandbox container at sandbox creation time — `policy-add` only opens network egress and is not enough on its own. If you skipped Telegram during onboard, re-run the installer to recreate the sandbox with Telegram enabled.
|
||||
|
||||
### Step 9. Create a Telegram bot
|
||||
|
||||
Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the bot token it gives you.
|
||||
Do this **before** running the NemoClaw installer in Step 4 so you have your bot token ready when the wizard prompts for it.
|
||||
|
||||
### Step 10. Configure and start the Telegram bridge
|
||||
Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the bot token it gives you and paste it into the wizard when you reach the **Messaging channels** step.
|
||||
|
||||
### Step 10. Install cloudflared and start the Telegram bridge
|
||||
|
||||
The Telegram bridge needs a public webhook URL so Telegram can deliver messages to your bot. NemoClaw uses [cloudflared](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/) to create a free `trycloudflare.com` tunnel.
|
||||
|
||||
Make sure you are on the **host** (not inside the sandbox). If you are inside the sandbox, run `exit` first.
|
||||
|
||||
Add the Telegram network policy to the sandbox so it can reach the Telegram API:
|
||||
|
||||
```bash
|
||||
nemoclaw my-assistant policy-add
|
||||
```
|
||||
|
||||
When prompted, select `telegram` and hit **Y** to confirm.
|
||||
|
||||
The Telegram bridge uses cloudflared to expose a public webhook URL. Install cloudflared on the Spark host (arm64):
|
||||
Install cloudflared (DGX Spark is arm64):
|
||||
|
||||
```bash
|
||||
curl -L --output cloudflared.deb \
|
||||
@ -389,14 +387,13 @@ curl -L --output cloudflared.deb \
|
||||
sudo dpkg -i cloudflared.deb
|
||||
```
|
||||
|
||||
Set the bot token and start auxiliary services:
|
||||
Start the tunnel:
|
||||
|
||||
```bash
|
||||
export TELEGRAM_BOT_TOKEN=<your-bot-token>
|
||||
nemoclaw start
|
||||
nemoclaw tunnel start
|
||||
```
|
||||
|
||||
The Telegram bridge starts only when the `TELEGRAM_BOT_TOKEN` environment variable is set. Verify the services are running and note the public URL:
|
||||
Verify the public URL is live:
|
||||
|
||||
```bash
|
||||
nemoclaw status
|
||||
@ -407,16 +404,16 @@ You should see `● cloudflared` with a `trycloudflare.com` public URL (e.g. `ht
|
||||
Open Telegram, find your bot, and send it a message. The bot forwards it to the agent and replies.
|
||||
|
||||
> [!NOTE]
|
||||
> If `nemoclaw start` prints `cloudflared not found — no public URL`, the cloudflared install above did not complete successfully. Re-run the install, then restart services:
|
||||
> If `nemoclaw tunnel start` prints `cloudflared not found — no public URL`, the cloudflared install above did not complete successfully. Re-run the install, then restart the tunnel:
|
||||
> ```bash
|
||||
> nemoclaw stop && nemoclaw start
|
||||
> nemoclaw tunnel stop && nemoclaw tunnel start
|
||||
> ```
|
||||
|
||||
> [!NOTE]
|
||||
> The first response may take 30--90 seconds for a 120B parameter model running locally.
|
||||
|
||||
> [!NOTE]
|
||||
> If the bridge does not appear in `nemoclaw status`, make sure `TELEGRAM_BOT_TOKEN` is exported in the same shell session where you run `nemoclaw start`.
|
||||
> If sending a message returns `Error: Channel is unavailable: telegram`, the channel was not enabled during onboard. Re-run the installer to recreate the sandbox with Telegram selected at the **Messaging channels** step.
|
||||
|
||||
> [!NOTE]
|
||||
> For details on restricting which Telegram chats can interact with the agent, see the [NemoClaw Telegram bridge documentation](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html).
|
||||
@ -427,10 +424,10 @@ Open Telegram, find your bot, and send it a message. The bot forwards it to the
|
||||
|
||||
### Step 11. Stop services
|
||||
|
||||
Stop any running auxiliary services (Telegram bridge, cloudflared tunnel):
|
||||
Stop the cloudflared tunnel:
|
||||
|
||||
```bash
|
||||
nemoclaw stop
|
||||
nemoclaw tunnel stop
|
||||
```
|
||||
|
||||
Stop the port forward:
|
||||
@ -442,14 +439,13 @@ openshell forward stop 18789 # stop the dashboard forward
|
||||
|
||||
### Step 12. Uninstall NemoClaw
|
||||
|
||||
Run the uninstaller from the cloned source directory. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and Ollama are preserved.
|
||||
Run the uninstaller via curl (matches the [NemoClaw README](https://github.com/NVIDIA/NemoClaw)). It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and Ollama are preserved.
|
||||
|
||||
```bash
|
||||
cd ~/.nemoclaw/source
|
||||
./uninstall.sh
|
||||
curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh | bash
|
||||
```
|
||||
|
||||
**Uninstaller flags:**
|
||||
**Uninstaller flags** (pass via `bash -s -- <flags>`):
|
||||
|
||||
| Flag | Effect |
|
||||
|------|--------|
|
||||
@ -457,10 +453,10 @@ cd ~/.nemoclaw/source
|
||||
| `--keep-openshell` | Leave the `openshell` binary in place |
|
||||
| `--delete-models` | Also remove the Ollama models pulled by NemoClaw |
|
||||
|
||||
To remove everything including the Ollama model:
|
||||
To remove everything including the Ollama model, non-interactively:
|
||||
|
||||
```bash
|
||||
./uninstall.sh --yes --delete-models
|
||||
curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh | bash -s -- --yes --delete-models
|
||||
```
|
||||
|
||||
The uninstaller runs 6 steps:
|
||||
@ -472,7 +468,7 @@ The uninstaller runs 6 steps:
|
||||
6. Remove state directories (`~/.nemoclaw`, `~/.config/openshell`, `~/.config/nemoclaw`) and the OpenShell binary
|
||||
|
||||
> [!NOTE]
|
||||
> The source clone at `~/.nemoclaw/source` is removed as part of state cleanup in step 6. If you want to keep a local copy, move or back it up before running the uninstaller.
|
||||
> If you have a local clone at `~/.nemoclaw/source` you want to keep, move or back it up before running the uninstaller — it is removed as part of state cleanup in step 6.
|
||||
|
||||
## Useful commands
|
||||
|
||||
@ -482,13 +478,13 @@ The uninstaller runs 6 steps:
|
||||
| `nemoclaw my-assistant status` | Show sandbox status and inference config |
|
||||
| `nemoclaw my-assistant logs --follow` | Stream sandbox logs in real time |
|
||||
| `nemoclaw list` | List all registered sandboxes |
|
||||
| `nemoclaw start` | Start auxiliary services (Telegram bridge, cloudflared) |
|
||||
| `nemoclaw stop` | Stop auxiliary services |
|
||||
| `nemoclaw tunnel start` | Start cloudflared tunnel (public URL for Telegram webhooks) |
|
||||
| `nemoclaw tunnel stop` | Stop the cloudflared tunnel |
|
||||
| `openshell term` | Open the monitoring TUI on the host |
|
||||
| `openshell forward list` | List active port forwards |
|
||||
| `openshell forward start 18789 my-assistant --background` | Restart port forwarding for Web UI |
|
||||
| `cd ~/.nemoclaw/source && ./uninstall.sh` | Remove NemoClaw (preserves Docker, Node.js, Ollama) |
|
||||
| `cd ~/.nemoclaw/source && ./uninstall.sh --delete-models` | Remove NemoClaw and Ollama models |
|
||||
| `curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh \| bash` | Remove NemoClaw (preserves Docker, Node.js, Ollama) |
|
||||
| `curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh \| bash -s -- --delete-models` | Remove NemoClaw and Ollama models |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
|
||||
@ -53,6 +53,7 @@ The following models are supported with SGLang on Spark. All listed models are a
|
||||
|
||||
| Model | Quantization | Support Status | HF Handle |
|
||||
|-------|-------------|----------------|-----------|
|
||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | BF16 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16) |
|
||||
| **GPT-OSS-20B** | MXFP4 | ✅ | `openai/gpt-oss-20b` |
|
||||
| **GPT-OSS-120B** | MXFP4 | ✅ | `openai/gpt-oss-120b` |
|
||||
| **Llama-3.1-8B-Instruct** | FP8 | ✅ | `nvidia/Llama-3.1-8B-Instruct-FP8` |
|
||||
@ -75,12 +76,19 @@ Note: for NVFP4 models, add the `--quantization modelopt_fp4` flag.
|
||||
* **Estimated time:** 30 minutes for initial setup and validation
|
||||
* **Risk level:** Low - Uses pre-built, validated SGLang container with minimal configuration
|
||||
* **Rollback:** Stop and remove containers with `docker stop` and `docker rm` commands
|
||||
* **Last Updated:** 03/15/2026
|
||||
* Use latest NGC SGLang container: nvcr.io/nvidia/sglang:26.02-py3
|
||||
* **Last Updated:** 04/28/2026
|
||||
* Introduce Nemotron-3-Nano-Omni reasoning FP8 support
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Verify system prerequisites
|
||||
## Step 1. Use model specific deployment guide
|
||||
|
||||
Certain models require special deployment configurations. Please refer to their respective model cards to run on DGX Spark:
|
||||
| Model | Quantization | HF Model Card Link |
|
||||
|-------|-------------|----------------|
|
||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | BF16 | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 |
|
||||
|
||||
## Step 2. Verify system prerequisites
|
||||
|
||||
Check that your NVIDIA Spark device meets all requirements before proceeding. This step runs on
|
||||
your host system and ensures Docker, GPU drivers, and container toolkit are properly configured.
|
||||
@ -108,7 +116,7 @@ sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
## Step 2. Pull the SGLang Container
|
||||
## Step 3. Pull the SGLang Container
|
||||
|
||||
Download the latest SGLang container. This step runs on the host and may take
|
||||
several minutes depending on your network connection.
|
||||
@ -122,7 +130,7 @@ docker pull nvcr.io/nvidia/sglang:26.02-py3
|
||||
docker images | grep sglang
|
||||
```
|
||||
|
||||
## Step 3. Launch SGLang container for server mode
|
||||
## Step 4. Launch SGLang container for server mode
|
||||
|
||||
Start the SGLang container in server mode to enable HTTP API access. This runs the inference
|
||||
server inside the container, exposing it on port 30000 for client connections.
|
||||
@ -136,7 +144,7 @@ docker run --gpus all -it --rm \
|
||||
bash
|
||||
```
|
||||
|
||||
## Step 4. Start the SGLang inference server
|
||||
## Step 5. Start the SGLang inference server
|
||||
|
||||
Inside the container, launch the HTTP inference server with a supported model. This step runs
|
||||
inside the Docker container and starts the SGLang server daemon.
|
||||
@ -159,7 +167,7 @@ sleep 30
|
||||
curl http://localhost:30000/health
|
||||
```
|
||||
|
||||
## Step 5. Test client-server inference
|
||||
## Step 6. Test client-server inference
|
||||
|
||||
From a new terminal on your host system, test the SGLang server API to ensure it's working
|
||||
correctly. This validates that the server is accepting requests and generating responses.
|
||||
@ -177,7 +185,7 @@ curl -X POST http://localhost:30000/generate \
|
||||
}'
|
||||
```
|
||||
|
||||
## Step 6. Test Python client API
|
||||
## Step 7. Test Python client API
|
||||
|
||||
Create a simple Python script to test programmatic access to the SGLang server. This runs on
|
||||
the host system and demonstrates how to integrate SGLang into applications.
|
||||
@ -197,7 +205,7 @@ response = requests.post('http://localhost:30000/generate', json={
|
||||
print(f"Response: {response.json()['text']}")
|
||||
```
|
||||
|
||||
## Step 7. Validate installation
|
||||
## Step 8. Validate installation
|
||||
|
||||
Confirm that both server and offline modes are working correctly. This step verifies the
|
||||
complete SGLang setup and ensures reliable operation.
|
||||
@ -213,7 +221,7 @@ docker ps
|
||||
docker logs <CONTAINER_ID>
|
||||
```
|
||||
|
||||
## Step 8. Cleanup and rollback
|
||||
## Step 9. Cleanup and rollback
|
||||
|
||||
Stop and remove containers to clean up resources. This step returns your system to its
|
||||
original state.
|
||||
@ -232,7 +240,7 @@ docker container prune -f
|
||||
docker rmi nvcr.io/nvidia/sglang:26.02-py3
|
||||
```
|
||||
|
||||
## Step 9. Next steps
|
||||
## Step 10. Next steps
|
||||
|
||||
With SGLang successfully deployed, you can now:
|
||||
|
||||
|
||||
@ -57,7 +57,7 @@ inference through kernel-level optimizations, efficient memory layouts, and adva
|
||||
|
||||
- DGX Spark device
|
||||
- NVIDIA drivers compatible with CUDA 12.x: `nvidia-smi`
|
||||
- Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 nvidia-smi`
|
||||
- Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5 nvidia-smi`
|
||||
- Hugging Face account with token for model access: `echo $HF_TOKEN`
|
||||
- Sufficient GPU VRAM (40GB+ recommended for 70B models)
|
||||
- Internet connectivity for downloading models and container images
|
||||
@ -136,7 +136,7 @@ models and containers.
|
||||
nvidia-smi
|
||||
|
||||
## Verify Docker GPU support
|
||||
docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 nvidia-smi
|
||||
docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5 nvidia-smi
|
||||
|
||||
```
|
||||
|
||||
@ -146,7 +146,7 @@ docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 nvidia-s
|
||||
## Set `HF_TOKEN` for model access.
|
||||
export HF_TOKEN=<your-huggingface-token>
|
||||
|
||||
export DOCKER_IMAGE="nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6"
|
||||
export DOCKER_IMAGE="nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc5"
|
||||
```
|
||||
|
||||
## Step 4. Validate TensorRT-LLM installation
|
||||
@ -161,8 +161,8 @@ docker run --rm -it --gpus all \
|
||||
|
||||
Expected output:
|
||||
```
|
||||
[TensorRT-LLM] TensorRT-LLM version: 1.2.0rc6
|
||||
TensorRT-LLM version: 1.2.0rc6
|
||||
[TensorRT-LLM] TensorRT-LLM version: 1.3.0rc5
|
||||
TensorRT-LLM version: 1.3.0rc5
|
||||
```
|
||||
|
||||
## Step 5. Create cache directory
|
||||
|
||||
@ -54,6 +54,9 @@ The following models are supported with vLLM on Spark. All listed models are ava
|
||||
|
||||
| Model | Quantization | Support Status | HF Handle |
|
||||
|-------|-------------|----------------|-----------|
|
||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | BF16 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16) |
|
||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | FP8 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8) |
|
||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | NVFP4 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4) |
|
||||
| **Gemma 4 31B IT** | Base | ✅ | [`google/gemma-4-31B-it`](https://huggingface.co/google/gemma-4-31B-it) |
|
||||
| **Gemma 4 31B IT** | NVFP4 | ✅ | [`nvidia/Gemma-4-31B-IT-NVFP4`](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) |
|
||||
| **Gemma 4 26B A4B IT** | Base | ✅ | [`google/gemma-4-26B-A4B-it`](https://huggingface.co/google/gemma-4-26B-A4B-it) |
|
||||
@ -94,12 +97,22 @@ Reminder: not all model architectures are supported for NVFP4 quantization.
|
||||
* **Duration:** 30 minutes for Docker approach
|
||||
* **Risks:** Container registry access requires internal credentials
|
||||
* **Rollback:** Container approach is non-destructive.
|
||||
* **Last Updated:** 04/02/2026
|
||||
* Add support for Gemma 4 model family
|
||||
* **Last Updated:** 04/28/2026
|
||||
* Add support for Nemotron-3-Nano-Omni reasoning BF16, FP8, NVFP4
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Configure Docker permissions
|
||||
## Step 1. Use model specific deployment guide
|
||||
|
||||
Certain models require special deployment configurations. Please refer to their respective model cards to run on DGX Spark:
|
||||
|
||||
| Model | Quantization | HF Model Card Link |
|
||||
|-------|-------------|----------------|
|
||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | BF16 | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 |
|
||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | FP8 | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 |
|
||||
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | NVFP4 | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 |
|
||||
|
||||
## Step 2. Configure Docker permissions
|
||||
|
||||
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
|
||||
|
||||
@ -115,7 +128,7 @@ sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
## Step 2. Pull vLLM container image
|
||||
## Step 3. Pull vLLM container image
|
||||
|
||||
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm
|
||||
|
||||
@ -136,7 +149,7 @@ For Gemma 4 model family, use vLLM custom containers:
|
||||
docker pull vllm/vllm-openai:gemma4-cu130
|
||||
```
|
||||
|
||||
## Step 3. Test vLLM in container
|
||||
## Step 4. Test vLLM in container
|
||||
|
||||
Launch the container and start vLLM server with a test model to verify basic functionality.
|
||||
|
||||
@ -171,7 +184,7 @@ curl http://localhost:8000/v1/chat/completions \
|
||||
|
||||
Expected response should contain `"content": "204"` or similar mathematical calculation.
|
||||
|
||||
## Step 4. Cleanup and rollback
|
||||
## Step 5. Cleanup and rollback
|
||||
|
||||
For container approach (non-destructive):
|
||||
|
||||
@ -180,7 +193,7 @@ docker rm $(docker ps -aq --filter ancestor=nvcr.io/nvidia/vllm:${LATEST_VLLM_VE
|
||||
docker rmi nvcr.io/nvidia/vllm
|
||||
```
|
||||
|
||||
## Step 5. Next steps
|
||||
## Step 6. Next steps
|
||||
|
||||
- **Production deployment:** Configure vLLM with your specific model requirements
|
||||
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
|
||||
|
||||
Loading…
Reference in New Issue
Block a user