Compare commits

...

4 Commits

Author SHA1 Message Date
Omar Obando
c7d068ccdc
Merge 48fc5eb30e into 718d8288e3 2026-05-29 23:02:55 -05:00
GitLab CI
718d8288e3 chore: Regenerate all playbooks 2026-05-30 03:20:45 +00:00
GitLab CI
6942395d72 chore: Regenerate all playbooks 2026-05-29 15:56:45 +00:00
Omar Obando
48fc5eb30e
Add troubleshooting tips for WiFi and watchdog issues 2026-03-09 17:19:09 -06:00
15 changed files with 631 additions and 586 deletions

View File

@ -40,7 +40,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
- [Connect Multiple DGX Spark through a Switch](nvidia/multi-sparks-through-switch/)
- [NCCL for Two Sparks](nvidia/nccl/)
- [Fine-tune with NeMo](nvidia/nemo-fine-tune/)
- [NemoClaw with Nemotron 3 Super and Telegram on DGX Spark](nvidia/nemoclaw/)
- [Run NemoClaw with a Local LLM](nvidia/nemoclaw/)
- [Nemotron-3-Nano with llama.cpp](nvidia/nemotron/)
- [NIM on Spark](nvidia/nim-llm/)
- [NVFP4 Quantization](nvidia/nvfp4-quantization/)

View File

@ -370,7 +370,7 @@ docker run \
-e HF_TOKEN="$HF_TOKEN" \
-e MODEL_HANDLE="$MODEL_HANDLE" \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
nvcr.io/nvidia/sglang:25.12-py3 \
lmsysorg/sglang:latest-cu130 \
bash -lc '
python3 -m sglang.bench_offline_throughput \
--model-path "$MODEL_HANDLE" \
@ -394,7 +394,7 @@ docker run \
-e HF_TOKEN="$HF_TOKEN" \
-e MODEL_HANDLE="$MODEL_HANDLE" \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
nvcr.io/nvidia/sglang:25.12-py3 \
lmsysorg/sglang:latest-cu130 \
bash -lc '
python3 -m sglang.launch_server \
--model-path "$MODEL_HANDLE" \
@ -417,7 +417,7 @@ docker run \
--network host \
-e HF_TOKEN="$HF_TOKEN" \
-e MODEL_HANDLE="$MODEL_HANDLE" \
nvcr.io/nvidia/sglang:25.12-py3 \
lmsysorg/sglang:latest-cu130 \
bash -lc '
python3 -m sglang.bench_serving \
--backend sglang \

View File

@ -1,13 +1,11 @@
# NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
# Run NemoClaw with a Local LLM
> Install NemoClaw on DGX Spark with local Ollama inference and Telegram bot integration
> Build your first local AI assistant on DGX Spark using NemoClaw and Ollama in a secure sandbox, with optional Telegram.
## Table of Contents
- [Overview](#overview)
- [Overview](#overview)
- [Basic idea](#basic-idea)
- [What you'll accomplish](#what-youll-accomplish)
- [Notice and disclaimers](#notice-and-disclaimers)
- [Isolation layers (OpenShell)](#isolation-layers-openshell)
@ -17,40 +15,33 @@
- [Ancillary files](#ancillary-files)
- [Time and risk](#time-and-risk)
- [Instructions](#instructions)
- [Step 1. Configure Docker and the NVIDIA container runtime](#step-1-configure-docker-and-the-nvidia-container-runtime)
- [Step 2. Install Ollama](#step-2-install-ollama)
- [Step 3. Pull the Nemotron 3 Super model](#step-3-pull-the-nemotron-3-super-model)
- [Step 4. Install NemoClaw](#step-4-install-nemoclaw)
- [Step 5. Connect to the sandbox and verify inference](#step-5-connect-to-the-sandbox-and-verify-inference)
- [Step 6. Talk to the agent (CLI)](#step-6-talk-to-the-agent-cli)
- [Step 7. Interactive TUI](#step-7-interactive-tui)
- [Step 8. Exit the sandbox and access the Web UI](#step-8-exit-the-sandbox-and-access-the-web-ui)
- [Step 9. Create a Telegram bot](#step-9-create-a-telegram-bot)
- [Step 10. Install cloudflared and start the Telegram bridge](#step-10-install-cloudflared-and-start-the-telegram-bridge)
- [Step 11. Stop services](#step-11-stop-services)
- [Step 12. Uninstall NemoClaw](#step-12-uninstall-nemoclaw)
- [Step 1. Install NemoClaw](#step-1-install-nemoclaw)
- [Step 2. NemoClaw Onboarding](#step-2-nemoclaw-onboarding)
- [Step 3. Interact with OpenClaw](#step-3-interact-with-openclaw)
- [Step 4. Enable Brave Search in sandbox](#step-4-enable-brave-search-in-sandbox)
- [Step 5. Set up Messaging Channel (Telegram Bot as an example)](#step-5-set-up-messaging-channel-telegram-bot-as-an-example)
- [Step 6. Set Up NemoClaw Agents](#step-6-set-up-nemoclaw-agents)
- [Step 7. Stop services](#step-7-stop-services)
- [Step 8. Uninstall NemoClaw](#step-8-uninstall-nemoclaw)
- [Troubleshooting](#troubleshooting)
---
## Overview
### Overview
## Basic idea
### Basic idea
**NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime — an environment designed for executing agents with additional security — and connects them to **local Ollama** inference on your DGX Spark. A single installer command (`nemoclaw.sh`) handles Node.js, OpenShell, and the NemoClaw CLI; the **onboard** wizard then creates a sandboxed agent, optional **Brave Search**, optional **messaging channels** (Telegram, Discord, or Slack), and a **policy tier** with network presets.
**NVIDIA NemoClaw** is an open-source reference stack that simplifies running OpenClaw always-on assistants more safely. It installs the **NVIDIA OpenShell** runtime -- an environment designed for executing agents with additional security -- and open-source models like NVIDIA Nemotron. A single installer command handles Node.js, OpenShell, and the NemoClaw CLI, then walks you through an onboard wizard to create a sandboxed agent on your DGX Spark using Ollama with Nemotron 3 Super.
By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, accessible via a web dashboard and a Telegram bot, with inference routed to a local Nemotron 3 Super 120B model on your Spark -- all without exposing your host filesystem or network to the agent.
By the end of this playbook you will have a working AI agent inside an OpenShell sandbox, reachable through the **Web UI** or **terminal TUI**, with inference routed to **local Ollama** on the Spark. You can optionally add **Telegram** (with **cloudflared** for a public webhook URL) and optional **web search** — all without exposing your host filesystem or network beyond what you explicitly allow in policy.
### What you'll accomplish
- Configure Docker and the NVIDIA container runtime for OpenShell on DGX Spark
- Install Ollama, pull Nemotron 3 Super 120B, and configure it for sandbox access
- Install NemoClaw with a single command (handles Node.js, OpenShell, and the CLI)
- Run the onboard wizard to create a sandbox and configure local inference
- Chat with the agent via the CLI, TUI, and web UI
- Set up a Telegram bot that forwards messages to your sandboxed agent
- Install **NemoClaw** with one command (`nemoclaw.sh`), which pulls Node.js, OpenShell, and the CLI as needed
- Walk through `nemoclaw onboard` wizard with recommended settings
- Open the **Web UI** to interact with agent
- Optionally enable **Brave Search** or **Telegram** after onboarding
- **Cleanup and uninstall** with the documented `uninstall.sh` flags when finished
### Notice and disclaimers
@ -64,14 +55,14 @@ By installing this demo, you accept responsibility for all third-party component
#### What you're getting
This experience is provided "AS IS" for demonstration purposes only -- no warranties, no guarantees. This is a demo, not a production-ready solution. You will need to implement appropriate security controls for your environment and use case.
This experience is provided "AS IS" for demonstration purposes only no warranties, no guarantees. This is a demo, not a production-ready solution. You will need to implement appropriate security controls for your environment and use case.
#### Key risks with AI agents
- **Data leakage** -- Any materials the agent accesses could be exposed, leaked, or stolen.
- **Malicious code execution** -- The agent or its connected tools could expose your system to malicious code or cyber-attacks.
- **Unintended actions** -- The agent might modify or delete files, send messages, or access services without explicit approval.
- **Prompt injection and manipulation** -- External inputs or connected content could hijack the agent's behavior in unexpected ways.
- **Data leakage** Any materials the agent accesses could be exposed, leaked, or stolen.
- **Malicious code execution** The agent or its connected tools could expose your system to malicious code or cyber-attacks.
- **Unintended actions** The agent might modify or delete files, send messages, or access services without explicit approval.
- **Prompt injection and manipulation** External inputs or connected content could hijack the agent's behavior in unexpected ways.
#### Participant acknowledgement
@ -81,23 +72,22 @@ By participating in this demo, you acknowledge that you are solely responsible f
| Layer | What it protects | When it applies |
|------------|----------------------------------------------------|-----------------------------|
| Filesystem | Prevents reads/writes outside allowed paths. | Locked at sandbox creation. |
| Filesystem | Prevents reads/writes outside allowed paths. | Locked at sandbox creation. |
| Network | Blocks unauthorized outbound connections. | Hot-reloadable at runtime. |
| Process | Blocks privilege escalation and dangerous syscalls.| Locked at sandbox creation. |
| Process | Blocks privilege escalation and dangerous syscalls.| Locked at sandbox creation. |
| Inference | Reroutes model API calls to controlled backends. | Hot-reloadable at runtime. |
### What to know before starting
- Basic use of the Linux terminal and SSH
- Familiarity with Docker (permissions, `docker run`)
- Familiarity with Docker (permissions, `docker run`, optional `docker` group membership)
- Awareness of the security and risk sections above
### Prerequisites
**Hardware and access:**
**Hardware:**
- A DGX Spark (GB10) with keyboard and monitor, or SSH access
- A **Telegram bot token** from [@BotFather](https://t.me/BotFather) (create one with `/newbot`) -- only needed if you want the Telegram bot. Have it ready *before* running the installer; the onboard wizard prompts for it.
**Software:**
@ -115,9 +105,10 @@ Expected: Ubuntu 24.04, NVIDIA GB10 GPU, Docker 28.x+.
### Have ready before you begin
| Item | Where to get it |
|------|----------------|
| Telegram bot token (optional) | [@BotFather](https://t.me/BotFather) on Telegram -- create with `/newbot`. Required only for the Telegram bot; have it ready before running the installer. |
| Item | When you need it |
|------|------------------|
| **Telegram bot token** (optional) | Create with [@BotFather](https://t.me/BotFather) (`/newbot`). You can paste it during **onboarding** (Step 3) **or** when you run **`nemoclaw <sandbox> channels add telegram`** later. |
| **Brave Search API key** (optional) | From [Brave Search API](https://brave.com/search/api/) if you enable web search during onboarding or via **`nemoclaw onboard --fresh --gpu`** (`--fresh` re-prompts every onboarding question, including features you previously skipped; without `--fresh` the wizard resumes the previous session and will not re-prompt). |
### Ancillary files
@ -125,143 +116,51 @@ All required assets are handled by the NemoClaw installer. No manual cloning is
### Time and risk
- **Estimated time:** 20--30 minutes (with Ollama and model already downloaded). First-time model download adds ~15--30 minutes depending on network speed.
- **Risk level:** Medium -- you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
- **Last Updated:** 04/28/2026
* Updated for NemoClaw v0.0.22+: revised Telegram setup, renamed tunnel commands, refreshed uninstall instructions.
- **Estimated time:** About 3060 minutes for a first full pass (install, onboard, model download depending on choice and network). Optional Brave, Telegram, and cloudflared steps add time if you do them in a second session.
- **Risk level:** Medium you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
- **Last Updated:** 05/29/2026
- Update to latest nemoclaw installer instructions
## Instructions
## Phase 1: Prerequisites
## Phase 1: Install and Run NemoClaw
These steps prepare a fresh DGX Spark for NemoClaw. If Docker, the NVIDIA runtime, and Ollama are already configured, skip to Phase 2.
### Step 1. Install NemoClaw
### Step 1. Configure Docker and the NVIDIA container runtime
OpenShell's gateway runs k3s inside Docker. On DGX Spark (Ubuntu 24.04, cgroup v2), Docker must be configured with the NVIDIA runtime and host cgroup namespace mode.
Configure the NVIDIA container runtime for Docker:
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the pinned NemoClaw **v0.55** release (set via `NEMOCLAW_VERSION`; v0.55 is the version the NemoClaw team currently recommends as the most stable), builds the CLI, and runs the onboard wizard to create a sandbox.
```bash
sudo nvidia-ctk runtime configure --runtime=docker
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_VERSION=v0.55 bash
```
Set the cgroup namespace mode required by OpenShell on DGX Spark:
The installation wizard walks you through setup:
```bash
sudo python3 -c "
import json, os
path = '/etc/docker/daemon.json'
d = json.load(open(path)) if os.path.exists(path) else {}
d['default-cgroupns-mode'] = 'host'
json.dump(d, open(path, 'w'), indent=2)
"
```
1. **Accept NemoClaw license** -- Confirm by entering `yes`
2. **Run express install** -- Confirm by entering `Y`
Restart Docker:
The installer requires **Node.js 22.16+** (installed automatically if missing). It walks you through Node.js, NemoClaw CLI and Onboarding phases. See more details of Onboarding configuration in the next step.
```bash
sudo systemctl restart docker
```
Verify the NVIDIA runtime works:
```bash
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
```
If you get a permission denied error on `docker`, add your user to the Docker group and activate the new group in your current session:
```bash
sudo usermod -aG docker $USER
newgrp docker
```
This applies the group change immediately. Alternatively, you can log out and back in instead of running `newgrp docker`.
### Step 2. NemoClaw Onboarding
> [!NOTE]
> DGX Spark uses cgroup v2. OpenShell's gateway embeds k3s inside Docker and needs host cgroup namespace access. Without `default-cgroupns-mode: host`, the gateway can fail with "Failed to start ContainerManager" errors.
> If you chose **express install** in Step 1, all settings are auto-configured with recommended defaults. Skip to Step 3.
### Step 2. Install Ollama
During custom setup, the onboard wizard walks you through:
Install Ollama:
```bash
curl -fsSL https://ollama.com/install.sh | sh
```
Configure Ollama to listen on all interfaces so the sandbox container can reach it:
```bash
sudo mkdir -p /etc/systemd/system/ollama.service.d
printf '[Service]\nEnvironment="OLLAMA_HOST=0.0.0.0"\n' | sudo tee /etc/systemd/system/ollama.service.d/override.conf
sudo systemctl daemon-reload
sudo systemctl restart ollama
```
Verify it is running and reachable on all interfaces:
```bash
curl http://0.0.0.0:11434
```
Expected: `Ollama is running`. If not, start it with `sudo systemctl start ollama`.
> [!IMPORTANT]
> Always start Ollama via systemd (`sudo systemctl restart ollama`) — do not use `ollama serve &`. A manually started Ollama process does not pick up the `OLLAMA_HOST=0.0.0.0` setting above, and the NemoClaw sandbox will not be able to reach the inference server.
### Step 3. Pull the Nemotron 3 Super model
Download Nemotron 3 Super 120B (~87 GB; may take 15--30 minutes depending on network speed):
```bash
ollama pull nemotron-3-super:120b
```
Run it briefly to pre-load weights into memory (type `/bye` to exit):
```bash
ollama run nemotron-3-super:120b
```
Verify the model is available:
```bash
ollama list
```
You should see `nemotron-3-super:120b` in the output.
---
## Phase 2: Install and Run NemoClaw
### Step 4. Install NemoClaw
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the latest stable NemoClaw release, builds the CLI, and runs the onboard wizard to create a sandbox.
```bash
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
```
The onboard wizard walks you through setup:
1. **Sandbox name** -- Pick a name (e.g. `my-assistant`). Names must be lowercase alphanumeric with hyphens only.
2. **Inference provider** -- Select **Local Ollama**.
3. **Model** -- Select **nemotron-3-super:120b**.
4. **Messaging channels** -- If you want a Telegram bot, select `telegram` here and paste your bot token when prompted. Create the bot first via [@BotFather](https://t.me/BotFather) in Telegram (see Step 9). If you skip this, you can re-run the installer later to recreate the sandbox with Telegram enabled.
5. **Policy presets** -- Accept the suggested presets when prompted (hit **Y**).
> [!IMPORTANT]
> Telegram must be configured at this step. The channel plugin and bot token are wired into the sandbox container during onboarding — they cannot be added to an existing sandbox by exporting environment variables on the host.
1. **Configuring inference** -- Choose to set up local inference on your Spark by selecting **`7) Local Ollama`**.
2. **Ollama models** -- Choose desired inference model. If no model is present locally, the installer will download **`qwen3:30b`** automatically.
3. **Sandbox name** -- Pick a name (e.g. my-assistant). Each sandbox requires a unique name.
4. **Apply this configuration** -- Enter `Y` to confirm setting up local inference.
5. **Enable Brave Web Search** -- Optional. If you enable it, paste a [Brave Search API](https://brave.com/search/api/) key when prompted.
6. **Messaging channels** -- Optional. If you enable it, choose your desired bot (`telegram`, `discord` or `slack`) and paste your bot token when prompted.
7. **Policy presets** -- Choose desired Policy tier (`Balanced` recommended) and accept/edit the suggested presets when prompted (confirm with **Enter**).
When complete you will see output like:
```text
──────────────────────────────────────────────────
Dashboard http://localhost:18789/
Sandbox my-assistant (Landlock + seccomp + netns)
Model nemotron-3-super:120b (Local Ollama)
Model <your-selected-model> (Local Ollama)
──────────────────────────────────────────────────
Run: nemoclaw my-assistant connect
Status: nemoclaw my-assistant status
@ -269,68 +168,40 @@ Logs: nemoclaw my-assistant logs --follow
──────────────────────────────────────────────────
```
> [!IMPORTANT]
> Save the tokenized Web UI URL printed at the end -- you will need it in Step 8. It looks like:
> `http://127.0.0.1:18789/#token=<long-token-here>`
> [!NOTE]
> If `nemoclaw` is not found after install, run `source ~/.bashrc` to reload your shell path.
> - If `nemoclaw` is not found after install, run `source ~/.bashrc` to reload your shell path.
> - Time to finish **Onboarding** can vary, depending on the model choice and internet speed.
### Step 5. Connect to the sandbox and verify inference
Connect to the sandbox:
NemoClaw Onboarding can be run repeatedly to create multiple sandboxes for independent usecases. Use `--name <new-name>` to create an additional sandbox alongside any existing ones:
```bash
nemoclaw my-assistant connect
nemoclaw onboard --gpu --name <new-name>
```
You will see `sandbox@my-assistant:~$` -- you are now inside the sandboxed environment.
> [!IMPORTANT]
> Use `--name <new-name>` to create an additional sandbox without affecting existing ones. The `--fresh` flag is a destructive option reserved for starting a completely new onboard session — if a sandbox with the same name already exists, `--fresh` will **destroy and recreate it**. Only use `--fresh` when you intend to wipe and re-onboard (see Step 4 for an example where re-prompting is required).
Verify that the inference route is working:
### Step 3. Interact with OpenClaw
There are two ways to interact with your OpenClaw, Web UI or terminal UI.
#### Option 1. Web UI
Get the full dashboard URL (includes the auto-assigned port and token):
```bash
curl -sf https://inference.local/v1/models
nemoclaw my-assistant dashboard-url --quiet
```
Expected: JSON listing `nemotron-3-super:120b`.
This prints a URL like `http://127.0.0.1:18790/#token=<token>`. The port is auto-assigned (commonly 18789 or 18790) and may differ between installs.
### Step 6. Talk to the agent (CLI)
**If accessing the Web UI directly on the Spark** (keyboard and monitor attached), open the dashboard URL in a browser.
Still inside the sandbox, send a test message:
**If accessing the Web UI from a remote machine**, you need to set up an SSH tunnel.
```bash
openclaw agent --agent main -m "hello" --session-id test
```
First, note the port number from the dashboard URL above (e.g. `18790`).
The agent will respond using Nemotron 3 Super. First responses may take 30--90 seconds for a 120B parameter model running locally.
### Step 7. Interactive TUI
Launch the terminal UI for an interactive chat session:
```bash
openclaw tui
```
Press **Ctrl+C** to exit the TUI.
### Step 8. Exit the sandbox and access the Web UI
Exit the sandbox to return to the host:
```bash
exit
```
**If accessing the Web UI directly on the Spark** (keyboard and monitor attached), open a browser and navigate to the tokenized URL from Step 4:
```text
http://127.0.0.1:18789/#token=<long-token-here>
```
**If accessing the Web UI from a remote machine**, you need to set up an SSH tunnel. The NemoClaw onboard wizard already created the port 18789 forward on the Spark, so you only need to tunnel from your remote machine.
First, find your Spark's IP address. On the Spark, run:
Find your Spark's IP address:
```bash
hostname -I | awk '{print $1}'
@ -338,46 +209,120 @@ hostname -I | awk '{print $1}'
This prints the primary IP address (e.g. `192.168.1.42`). You can also find it in **Settings > Wi-Fi** or **Settings > Network** on the Spark's desktop, or check your router's connected-devices list.
From your remote machine, create an SSH tunnel to the Spark (replace `<your-spark-ip>` with the IP address from above):
From your remote machine, create an SSH tunnel using the port from above (replace `<port>` and `<your-spark-ip>`):
```bash
ssh -L 18789:127.0.0.1:18789 <your-user>@<your-spark-ip>
ssh -L <port>:127.0.0.1:<port> <your-user>@<your-spark-ip>
```
Now open the tokenized URL in your remote machine's browser:
```text
http://127.0.0.1:18789/#token=<long-token-here>
```
Now open the dashboard URL in your remote machine's browser.
> [!IMPORTANT]
> Use `127.0.0.1`, not `localhost` -- the gateway origin check requires an exact match.
> [!NOTE]
> If the Web UI fails to load and the port forward may be stale, reset it on the Spark host:
> If the Web UI fails to load and the port forward may be stale, get the port from `nemoclaw my-assistant dashboard-url --quiet` and reset:
> ```bash
> openshell forward stop 18789 my-assistant || true
> openshell forward start 18789 my-assistant --background
> openshell forward stop <port> my-assistant || true
> openshell forward start <port> my-assistant --background
> ```
#### Option 2. Terminal UI
Connect to the sandbox:
```bash
nemoclaw my-assistant connect
```
Then launch the terminal UI inside the sandbox:
```bash
openclaw tui
```
You can start chatting with OpenClaw. Press **Ctrl+C** to exit the terminal UI.
To exit the sandbox:
```bash
exit
```
---
## Phase 3: Telegram Bot
## Phase 2: Modify NemoClaw Policy
> [!IMPORTANT]
> Telegram must be enabled in the **NemoClaw onboard wizard** (Step 4 → Messaging channels). The channel plugin and bot token are wired into the sandbox container at sandbox creation time — `policy-add` only opens network egress and is not enough on its own. If you skipped Telegram during onboard, re-run the installer to recreate the sandbox with Telegram enabled.
### Step 4. Enable Brave Search in sandbox
### Step 9. Create a Telegram bot
To add Brave Web Search to an existing sandbox, re-run the onboard wizard with `--fresh` to start a new session that re-prompts all options (including previously skipped features):
Do this **before** running the NemoClaw installer in Step 4 so you have your bot token ready when the wizard prompts for it.
```bash
nemoclaw onboard --fresh --gpu
```
Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the bot token it gives you and paste it into the wizard when you reach the **Messaging channels** step.
> [!NOTE]
> Without `--fresh`, the onboard wizard **resumes** the previous session and will not re-prompt for features you already skipped.
### Step 10. Install cloudflared and start the Telegram bridge
When you reach **Enable Brave Web Search**, choose **yes** and paste the key from the [Brave Search API](https://brave.com/search/api/) console. Confirm the same sandbox name and inference choices where prompted. The wizard will **rebuild** the sandbox so the key is applied.
The Telegram bridge needs a public webhook URL so Telegram can deliver messages to your bot. NemoClaw uses [cloudflared](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/) to create a free `trycloudflare.com` tunnel.
> [!NOTE]
> Alternatively, set `BRAVE_API_KEY` in your environment before running the installer and Brave Search will be enabled automatically during onboard.
Make sure you are on the **host** (not inside the sandbox). If you are inside the sandbox, run `exit` first.
To confirm web search is enabled, relaunch your OpenClaw WebUI or terminal UI. Ask the agent for something that needs **live web search**. If requests still fail, recheck **`policy-list`** and re-read the onboard output for Brave/API errors.
### Step 5. Set up Messaging Channel (Telegram Bot as an example)
These steps apply when your sandbox exists but **Telegram was never configured** (you skipped **Messaging channels** in Step 2, or the sandbox policy tier never included Telegram-related egress). Replace `<sandbox-name>` with your sandbox (for example `my-assistant`).
#### 1. Create a Telegram bot
In Telegram, open [@BotFather](https://t.me/BotFather), send `/newbot`, and complete the prompts. Copy the **bot token** BotFather returns and keep it ready for the next step.
#### 2. Register Telegram with NemoClaw and rebuild the sandbox
```bash
nemoclaw <sandbox-name> channels add telegram
```
Paste the token when prompted. NemoClaw persists credentials and **rebuilds** the sandbox so OpenClaw can use Telegram as a messaging channel.
#### 3. (If needed) Allow Telegram egress in the sandbox policy
If messages fail with network or policy errors after the channel is registered, inspect presets and add Telegram-related egress if your tier omitted it:
```bash
nemoclaw <sandbox-name> policy-list
nemoclaw <sandbox-name> policy-add telegram
```
Preset names follow your selected tier; confirm against [Network policies](https://docs.nvidia.com/nemoclaw/latest/reference/network-policies.html).
#### 4. Verify Telegram
Telegram uses long-polling (`getUpdates`) — the sandbox actively pulls messages from Telegram servers. **No public URL or cloudflared tunnel is required for Telegram to work.**
Open Telegram, find your bot, and send a message. The bot should forward traffic to the agent in your NemoClaw sandbox and reply.
> [!NOTE]
> The first response may take longer depending on model size (30B models respond in a few seconds; larger models may take longer on first inference).
> [!NOTE]
> If the bot does not respond:
> - Run `nemoclaw <sandbox-name> status` to confirm the sandbox is running and inference is healthy.
> - Run `nemoclaw <sandbox-name> logs --follow` and look for Telegram-related errors.
> - If Telegram egress is missing, run `nemoclaw <sandbox-name> policy-add` and select `telegram`.
> - If the channel was never registered, run `nemoclaw <sandbox-name> channels add telegram`.
> [!NOTE]
> The `channels add telegram` wizard also prompts for an optional **Telegram User ID** to restrict who can DM the bot. Send `/start` to [@userinfobot](https://t.me/userinfobot) on Telegram to get your numeric user ID. If you skip this, the bot will require device pairing (a terminal-based code confirmation) before responding to messages.
> [!NOTE]
> For details on restricting which Telegram chats can interact with the agent, see the [NemoClaw Telegram bridge documentation](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html).
#### 5. (Optional) Install cloudflared for remote Web UI access
The cloudflared tunnel provides a **public URL for the Web UI dashboard** — it is not related to Telegram messaging.
Install cloudflared (DGX Spark is arm64):
@ -393,36 +338,29 @@ Start the tunnel:
nemoclaw tunnel start
```
Verify the public URL is live:
Verify:
```bash
nemoclaw status
```
You should see `● cloudflared` with a `trycloudflare.com` public URL (e.g. `https://assembled-peer-persian-kitty.trycloudflare.com`).
You should see `● cloudflared` with a `trycloudflare.com` public URL.
Open Telegram, find your bot, and send it a message. The bot forwards it to the agent and replies.
---
> [!NOTE]
> If `nemoclaw tunnel start` prints `cloudflared not found — no public URL`, the cloudflared install above did not complete successfully. Re-run the install, then restart the tunnel:
> ```bash
> nemoclaw tunnel stop && nemoclaw tunnel start
> ```
## Phase 3: Set Up NemoClaw Agent
> [!NOTE]
> The first response may take 30--90 seconds for a 120B parameter model running locally.
### Step 6. Set Up NemoClaw Agents
> [!NOTE]
> If sending a message returns `Error: Channel is unavailable: telegram`, the channel was not enabled during onboard. Re-run the installer to recreate the sandbox with Telegram selected at the **Messaging channels** step.
Set up NemoClaw Agents in general require three steps: Configure NemoClaw security policy, Run Agent Workflow Prompt, Personalize the Workflow for your own use case.
> [!NOTE]
> For details on restricting which Telegram chats can interact with the agent, see the [NemoClaw Telegram bridge documentation](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html).
Checkout these [Example NemoClaw Agents](https://build.nvidia.com/spark/nemoclaw-applications) for reference. Consider sharing your NemoClaw agent setup with the community at [DGX Spark Developer Forum](https://forums.developer.nvidia.com/c/accelerated-computing/dgx-spark-gb10)
---
## Phase 4: Cleanup and Uninstall
### Step 11. Stop services
### Step 7. Stop services
Stop the cloudflared tunnel:
@ -433,19 +371,25 @@ nemoclaw tunnel stop
Stop the port forward:
```bash
openshell forward list # find active forwards
openshell forward stop 18789 # stop the dashboard forward
openshell forward list # find active forwards and their ports
openshell forward stop <port> # stop the dashboard forward (use the port shown above)
```
### Step 12. Uninstall NemoClaw
### Step 8. Uninstall NemoClaw
Run the uninstaller via curl (matches the [NemoClaw README](https://github.com/NVIDIA/NemoClaw)). It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and Ollama are preserved.
The NemoClaw CLI includes a built-in uninstaller. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and Ollama are preserved.
```bash
curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh | bash
nemoclaw uninstall --yes
```
**Uninstaller flags** (pass via `bash -s -- <flags>`):
To remove everything including the Ollama model:
```bash
nemoclaw uninstall --yes --delete-models
```
**Uninstaller flags:**
| Flag | Effect |
|------|--------|
@ -453,11 +397,11 @@ curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uni
| `--keep-openshell` | Leave the `openshell` binary in place |
| `--delete-models` | Also remove the Ollama models pulled by NemoClaw |
To remove everything including the Ollama model, non-interactively:
```bash
curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh | bash -s -- --yes --delete-models
```
> [!NOTE]
> If the `nemoclaw` CLI is not available (e.g. install failed partway), use the remote uninstaller as a fallback:
> ```bash
> curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh | bash -s -- --yes
> ```
The uninstaller runs 6 steps:
1. Stop NemoClaw helper services and port-forward processes
@ -478,35 +422,84 @@ The uninstaller runs 6 steps:
| `nemoclaw my-assistant status` | Show sandbox status and inference config |
| `nemoclaw my-assistant logs --follow` | Stream sandbox logs in real time |
| `nemoclaw list` | List all registered sandboxes |
| `nemoclaw tunnel start` | Start cloudflared tunnel (public URL for Telegram webhooks) |
| `nemoclaw tunnel start` | Start cloudflared tunnel (public URL for remote Web UI access) |
| `nemoclaw tunnel stop` | Stop the cloudflared tunnel |
| `nemoclaw my-assistant dashboard-url --quiet` | Print the full tokenized Web UI URL (includes auto-assigned port) |
| `openshell term` | Open the monitoring TUI on the host |
| `openshell forward list` | List active port forwards |
| `openshell forward start 18789 my-assistant --background` | Restart port forwarding for Web UI |
| `curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh \| bash` | Remove NemoClaw (preserves Docker, Node.js, Ollama) |
| `curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh \| bash -s -- --delete-models` | Remove NemoClaw and Ollama models |
| `nemoclaw uninstall --yes` | Remove NemoClaw (preserves Docker, Node.js, Ollama) |
| `nemoclaw uninstall --yes --delete-models` | Remove NemoClaw and Ollama models |
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| `nemoclaw: command not found` after install | Shell PATH not updated | Run `source ~/.bashrc` (or `source ~/.zshrc` for zsh), or open a new terminal window. |
| Installer fails with Node.js version error | Node.js version below 20 | Install Node.js 20+: `curl -fsSL https://deb.nodesource.com/setup_22.x \| sudo -E bash - && sudo apt-get install -y nodejs` then re-run the installer. |
| Installer fails with Node.js version error | Node.js version below 22.16 | Install Node.js 22.16+: `curl -fsSL https://deb.nodesource.com/setup_22.x \| sudo -E bash - && sudo apt-get install -y nodejs` then re-run the installer. |
| npm install fails with `EACCES` permission error | npm global directory not writable | `mkdir -p ~/.npm-global && npm config set prefix ~/.npm-global && export PATH=~/.npm-global/bin:$PATH` then re-run the installer. Add the `export` line to `~/.bashrc` to make it permanent. |
| Docker permission denied | User not in docker group | `sudo usermod -aG docker $USER`, then log out and back in. |
| Gateway fails with cgroup / "Failed to start ContainerManager" errors | Docker not configured for host cgroup namespace on DGX Spark | Run the cgroup fix: `sudo python3 -c "import json, os; path='/etc/docker/daemon.json'; d=json.load(open(path)) if os.path.exists(path) else {}; d['default-cgroupns-mode']='host'; json.dump(d, open(path,'w'), indent=2)"` then `sudo systemctl restart docker`. Alternatively, run `sudo nemoclaw setup-spark` which applies this fix automatically. |
| Gateway fails with cgroup / "Failed to start ContainerManager" errors | Older OpenShell or Docker still using a **private** cgroup namespace for the gateway so kubelet cannot see cgroup v2 controllers | First **upgrade OpenShell** (re-run the Phase 1 `nemoclaw.sh` install so you get a build that sets host cgroupns on the gateway container). If it still fails, force Docker's default to host mode by running the [daemon.json cgroup fix](#daemonjson-cgroup-fix) below, then run `sudo systemctl restart docker`. |
| Gateway fails with "port 8080 is held by container..." | Another OpenShell gateway or container is using port 8080 | Stop the conflicting container: `openshell gateway destroy -g <old-gateway-name>` or `docker stop <container-name> && docker rm <container-name>`, then retry `nemoclaw onboard`. |
| Sandbox creation fails | Stale gateway state or DNS not propagated | Run `openshell gateway destroy && openshell gateway start`, then re-run the installer or `nemoclaw onboard`. |
| CoreDNS crash loop | Known issue on some DGX Spark configurations | Run `sudo ./scripts/fix-coredns.sh` from the NemoClaw repo directory. |
| CoreDNS crash loop | Known issue on some DGX Spark configurations | Re-run the NemoClaw installer (`curl -fsSL https://www.nvidia.com/nemoclaw.sh \| bash`) which includes the CoreDNS fix. If the issue persists, see [NemoClaw troubleshooting](https://docs.nvidia.com/nemoclaw/latest/reference/troubleshooting.html). |
| "No GPU detected" during onboard | DGX Spark GB10 reports unified memory differently | Expected on DGX Spark. The wizard still works and uses Ollama for inference. |
| Inference timeout or hangs | Ollama not running or not reachable | Check Ollama: `curl http://localhost:11434`. If not running: `ollama serve &`. If running but unreachable from sandbox, ensure Ollama is configured to listen on `0.0.0.0` (see Step 2 in Instructions). |
| Agent gives no response or is very slow | Normal for 120B model running locally | Nemotron 3 Super 120B can take 30--90 seconds per response. Verify inference route: `nemoclaw my-assistant status`. |
| Inference timeout or hangs | Ollama not running or not reachable | Check Ollama: `curl http://127.0.0.1:11434`. If not running: `sudo systemctl restart ollama`. Verify the NemoClaw auth proxy is healthy: `curl http://127.0.0.1:11435/api/tags`. If both respond, check `nemoclaw my-assistant status` for the Inference health line. |
| Agent gives no response or is very slow | First response can be slow, especially with larger models | Response time depends on model size (30B: a few seconds, 120B: 3090 seconds). Verify inference route: `nemoclaw my-assistant status`. |
| Port 18789 already in use | Another process is bound to the port | `lsof -i :18789` then `kill <PID>`. If needed, `kill -9 <PID>` to force-terminate. |
| Web UI port forward dies or dashboard unreachable | Port forward not active | `openshell forward stop 18789 my-assistant` then `openshell forward start 18789 my-assistant --background`. |
| Web UI shows `origin not allowed` | Accessing via `localhost` instead of `127.0.0.1` | Use `http://127.0.0.1:18789/#token=...` in the browser. The gateway origin check requires `127.0.0.1` exactly. |
| Telegram bridge does not start | Missing environment variables | Ensure `TELEGRAM_BOT_TOKEN` and `SANDBOX_NAME` are set on the host. `SANDBOX_NAME` must match the sandbox name from onboarding. |
| Telegram bridge needs restart but `nemoclaw stop` does not work | Known bug in `nemoclaw stop` | Find the PID from the `nemoclaw start` output, force-kill with `kill -9 <PID>`, then run `nemoclaw start` again. |
| Telegram bot receives messages but does not reply | Telegram policy not added to sandbox | Run `nemoclaw my-assistant policy-add`, type `telegram`, hit Y. Then restart the bridge with `nemoclaw start`. |
| Telegram bridge does not start | Telegram channel not registered with sandbox | Run `nemoclaw <sandbox-name> channels add telegram` to register the bot token and rebuild the sandbox. Verify with `nemoclaw <sandbox-name> status`. |
| Telegram stops responding after sandbox rebuild | Telegram long-polling session stale after rebuild | Run `nemoclaw <sandbox-name> recover` to restart the gateway. If still unresponsive, run `nemoclaw <sandbox-name> channels add telegram` to re-register and rebuild. |
| Telegram bot receives messages but does not reply | Telegram network egress policy not added | Run `nemoclaw <sandbox-name> policy-add`, select `telegram`, and confirm. This is a hot-reload — no rebuild needed. |
#### daemon.json cgroup fix
Use this script as the fallback for the cgroup / "Failed to start ContainerManager" row above. It validates any existing `/etc/docker/daemon.json`, writes a `.bak` backup, sets `default-cgroupns-mode` to `host`, and atomically replaces the file. It exits non-zero with an error on stderr if anything fails, leaving the original `daemon.json` untouched.
```bash
sudo python3 - <<'PY'
import json, os, shutil, sys, tempfile
path = '/etc/docker/daemon.json'
try:
if os.path.exists(path):
with open(path) as f:
data = json.load(f)
if not isinstance(data, dict):
raise ValueError(f'{path} is not a JSON object')
else:
data = {}
except (json.JSONDecodeError, ValueError, OSError) as e:
print(f'error: failed to read {path}: {e}', file=sys.stderr)
sys.exit(1)
if os.path.exists(path):
try:
shutil.copy2(path, path + '.bak')
except OSError as e:
print(f'error: failed to back up {path}: {e}', file=sys.stderr)
sys.exit(1)
data['default-cgroupns-mode'] = 'host'
target_dir = os.path.dirname(path) or '/'
fd, tmp = tempfile.mkstemp(prefix='daemon.json.', dir=target_dir)
try:
with os.fdopen(fd, 'w') as f:
json.dump(data, f, indent=2)
f.write('\n')
os.chmod(tmp, 0o644)
os.replace(tmp, path)
except OSError as e:
if os.path.exists(tmp):
try:
os.unlink(tmp)
except OSError:
pass
print(f'error: failed to write {path}: {e}', file=sys.stderr)
sys.exit(1)
PY
```
> [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

View File

@ -17,7 +17,7 @@
SGLang is a fast serving framework for large language models and vision language models that makes
your interaction with models faster and more controllable by co-designing the backend runtime and
frontend language. This setup uses the optimized NVIDIA SGLang NGC Container on a single NVIDIA
frontend language. This setup uses the optimized SGLang CUDA container on a single NVIDIA
Spark device with Blackwell architecture, providing GPU-accelerated inference with all dependencies
pre-installed.
@ -39,9 +39,9 @@ vision-language tasks using models like DeepSeek-V2-Lite.
- NVIDIA Spark device with Blackwell architecture
- Docker Engine installed and running: `docker --version`
- NVIDIA GPU drivers installed: `nvidia-smi`
- NVIDIA Container Toolkit configured: `docker run --rm --gpus all nvcr.io/nvidia/sglang:26.02-py3 nvidia-smi`
- NVIDIA Container Toolkit configured: `docker run --rm --gpus all lmsysorg/sglang@sha256:ceaf8b16e02d165143633ac228bbb994a05fe77d7e0526cf035ae4bbf4eacc36 nvidia-smi`
- Sufficient disk space (>20GB available): `df -h`
- Network connectivity for pulling NGC containers: `ping nvcr.io`
- Network connectivity for pulling containers: `docker pull lmsysorg/sglang@sha256:ceaf8b16e02d165143633ac228bbb994a05fe77d7e0526cf035ae4bbf4eacc36`
## Ancillary files
@ -103,7 +103,7 @@ docker --version
nvidia-smi
## Verify Docker GPU support
docker run --rm --gpus all nvcr.io/nvidia/sglang:26.02-py3 nvidia-smi
docker run --rm --gpus all lmsysorg/sglang@sha256:ceaf8b16e02d165143633ac228bbb994a05fe77d7e0526cf035ae4bbf4eacc36 nvidia-smi
## Check available disk space
df -h /
@ -124,7 +124,7 @@ several minutes depending on your network connection.
```bash
## Pull the SGLang container
docker pull nvcr.io/nvidia/sglang:26.02-py3
docker pull lmsysorg/sglang@sha256:ceaf8b16e02d165143633ac228bbb994a05fe77d7e0526cf035ae4bbf4eacc36
## Verify the image was downloaded
docker images | grep sglang
@ -140,7 +140,7 @@ server inside the container, exposing it on port 30000 for client connections.
docker run --gpus all -it --rm \
-p 30000:30000 \
-v /tmp:/tmp \
nvcr.io/nvidia/sglang:26.02-py3 \
lmsysorg/sglang@sha256:ceaf8b16e02d165143633ac228bbb994a05fe77d7e0526cf035ae4bbf4eacc36 \
bash
```
@ -237,7 +237,7 @@ docker ps | grep sglang | awk '{print $1}' | xargs docker stop
docker container prune -f
## Remove SGLang images (optional)
docker rmi nvcr.io/nvidia/sglang:26.02-py3
docker rmi lmsysorg/sglang@sha256:ceaf8b16e02d165143633ac228bbb994a05fe77d7e0526cf035ae4bbf4eacc36
```
## Step 10. Next steps

View File

@ -15,18 +15,16 @@
## Basic idea
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI.
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation.
The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
## What you'll accomplish
You will have a working nanochat setup that trains a small LLM and serves it for chat.
- **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station.
- **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples.
- **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
- **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples.
## What to know before starting
@ -38,104 +36,143 @@ You will have a working nanochat setup that trains a small LLM and serves it for
**Hardware:**
- NVIDIA DGX Station with GB300 Ultra Superchip.
- Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun).
- Adequate storage for cache (~24GB+ for FineWeb data and checkpoints).
- NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
- Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).
**Software:**
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images.
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
- [Weights & Biases](https://wandb.ai/) account and API key.
- [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets.
## Model architecture (d24)
```
Layers: 24
Attention Heads: 12
Head Dimension: 128
Context Length: 2048 tokens
Vocabulary Size: 65,536 (2^16, trained BPE)
Precision: FP8 (e4m3, tensorwise scaling)
```
## Training stages
| Stage | Description |
|-------|-------------|
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 |
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
| Report | Generates `report.md` with metrics, samples, and system info |
## Ancillary files
All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository).
- `assets/Dockerfile` PyTorch NGC image plus nanochat dependencies and venv.
- `assets/setup.sh` Clones nanochat, checks out the supported commit, and builds the Docker image.
- `assets/launch.sh` Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation).
- `assets/README.md` Additional detail on training stages, inference, and troubleshooting.
All required assets are in `nvidia/station-nanochat/assets/`:
- `Dockerfile` PyTorch NGC image with nanochat pip dependencies.
- `setup.sh` Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image.
- `launch.sh` Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
- `speedrun_station.sh` Modified speedrun script adapted for single-GPU DGX Station.
## Time & risk
- **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more).
- **Risk level:** Medium
- Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or the launch script will exit.
- **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed.
* **Last Updated:** 03/02/2026
* First Publication
- **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra.
- **Risk level:** Medium
- Large downloads (FineWeb) can be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or `launch.sh` will exit immediately.
- **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed.
## Credits
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy
- [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data)
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data)
## Instructions
## Step 1. Prerequisites and environment
This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
```bash
## Verify GPU and Docker
nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
```
Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them.
Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell:
```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
```
## Step 2. Clone the playbook and set up nanochat
## Step 2. Clone and set up
Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image.
Clone the playbook repository and navigate to the assets directory:
```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nanochat/assets
```
From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.).
Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies):
```bash
./setup.sh
```
Setup may take several minutes while the image builds. Verify the image:
You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this:
```bash
docker images | grep nanochat
```
assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/
```
You should see the `nanochat` image listed.
## Step 3. Launch training
## Step 3. Launch full training
> [!NOTE]
> The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running.
To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours:
Ensure your API keys are exported, then launch:
```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
./launch_full.sh
./launch.sh
```
This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation.
The training runs inside the `nanochat` container and executes the full pipeline automatically:
## Step 4. Verify and use the model
1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer
2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8
3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat
4. **Report generation** — produces `report.md` with metrics and samples
After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station.
Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run.
## Step 4. Monitor training
**W&B dashboard:**
Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics:
- Training loss
- Validation BPB
- Throughput (tokens/sec)
## Step 5. Inference
After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively:
**Web UI (recommended):**
```bash
cd nanochat
source ../.venv/bin/activate # if using venv from container context; otherwise use the container
python -m scripts.chat_web
docker run --rm --gpus all --net=host \
-v $(pwd)/nanochat:/workspace/nanochat \
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_web
```
Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX Stations IP address.
@ -143,14 +180,15 @@ Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX St
**CLI:**
```bash
cd nanochat
python -m scripts.chat_cli -p "Why is the sky blue?"
python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning"
docker run --rm -it --gpus all \
-v $(pwd)/nanochat:/workspace/nanochat \
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_cli -p "Why is the sky blue?"
```
A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project.
## Step 5. Cleanup
## Step 6. Cleanup
To stop training early, interrupt the launch script or stop the container:
@ -160,32 +198,43 @@ To stop training early, interrupt the launch script or stop the container:
```bash
## If launch.sh is running: press Ctrl+C
## Or stop the container by name
## Or stop the container directly
docker stop $(docker ps -q --filter ancestor=nanochat)
```
To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`):
To free disk space:
```bash
rm -rf ./nanochat_cache ./hf_cache
docker system prune -a
```
## Step 6. Next steps and customization
## Step 7. Customization
- **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time.
- **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory.
- **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput).
- **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`.
**Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size:
```bash
## Fewer data shards (10 instead of default)
python -m nanochat.dataset -n 10 &
## Smaller model (d4 instead of d24), smaller batch size
python -m scripts.base_train --depth=4 --device-batch-size=32
```
**Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.
Then re-run `./setup.sh` to rebuild with the changes.
## Troubleshooting
| Symptom | Cause | Fix |
|--------|--------|-----|
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<your_key>` and `export HF_TOKEN=<your_token>` in the same shell, then run `./launch.sh`. |
| `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). |
| Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. |
| `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` dont exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p <your_cache_dirs>` and re-run `launch.sh`. |
| `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). |
| Training exits immediately or script doesnt wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars, cache paths, or batch size as above. |
| Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. |
|---------|-------|-----|
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<key>` and `export HF_TOKEN=<token>` in the same shell, then re-run `./launch.sh` |
| `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` |
| Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars or paths as needed |
| `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` |
| `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` |
| Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` |
| W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct |
| Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs <id>` |
| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |

View File

@ -1,11 +1,15 @@
FROM nvcr.io/nvidia/pytorch:25.09-py3
FROM nvcr.io/nvidia/pytorch:26.04-py3
WORKDIR /workspace
# Install dependencies globally so torchrun (which uses /usr/bin/python) can access them
RUN /usr/bin/python -m pip install tiktoken tokenizers datasets psutil files-to-prompt regex setuptools uvicorn wandb maturin
# Create venv with --system-site-packages so it inherits global packages
RUN /usr/bin/python -m venv --system-site-packages .venv
RUN pip install \
datasets \
tokenizers \
wandb \
tiktoken \
psutil \
files-to-prompt \
uvicorn \
rustbpe
CMD ["/bin/bash"]

View File

@ -3,7 +3,6 @@
# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Lite training (default). Runs speedrun.sh, which setup copies from speedrun_lite.sh.
# Get wandb API key
export WANDB_API_KEY=$WANDB_API_KEY
@ -11,7 +10,6 @@ if [ -z "$WANDB_API_KEY" ]; then
echo "WANDB_API_KEY is not set"
exit 1
fi
export WANDB_RUN=${WANDB_RUN:-speedrun}
# Get Hugging Face API key
@ -21,26 +19,23 @@ if [ -z "$HF_TOKEN" ]; then
exit 1
fi
# Cleanup function to stop containers
# Use local cache dirs so no root paths are required
workdir=$(pwd)
NANOCHAT_CACHE="$(pwd)/nanochat_cache"
HF_CACHE="$(pwd)/hf_cache"
cleanup() {
echo
echo "Stopping containers..."
docker stop $(docker ps -q --filter ancestor=nanochat) 2>/dev/null || true
echo "Interrupted training!"
echo -e "\nStopping training container..."
docker stop $(docker ps -q --filter ancestor=nanochat) 2>/dev/null
echo "Cleanup complete."
exit 0
}
workdir=$(pwd)
# DGX Station: use local cache dirs so no root paths are required
NANOCHAT_CACHE="${NANOCHAT_CACHE:-$(pwd)/nanochat_cache}"
HF_CACHE="${HF_CACHE:-$(pwd)/hf_cache}"
mkdir -p "$NANOCHAT_CACHE" "$HF_CACHE"
trap cleanup SIGINT SIGTERM
cmd="
mkdir -p /nanochat_cache && \
mkdir -p /hf_cache && \
chmod 777 /nanochat_cache && \
chmod 777 /hf_cache && \
# Launch Nanochat training
cmd="mkdir -p $NANOCHAT_CACHE $HF_CACHE && \
chmod u+rwx $NANOCHAT_CACHE $HF_CACHE && \
docker run \
--rm \
--runtime=nvidia \
@ -57,16 +52,8 @@ docker run \
-v $HF_CACHE:/root/.cache/huggingface \
-w /workspace/nanochat \
nanochat \
bash speedrun.sh"
bash runs/speedrun.sh"
sh -c "$cmd" &
sleep 5
while true; do
if ! docker ps | grep -q "nanochat"; then
echo
echo "Training complete!"
exit 0
fi
sleep 1
done
wait
echo -e "\nTraining complete!"

View File

@ -11,10 +11,10 @@ assets_dir="$(cd "$(dirname "$0")" && pwd)"
cmd="cd $workdir && \
git clone https://github.com/karpathy/nanochat.git && \
cd nanochat && \
git checkout c6b7ab744055d5915e6ccb61088de80c10cbaff9 && \
cp ../speedrun_spark.sh ./speedrun.sh && \
git checkout 0aaca56805eb13f6e6e1fff789a08086902f12ab && \
cp ../speedrun_station.sh ./runs/speedrun.sh && \
cd .. && \
chmod +x launch_full.sh 2>/dev/null || true && \
chmod +x launch.sh 2>/dev/null || true && \
docker build -t nanochat ."
sh -c "$cmd"

View File

@ -1,15 +1,14 @@
#!/bin/bash
set -e
# This script is the "Best ChatGPT clone that $100 can buy",
# It is designed to run in ~4 hours on 8XH100 node at $3/GPU/hour.
# This script is configured to train your own GPT-2 grade LLM (pretraining + finetuning)
# It is designed to run on a blank 8XH100 GPU node and takes approximately 3 hours to complete.
# 1) Example launch (simplest):
# bash speedrun.sh
# 2) Example launch in a screen session (because the run takes ~4 hours):
# screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh
# bash runs/speedrun.sh
# 2) Example launch in a screen session (because the run takes ~3 hours):
# screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh
# 3) Example launch with wandb logging, but see below for setting up wandb first:
# WANDB_RUN=speedrun screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh
# WANDB_RUN=speedrun screen -L -Logfile runs/speedrun.log -S speedrun bash runs/speedrun.sh
# Default intermediate artifacts directory is in ~/.cache/nanochat
export OMP_NUM_THREADS=1
@ -26,7 +25,7 @@ mkdir -p $NANOCHAT_BASE_DIR
# install the repo dependencies
# uv sync --extra gpu
# activate venv so that `python` uses the project's venv instead of system python
source ../.venv/bin/activate
# source .venv/bin/activate
# -----------------------------------------------------------------------------
# wandb setup
@ -49,70 +48,41 @@ python -m nanochat.report reset
# -----------------------------------------------------------------------------
# Tokenizer
# Install Rust / Cargo
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"
# Build the rustbpe Tokenizer
# unset VIRTUAL_ENV
maturin develop --release --manifest-path rustbpe/Cargo.toml
# Download the first ~2B characters of pretraining dataset
# look at dev/repackage_data_reference.py for details on how this data was prepared
# each data shard is ~250M chars
# so we download 2e9 / 250e6 = 8 data shards at this point
# each shard is ~100MB of text (compressed), so this is about ~800MB of data on disk
# look at dev/repackage_data_reference.py for details on how this data was prepared
python -m nanochat.dataset -n 8
# Immediately also kick off downloading more shards in the background while tokenizer trains
# See comment below for why 240 is the right number here
python -m nanochat.dataset -n 240 &
# Approximately 150 shards are needed for GPT-2 capability pretraining, add 20 for padding.
# The maximum total number of shards available in the entire dataset is 6542.
python -m nanochat.dataset -n 170 &
DATASET_DOWNLOAD_PID=$!
# train the tokenizer with vocab size 2**16 = 65536 on ~2B characters of data
python -m scripts.tok_train --max_chars=2000000000
# train the tokenizer with vocab size 2**15 = 32768 on ~2B characters of data
python -m scripts.tok_train
# evaluate the tokenizer (report compression ratio etc.)
python -m scripts.tok_eval
# -----------------------------------------------------------------------------
# Base model (pretraining)
# The d20 model is 561M parameters.
# Chinchilla says #tokens = 20X #params, so we need 561e6 * 20 = 11.2B tokens.
# Assume our tokenizer is 4.8 chars/token, this is 11.2B * 4.8 ~= 54B chars.
# At 250M chars/shard, this is 54B / 250M ~= 216 shards needed for pretraining.
# Round up to 240 for safety. At ~100MB/shard, this downloads ~24GB of data to disk.
# (The total number of shards available in the entire dataset is 1822.)
echo "Waiting for dataset download to complete..."
wait $DATASET_DOWNLOAD_PID
source ../.venv/bin/activate
# pretrain the d20 model
python -m scripts.base_train --depth=20 --run=$WANDB_RUN
# evaluate the model on a larger chunk of train/val data and draw some samples
python -m scripts.base_loss
# evaluate the model on CORE tasks
python -m scripts.base_eval
sleep 5
# d24 model (slightly undertrained to beat GPT-2 => decrease data:params ratio from compute optimal 10.5 (default) to 8)
python -m scripts.base_train --depth=24 --target-param-data-ratio=8 --device-batch-size=64 --fp8 --run=$WANDB_RUN
# evaluate the model: CORE metric, BPB on train/val, and draw samples
python -m scripts.base_eval --device-batch-size=64
# -----------------------------------------------------------------------------
# Midtraining (teach the model conversation special tokens, tool use, multiple choice)
# SFT (teach the model conversation special tokens, tool use, multiple choice)
# download 2.3MB of synthetic identity conversations to impart a personality to nanochat
# see dev/gen_sft_data.py for details on how this data was prepared and to get a sense of how you can easily tune it
# see dev/gen_synthetic_data.py for details on how this data was prepared and to get a sense of how you can easily tune it
curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl
# run midtraining and eval the model
python -m scripts.mid_train --run=$WANDB_RUN
python -m scripts.chat_eval -i mid
sleep 5
# -----------------------------------------------------------------------------
# Supervised Finetuning (domain adaptation to each sequence all by itself per row)
# train sft and re-eval right away (should see a small bump)
python -m scripts.chat_sft --run=$WANDB_RUN
# run SFT and eval the model
python -m scripts.chat_sft --device-batch-size=64 --run=$WANDB_RUN
python -m scripts.chat_eval -i sft
# chat with the model over CLI! Leave out the -p to chat interactively
@ -121,15 +91,6 @@ python -m scripts.chat_eval -i sft
# even better, chat with your model over a pretty WebUI ChatGPT style
# python -m scripts.chat_web
# -----------------------------------------------------------------------------
# Reinforcement Learning. Optional, and currently only on GSM8K
# (optional)
# run reinforcement learning
# python -m scripts.chat_rl --run=$WANDB_RUN
# eval the RL model only on GSM8K
# python -m scripts.chat_eval --i rl -a GSM8K
# -----------------------------------------------------------------------------
# Generate the full report by putting together all the sections
# report.md is the output and will be copied to current directory for convenience

View File

@ -45,18 +45,16 @@ spec:
content: |
# Basic idea
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, midtraining (conversation format), supervised fine-tuning (SFT), and inference via CLI or web UI.
This playbook demonstrates training of [nanochat](https://github.com/karpathy/nanochat) on DGX Station with the GB300 Ultra Superchip. You run the full pipeline on a single system: custom BPE tokenizer training, base model pretraining, supervised fine-tuning (SFT), and inference via CLI or web UI.
The project uses the PyTorch NGC container, FineWeb for pretraining, SmolTalk for SFT, and Weights & Biases for logging. The default speedrun configuration trains a 561M-parameter (d20) model suitable for learning and experimentation.
The project uses the PyTorch NGC container, FineWeb for pretraining data, SmolTalk for SFT, and Weights & Biases for logging. The default configuration trains a ~1B parameter (d24) Transformer model with FP8 precision.
# What you'll accomplish
You will have a working nanochat setup that trains a small LLM and serves it for chat.
- **Environment:** Docker image with PyTorch and nanochat dependencies on your DGX Station.
- **Training pipeline:** Tokenizer (65K BPE), pretraining (~11.2B tokens), midtraining, SFT, and automated report generation.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base, mid, or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat/report.md` with metrics and samples.
- **Environment:** Docker image with PyTorch NGC and nanochat dependencies on your DGX Station.
- **Training pipeline:** BPE tokenizer (65K vocab), base model pretraining with FP8, SFT, and automated report generation.
- **Inference:** ChatGPT-style web UI and CLI to chat with the base or SFT checkpoints.
- **Monitoring:** W&B dashboards and `nanochat_cache/report/report.md` with metrics and samples.
# What to know before starting
@ -68,36 +66,58 @@ spec:
**Hardware:**
- NVIDIA DGX Station with GB300 Ultra Superchip.
- Sufficient GPU memory for the chosen model (the GB300 Ultra provides ample memory for the d20 speedrun).
- Adequate storage for cache (~24GB+ for FineWeb data and checkpoints).
- NVIDIA DGX Station with GB300 Ultra Superchip (288GB VRAM).
- Adequate storage for cache (~25GB+ for FineWeb data and checkpoints).
**Software:**
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images.
- Docker with NVIDIA Container Toolkit: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`
- Network access to download datasets (Hugging Face, FineWeb) and container images (nvcr.io)
- [Weights & Biases](https://wandb.ai/) account and API key.
- [Hugging Face](https://huggingface.co/docs/hub/en/security-tokens) token for evaluation datasets.
# Model architecture (d24)
```
Layers: 24
Attention Heads: 12
Head Dimension: 128
Context Length: 2048 tokens
Vocabulary Size: 65,536 (2^16, trained BPE)
Precision: FP8 (e4m3, tensorwise scaling)
```
# Training stages
| Stage | Description |
|-------|-------------|
| Tokenizer | Trains BPE tokenizer (65K vocab) on ~2B characters from FineWeb |
| Base pretraining | Pretrains d24 model on FineWeb with FP8, target data:param ratio of 8 |
| SFT | Fine-tunes on synthetic identity conversations + SmolTalk |
| Report | Generates `report.md` with metrics, samples, and system info |
# Ancillary files
All required assets are in the playbook directory `nvidia/station-nanochat/assets` (see the [dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) repository).
- `assets/Dockerfile` PyTorch NGC image plus nanochat dependencies and venv.
- `assets/setup.sh` Clones nanochat, checks out the supported commit, and builds the Docker image.
- `assets/launch.sh` Runs the training container on your DGX Station (runs the full pipeline: tokenizer, pretrain, midtrain, SFT, and report generation).
- `assets/README.md` Additional detail on training stages, inference, and troubleshooting.
All required assets are in `nvidia/station-nanochat/assets/`:
- `Dockerfile` PyTorch NGC image with nanochat pip dependencies.
- `setup.sh` Clones nanochat, checks out the supported commit, copies `speedrun_station.sh`, and builds the Docker image.
- `launch.sh` Runs the training container (full pipeline: tokenizer → pretrain → SFT → report).
- `speedrun_station.sh` Modified speedrun script adapted for single-GPU DGX Station.
# Time & risk
- **Estimated time:** About 30 minutes for clone, setup, and launching the run. Full d20 speedrun training time depends on your DGX Station configuration (hours to a day or more).
- **Risk level:** Medium
- Large downloads (FineWeb) can fail or be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or the launch script will exit.
- **Rollback:** Stop containers with `docker stop`, remove caches under `~/.cache/nanochat` (or paths in `launch.sh`), and run `docker system prune -a` if needed.
* **Last Updated:** 03/02/2026
* First Publication
- **Estimated time:** ~30 minutes for setup. Full d24 training takes on the order of 16+ hours on a single GB300 Ultra.
- **Risk level:** Medium
- Large downloads (FineWeb) can be slow; ensure stable network and disk space.
- API keys (W&B, HF) must be set or `launch.sh` will exit immediately.
- **Rollback:** Stop containers with `docker stop`, remove caches, and run `docker system prune -a` if needed.
# Credits
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy
- [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) by HuggingFace (pretraining data)
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) by HuggingFace (SFT data)
@ -108,69 +128,86 @@ spec:
content: |
# Step 1. Prerequisites and environment
This playbook is for **DGX Station** (single node). Ensure your DGX Station has Docker with NVIDIA runtime, GPU access, and required API keys. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
Ensure your DGX Station has Docker with NVIDIA runtime and GPU access. Nanochat uses Weights & Biases (W&B) for training visualization and a Hugging Face token for evaluation datasets.
```bash
# Verify GPU and Docker
nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.01-py3 nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi
```
Expected output should show your GPU(s) and driver version. Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you do not have them.
Create a [W&B account](https://wandb.ai/) and a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) if you don't have them. Export both keys in your shell:
```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
```
# Step 2. Clone the playbook and set up nanochat
# Step 2. Clone and set up
Clone the playbook repository and run the setup script to clone the nanochat repo and build the Docker image.
Clone the playbook repository and navigate to the assets directory:
```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/station-nanochat/assets
```
From the `assets` directory, run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, and builds the `nanochat` Docker image (PyTorch NGC base with tiktoken, tokenizers, datasets, wandb, etc.).
Run the setup script. It clones [nanochat](https://github.com/karpathy/nanochat), checks out the supported commit, copies the station-adapted `speedrun_station.sh`, and builds the `nanochat` Docker image (PyTorch NGC base with dependencies):
```bash
./setup.sh
```
Setup may take several minutes while the image builds. Verify the image:
You should see the `nanochat` image listed if you run `docker images`. Your directory structure after setup should look like this:
```bash
docker images | grep nanochat
```
assets/
├── Dockerfile
├── launch.sh
├── setup.sh
├── speedrun_station.sh
└── nanochat/
```
You should see the `nanochat` image listed.
# Step 3. Launch training
# Step 3. Launch full training
> [!NOTE]
> The default `launch.sh` uses cache directories under `/nanochat_cache`. If that path does not exist on your DGX Station, edit `launch.sh` and replace those paths with your own (e.g. `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`), and create the directories before running.
To run **full** training (d20 model, 240 shards, midtraining, SFT, report) for higher-quality results, use the full launcher. On a DGX Station with GB300 Ultra this can take on the order of 16 hours:
Ensure your API keys are exported, then launch:
```bash
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
./launch_full.sh
./launch.sh
```
This runs `speedrun_full.sh` inside the container: full FineWeb download (240 shards), 561M-parameter (d20) pretraining, midtraining, supervised fine-tuning, and report generation.
The training runs inside the `nanochat` container and executes the full pipeline automatically:
# Step 4. Verify and use the model
1. **Tokenizer training** — downloads ~2B characters from FineWeb, trains a 65K BPE tokenizer
2. **Base model pretraining** — downloads additional FineWeb shards, pretrains a d24 model (~1B params) with FP8
3. **SFT** — downloads synthetic identity conversations, fine-tunes for chat
4. **Report generation** — produces `report.md` with metrics and samples
After training completes, checkpoints and the tokenizer are under `~/.cache/nanochat/` (or the cache path used in `launch.sh`). Run inference from the nanochat directory (e.g. `assets/nanochat`) on your DGX Station.
Training on a single GB300 Ultra takes on the order of 16+ hours for the full d24 run.
# Step 4. Monitor training
**W&B dashboard:**
Track training at [wandb.ai](https://wandb.ai/) under the `nanochat` project. The exact link to the wandb run would be provided in the training logs. Key metrics:
- Training loss
- Validation BPB
- Throughput (tokens/sec)
# Step 5. Inference
After training, checkpoints are saved under the `nanochat_cache/` directory. Run inference from inside the container or interactively:
**Web UI (recommended):**
```bash
cd nanochat
source ../.venv/bin/activate # if using venv from container context; otherwise use the container
python -m scripts.chat_web
docker run --rm --gpus all --net=host \
-v $(pwd)/nanochat:/workspace/nanochat \
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_web
```
Open a browser to `http://<STATION_IP>:8000` where `<STATION_IP>` is your DGX Stations IP address.
@ -178,14 +215,15 @@ spec:
**CLI:**
```bash
cd nanochat
python -m scripts.chat_cli -p "Why is the sky blue?"
python -m scripts.chat_cli -i sft -p "Write a haiku about machine learning"
docker run --rm -it --gpus all \
-v $(pwd)/nanochat:/workspace/nanochat \
-v $(pwd)/nanochat_cache:/root/.cache/nanochat \
-w /workspace/nanochat \
nanochat \
python -m scripts.chat_cli -p "Why is the sky blue?"
```
A full report is generated at `nanochat/report.md` after the run. You can also monitor training at [wandb.ai](https://wandb.ai/) under your project.
# Step 5. Cleanup
# Step 6. Cleanup
To stop training early, interrupt the launch script or stop the container:
@ -195,23 +233,32 @@ spec:
```bash
# If launch.sh is running: press Ctrl+C
# Or stop the container by name
# Or stop the container directly
docker stop $(docker ps -q --filter ancestor=nanochat)
```
To free disk space after training (use the same path as your cache if you set `NANOCHAT_CACHE`):
To free disk space:
```bash
rm -rf ./nanochat_cache ./hf_cache
docker system prune -a
```
# Step 6. Next steps and customization
# Step 7. Customization
- **Small scale run:** `./launch.sh` can run a lite training by following the customization guide to make changes to `speedrun_station.sh`. This can potentially bring down the training time.
- **Custom cache paths:** Set `NANOCHAT_CACHE` and `HF_CACHE` before launching (e.g. `export NANOCHAT_CACHE=/path/to/nanochat_cache`) if you want cache outside the assets directory.
- **Monitoring:** Use `nvidia-smi` and W&B dashboards to watch GPU utilization and training metrics (loss, throughput).
- **Inference:** Try the web UI and CLI with different checkpoints (`base`, `mid`, `sft`) and prompts; see sample prompts in `assets/README.md`.
**Smaller/faster run:** Edit `speedrun_station.sh` before running setup to reduce data and model size:
```bash
# Fewer data shards (10 instead of default)
python -m nanochat.dataset -n 10 &
# Smaller model (d4 instead of d24), smaller batch size
python -m scripts.base_train --depth=4 --device-batch-size=32
```
**Batch size:** The default `--device-batch-size=64` is tuned for the GB300's 288GB VRAM. Feel free to change the batch size if utilization is low or the training OOMs.
Then re-run `./setup.sh` to rebuild with the changes.
@ -221,14 +268,16 @@ spec:
label: Troubleshooting
content: |
| Symptom | Cause | Fix |
|--------|--------|-----|
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<your_key>` and `export HF_TOKEN=<your_token>` in the same shell, then run `./launch.sh`. |
| `RuntimeError: CUDA out of memory` | Batch size or model too large for GPU | In the training script in the cloned nanochat repo (e.g. `speedrun.sh`), reduce `--device_batch_size` (e.g. `16`, `8`, `4`, `2`, or `1`). |
| Docker container not starting or no GPU | Docker or NVIDIA runtime misconfigured | Run `nvidia-smi` on your DGX Station. Check no other containers hold GPUs: `docker ps`. Test GPU in Docker: `docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi`. |
| `Permission denied` or `No such file or directory` for cache paths in `launch.sh` | Paths like `/home/scratch.lramesh_dpt/...` dont exist on your system | Edit `launch.sh`: set cache dirs to paths you can create (e.g. `$(pwd)/nanochat_cache`, `$(pwd)/hf_cache`). Run `mkdir -p <your_cache_dirs>` and re-run `launch.sh`. |
| `nanochat` image not found when running `launch.sh` | Setup not run or build failed | From `nvidia/nanochat/assets`, run `./setup.sh` and confirm with `docker images` (look for the `nanochat` image). |
| Training exits immediately or script doesnt wait | Container fails early (missing keys, paths, or OOM) | Check container logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars, cache paths, or batch size as above. |
| Wrong cache path or "No such file" when launching | `launch.sh` uses non-existent paths (e.g. `/home/scratch...`) | On DGX Station, edit `launch.sh`: replace cache dirs with `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`, then run `mkdir -p nanochat_cache hf_cache`. |
|---------|-------|-----|
| `WANDB_API_KEY is not set` or `HF_TOKEN is not set` | Required env vars not exported before `launch.sh` | `export WANDB_API_KEY=<key>` and `export HF_TOKEN=<token>` in the same shell, then re-run `./launch.sh` |
| `RuntimeError: CUDA out of memory` | Batch size too large for available VRAM | Edit `speedrun_station.sh`: reduce `--device-batch-size` (try 64, 32, 16, 8). Re-run `./setup.sh` then `./launch.sh` |
| Docker container exits immediately | Missing env vars, bad cache paths, or build failure | Check logs: `docker ps -a` then `docker logs <container_id>`. Fix env vars or paths as needed |
| `nanochat` image not found | Setup not run or Docker build failed | From the `assets/` directory, run `./setup.sh` and confirm with `docker images \| grep nanochat` |
| `No such file or directory` for cache paths | Cache directories don't exist | `launch.sh` creates them automatically under `$(pwd)/nanochat_cache` and `$(pwd)/hf_cache`. If using custom paths, create them: `mkdir -p $NANOCHAT_CACHE $HF_CACHE` |
| Training hangs at "Waiting for dataset download" | Network issue downloading FineWeb shards | Check network connectivity. The download can take time depending on bandwidth. If it persists, restart `./launch.sh` |
| W&B shows wrong user / stale login | Cached W&B credentials in container volume | `speedrun_station.sh` runs `wandb login --relogin` with your key automatically. Ensure `WANDB_API_KEY` is correct |
| Container runs but `launch.sh` says "Training complete!" immediately | Container failed fast and exited before the poll loop detected it | Check `docker ps -a` for the exited container and inspect logs with `docker logs <id>` |
| GPU not visible inside container | Docker NVIDIA runtime not configured | Test: `docker run --rm --gpus all nvcr.io/nvidia/pytorch:26.04-py3 nvidia-smi`. If it fails, install/configure [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) |

View File

@ -87,7 +87,7 @@ spec:
- NVIDIA DGX Station with Blackwell architecture GPU (GB300 chip)
- Docker installed with GPU support
- NVIDIA Container Toolkit configured
- Megatron-Bridge installed (via the the NeMo Framework NGC container)
- Megatron-Bridge installed (via the NeMo Framework NGC container)
Verify your setup:
@ -139,7 +139,7 @@ spec:
nvcr.io/nvidia/nemo:${TAG}
```
All subsequent `torchrun` / `python` commands in this playbook are meant to be executed **from the shell inside this container** .
All subsequent `torchrun` / `python` commands in this playbook are meant to be executed **from the shell inside this container**.
# Step 2. Review the pretraining script
@ -279,7 +279,7 @@ spec:
| `CUDA out of memory` during model init | Insufficient GPU memory for Llama 3.1 8B + optimizer states | Reduce `micro_batch_size` or use `--nproc_per_node` for model parallelism |
| `torchrun` hangs or times out | NCCL communication failure between GPUs | Check `NCCL_DEBUG=INFO torchrun ...` for details; verify all GPUs are visible |
| Training loss is NaN | Precision instability | Increase `num_layers_at_end_in_bf16` (e.g., from 4 to 8) or reduce learning rate |
| `--no-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` |
| `--disable-fp4` works but NVFP4 crashes | Transformer Engine version mismatch | Ensure Transformer Engine supports NVFP4; update with `pip install --upgrade transformer-engine` |
| Slow training throughput | Not using Tensor Cores efficiently | Ensure batch dimensions are multiples of 8; check that `nvidia-smi` shows high GPU utilization |
| Permission denied on Docker | User not in docker group | Run `sudo usermod -aG docker $USER && newgrp docker` |

View File

@ -330,7 +330,7 @@ spec:
```bash
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-station-playbooks/nvidia/station-sglang-inference
cd dgx-spark-playbooks/nvidia/station-sglang-inference
```
> [!TIP]

View File

@ -1,8 +1,8 @@
kind: Playbook
metadata:
name: station-vllm
displayName: Serve Qwen3-235B with vLLM
shortDescription: Set up vLLM server with Qwen3-235B on DGX Station
displayName: vLLM for Inference
shortDescription: Install and use vLLM on DGX Station
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
@ -15,7 +15,7 @@ metadata:
attributes:
- key: DURATION
value: 20 MIN
value: 30 MIN
spec:
artifactName: station-vllm
@ -42,7 +42,9 @@ spec:
# What you'll accomplish
Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.
Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
# What to know before starting
@ -57,21 +59,30 @@ spec:
- HuggingFace account with access token
- Network access to NGC and HuggingFace
# Model Support Matrix
The following models are supported with vLLM on DGX Station. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
# Time & risk
* **Duration:** 15-20 minutes (longer on first run due to model download)
* **Duration:** 30 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 03/02/2026
* First Publication
* **Last Updated:** 05/28/2026
* Update models
-
id: instructions
label: Serve Qwen3-235B
label: Instructions
content: |
# Step 1. Set up Docker permissions
@ -92,7 +103,7 @@ spec:
export HF_TOKEN="your_huggingface_token"
# Model to serve
export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4"
export MODEL_HANDLE="<HF_HANDLE>"
# Maximum context length
export MAX_MODEL_LEN=8192
@ -106,9 +117,16 @@ spec:
docker pull nvcr.io/nvidia/vllm:26.01-py3
```
For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
```
# Step 4. Start vLLM server
Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
```bash
docker run -d \
@ -126,6 +144,28 @@ spec:
--gpu-memory-utilization 0.9
```
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:stepfun37 \
"$MODEL_HANDLE" \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--kv-cache-dtype fp8
```
Check the server logs for startup progress:
```bash
@ -135,7 +175,7 @@ spec:
Expected output includes:
- Model download progress (first run only)
- Model loading into GPU memory
- `Uvicorn running on http://0.0.0.0:8000`
- `Application startup complete.`
Press `Ctrl+C` to exit log view once the server is ready.
@ -166,9 +206,10 @@ spec:
Optionally, remove the image and cached model:
Eg.
```bash
docker rmi nvcr.io/nvidia/vllm:26.01-py3
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4
docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
```

View File

@ -1,8 +1,8 @@
kind: Playbook
metadata:
name: station-vllm
displayName: vLLM for Inference
shortDescription: Install and use vLLM on DGX Station
displayName: Serve Qwen3-235B with vLLM
shortDescription: Set up vLLM server with Qwen3-235B on DGX Station
publisher: nvidia
description: |
# REPLACE THIS WITH YOUR MODEL CARD
@ -15,7 +15,7 @@ metadata:
attributes:
- key: DURATION
value: 30 MIN
value: 20 MIN
spec:
artifactName: station-vllm
@ -42,9 +42,7 @@ spec:
# What you'll accomplish
Serve a **supported model** using vLLM on NVIDIA DGX Station. Refer to the table below to see the supported models.
You'll set up vLLM high-throughput LLM serving on NVIDIA DGX Station with Blackwell architecture.
Serve the **Qwen3-235B-A22B-NVFP4** model using vLLM on NVIDIA DGX Station. This 235B parameter model uses NVFP4 quantization and fits entirely in VRAM on the GB300 GPU.
# What to know before starting
@ -59,30 +57,21 @@ spec:
- HuggingFace account with access token
- Network access to NGC and HuggingFace
# Model Support Matrix
The following models are supported with vLLM on Spark. All listed models are available and ready to use:
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Step-3.7-Flash-FP8** | FP8 | ✅ | [`stepfun-ai/Step-3.7-Flash-FP8`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8) |
| **Step-3.7-Flash-NVFP4** | NVFP4 | ✅ | [`stepfun-ai/Step-3.7-Flash-NVFP4`](https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4) |
| **Qwen3-235B-A22B-NVFP4** | NVFP4 | ✅ | [`nvidia/Qwen3-235B-A22B-NVFP4`](https://huggingface.co/nvidia/Qwen3-235B-A22B-NVFP4) |
# Time & risk
* **Duration:** 30 minutes (longer on first run due to model download)
* **Duration:** 15-20 minutes (longer on first run due to model download)
* **Risks:** Model download requires HuggingFace authentication
* **Rollback:** Stop and remove the container to restore state
* **Last Updated:** 05/28/2026
* Update models
* **Last Updated:** 03/02/2026
* First Publication
-
id: instructions
label: Instructions
label: Serve Qwen3-235B
content: |
# Step 1. Set up Docker permissions
@ -103,7 +92,7 @@ spec:
export HF_TOKEN="your_huggingface_token"
# Model to serve
export MODEL_HANDLE="<HF_HANDLE>"
export MODEL_HANDLE="nvidia/Qwen3-235B-A22B-NVFP4"
# Maximum context length
export MAX_MODEL_LEN=8192
@ -117,16 +106,9 @@ spec:
docker pull nvcr.io/nvidia/vllm:26.01-py3
```
For Step-3.7-Flash models, pull the custom VLLM container
```bash
docker pull vllm/vllm-openai:stepfun37
```
# Step 4. Start vLLM server
Start the vLLM server with the model. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
For Qwen3-235B NVFP4 model, run with the NGC container. This model fits entirely in VRAM on the GB300.
Start the vLLM server with the Qwen3-235B model. This model fits entirely in VRAM on the GB300. On a single-GPU DGX Station, `--gpus all` uses the GB300; if you have multiple GPUs and want to use only the GB300, replace with `--gpus '"device=N"'` where N is the GB300 device ID from `nvidia-smi`.
```bash
docker run -d \
@ -144,28 +126,6 @@ spec:
--gpu-memory-utilization 0.9
```
For Step-3.7-Flash models, run with the custom VLLM container. The FP8 and the NVFP4 versions fit entirely in VRAM on the GB300.
```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:stepfun37 \
"$MODEL_HANDLE" \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--kv-cache-dtype fp8
```
Check the server logs for startup progress:
```bash
@ -175,7 +135,7 @@ spec:
Expected output includes:
- Model download progress (first run only)
- Model loading into GPU memory
- `Application startup complete.`
- `Uvicorn running on http://0.0.0.0:8000`
Press `Ctrl+C` to exit log view once the server is ready.
@ -206,10 +166,9 @@ spec:
Optionally, remove the image and cached model:
Eg.
```bash
docker rmi "<docker image name>"
rm -rf $HOME/.cache/huggingface/hub/"<downloaded model name>"
docker rmi nvcr.io/nvidia/vllm:26.01-py3
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3-235B-A22B-NVFP4
```

View File

@ -171,10 +171,12 @@ Add additional model entries for any other Ollama models you wish to host remote
| Symptom | Cause | Fix |
|---------|-------|-----|
|Ollama not starting|GPU drivers may not be installed correctly|Run `nvidia-smi` in the terminal. If the command fails check DGX Dashboard for updates to your DGX Spark.|
|Continue can't connect over the network|Port 11434 may not be open or accessible|Run command `ss -tuln \| grep 11434`. If the output does not reflect ` tcp LISTEN 0 4096 *:11434 *:* `, go back to step 2 and run the ufw command.|
|Continue can't detect a locally running Ollama model|Configuration not properly set or detected|Check `OLLAMA_HOST` and `OLLAMA_ORIGINS` in `/etc/systemd/system/ollama.service.d/override.conf` file. If `OLLAMA_HOST` and `OLLAMA_ORIGINS` are set correctly, add these lines to your `~/.bashrc` file.|
|High memory usage|Model size too big|Confirm no other large models or containers are running with `nvidia-smi`. Use smaller models such as `gpt-oss:20b` for lightweight usage.|
| **WiFi connection drops or becomes unreachable** (especially in headless mode) | Aggressive WiFi power-saving settings in NetworkManager | Edit `/etc/NetworkManager/conf.d/default-wifi-powersave-on.conf`, set `wifi.powersave = 2`, and run `sudo systemctl restart NetworkManager`. |
| **Random reboots and "00" error code on the display** | Watchdog timer module (`sbsa_gwdt`) not loaded | Add `sbsa_gwdt` to `/etc/modules-load.d/watchdog.conf` and reboot to ensure the hardware watchdog is correctly managed by the kernel. |
| Ollama not starting | GPU drivers may not be installed correctly | Run `nvidia-smi` in the terminal. If the command fails check DGX Dashboard for updates to your DGX Spark. |
| Continue can't connect over the network | Port 11434 may not be open or accessible | Run command `ss -tuln \| grep 11434`. If the output does not reflect `tcp LISTEN 0 4096 *:11434 *:*`, go back to step 2 and run the ufw command. |
| Continue can't detect a locally running Ollama model | Configuration not properly set or detected | Check `OLLAMA_HOST` and `OLLAMA_ORIGINS` in `/etc/systemd/system/ollama.service.d/override.conf` file. If `OLLAMA_HOST` and `OLLAMA_ORIGINS` are set correctly, add these lines to your `~/.bashrc` file. |
| High memory usage | Model size too big | Confirm no other large models or containers are running with `nvidia-smi`. Use smaller models such as `gpt-oss:20b` for lightweight usage. |
> [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.