mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-23 10:33:51 +00:00
Compare commits
4 Commits
367f892cf2
...
fd1510e368
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
fd1510e368 | ||
|
|
6e98abc3b0 | ||
|
|
1d85b97d79 | ||
|
|
48fc5eb30e |
@ -39,7 +39,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
|
||||
- [Connect Multiple DGX Spark through a Switch](nvidia/multi-sparks-through-switch/)
|
||||
- [NCCL for Two Sparks](nvidia/nccl/)
|
||||
- [Fine-tune with NeMo](nvidia/nemo-fine-tune/)
|
||||
- [NemoClaw with Nemotron-3-Super and Telegram on DGX Spark](nvidia/nemoclaw/)
|
||||
- [NemoClaw with Nemotron 3 Super and Telegram on DGX Spark](nvidia/nemoclaw/)
|
||||
- [Nemotron-3-Nano with llama.cpp](nvidia/nemotron/)
|
||||
- [NIM on Spark](nvidia/nim-llm/)
|
||||
- [NVFP4 Quantization](nvidia/nvfp4-quantization/)
|
||||
|
||||
@ -1,4 +1,4 @@
|
||||
# NemoClaw with Nemotron-3-Super and Telegram on DGX Spark
|
||||
# NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
|
||||
|
||||
> Install NemoClaw on DGX Spark with local Ollama inference and Telegram bot integration
|
||||
|
||||
@ -372,7 +372,15 @@ Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and fo
|
||||
|
||||
Make sure you are on the **host** (not inside the sandbox). If you are inside the sandbox, run `exit` first.
|
||||
|
||||
Add the Telegram network policy to the sandbox so it can reach the Telegram API:
|
||||
Set the required environment variables. Replace the placeholders with your actual values. `SANDBOX_NAME` must match the sandbox name you chose during the onboard wizard:
|
||||
|
||||
```bash
|
||||
export TELEGRAM_BOT_TOKEN=<your-bot-token>
|
||||
export SANDBOX_NAME=my-assistant
|
||||
export NVIDIA_API_KEY=<your-nvidia-api-key>
|
||||
```
|
||||
|
||||
Add the Telegram network policy to the sandbox:
|
||||
|
||||
```bash
|
||||
nemoclaw my-assistant policy-add
|
||||
@ -380,7 +388,7 @@ nemoclaw my-assistant policy-add
|
||||
|
||||
When prompted, select `telegram` and hit **Y** to confirm.
|
||||
|
||||
Set the bot token and start auxiliary services:
|
||||
Start the Telegram bridge.
|
||||
|
||||
```bash
|
||||
export TELEGRAM_BOT_TOKEN=<your-bot-token>
|
||||
|
||||
@ -685,6 +685,7 @@ docker rmi ghcr.io/open-webui/open-webui:main
|
||||
| "invalid mount config for type 'bind'" | Missing or non-executable entrypoint script | Run `docker inspect <container_id>` to see full error message. Verify `trtllm-mn-entrypoint.sh` exists on both nodes in your home directory (`ls -la $HOME/trtllm-mn-entrypoint.sh`) and has executable permissions (`chmod +x $HOME/trtllm-mn-entrypoint.sh`) |
|
||||
| "task: non-zero exit (255)" | Container exit with error code 255 | Check container logs with `docker ps -a --filter "name=trtllm-multinode_trtllm"` to get container ID, then `docker logs <container_id>` to see detailed error messages |
|
||||
| Docker state stuck in "Pending" with "no suitable node (insufficien...)" | Docker daemon not properly configured for GPU access | Verify steps 2-4 were completed successfully and check that `/etc/docker/daemon.json` contains correct GPU configuration |
|
||||
| Serving model fails `ptxas fatal` errors | Model needs runtime triton kernel compilation | In Step 10, add `-x TRITON_PTXAS_PATH` to your `mpirun` command |
|
||||
|
||||
> [!NOTE]
|
||||
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||
|
||||
@ -171,10 +171,12 @@ Add additional model entries for any other Ollama models you wish to host remote
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
|Ollama not starting|GPU drivers may not be installed correctly|Run `nvidia-smi` in the terminal. If the command fails check DGX Dashboard for updates to your DGX Spark.|
|
||||
|Continue can't connect over the network|Port 11434 may not be open or accessible|Run command `ss -tuln \| grep 11434`. If the output does not reflect ` tcp LISTEN 0 4096 *:11434 *:* `, go back to step 2 and run the ufw command.|
|
||||
|Continue can't detect a locally running Ollama model|Configuration not properly set or detected|Check `OLLAMA_HOST` and `OLLAMA_ORIGINS` in `/etc/systemd/system/ollama.service.d/override.conf` file. If `OLLAMA_HOST` and `OLLAMA_ORIGINS` are set correctly, add these lines to your `~/.bashrc` file.|
|
||||
|High memory usage|Model size too big|Confirm no other large models or containers are running with `nvidia-smi`. Use smaller models such as `gpt-oss:20b` for lightweight usage.|
|
||||
| **WiFi connection drops or becomes unreachable** (especially in headless mode) | Aggressive WiFi power-saving settings in NetworkManager | Edit `/etc/NetworkManager/conf.d/default-wifi-powersave-on.conf`, set `wifi.powersave = 2`, and run `sudo systemctl restart NetworkManager`. |
|
||||
| **Random reboots and "00" error code on the display** | Watchdog timer module (`sbsa_gwdt`) not loaded | Add `sbsa_gwdt` to `/etc/modules-load.d/watchdog.conf` and reboot to ensure the hardware watchdog is correctly managed by the kernel. |
|
||||
| Ollama not starting | GPU drivers may not be installed correctly | Run `nvidia-smi` in the terminal. If the command fails check DGX Dashboard for updates to your DGX Spark. |
|
||||
| Continue can't connect over the network | Port 11434 may not be open or accessible | Run command `ss -tuln \| grep 11434`. If the output does not reflect `tcp LISTEN 0 4096 *:11434 *:*`, go back to step 2 and run the ufw command. |
|
||||
| Continue can't detect a locally running Ollama model | Configuration not properly set or detected | Check `OLLAMA_HOST` and `OLLAMA_ORIGINS` in `/etc/systemd/system/ollama.service.d/override.conf` file. If `OLLAMA_HOST` and `OLLAMA_ORIGINS` are set correctly, add these lines to your `~/.bashrc` file. |
|
||||
| High memory usage | Model size too big | Confirm no other large models or containers are running with `nvidia-smi`. Use smaller models such as `gpt-oss:20b` for lightweight usage. |
|
||||
|
||||
> [!NOTE]
|
||||
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user