Compare commits

...

5 Commits

Author SHA1 Message Date
TharunGaneshram
02b1d1d36b
Merge e542e522c5 into 6e98abc3b0 2026-04-14 09:38:46 +02:00
GitLab CI
6e98abc3b0 chore: Regenerate all playbooks 2026-04-14 01:42:17 +00:00
GitLab CI
1d85b97d79 chore: Regenerate all playbooks 2026-04-14 00:52:53 +00:00
GitLab CI
6a4d122e92 chore: Regenerate all playbooks 2026-04-13 13:31:35 +00:00
Tharun Ganeshram
e542e522c5 feat: add 2x DGX Spark Nemo Automodel playbook
Modify the Nemo Automodel playbook to support 2x Spark with RoCE for multi-node fine-tuning.
source code here: https://github.com/TharunGaneshram/dgx-spark-playbooks/tree/2xSparkAutomodel

Co-authored-by: Tharun Ganeshram <tganeshram@nvidia.com>
2025-12-08 22:54:32 +00:00
4 changed files with 340 additions and 33 deletions

View File

@ -39,7 +39,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
- [Connect Multiple DGX Spark through a Switch](nvidia/multi-sparks-through-switch/)
- [NCCL for Two Sparks](nvidia/nccl/)
- [Fine-tune with NeMo](nvidia/nemo-fine-tune/)
- [NemoClaw with Nemotron-3-Super and Telegram on DGX Spark](nvidia/nemoclaw/)
- [NemoClaw with Nemotron 3 Super and Telegram on DGX Spark](nvidia/nemoclaw/)
- [Nemotron-3-Nano with llama.cpp](nvidia/nemotron/)
- [NIM on Spark](nvidia/nim-llm/)
- [NVFP4 Quantization](nvidia/nvfp4-quantization/)

View File

@ -6,6 +6,7 @@
- [Overview](#overview)
- [Instructions](#instructions)
- [Run on two Sparks](#run-on-two-sparks)
- [Troubleshooting](#troubleshooting)
---
@ -296,8 +297,296 @@ python3 my_custom_training.py
Explore the [NeMo AutoModel GitHub repository](https://github.com/NVIDIA-NeMo/Automodel) for more recipes, documentation, and community examples. Consider setting up custom datasets, experimenting with different model architectures, and scaling to multi-node distributed training for larger models.
## Run on two Sparks
## Step 1. Configure network connectivity
Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes.
This includes:
- Physical QSFP cable connection
- Network interface configuration (automatic or manual IP assignment)
- Passwordless SSH setup
- Network connectivity verification
> [!NOTE]
> Steps 2 to 8 must be conducted on each node.
## Step 2. Configure Docker permissions
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
Open a new terminal and test Docker access. In the terminal, run:
```bash
docker ps
```
If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group so that you don't need to run the command with sudo .
```bash
sudo usermod -aG docker $USER
newgrp docker
```
## Step 3. Install NVIDIA Container Toolkit & setup Docker environment
Ensure the NVIDIA drivers and the NVIDIA Container Toolkit are installed on each node (both manager and workers) that will provide GPU resources. This package enables Docker containers to access the host's GPU hardware. Ensure you complete the [installation steps](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), including the [Docker configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) for NVIDIA Container Toolkit.
## Step 4. Deploy Docker Containers
Download the [**pytorch-ft-entrypoint.sh**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/pytorch-fine-tune/assets/pytorch-ft-entrypoint.sh) script into your home directory and run the following command to make it executable:
```bash
chmod +x $HOME/pytorch-ft-entrypoint.sh
```
Deploy the docker container by running the following command:
```bash
docker run -d \
--name automodel-node \
--gpus all \
--network host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--device=/dev/infiniband \
-v "$PWD"/pytorch-ft-entrypoint.sh:/opt/pytorch-ft-entrypoint.sh \
-v "$HOME/.cache/huggingface/":/root/.cache/huggingface/ \
-v "$HOME/.ssh":/tmp/.ssh:ro \
-e UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1 \
-e NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \
-e GLOO_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \
-e NCCL_DEBUG=INFO \
-e TORCH_NCCL_ASYNC_ERROR_HANDLING=1 \
-e TORCH_DISTRIBUTED_DEBUG=INFO \
-e CUDA_DEVICE_MAX_CONNECTIONS=1 \
-e CUDA_VISIBLE_DEVICES=0 \
nvcr.io/nvidia/pytorch:25.10-py3 \
/opt/pytorch-ft-entrypoint.sh
```
## Step 5. Install package management tools
Launch a terminal into your docker container on the node.
```bash
docker exec -it automodel-node bash
```
> [!NOTE]
> All subsequent steps and commands, other than "Cleanup and rollback", should be run from within the docker container terminal.
Install `uv` for efficient package management and virtual environment isolation. NeMo AutoModel uses `uv` for dependency management and automatic environment handling.
```bash
## Install uv package manager
pip3 install uv
## Verify installation
uv --version
```
## Step 6. Clone NeMo AutoModel repository
Clone the official NeMo AutoModel repository to access recipes and examples. This provides ready-to-use training configurations for various model types and training scenarios.
```bash
## Clone the repository
git clone https://github.com/NVIDIA-NeMo/Automodel.git
## Navigate to the repository
cd Automodel
```
## Step 7. Install NeMo AutoModel
Set up the virtual environment and install NeMo AutoModel. Choose between wheel package installation for stability or source installation for latest features.
**Install from wheel package (recommended):**
```bash
## Initialize virtual environment
uv venv --system-site-packages
## Install packages with uv
uv sync --inexact --frozen --all-extras \
--no-install-package torch \
--no-install-package torchvision \
--no-install-package triton \
--no-install-package nvidia-cublas-cu12 \
--no-install-package nvidia-cuda-cupti-cu12 \
--no-install-package nvidia-cuda-nvrtc-cu12 \
--no-install-package nvidia-cuda-runtime-cu12 \
--no-install-package nvidia-cudnn-cu12 \
--no-install-package nvidia-cufft-cu12 \
--no-install-package nvidia-cufile-cu12 \
--no-install-package nvidia-curand-cu12 \
--no-install-package nvidia-cusolver-cu12 \
--no-install-package nvidia-cusparse-cu12 \
--no-install-package nvidia-cusparselt-cu12 \
--no-install-package nvidia-nccl-cu12 \
--no-install-package transformer-engine \
--no-install-package nvidia-modelopt \
--no-install-package nvidia-modelopt-core \
--no-install-package flash-attn \
--no-install-package transformer-engine-cu12 \
--no-install-package transformer-engine-torch
## Install bitsandbytes
CMAKE_ARGS="-DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY=80;86;87;89;90" \
CMAKE_BUILD_PARALLEL_LEVEL=8 \
uv pip install --no-deps git+https://github.com/bitsandbytes-foundation/bitsandbytes.git@50be19c39698e038a1604daf3e1b939c9ac1c342
```
## Step 8. Verify installation
Confirm NeMo AutoModel is properly installed and accessible. This step validates the installation and checks for any missing dependencies.
```bash
## Test NeMo AutoModel import
uv run --frozen --no-sync python -c "import nemo_automodel; print('✅ NeMo AutoModel ready')"
```
> [!NOTE]
> You might see a warning stating `grouped_gemm is not available`. You can ignore this warning if you see '✅ NeMo AutoModel ready'.
> [!NOTE]
> Ensure steps 2 to 8 were conducted on all nodes for correct setup.
## Step 9. Run sample multi-node fine-tuning
The following commands show how to perform full fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) with LoRA across both Spark devices using `torch.distributed.run`.
First, export your HF_TOKEN on both nodes so that gated models can be downloaded.
```bash
export HF_TOKEN=<your_huggingface_token>
```
> [!NOTE]
> Replace `<your_huggingface_token>` with your personal Hugging Face access token. A valid token is required to download any gated model.
>
> - Generate a token: [Hugging Face tokens](https://huggingface.co/settings/tokens), guide available [here](https://huggingface.co/docs/hub/en/security-tokens).
> - Request and receive access on each model's page (and accept license/terms) before attempting downloads.
> - Llama-3.1-8B: [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
> - Qwen3-8B: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
> - Mixtral-8x7B: [mistralai/Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B)
>
> The same steps apply for any other gated model you use: visit its model card on Hugging Face, request access, accept the license, and wait for approval.
Next, export a few multi-node PyTorch configuration environment variables.
- `MASTER_ADDR`: IP address of your master node as set in [Connect two Sparks](https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks). \(ex: 192.168.100.10\)
- `MASTER_PORT`: Set a port number that can be used on your master node. \(ex: 12345\)
- `NODE_RANK`: Master rank is set to 0 and Worker rank is set to 1
Run this on the Master node
```bash
export MASTER_ADDR=<TODO: specify IP>
export MASTER_PORT=<TODO: specify port>
export NODE_RANK=0
```
Run this on the Worker node
```bash
export MASTER_ADDR=<TODO: specify IP>
export MASTER_PORT=<TODO: specify port>
export NODE_RANK=1
```
**LoRA fine-tuning example:**
Execute a basic fine-tuning example to validate the complete setup. This demonstrates parameter-efficient fine-tuning using a small model suitable for testing.
For the examples below, we are using YAML for configuration, and parameter overrides are passed as command line arguments.
Run this on the all nodes:
```bash
uv run --frozen --no-sync python -m torch.distributed.run \
--nnodes=2 \
--nproc_per_node=1 \
--node_rank=${NODE_RANK} \
--rdzv_backend=static \
--rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
examples/llm_finetune/finetune.py \
-c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B \
--packed_sequence.packed_sequence_size 1024 \
--step_scheduler.max_steps 100
```
The following `torch.distributed.run` parameters configure our dual-node distributed PyTorch workload and communication:
- `--nnodes`: sets the total number of nodes participating in the distributed training. This is 2 for our dual-node case.
- `--nproc_per_node`: sets the number of processes to be executed on each node. 1 fine-tuning process will occur on each node in our example.
- `--node_rank`: sets the rank of the current node. Again, Master rank is set to 0 and Worker rank is set to 1.
- `--rdzv_backend`: sets the backend used for the rendezvous mechanism. The rendezvous mechanism allows nodes to discover each other and establish communication channels before beginning the distributed workload. We use `fixed` for a pre-configured rendezvous setup.
- `--rdzv_endpoint`: sets the endpoint on which the rendezvous is expected to occur. This will be the Master node IP address and port specified earlier.
These config overrides ensure the Llama-3.1-8B LoRA run behaves as expected:
- `--model.pretrained_model_name_or_path`: selects the Llama-3.1-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token).
- `--packed_sequence.packed_sequence_size`: sets the packed sequence size to 1024 to enable packed sequence training.
- `--step_scheduler.max_steps`: sets the maximum number of training steps. We set it to 100 for demonstation purposes, please adjust this based on your needs.
> [!NOTE]
> `NCCL WARN NET/IB : roceP2p1s0f1:1 unknown event type (18)` logs during multi-node workloads can be ignored and are a sign that RoCE is functional.
**Full Fine-tuning example:**
Run this on the all nodes:
```bash
uv run --frozen --no-sync python -m torch.distributed.run \
--nnodes=2 \
--nproc_per_node=1 \
--node_rank=${NODE_RANK} \
--rdzv_backend=static \
--rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
examples/llm_finetune/finetune.py \
-c examples/llm_finetune/qwen/qwen3_8b_squad_spark.yaml \
--model.pretrained_model_name_or_path Qwen/Qwen3-8B \
--step_scheduler.local_batch_size 1 \
--step_scheduler.max_steps 100 \
--packed_sequence.packed_sequence_size 1024
```
These config overrides ensure the Qwen3-8B SFT run behaves as expected:
- `--model.pretrained_model_name_or_path`: selects the Qwen/Qwen3-8B model to fine-tune from the Hugging Face model hub (weights fetched via your Hugging Face token). Adjust this if you want to fine-tune a different model.
- `--step_scheduler.max_steps`: sets the maximum number of training steps. We set it to 100 for demonstation purposes, please adjust this based on your needs.
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
## Step 10. Validate successful training completion
Validate the fine-tuned model by inspecting artifacts contained in the checkpoint directory on your Master node.
```bash
## Inspect logs and checkpoint output.
## The LATEST is a symlink pointing to the latest checkpoint.
## The checkpoint is the one that was saved during training.
## below is an example of the expected output (username and domain-users are placeholders).
ls -lah checkpoints/LATEST/
## root@gx10-f154:/workspace/Automodel# ls -lah checkpoints/LATEST/
## total 36K
## drwxr-xr-x 6 username domain-users 4.0K Dec 8 20:16 .
## drwxr-xr-x 3 username domain-users 4.0K Dec 8 20:16 ..
## -rw-r--r-- 1 username domain-users 1.6K Dec 8 20:16 config.yaml
## drwxr-xr-x 2 username domain-users 4.0K Dec 8 20:16 dataloader
## -rw-r--r-- 1 username domain-users 66 Dec 8 20:16 losses.json
## drwxr-xr-x 3 username domain-users 4.0K Dec 8 20:16 model
## drwxr-xr-x 2 username domain-users 4.0K Dec 8 20:16 optim
## drwxr-xr-x 2 username domain-users 4.0K Dec 8 20:16 rng
## -rw-r--r-- 1 username domain-users 1.3K Dec 8 20:16 step_scheduler.pt
```
## Step 11. Cleanup and rollback
Stop and remove containers by using the following command on all nodes:
```bash
docker stop automodel-node
docker rm automodel-node
```
> [!WARNING]
> This removes all training data and performance reports. Copy `checkpoints/` out of the container in advance if you want to keep it.
## Troubleshooting
## Common issues for running on a single Spark
| Symptom | Cause | Fix |
|---------|--------|-----|
| `nvcc: command not found` | CUDA toolkit not in PATH | Add CUDA toolkit to PATH: `export PATH=/usr/local/cuda/bin:$PATH` |
@ -307,6 +596,19 @@ Explore the [NeMo AutoModel GitHub repository](https://github.com/NVIDIA-NeMo/Au
| ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
## Common Issues for running on two Starks
| Symptom | Cause | Fix |
|---------|-------|-----|
| `nvcc: command not found` | CUDA toolkit not in PATH | Add CUDA toolkit to PATH: `export PATH=/usr/local/cuda/bin:$PATH` |
| Container exits immediately | Missing entrypoint script | Ensure `pytorch-ft-entrypoint.sh` download succeeded and has executable permissions |
| `The container name "/automodel-node" is already in use` | Another docker container of the same name is in use on the node (likely forgotten during clean up) | Remove (or rename) the old container or rename the new one |
| GPU not detected in training | CUDA driver/runtime mismatch | Verify driver compatibility: `nvidia-smi` and reinstall CUDA if needed |
| Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
| Checkpoint loading failure when running fine-tuning examples consecutively: `No such file or directory: 'checkpoints/epoch_0_step_*/*'` | Fine-tuning script attempts to load old checkpoints unsuccessfully | Remove the `checkpoints/` directory before running again |
| `Unable to find address for: enp1s0f0np0` when attempting single node fine-tuning run on multi-node container | `enp1s0f0np0` is not configured with an IP | Verify network configuration or, if you configured the devices on `enp1s0f1np1`, set `NCCL_SOCKET_IFNAME` and `GLOO_SOCKET_IFNAME` to only `enp1s0f1np1` |
> [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within

View File

@ -1,4 +1,4 @@
# NemoClaw with Nemotron-3-Super and Telegram on DGX Spark
# NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
> Install NemoClaw on DGX Spark with local Ollama inference and Telegram bot integration
@ -25,7 +25,7 @@
- [Step 6. Talk to the agent (CLI)](#step-6-talk-to-the-agent-cli)
- [Step 7. Interactive TUI](#step-7-interactive-tui)
- [Step 8. Exit the sandbox and access the Web UI](#step-8-exit-the-sandbox-and-access-the-web-ui)
- [Step 9. Prepare credentials](#step-9-prepare-credentials)
- [Step 9. Create a Telegram bot](#step-9-create-a-telegram-bot)
- [Step 10. Configure and start the Telegram bridge](#step-10-configure-and-start-the-telegram-bridge)
- [Step 11. Stop services](#step-11-stop-services)
- [Step 12. Uninstall NemoClaw](#step-12-uninstall-nemoclaw)
@ -192,14 +192,6 @@ Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
```
Verify it is running:
```bash
curl http://localhost:11434
```
Expected: `Ollama is running`. If not, start it: `ollama serve &`
Configure Ollama to listen on all interfaces so the sandbox container can reach it:
```bash
@ -209,6 +201,17 @@ sudo systemctl daemon-reload
sudo systemctl restart ollama
```
Verify it is running and reachable on all interfaces:
```bash
curl http://0.0.0.0:11434
```
Expected: `Ollama is running`. If not, start it with `sudo systemctl start ollama`.
> [!IMPORTANT]
> Always start Ollama via systemd (`sudo systemctl restart ollama`) — do not use `ollama serve &`. A manually started Ollama process does not pick up the `OLLAMA_HOST=0.0.0.0` setting above, and the NemoClaw sandbox will not be able to reach the inference server.
### Step 3. Pull the Nemotron 3 Super model
Download Nemotron 3 Super 120B (~87 GB; may take 15--30 minutes depending on network speed):
@ -237,10 +240,10 @@ You should see `nemotron-3-super:120b` in the output.
### Step 4. Install NemoClaw
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones NemoClaw at the pinned stable release (`v0.0.1`), builds the CLI, and runs the onboard wizard to create a sandbox.
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the latest stable NemoClaw release, builds the CLI, and runs the onboard wizard to create a sandbox.
```bash
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.4 bash
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
```
The onboard wizard walks you through setup:
@ -358,14 +361,12 @@ http://127.0.0.1:18789/#token=<long-token-here>
## Phase 3: Telegram Bot
### Step 9. Prepare credentials
> [!NOTE]
> If you already configured Telegram during the NemoClaw onboarding wizard (step 5/8), you can skip this phase. These steps cover adding Telegram after the initial setup.
You need two items:
### Step 9. Create a Telegram bot
| Item | Where to get it |
|------|----------------|
| Telegram bot token | Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the token it gives you. |
| NVIDIA API key | Go to [build.nvidia.com/settings/api-keys](https://build.nvidia.com/settings/api-keys) and create or copy a key (starts with `nvapi-`). |
Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the bot token it gives you.
### Step 10. Configure and start the Telegram bridge
@ -376,6 +377,7 @@ Set the required environment variables. Replace the placeholders with your actua
```bash
export TELEGRAM_BOT_TOKEN=<your-bot-token>
export SANDBOX_NAME=my-assistant
export NVIDIA_API_KEY=<your-nvidia-api-key>
```
Add the Telegram network policy to the sandbox:
@ -384,34 +386,36 @@ Add the Telegram network policy to the sandbox:
nemoclaw my-assistant policy-add
```
When prompted, type `telegram` and hit **Y** to confirm.
When prompted, select `telegram` and hit **Y** to confirm.
Start the Telegram bridge. On first run it will ask for your NVIDIA API key:
Start the Telegram bridge.
```bash
export TELEGRAM_BOT_TOKEN=<your-bot-token>
nemoclaw start
```
Paste your `nvapi-` key when prompted.
The Telegram bridge starts only when the `TELEGRAM_BOT_TOKEN` environment variable is set. Verify the services are running:
You should see:
```text
[services] telegram-bridge started
Telegram: bridge running
```bash
nemoclaw status
```
Open Telegram, find your bot, and send it a message. The bot forwards it to the agent and replies.
> [!NOTE]
> The first response may include a debug log line like "gateway Running as non-root..." -- this is cosmetic and can be ignored.
> The first response may take 30--90 seconds for a 120B parameter model running locally.
> [!NOTE]
> If you need to restart the bridge, `nemoclaw stop` may not cleanly stop the process. If that happens, find and kill the bridge process via its PID file:
> If the bridge does not appear in `nemoclaw status`, make sure `TELEGRAM_BOT_TOKEN` is exported in the same shell session where you run `nemoclaw start`. You can also try stopping and restarting:
> ```bash
> kill -9 "$(cat /tmp/nemoclaw-services-${SANDBOX_NAME}/telegram-bridge.pid)"
> nemoclaw stop
> export TELEGRAM_BOT_TOKEN=<your-bot-token>
> nemoclaw start
> ```
> Then run `nemoclaw start` again.
> [!NOTE]
> For details on restricting which Telegram chats can interact with the agent, see the [NemoClaw Telegram bridge documentation](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html).
---
@ -419,7 +423,7 @@ Open Telegram, find your bot, and send it a message. The bot forwards it to the
### Step 11. Stop services
Stop any running auxiliary services (Telegram bridge, cloudflared):
Stop any running auxiliary services (Telegram bridge, cloudflared tunnel):
```bash
nemoclaw stop
@ -474,7 +478,7 @@ The uninstaller runs 6 steps:
| `nemoclaw my-assistant status` | Show sandbox status and inference config |
| `nemoclaw my-assistant logs --follow` | Stream sandbox logs in real time |
| `nemoclaw list` | List all registered sandboxes |
| `nemoclaw start` | Start auxiliary services (Telegram bridge) |
| `nemoclaw start` | Start auxiliary services (Telegram bridge, cloudflared) |
| `nemoclaw stop` | Stop auxiliary services |
| `openshell term` | Open the monitoring TUI on the host |
| `openshell forward list` | List active port forwards |

View File

@ -685,6 +685,7 @@ docker rmi ghcr.io/open-webui/open-webui:main
| "invalid mount config for type 'bind'" | Missing or non-executable entrypoint script | Run `docker inspect <container_id>` to see full error message. Verify `trtllm-mn-entrypoint.sh` exists on both nodes in your home directory (`ls -la $HOME/trtllm-mn-entrypoint.sh`) and has executable permissions (`chmod +x $HOME/trtllm-mn-entrypoint.sh`) |
| "task: non-zero exit (255)" | Container exit with error code 255 | Check container logs with `docker ps -a --filter "name=trtllm-multinode_trtllm"` to get container ID, then `docker logs <container_id>` to see detailed error messages |
| Docker state stuck in "Pending" with "no suitable node (insufficien...)" | Docker daemon not properly configured for GPU access | Verify steps 2-4 were completed successfully and check that `/etc/docker/daemon.json` contains correct GPU configuration |
| Serving model fails `ptxas fatal` errors | Model needs runtime triton kernel compilation | In Step 10, add `-x TRITON_PTXAS_PATH` to your `mpirun` command |
> [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.