Compare commits

...

16 Commits

Author SHA1 Message Date
Csaba Kecskemeti
fd16fd4edc
Merge 59bedc4afe into 599cf838a0 2026-05-07 10:38:13 +00:00
GitLab CI
599cf838a0 chore: Regenerate all playbooks 2026-04-29 18:42:01 +00:00
GitLab CI
9809e38119 chore: Regenerate all playbooks 2026-04-29 18:29:39 +00:00
GitLab CI
5a9d5d1f2a chore: Regenerate all playbooks 2026-04-28 15:49:55 +00:00
GitLab CI
90fe8c7cae chore: Regenerate all playbooks 2026-04-27 17:19:18 +00:00
GitLab CI
2022e2b24b chore: Regenerate all playbooks 2026-04-20 15:46:44 +00:00
GitLab CI
3ba4d58f1e chore: Regenerate all playbooks 2026-04-14 17:45:10 +00:00
GitLab CI
6e98abc3b0 chore: Regenerate all playbooks 2026-04-14 01:42:17 +00:00
GitLab CI
1d85b97d79 chore: Regenerate all playbooks 2026-04-14 00:52:53 +00:00
GitLab CI
6a4d122e92 chore: Regenerate all playbooks 2026-04-13 13:31:35 +00:00
Csaba Kecskemeti
59bedc4afe playbook rev5 2026-01-23 19:26:14 -08:00
Csaba Kecskemeti
557fba70c8 playbook rev4 2026-01-23 19:09:38 -08:00
Csaba Kecskemeti
c2829a5421 playbook rev3 2026-01-23 19:05:48 -08:00
Csaba Kecskemeti
ae8c01eb9d add link to playbook 2026-01-23 19:01:29 -08:00
Csaba Kecskemeti
b254e6f632 playbook rev2 2026-01-23 18:59:01 -08:00
Csaba Kecskemeti
c5d7b777e1 initial version of the playbook 2026-01-23 18:39:02 -08:00
12 changed files with 1485 additions and 166 deletions

View File

@ -28,6 +28,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
- [CUDA-X Data Science](nvidia/cuda-x-data-science/)
- [DGX Dashboard](nvidia/dgx-dashboard/)
- [FLUX.1 Dreambooth LoRA Fine-tuning](nvidia/flux-finetuning/)
- [Heterogeneous Distributed Inference over RDMA](nvidia/heterogeneous-distributed-inference-rdma/)
- [Install and Use Isaac Sim and Isaac Lab](nvidia/isaac/)
- [Optimized JAX](nvidia/jax/)
- [Live VLM WebUI](nvidia/live-vlm-webui/)
@ -39,7 +40,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
- [Connect Multiple DGX Spark through a Switch](nvidia/multi-sparks-through-switch/)
- [NCCL for Two Sparks](nvidia/nccl/)
- [Fine-tune with NeMo](nvidia/nemo-fine-tune/)
- [NemoClaw with Nemotron-3-Super and Telegram on DGX Spark](nvidia/nemoclaw/)
- [NemoClaw with Nemotron 3 Super and Telegram on DGX Spark](nvidia/nemoclaw/)
- [Nemotron-3-Nano with llama.cpp](nvidia/nemotron/)
- [NIM on Spark](nvidia/nim-llm/)
- [NVFP4 Quantization](nvidia/nvfp4-quantization/)

View File

@ -0,0 +1,532 @@
# Distributed Inference Guide
> Deploy and run distributed AI inference across DGX Spark and Linux Workstation using vLLM and Ray
## Table of Contents
- [Overview](#overview)
- [Instructions](#instructions)
- [Performance Benchmarks](#performance-benchmarks)
- [Troubleshooting](#troubleshooting)
- [Credits](#credits)
---
## Overview
## Basic idea
This guide walks you through deploying distributed inference across your heterogeneous RDMA cluster. Using Ray for orchestration and vLLM for inference, you can run large language models that exceed the memory capacity of any single GPU by distributing them across your DGX Spark and Linux workstation.
**Architecture:**
```
┌─────────────────────────────────┐ ┌───────────────────────────────────┐
│ DGX SPARK │ │ WORKSTATION │
│ (Grace Blackwell GB10) │ │ (RTX 6000 Pro / RTX 5090) │
│ │ │ │
│ ┌───────────────────────────┐ │ │ ┌───────────────────────────┐ │
│ │ vLLM Head Node │ │ │ │ vLLM Worker │ │
│ │ (API Server, Rank 0) │ │ │ │ (Tensor Parallel) │ │
│ └───────────────────────────┘ │ │ └───────────────────────────┘ │
│ │ │ │ │ │
│ ┌───────────────────────────┐ │ │ ┌───────────────────────────┐ │
│ │ Ray Head (6379) │◄─┼────┼──│ Ray Worker │ │
│ └───────────────────────────┘ │ │ └───────────────────────────┘ │
│ │ │ │ │ │
│ ┌───────────────────────────┐ │ │ ┌───────────────────────────┐ │
│ │ NCCL over RDMA │◄─┼════┼──│ NCCL over RDMA │ │
│ │ 192.168.200.1 │ │ │ │ 192.168.200.2 │ │
│ └───────────────────────────┘ │ │ └───────────────────────────┘ │
└─────────────────────────────────┘ └───────────────────────────────────┘
```
## What you'll accomplish
- Configure SSH and hostname resolution between nodes
- Test NCCL communication over RDMA
- Deploy RDMA-enabled Docker containers
- Establish a Ray cluster across both systems
- Run distributed inference with vLLM
- Benchmark performance across different configurations
## What to know before starting
- Familiarity with Docker and container networking
- Understanding of distributed computing concepts (Ray, tensor parallelism)
- Basic knowledge of LLM inference serving
## Prerequisites
- Completed [RDMA Network Setup](README.md) with validated 90+ Gbps bandwidth
- Docker installed on both systems: `docker --version`
- NVIDIA Container Toolkit installed
- Hugging Face account for model access (some models require authentication)
> [!NOTE]
> **Why we use the `nvcr.io/nvidia/vllm` container:** This tutorial uses the official NVIDIA vLLM container image (`nvcr.io/nvidia/vllm:25.09-py3`) on both nodes. This is important because:
> - **Version consistency:** Ray cluster is very sensitive to Python and Ray version mismatches between nodes. The container guarantees identical versions on both DGX Spark (ARM64) and Workstation (AMD64).
> - **Pre-installed dependencies:** NCCL, RDMA libraries, and all required packages are already configured.
> - **Multi-architecture support:** The same image tag works on both ARM64 (DGX Spark) and AMD64 (Workstation) architectures.
> - **vLLM ready:** No additional installation needed - just pull and run.
## Time & risk
- **Duration:** 1-2 hours including testing
- **Risk level:** Low - uses containers, non-destructive
- **Rollback:** Stop containers to revert
- **Last Updated:** 01/23/2026
---
## Instructions
## Step 1. Configure Hostnames
Add hostname aliases on both systems:
```bash
## Add hostname resolution on both DGX Spark and Workstation
sudo tee -a /etc/hosts > /dev/null <<EOF
192.168.200.1 dgx-spark
192.168.200.2 workstation
EOF
```
## Step 2. Set Up SSH Access
Install SSH server if needed (common on workstations):
```bash
## Check SSH server status
sudo systemctl status ssh
## If not installed:
sudo apt update
sudo apt install openssh-server
sudo systemctl start ssh
sudo systemctl enable ssh
```
Configure passwordless SSH between nodes:
On DGX Spark:
```bash
## Check if SSH key exists
ls ~/.ssh/id_*.pub
## If no key exists, generate one:
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519
## Copy key to workstation
ssh-copy-id <your-username>@workstation
```
On Workstation:
```bash
## Check if SSH key exists
ls ~/.ssh/id_*.pub
## If no key exists, generate one:
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519
## Copy key to DGX Spark
ssh-copy-id <your-username>@dgx-spark
```
Verify passwordless SSH:
```bash
## From DGX Spark
ssh <your-username>@workstation hostname
## Expected output: workstation
## From Workstation
ssh <your-username>@dgx-spark hostname
## Expected output: dgx-spark
```
---
## Step 3. Test NCCL Communication
Create the NCCL test script on both systems:
```bash
## Create test script
cat > test_nccl.py << 'EOF'
import os
import torch
import torch.distributed as dist
import argparse
def test_nccl_communication():
parser = argparse.ArgumentParser()
parser.add_argument('--rank', type=int, required=True)
parser.add_argument('--world_size', type=int, default=2)
parser.add_argument('--master_addr', type=str, default='192.168.200.1')
parser.add_argument('--master_port', type=str, default='29500')
args = parser.parse_args()
os.environ['RANK'] = str(args.rank)
os.environ['WORLD_SIZE'] = str(args.world_size)
os.environ['MASTER_ADDR'] = args.master_addr
os.environ['MASTER_PORT'] = args.master_port
os.environ['NCCL_SOCKET_IFNAME'] = 'enp1s0f0np0'
print(f"Initializing process group - Rank: {args.rank}, World Size: {args.world_size}")
print(f"Master: {args.master_addr}:{args.master_port}")
dist.init_process_group(backend='nccl', rank=args.rank, world_size=args.world_size)
print(f"Process group initialized - Rank: {dist.get_rank()}/{dist.get_world_size()}")
device = torch.device('cuda:0')
tensor = torch.ones(10, device=device) * (args.rank + 1)
print(f"Rank {args.rank} - Before allreduce: {tensor}")
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
print(f"Rank {args.rank} - After allreduce: {tensor}")
print(f"Expected result: {torch.ones(10) * (1 + 2)}")
dist.destroy_process_group()
print(f"Rank {args.rank} - Test completed successfully!")
if __name__ == "__main__":
test_nccl_communication()
EOF
```
Run NCCL test in Docker containers:
On DGX Spark (start first):
```bash
docker run -it --runtime=nvidia --gpus all --network host --ipc=host \
--privileged --ulimit memlock=-1 --ulimit stack=67108864 \
-v /dev/infiniband:/dev/infiniband -v /sys:/sys:ro \
-e NCCL_IB_DISABLE=0 -e NCCL_IB_HCA=rocep1s0f0:1 -e NCCL_IB_GID_INDEX=3 \
-e NCCL_SOCKET_IFNAME=enp1s0f0np0 -v $(pwd):/workspace \
nvcr.io/nvidia/vllm:25.09-py3 python /workspace/test_nccl.py --rank 0
```
On Workstation (connect to DGX):
```bash
docker run -it --runtime=nvidia --gpus all --network host --ipc=host \
--privileged --ulimit memlock=-1 --ulimit stack=67108864 \
-v /dev/infiniband:/dev/infiniband -v /sys:/sys:ro \
-e NCCL_IB_DISABLE=0 -e NCCL_IB_HCA=rocep1s0f0:1 -e NCCL_IB_GID_INDEX=3 \
-e NCCL_SOCKET_IFNAME=enp1s0f0np0 -v $(pwd):/workspace \
nvcr.io/nvidia/vllm:25.09-py3 python /workspace/test_nccl.py --rank 1
```
**Success indicators:**
- Output shows: `NCCL INFO Using network IBext_v10`
- All-reduce operation completes successfully
- Final tensors show expected sum values (3.0 for each element)
---
## Step 4. Start RDMA-Enabled Containers
On DGX Spark:
```bash
docker run -it --runtime=nvidia --gpus all --network host --ipc=host --shm-size=10g \
--privileged \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v /dev/infiniband:/dev/infiniband \
-v /sys:/sys:ro \
-e CUDA_DEVICE_ORDER=PCI_BUS_ID \
-e GLOO_SOCKET_IFNAME=enp1s0f0np0 \
-e NCCL_IB_DISABLE=0 \
-e NCCL_IB_HCA=rocep1s0f0:1 \
-e NCCL_IB_GID_INDEX=3 \
-e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
-e RAY_USE_MULTIPLE_IPS=0 \
-e RAY_NODE_IP_ADDRESS=192.168.200.1 \
-e RAY_OVERRIDE_NODE_IP=192.168.200.1 \
-e VLLM_HOST_IP=192.168.200.1 \
nvcr.io/nvidia/vllm:25.09-py3 bash
```
On Workstation:
```bash
docker run -it --runtime=nvidia --gpus all --network host --ipc=host --shm-size=10g \
--privileged \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v /dev/infiniband:/dev/infiniband \
-v /sys:/sys:ro \
-e CUDA_DEVICE_ORDER=PCI_BUS_ID \
-e GLOO_SOCKET_IFNAME=enp1s0f0np0 \
-e NCCL_IB_DISABLE=0 \
-e NCCL_IB_HCA=rocep1s0f0:1 \
-e NCCL_IB_GID_INDEX=3 \
-e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
-e RAY_USE_MULTIPLE_IPS=0 \
-e RAY_NODE_IP_ADDRESS=192.168.200.2 \
-e RAY_OVERRIDE_NODE_IP=192.168.200.2 \
nvcr.io/nvidia/vllm:25.09-py3 bash
```
**Key parameters explained:**
- `--runtime=nvidia`: Required for GPU access
- `--network host`: Uses host networking (required for RDMA)
- `--privileged`: Needed for InfiniBand device access
- `--ulimit memlock=-1`: Unlimited memory locking for RDMA
- `-v /dev/infiniband:/dev/infiniband`: Mounts RDMA devices
- `NCCL_IB_HCA=rocep1s0f0:1`: Tells NCCL to use specific RDMA device
- `RAY_USE_MULTIPLE_IPS=0`: Prevents Ray IP detection issues
---
## Step 5. Establish Ray Cluster
On DGX Spark container (head node):
```bash
ray start --head \
--node-ip-address=192.168.200.1 \
--port=6379 \
--dashboard-host=192.168.200.1 \
--dashboard-port=8265 \
--num-gpus=1
```
Verify head node:
```bash
ray status
```
Expected output:
```
======== Autoscaler status: 2026-01-10 19:43:05.517578 ========
Node status
---------------------------------------------------------------
Active:
1 node_xxxxx
Resources
---------------------------------------------------------------
Total Usage:
0.0/20.0 CPU
0.0/1.0 GPU
```
On Workstation container (worker node):
```bash
ray start \
--address=192.168.200.1:6379 \
--node-ip-address=192.168.200.2 \
--num-gpus=1
```
> [!NOTE]
> Adjust `--num-gpus` based on your workstation configuration. In our case, we had 2 GPUs (RTX 6000 Pro + RTX 5090) but only used 1 for this tutorial.
Verify cluster formation:
```bash
ray status
```
Expected output (should show 2+ total GPUs depending on your setup):
```
======== Autoscaler status: 2026-01-10 19:46:26.274139 ========
Node status
---------------------------------------------------------------
Active:
1 node_xxxxx (head)
1 node_xxxxx (worker)
Resources
---------------------------------------------------------------
Total Usage:
0.0/68.0 CPU
0.0/2.0 GPU
```
---
## Step 6. Run Validation Test (4B Model)
Start small model for validation on DGX Spark container:
```bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-4B-Instruct-2507 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--gpu-memory-utilization 0.8 \
--host 192.168.200.1 \
--port 8000
```
Test from another terminal:
```bash
curl -X POST "http://192.168.200.1:8000/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-4B-Instruct-2507",
"prompt": "Test distributed inference:",
"max_tokens": 500
}'
```
---
## Step 7. Run FP8 Quantized Model (30B)
FP8 quantization provides excellent memory efficiency with good performance:
```bash
## Stop previous model (Ctrl+C), then start FP8 30B model
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--gpu-memory-utilization 0.8 \
--host 192.168.200.1 \
--port 8000
```
**Benefits of FP8:**
- Memory efficiency: Reduced footprint compared to BF16
- Performance: 341+ tok/s demonstrated
- Hardware compatibility: Fully supported on Blackwell GB10
---
## Step 8. Run Large Model (72B)
This step demonstrates the real power of distributed inference: running a model that **exceeds the memory capacity of any single GPU**.
| Component | Available VRAM | Sufficient for 72B? |
|-----------|---------------|---------------------|
| DGX Spark | 128 GB | No (~136GB needed) |
| RTX 6000 Pro | 96 GB | No (~136GB needed) |
| **Combined Cluster** | **224 GB** | **Yes** |
The Qwen2.5-72B-Instruct model requires ~136GB in BF16 precision - impossible to run on either GPU alone. This is where our RDMA cluster shines, aggregating memory across both systems.
Memory-optimized configuration for 136GB model:
```bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--gpu-memory-utilization 0.85 \
--host 192.168.200.1 \
--port 8000 \
--max-model-len 2048 \
--max-num-seqs 8 \
--disable-sliding-window \
--enforce-eager
```
**Memory optimization parameters:**
- `--gpu-memory-utilization 0.85`: Uses 85% of GPU memory
- `--max-model-len 2048`: Limits context length to save memory
- `--max-num-seqs 8`: Reduces concurrent sequences
- `--disable-sliding-window`: Disables memory-intensive sliding window attention
- `--enforce-eager`: Uses eager execution (saves memory)
Test 72B model:
```bash
curl -X POST "http://192.168.200.1:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-72B-Instruct",
"messages": [
{"role": "user", "content": "Explain the benefits of RDMA for AI workloads in one paragraph."}
],
"max_tokens": 500
}'
```
---
## Step 9. Monitor RDMA Traffic
Monitor RDMA activity during inference:
```bash
## Run on both systems (separate terminals)
watch -n 0.5 "
echo '=== RDMA Counters ===';
echo -n 'TX: '; cat /sys/class/infiniband/rocep1s0f0/ports/1/counters/port_xmit_data;
echo -n 'RX: '; cat /sys/class/infiniband/rocep1s0f0/ports/1/counters/port_rcv_data;
echo 'Timestamp: '; date;
"
```
During inference, you'll see counters increasing as tensors are communicated between GPUs.
---
## Performance Benchmarks
### Benchmark Commands
**Single-node testing:**
```bash
## On RTX 6000 Pro or DGX Spark
vllm bench latency --model Qwen/Qwen3-30B-A3B-Thinking-2507 --input-len 512 --output-len 2000 --num-iters 10
vllm bench throughput --model Qwen/Qwen3-30B-A3B-Thinking-2507 --input-len 512 --output-len 2000 --num-prompts 20
```
**Distributed testing:**
```bash
## 30B Model
vllm bench serve --host 192.168.200.1 --port 8000 --random-input-len 512 --random-output-len 2000 --num-prompts 20 --request-rate 2 --model Qwen/Qwen3-30B-A3B-Thinking-2507
## 72B Model
vllm bench serve --host 192.168.200.1 --port 8000 --random-input-len 256 --random-output-len 1500 --num-prompts 20 --request-rate 2 --model Qwen/Qwen2.5-72B-Instruct
```
### Performance Results Summary
| Configuration | Avg Latency | Output Throughput | Total Throughput |
|---------------|-------------|-------------------|------------------|
| **RTX 6000 Pro (Single)** | 36.87s | 679.88 tok/s | 853.90 tok/s |
| **DGX Spark (Single)** | 213.12s | 105.10 tok/s | 132.00 tok/s |
| **Distributed RDMA** | 191.09s | 205.83 tok/s | 259.41 tok/s |
### What This Demonstrates
The key achievement of this tutorial is successfully running distributed inference across heterogeneous hardware (DGX Spark ARM64 + Linux Workstation AMD64) over RDMA. The distributed setup aggregates GPU memory from both systems, enabling models that wouldn't fit on either device alone.
### FP8 30B Model Results
```
============ Serving Benchmark Result ============
Successful requests: 20
Benchmark duration (s): 115.36
Output token throughput (tok/s): 341.15
Total Token throughput (tok/s): 429.89
Mean TTFT (ms): 171.00
Mean TPOT (ms): 53.08
==================================================
```
---
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| Ray worker can't connect to head | Firewall blocking port 6379 | `sudo ufw allow 6379/tcp` |
| NCCL timeout during model load | RDMA not working | Verify `ib_send_bw` test passes |
| "Placement group" errors | Ray cluster not formed | Check `ray status` on both nodes |
| OOM during 72B model load | Insufficient memory optimization | Add `--max-model-len 2048 --enforce-eager` |
| SSH connection refused | SSH server not running | `sudo systemctl start ssh` |
| Container can't access RDMA | Missing device mount | Ensure `-v /dev/infiniband:/dev/infiniband` |
| Wrong IP in Ray cluster | Multiple network interfaces | Set `RAY_USE_MULTIPLE_IPS=0` |
| Slow inference performance | NCCL using wrong interface | Verify `NCCL_SOCKET_IFNAME=enp1s0f0np0` |
---
## Credits
This playbook was contributed by **Csaba Kecskemeti** | [DevQuasar](https://devquasar.com/).
For a detailed walkthrough and additional context, see the original article:
[Distributed Inference Cluster: DGX Spark + RTX 6000 Pro](https://devquasar.com/ai/edge-ai/distributed-inference-cluster-dgx-spark-rtx-6000-pro/)

View File

@ -0,0 +1,624 @@
# Heterogeneous Distributed Inference over RDMA
> Set up high-speed RDMA networking between DGX Spark (ConnectX-7) and a Linux Workstation (ConnectX-5) for distributed AI inference
## Table of Contents
- [Overview](#overview)
- [Instructions](#instructions)
- [Troubleshooting](#troubleshooting)
- [Next Steps](#next-steps)
- [Credits](#credits)
---
## Overview
## Basic idea
This playbook guides you through setting up a heterogeneous distributed computing environment using RDMA (Remote Direct Memory Access) over Converged Ethernet (RoCE v2). You will connect a DGX Spark system with a Linux workstation equipped with a Mellanox ConnectX network adapter, enabling high-speed GPU-to-GPU communication for distributed AI workloads.
With RDMA enabled, data flows directly between GPU memories:
```
GPU memory → PCIe → NIC (mlx5) → wire → NIC → PCIe → GPU memory
```
**Key properties:**
- **No CPU copies:** Data bypasses system memory
- **No kernel networking stack:** Direct hardware-to-hardware communication
- **Ultra-low latency:** Microsecond-level communication
- **High throughput:** 93+ Gbps validated over 100 Gbps link
## What you'll accomplish
- Enable low-latency, zero-copy GPU↔GPU communication between heterogeneous systems
- Configure RoCE v2 networking over 100 Gbps direct QSFP connection
- Validate RDMA performance (93+ Gbps achievable)
- Prepare both systems for multi-node inference and training with NCCL
## What to know before starting
- Basic understanding of Linux networking and command line
- Familiarity with network interface configuration (netplan)
- Understanding of PCIe and GPU computing concepts
- Basic knowledge of RDMA/InfiniBand terminology is helpful but not required
## Prerequisites
**Node A: DGX Spark**
- GPU: 128 GB unified memory (Grace Blackwell GB10)
- NIC: ConnectX-7 (QSFP56/QSFP112)
- OS: NVIDIA DGX OS (Ubuntu-based, ARM64)
**Node B: Linux Workstation**
- GPU: NVIDIA GPU with sufficient VRAM (e.g., RTX 6000 Pro, RTX 5090)
- NIC: ConnectX-5 or newer (e.g., MCX516A-CDAT for 100 GbE dual-port)
- OS: Ubuntu 20.04 / 22.04 / 24.04
- PCIe: Gen4 x16 slot recommended
**Physical Requirements:**
- One QSFP cable (QSFP56 ↔ QSFP28 compatible, 100 Gbps negotiated)
- Direct connection or dedicated switch
> [!NOTE]
> **About the hardware used in this tutorial:** We used a ConnectX-5 (MCX516A-CDAT, 100GbE dual-port) on the workstation because that's what we had available. This limits the link speed to 100 Gbps. If you use a ConnectX-7 NIC on the workstation side (matching the DGX Spark), you can achieve up to 200 Gbps. The setup process is the same or very similar - just with higher bandwidth.
> [!NOTE]
> Interface names (e.g., `enp1s0f0np0`, `rocep1s0f0`) are system-specific and will differ on your hardware. Use these commands to identify your interfaces:
> ```bash
> ## Find RDMA device to network interface mapping
> ibdev2netdev
>
> ## List all network interfaces
> ip link show
>
> ## Show detailed RDMA device info
> ibv_devinfo
> ```
## Ancillary files
All required files for this playbook can be found [here on GitHub](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/heterogeneous-distributed-inference-rdma/)
- [**test_nccl.py**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/heterogeneous-distributed-inference-rdma/assets/test_nccl.py) - NCCL communication test script
## Time & risk
- **Duration:** 2-3 hours including validation and testing
- **Risk level:** Medium - involves network reconfiguration
- **Rollback:** Network changes can be reversed by removing netplan configs or IP assignments
- **Last Updated:** 01/23/2026
---
## Instructions
## Step 1. Understand the Architecture
Your distributed inference system uses **two separate communication planes**:
| Component | Purpose | Protocol | Latency |
|-----------|---------|----------|---------|
| **Control Plane (Ray)** | Orchestration, scheduling, actor management | TCP/IP (gRPC) | Milliseconds |
| **Data Plane (NCCL)** | High-speed GPU tensor transfers | RoCE v2 (RDMA) | Microseconds |
Both planes use the same 100 Gbps ConnectX network in this configuration.
**RoCE vs InfiniBand:**
| Mode | What it is | Notes |
|------|------------|-------|
| **RoCE v2 (Ethernet)** | RDMA over Ethernet | Recommended for this setup |
| **InfiniBand** | Native IB fabric | Requires IB switches |
> [!NOTE]
> If your ConnectX-5 is Ethernet-only (not VPI), RoCE v2 is the correct and only supported mode.
**Core software components (required on both nodes):**
| Component | Purpose | Notes |
|-----------|---------|--------|
| `mlx5_core` | Main NIC driver | Kernel module |
| `mlx5_ib` | RDMA support | Kernel module |
| `rdma-core` | Userspace RDMA stack | Package: rdma-core |
| `infiniband-diags` | Diagnostics (`ibstat`) | Package: infiniband-diags |
| `mstflint` | Firmware inspection | Package: mstflint |
| `NCCL` | Multi-GPU collectives | Built into PyTorch/frameworks |
---
## Step 2. Set Up the Workstation (ConnectX-5)
**Hardware & BIOS checklist:**
1. Install the ConnectX card in a PCIe Gen3/4 x16 slot (CPU-direct, not via chipset)
2. **Cooling Requirements:** ConnectX-5/7 100GbE cards are primarily designed for server environments with active cooling. In a workstation, ensure adequate case airflow directed at the card, and consider adding a PCIe slot fan for sustained high-bandwidth workloads.
3. **BIOS settings:**
```
Above 4G Decoding: Enabled
ASPM (Power Management): Disabled
PCIe Speed: Auto / Gen4
SR-IOV: Enabled (optional, for virtualization)
```
Verify PCIe detection:
```bash
## Check if ConnectX card is detected
lspci -nn | grep -i mellanox
```
Expected output:
```
03:00.0 Ethernet controller [0200]: Mellanox MT27800 [ConnectX-5] [15b3:1017]
03:00.1 Ethernet controller [0200]: Mellanox MT27800 [ConnectX-5] [15b3:1017]
```
## Step 3. Install Drivers on Workstation
Check if mlx5 drivers are already installed:
```bash
## Check for existing Mellanox drivers
lsmod | grep mlx5
```
**Option 1: Ubuntu Inbox Drivers (Recommended)**
```bash
## Update package list
sudo apt update
## Install kernel modules
sudo apt install linux-modules-extra-$(uname -r)
## Load drivers
sudo modprobe mlx5_core mlx5_ib
```
**Option 2: NVIDIA MLNX_OFED (If inbox drivers insufficient)**
```bash
## Download from: https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/
wget https://content.mellanox.com/ofed/MLNX_OFED-24.01-0.3.3.1/MLNX_OFED_LINUX-24.01-0.3.3.1-ubuntu24.04-x86_64.tgz
## Extract and install
tar -xzf MLNX_OFED_LINUX-*.tgz
cd MLNX_OFED_LINUX-*
sudo ./mlnxofedinstall --upstream-libs --dpdk
sudo /etc/init.d/openibd restart
```
## Step 4. Install Required Packages on Workstation
```bash
## Update package list
sudo apt update
## Install RDMA and networking packages
sudo apt install -y \
rdma-core \
ibverbs-utils \
rdmacm-utils \
libibmad5 \
infiniband-diags \
perftest \
mstflint \
ethtool \
ibutils
```
## Step 5. Verify Workstation RDMA Stack
Verify kernel drivers are loaded:
```bash
## Check loaded drivers
lsmod | grep mlx5
```
You must see `mlx5_core` and `mlx5_ib`. If missing, load them:
```bash
## Load drivers manually
sudo modprobe mlx5_core mlx5_ib
## Make permanent
echo 'mlx5_core' | sudo tee -a /etc/modules
echo 'mlx5_ib' | sudo tee -a /etc/modules
```
Validate RDMA stack:
```bash
## Show RDMA device info
ibv_devinfo
```
Expected output:
```
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.35.2000
node_guid: xxxx:xxxx:xxxx:xxxx
vendor_id: 0x02c9
vendor_part_id: 4119
phys_port_cnt: 1
```
```bash
## Show adapter status
ibstat
```
Validate PCIe bandwidth (replace `03:00.0` with your actual bus address):
```bash
## Check PCIe link speed and width
sudo lspci -s 03:00.0 -vv | grep -E "LnkCap|LnkSta"
```
Target output:
```
LnkCap: Port #0, Speed 16GT/s, Width x16
LnkSta: Speed 16GT/s (ok), Width x16 (ok)
```
---
## Step 6. Set Up DGX Spark (ConnectX-7)
**Fix repository signature issues (if needed):**
If you encounter GPG key errors:
```bash
## Remove problematic repository
sudo rm -f /etc/apt/sources.list.d/*ffmpeg* 2>/dev/null || true
## Download and install updated GPG key
curl -fsSL https://workbench.download.nvidia.com/stable/linux/gpgkey | \
gpg --dearmor | sudo tee /usr/share/keyrings/ai-workbench-desktop-key.gpg > /dev/null
## Update package list
sudo apt update
```
## Step 7. Install Required Packages on DGX Spark
```bash
## Update package list
sudo apt update
## Install RDMA packages
sudo apt install -y \
infiniband-diags \
rdma-core \
ibverbs-utils \
mstflint \
perftest \
ethtool
```
> [!NOTE]
> DOCA-OFED is **not required** for DGX Spark systems. The standard Ubuntu packages provide all necessary functionality.
## Step 8. Verify DGX Spark Interfaces
Verify network interfaces:
```bash
## Show network interfaces
ip link show | grep -E "enp|ib"
```
You should see ConnectX-7 ports like `enp1s0f0np0`, `enp1s0f1np1`, etc.
Verify RDMA interfaces:
```bash
## Show RDMA device to interface mapping
ibdev2netdev
```
Example output:
```
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
```
Check PCIe topology:
```bash
## Show GPU and NIC topology
nvidia-smi topo -m
```
This shows how GPUs and NICs are interconnected via PCIe.
---
## Step 9. Connect the QSFP Cable
**Cable type:** QSFP56 or QSFP28 cable (they are cross-compatible at 100 Gbps)
**Connection procedure:**
1. Identify ports: DGX Spark has 2 physical QSFP ports with 4 logical interfaces
2. Connect QSFP cable between any available ports
3. Cable compatibility: QSFP56 ↔ QSFP28 works (100 Gbps negotiated)
4. Link detection: Should be automatic within 10-20 seconds
Verify physical link detection on DGX Spark:
```bash
## Check link status
ibdev2netdev
```
Expected output (after cable connection):
```
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
```
> [!NOTE]
> If none of the interfaces are showing as 'Up', please check the QSFP cable connection, reboot the systems and try again.
Verify on Workstation:
```bash
## Check link status
ibdev2netdev
ip link show | grep -E "enp|mlx"
```
---
## Step 10. Configure Network Interfaces
**Network Configuration:**
- **RDMA Network:** 192.168.200.0/24
- **DGX Spark:** 192.168.200.1
- **Workstation:** 192.168.200.2
- **MTU:** 9000 (jumbo frames for optimal RDMA performance)
> [!NOTE]
> The management IP addresses shown in examples (192.168.1.x) are placeholders. Replace these with your actual network IP addresses that you see when running `ip addr show`.
**Option 1: Temporary Configuration (Testing)**
> [!NOTE]
> These commands are temporary and will be lost on reboot!
On DGX Spark:
```bash
## Configure RDMA interface (use interface showing "Up" from ibdev2netdev)
sudo ip addr add 192.168.200.1/24 dev enp1s0f0np0
sudo ip link set enp1s0f0np0 up
sudo ip link set enp1s0f0np0 mtu 9000
```
On Workstation:
```bash
## Configure RDMA interface
sudo ip addr add 192.168.200.2/24 dev enp1s0f0np0
sudo ip link set enp1s0f0np0 up
sudo ip link set enp1s0f0np0 mtu 9000
```
**Option 2: Permanent Configuration**
First, identify your active internet interface on both systems:
```bash
## Find your internet interface
ip addr show | grep -A 2 "inet.*scope global"
ip link show | grep "state UP"
```
On DGX Spark:
```bash
## Create netplan configuration (REPLACE interface names with YOUR actual interfaces!)
sudo tee /etc/netplan/99-rdma.yaml > /dev/null <<EOF
network:
version: 2
renderer: networkd
ethernets:
enp1s0f0np0:
addresses:
- 192.168.200.1/24
mtu: 9000
dhcp4: false
enP7s7: # Replace with YOUR actual internet interface!
dhcp4: true
wifis:
wlP9s9: # WiFi - optional backup
dhcp4: true
access-points:
"<your-wifi-ssid>":
password: "<your-wifi-password>"
EOF
## Set permissions and apply
sudo chmod 600 /etc/netplan/99-rdma.yaml
sudo netplan apply
```
On Workstation:
```bash
## Create netplan configuration (REPLACE interface names with YOUR actual interfaces!)
sudo tee /etc/netplan/99-rdma.yaml > /dev/null <<EOF
network:
version: 2
renderer: networkd
ethernets:
enp1s0f0np0:
addresses:
- 192.168.200.2/24
mtu: 9000
dhcp4: false
eno2np1: # Replace with YOUR actual internet interface!
dhcp4: true
EOF
## Set permissions and apply
sudo chmod 600 /etc/netplan/99-rdma.yaml
sudo netplan apply
```
> [!IMPORTANT]
> Before applying netplan, identify your active internet interface to avoid losing connectivity. Interface names may change after applying netplan (e.g., `mlx5_0` to `rocep1s0f0`). Always verify current device names with `ibdev2netdev`.
## Step 11. Verify Network Connectivity
Test basic connectivity:
```bash
## From DGX Spark
ping -c 4 192.168.200.2
## From Workstation
ping -c 4 192.168.200.1
```
Expected output:
```
PING 192.168.200.2 (192.168.200.2) 56(84) bytes of data.
64 bytes from 192.168.200.2: icmp_seq=1 time=0.xxx ms
...
4 packets transmitted, 4 received, 0% packet loss
```
---
## Step 12. Test RDMA Bandwidth
Identify correct device names:
```bash
## Check available RDMA devices
ibv_devinfo
ls /sys/class/infiniband/
```
**Device name mapping:**
- **DGX Spark:** Use `rocep1s0f0` or `roceP2p1s0f0`
- **Workstation:** Use `mlx5_0` or `mlx5_1` (or `rocep1s0f0` after persistent config)
Run bandwidth test:
On DGX Spark (server) - Start first:
```bash
## Start RDMA bandwidth test server
ib_send_bw -d rocep1s0f0
```
On Workstation (client) - Connect to server:
```bash
## Connect to server and run bandwidth test
ib_send_bw -d rocep1s0f0 192.168.200.1
```
Example successful output:
```
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : rocep1s0f0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
Link type : Ethernet
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 1000 11664.71 11664.25 0.186628
---------------------------------------------------------------------------------------
```
**Performance Analysis:**
- 11,664 MB/sec = ~93.3 Gbps
- Achieves >93% of 100 Gbps line rate
- Link type: Ethernet confirms RoCE v2 is working
---
## Step 13. Configure Environment Variables for NCCL
Add to both systems (persistent across reboots):
```bash
## Add RDMA configuration to bashrc
echo '# RDMA Network Configuration' >> ~/.bashrc
echo 'export UCX_NET_DEVICES=enp1s0f0np0' >> ~/.bashrc
echo 'export NCCL_SOCKET_IFNAME=enp1s0f0np0' >> ~/.bashrc
echo 'export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0' >> ~/.bashrc
## Apply to current session
source ~/.bashrc
```
Verification:
```bash
## Check environment variables
echo $UCX_NET_DEVICES
echo $NCCL_SOCKET_IFNAME
## Both should show: enp1s0f0np0
```
---
## Step 14. Final Validation
At this point, you should have achieved:
- [ ] Physical link detected - `ibdev2netdev` shows "(Up)" status
- [ ] IP connectivity working - `ping 192.168.200.x` succeeds
- [ ] MTU set to 9000 - Jumbo frames enabled
- [ ] RDMA bandwidth >90 Gbps validated
- [ ] RoCE v2 confirmed - Link type: Ethernet
- [ ] Environment variables set for NCCL
Your RDMA setup is **fully operational** and ready for distributed AI workloads!
---
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| `ibdev2netdev` shows no devices | mlx5 drivers not loaded | `sudo modprobe mlx5_core mlx5_ib` |
| Interface shows "(Down)" after cable | Link not negotiated | Check cable, try different port, reboot |
| Ping fails between nodes | IP not configured or wrong interface | Verify `ip addr show`, check interface names |
| RDMA bandwidth <80 Gbps | MTU not set to 9000 | `sudo ip link set <interface> mtu 9000` |
| "mlx5_0 not found" error | Device name changed after netplan | Run `ibdev2netdev` to find current name |
| Permission denied on `/dev/infiniband` | Missing RDMA permissions | Run with `sudo` or add user to `rdma` group |
| GPG key errors on DGX Spark | Expired NVIDIA repository key | See Step 6 for fix |
| Lost internet after netplan apply | Wrong interface in netplan config | Identify correct interface with `ip link show` first |
---
## Next Steps
Continue to [**Distributed Inference Guide**](DISTRIBUTED-INFERENCE.md) to:
- Set up SSH and hostname configuration
- Configure NCCL for multi-node communication
- Deploy RDMA-enabled containers with Ray cluster
- Run distributed inference with vLLM
- Benchmark performance across configurations
---
## Credits
This playbook was contributed by **Csaba Kecskemeti** | [DevQuasar](https://devquasar.com/).
For a detailed walkthrough and additional context, see the original article:
[Distributed Inference Cluster: DGX Spark + RTX 6000 Pro](https://devquasar.com/ai/edge-ai/distributed-inference-cluster-dgx-spark-rtx-6000-pro/)

View File

@ -0,0 +1,96 @@
#!/usr/bin/env python3
"""
NCCL Communication Test Script
Tests NCCL (NVIDIA Collective Communications Library) communication over RDMA
between two nodes in a distributed setup.
Usage:
On Node 0 (head): python test_nccl.py --rank 0
On Node 1 (worker): python test_nccl.py --rank 1
Requirements:
- PyTorch with CUDA support
- NCCL backend available
- RDMA network configured between nodes
"""
import os
import torch
import torch.distributed as dist
import argparse
def test_nccl_communication():
parser = argparse.ArgumentParser(description='Test NCCL communication over RDMA')
parser.add_argument('--rank', type=int, required=True,
help='Rank of this process (0 for head, 1 for worker)')
parser.add_argument('--world_size', type=int, default=2,
help='Total number of processes')
parser.add_argument('--master_addr', type=str, default='192.168.200.1',
help='IP address of the head node')
parser.add_argument('--master_port', type=str, default='29500',
help='Port for distributed communication')
parser.add_argument('--interface', type=str, default='enp1s0f0np0',
help='Network interface for NCCL socket')
args = parser.parse_args()
# Set environment variables for distributed communication
os.environ['RANK'] = str(args.rank)
os.environ['WORLD_SIZE'] = str(args.world_size)
os.environ['MASTER_ADDR'] = args.master_addr
os.environ['MASTER_PORT'] = args.master_port
os.environ['NCCL_SOCKET_IFNAME'] = args.interface
print(f"=" * 60)
print(f"NCCL Communication Test")
print(f"=" * 60)
print(f"Rank: {args.rank}")
print(f"World Size: {args.world_size}")
print(f"Master: {args.master_addr}:{args.master_port}")
print(f"Interface: {args.interface}")
print(f"=" * 60)
print(f"\n[Rank {args.rank}] Initializing process group...")
# Initialize the process group with NCCL backend
dist.init_process_group(
backend='nccl',
rank=args.rank,
world_size=args.world_size
)
print(f"[Rank {args.rank}] Process group initialized successfully!")
print(f"[Rank {args.rank}] Distributed rank: {dist.get_rank()}/{dist.get_world_size()}")
# Create a tensor on GPU
device = torch.device('cuda:0')
tensor = torch.ones(10, device=device) * (args.rank + 1)
print(f"\n[Rank {args.rank}] Before all_reduce: {tensor.tolist()}")
# Perform all-reduce operation (sum across all ranks)
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
print(f"[Rank {args.rank}] After all_reduce: {tensor.tolist()}")
# Calculate expected result
expected = sum(range(1, args.world_size + 1))
expected_tensor = torch.ones(10) * expected
print(f"[Rank {args.rank}] Expected result: {expected_tensor.tolist()}")
# Verify result
if torch.allclose(tensor.cpu(), expected_tensor):
print(f"\n[Rank {args.rank}] ✓ All-reduce test PASSED!")
else:
print(f"\n[Rank {args.rank}] ✗ All-reduce test FAILED!")
# Cleanup
dist.destroy_process_group()
print(f"[Rank {args.rank}] Test completed successfully!")
print(f"=" * 60)
if __name__ == "__main__":
test_nccl_communication()

View File

@ -1,6 +1,6 @@
# Run models with llama.cpp on DGX Spark
> Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Gemma 4 31B IT as example)
> Build llama.cpp with CUDA and serve models via an OpenAI-compatible API (Nemotron 3 Nano Omni as example)
## Table of Contents
@ -17,15 +17,15 @@
[llama.cpp](https://github.com/ggml-org/llama.cpp) is a lightweight C/C++ inference stack for large language models. You build it with CUDA so tensor work runs on the DGX Spark GB10 GPU, then load GGUF weights and expose chat through `llama-server`s OpenAI-compatible HTTP API.
This playbook walks through that stack end to end. As the model example, it uses **Gemma 4 31B IT** - a frontier reasoning model built by Google DeepMind that llama.cpp supports, with strengths in coding, agentic workflows, and fine-tuning. The instructions download its **F16** GGUF from Hugging Face. The same build and server steps apply to other GGUFs (including other sizes in the support matrix below).
This playbook walks through that stack end to end using **Nemotron 3 Nano Omni** as the hands-on example: an NVIDIA MoE family that runs well from quantized GGUF on Spark. Checkpoint choices and paths for all supported models are summarized in the matrix below; commands are in the instructions.
## What you'll accomplish
You will build llama.cpp with CUDA for GB10, download a Gemma 4 31B IT model checkpoint, and run **`llama-server`** with GPU offload. You get:
You will build llama.cpp with CUDA for GB10, download a **Nemotron 3 Nano Omni** example checkpoint, and run **`llama-server`** with GPU offload. You get:
- Local inference through llama.cpp (no separate Python inference framework required)
- An OpenAI-compatible `/v1/chat/completions` endpoint for tools and apps
- A concrete validation that **Gemma 4 31B IT** runs on this stack on DGX Spark
- A concrete validation that the **Nemotron 3 Nano Omni** example runs on this stack on DGX Spark
## What to know before starting
@ -39,8 +39,8 @@ You will build llama.cpp with CUDA for GB10, download a Gemma 4 31B IT model che
**Hardware requirements**
- NVIDIA DGX Spark with GB10 GPU
- Sufficient unified memory for the F16 checkpoint (on the order of **~62GB** for weights alone; more when KV cache and runtime overhead are included)
- At least **~70GB** free disk for the F16 download plus build artifacts (use a smaller quant from the same repo if you need less disk and VRAM)
- Sufficient unified memory for the example **Q8_0** checkpoint (weights on the order of **~35GB**, plus KV cache and runtime overhead—scale up if you pick a larger quant or longer context)
- At least **~40GB** free disk for the example download plus build artifacts (more if you keep multiple GGUFs)
**Software requirements**
@ -50,12 +50,15 @@ You will build llama.cpp with CUDA for GB10, download a Gemma 4 31B IT model che
- CUDA Toolkit: `nvcc --version`
- Network access to GitHub and Hugging Face
## Model Support Matrix
## Model support matrix
The following models are supported with llama.cpp on Spark. All listed models are available and ready to use:
The following models are supported with llama.cpp on Spark. The instructions use the **Nemotron 3 Nano Omni** example row by default.
| Model | Support Status | HF Handle |
|-------|----------------|-----------|
| **Nemotron 3 Nano Omni** (example walkthrough) | ✅ | `ggml-org/NVIDIA-Nemotron-3-Nano-Omni` |
| **Qwen3.6-35B-A3B** | ✅ | `unsloth/Qwen3.6-35B-A3B-GGUF` |
| **Qwen3.6-27B** | ✅ | `unsloth/Qwen3.6-27B-GGUF` |
| **Gemma 4 31B IT** | ✅ | `ggml-org/gemma-4-31B-it-GGUF` |
| **Gemma 4 26B A4B IT** | ✅ | `ggml-org/gemma-4-26B-A4B-it-GGUF` |
| **Gemma 4 E4B IT** | ✅ | `ggml-org/gemma-4-E4B-it-GGUF` |
@ -64,17 +67,17 @@ The following models are supported with llama.cpp on Spark. All listed models ar
## Time & risk
* **Estimated time:** About 30 minutes, plus downloading the ~62GB example
* **Estimated time:** About 30 minutes, plus downloading the example GGUF (~35GB order of magnitude for the default quant)
* **Risk level:** Low — build is local to your clone; no system-wide installs required for the steps below
* **Rollback:** Remove the `llama.cpp` clone and the model directory under `~/models/` to reclaim disk space
* **Last updated:** 04/02/2026
* First Publication
* **Last updated:** 04/28/2026
* Walkthrough now uses Nemotron Omni; other model rows stay available
## Instructions
## Step 1. Verify prerequisites
This walkthrough uses **Gemma 4 31B IT** (`gemma-4-31B-it-f16.gguf`) as the example checkpoint. You can substitute another GGUF from [`ggml-org/gemma-4-31B-it-GGUF`](https://huggingface.co/ggml-org/gemma-4-31B-it-GGUF) (for example `Q4_K_M` or `Q8_0`) by changing the `hf download` filename and `--model` path in later steps.
The **example** checkpoint is **`nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf`** from Hugging Face repo **`ggml-org/NVIDIA-Nemotron-3-Nano-Omni`** (full handle: `ggml-org/NVIDIA-Nemotron-3-Nano-Omni/nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf`). Other supported GGUFs—including Qwen3.6, Gemma, and alternate Nemotron Omni builds—use the same build and server steps; change `hf download` and `--model` paths (see the [overview model matrix](overview.md)).
Ensure the required tools are installed:
@ -121,25 +124,25 @@ make -j8
The build usually takes on the order of 510 minutes. When it finishes, binaries such as `llama-server` appear under `build/bin/`.
## Step 4. Download Gemma 4 31B IT GGUF (supported model example)
## Step 4. Download example Nemotron 3 Nano Omni GGUF
llama.cpp loads models in **GGUF** format. **gemma-4-31B-it** is available in GGUF from Hugging Face; this playbook uses a F16 variant that balances quality and memory on GB10-class hardware.
llama.cpp loads models in **GGUF** format. This playbook uses the **Q8_0** checkpoint from `ggml-org/NVIDIA-Nemotron-3-Nano-Omni`, which balances quality and memory on DGX Spark GB10 unified memory.
```bash
hf download ggml-org/gemma-4-31B-it-GGUF \
gemma-4-31B-it-f16.gguf \
--local-dir ~/models/gemma-4-31B-it-GGUF
hf download ggml-org/NVIDIA-Nemotron-3-Nano-Omni \
nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf \
--local-dir ~/models/NVIDIA-Nemotron-3-Nano-Omni
```
The F16 file is large (**~62GB**). The download can be resumed if interrupted.
The file is on the order of **~35GB** (exact size may vary). The download can be resumed if interrupted.
## Step 5. Start llama-server with Gemma 4 31B IT
## Step 5. Start llama-server with Nemotron 3 Nano Omni
From your `llama.cpp/build` directory, launch the OpenAI-compatible server with GPU offload:
```bash
./bin/llama-server \
--model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-f16.gguf \
--model ~/models/NVIDIA-Nemotron-3-Nano-Omni/nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf \
--host 0.0.0.0 \
--port 30000 \
--n-gpu-layers 99 \
@ -162,7 +165,7 @@ llama_new_context_with_model: n_ctx = 8192
main: server is listening on 0.0.0.0:30000
```
**Keep this terminal open** while testing. Large GGUFs can take several minutes to load; until you see `server is listening`, nothing accepts connections on port 30000 (see Troubleshooting if `curl` reports connection refused).
**Keep this terminal open** while testing. Large GGUFs can take a minute or more to load; until you see `server is listening`, nothing accepts connections on port 30000 (see Troubleshooting if `curl` reports connection refused).
## Step 6. Test the API
@ -172,7 +175,7 @@ Use a **second terminal on the same machine** that runs `llama-server` (for exam
curl -X POST http://127.0.0.1:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"model": "nemotron",
"messages": [{"role": "user", "content": "New York is a great city because..."}],
"max_tokens": 100
}'
@ -195,7 +198,7 @@ Example shape of the response (fields vary by llama.cpp version; `message` may i
}
],
"created": 1765916539,
"model": "gemma-4-31B-it-f16.gguf",
"model": "nemotron-3-nano-omni-ga_v1.0-Q8_0.gguf",
"object": "chat.completion",
"usage": {
"completion_tokens": 100,
@ -209,15 +212,15 @@ Example shape of the response (fields vary by llama.cpp version; `message` may i
}
```
## Step 7. Longer completion (with example model)
## Step 7. Longer completion (with Nemotron 3 Nano Omni)
Try a slightly longer prompt to confirm stable generation with **Gemma 4 31B IT**:
Try a slightly longer prompt to confirm stable generation with **Nemotron 3 Nano Omni**:
```bash
curl -X POST http://127.0.0.1:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4",
"model": "nemotron",
"messages": [{"role": "user", "content": "Solve this step by step: If a train travels 120 miles in 2 hours, what is its average speed?"}],
"max_tokens": 500
}'
@ -231,7 +234,7 @@ To remove this tutorials artifacts:
```bash
rm -rf ~/llama.cpp
rm -rf ~/models/gemma-4-31B-it-GGUF
rm -rf ~/models/NVIDIA-Nemotron-3-Nano-Omni
```
Deactivate the Python venv if you no longer need `hf`:

View File

@ -27,7 +27,7 @@ This playbook shows you how to deploy LM Studio on an NVIDIA DGX Spark device to
## What you'll accomplish
You'll deploy LM Studio on an NVIDIA DGX Spark device to run gpt-oss 120B, and use the model from your laptop. More specifically, you will:
You'll deploy LM Studio on an NVIDIA DGX Spark device to run **Nemotron 3 Nano Omni** (`nvidia/nemotron-3-nano-omni`), and use the model from your laptop. More specifically, you will:
- Install **llmster**, a totally headless, terminal native LM Studio on the Spark
- Run LLM inference locally on DGX Spark via API
@ -54,6 +54,15 @@ You'll deploy LM Studio on an NVIDIA DGX Spark device to run gpt-oss 120B, and u
- Laptop and DGX Spark must be on the same local network
- Network access to download packages and models
## Model support matrix
To explore all supported models in LM Studio, check out [LM Studio model catalog](https://lmstudio.ai/models) page.
| Model | Support Status | Model Path |
|-------|----------------|-----------|
| **Nemotron 3 Nano Omni** | ✅ | `nvidia/nemotron-3-nano-omni` |
| **Qwen3.6-35B-A3B** | ✅ | `qwen/qwen3.6-35b-a3b` |
| **GPT-OSS-120B** | ✅ | `openai/gpt-oss-120b` |
## LM Link (optional)
[LM Link](https://lmstudio.ai/link) lets you **use your local models remotely**. You link machines (e.g. your DGX Spark and your laptop), then load models on the Spark and use them from the laptop as if they were local.
@ -66,7 +75,7 @@ If you use LM Link, you can skip binding the server to `0.0.0.0` and using the S
## Ancillary files
All required assets can be found below. These sample scripts can be used in Step 6 of Instructions.
All required assets can be found below. These sample scripts can be used in Step 7 of Instructions.
- [run.js](https://github.com/lmstudio-ai/docs/blob/main/_assets/nvidia-spark-playbook/js/run.js) - JavaScript script for sending a test prompt to Spark
- [run.py](https://github.com/lmstudio-ai/docs/blob/main/_assets/nvidia-spark-playbook/py/run.py) - Python script for sending a test prompt to Spark
@ -80,8 +89,8 @@ All required assets can be found below. These sample scripts can be used in Step
* **Rollback:**
* Downloaded models can be removed manually from the models directory.
* Uninstall LM Studio or llmster
* **Last Updated:** 03/12/2026
* Add instructions for LM Link features
* **Last Updated:** 04/28/2026
* Introduce Nemotron Omni as example
## Instructions
@ -138,22 +147,22 @@ where `<SPARK_IP>` is your device's IP address. You can find your Sparks IP a
hostname -I
```
## Step 3b. (Optional) Connect with LM Link
## Step 4. (Optional) Connect with LM Link
**LM Link** lets you use your Sparks models from your laptop (or other devices) as if they were local, over an end-to-end encrypted connection. You dont need to be on the same local network or bind the server to `0.0.0.0`.
1. **Create a Link** — Go to [lmstudio.ai/link](https://lmstudio.ai/link) and follow **Create your Link** to set up your private LM Link network.
2. **Link both devices** — On your DGX Spark (llmster) and on your laptop, sign in and join the same Link. LM Link uses Tailscale mesh VPNs; devices communicate without opening ports to the internet.
3. **Use remote models** — On your laptop, open LM Studio (or use the local server). Remote models from your Spark appear in the model loader. Any tool that connects to `localhost:1234` — including the LM Studio SDK, Codex, Claude Code, OpenCode, and the scripts in Step 6 — can use those models without changing the endpoint.
3. **Use remote models** — On your laptop, open LM Studio (or use the local server). Remote models from your Spark appear in the model loader. Any tool that connects to `localhost:1234` — including the LM Studio SDK, Codex, Claude Code, OpenCode, and the scripts in Step 7 — can use those models without changing the endpoint.
LM Link is in **Preview** and is free for up to 2 users, 5 devices each. For details and limits, see [LM Link](https://lmstudio.ai/link).
## Step 4. Download a model to your Spark
## Step 5. Download a model to your Spark
As an example, let's download and run gpt-oss 120B, one of the best open source models from OpenAI. This model is too large for many laptops due to memory limitations, which makes this a fantastic use case for the Spark.
As an example, download **NVIDIA Nemotron 3 Nano Omni** from the LM Studio catalog (`nvidia/nemotron-3-nano-omni`) so you can run it on Spark with plenty of unified memory.
```bash
lms get openai/gpt-oss-120b
lms get nvidia/nemotron-3-nano-omni
```
This download will take a while due to its large size. Verify that the model has been successfully downloaded by listing your models:
@ -162,15 +171,15 @@ This download will take a while due to its large size. Verify that the model has
lms ls
```
## Step 5. Load the model
## Step 6. Load the model
Load the model on your Spark so that it is ready to respond to requests from your laptop.
```bash
lms load openai/gpt-oss-120b
lms load nvidia/nemotron-3-nano-omni
```
## Step 6. Set up a simple program that uses LM Studio SDK on the laptop
## Step 7. Set up a simple program that uses LM Studio SDK on the laptop
Install the LM Studio SDKs and use a simple script to send a prompt to your Spark and validate the response. To get started quickly, we provide simple scripts below for Python, JavaScript, and Bash. Download the scripts from the Overview page of this playbook and run the corresponding command from the directory containing it.
@ -202,12 +211,12 @@ Pre-reqs: User has installed `jq` and `curl`
bash run.sh
```
## Step 7. Next Steps
## Step 8. Next Steps
- Try downloading and serving different models from the [LM Studio model catalog](https://lmstudio.ai/models).
- Use [LM Link](https://lmstudio.ai/link) to connect more devices and use your Sparks models from anywhere with end-to-end encryption.
## Step 8. Cleanup and rollback
## Step 9. Cleanup and rollback
Remove and uninstall LM Studio completely if needed. Note that LM Studio stores models separately from the application. Uninstalling LM Studio will not remove downloaded models unless you explicitly delete them.
If you want to remove the entire LM Studio application, quit LM Studio from the tray first, then move the application to trash.

View File

@ -1,4 +1,4 @@
# NemoClaw with Nemotron-3-Super and Telegram on DGX Spark
# NemoClaw with Nemotron 3 Super and Telegram on DGX Spark
> Install NemoClaw on DGX Spark with local Ollama inference and Telegram bot integration
@ -25,8 +25,8 @@
- [Step 6. Talk to the agent (CLI)](#step-6-talk-to-the-agent-cli)
- [Step 7. Interactive TUI](#step-7-interactive-tui)
- [Step 8. Exit the sandbox and access the Web UI](#step-8-exit-the-sandbox-and-access-the-web-ui)
- [Step 9. Prepare credentials](#step-9-prepare-credentials)
- [Step 10. Configure and start the Telegram bridge](#step-10-configure-and-start-the-telegram-bridge)
- [Step 9. Create a Telegram bot](#step-9-create-a-telegram-bot)
- [Step 10. Install cloudflared and start the Telegram bridge](#step-10-install-cloudflared-and-start-the-telegram-bridge)
- [Step 11. Stop services](#step-11-stop-services)
- [Step 12. Uninstall NemoClaw](#step-12-uninstall-nemoclaw)
- [Troubleshooting](#troubleshooting)
@ -97,8 +97,7 @@ By participating in this demo, you acknowledge that you are solely responsible f
**Hardware and access:**
- A DGX Spark (GB10) with keyboard and monitor, or SSH access
- An **NVIDIA API key** from [build.nvidia.com](https://build.nvidia.com/settings/api-keys) (needed for the Telegram bridge)
- A **Telegram bot token** from [@BotFather](https://t.me/BotFather) (create one with `/newbot`)
- A **Telegram bot token** from [@BotFather](https://t.me/BotFather) (create one with `/newbot`) -- only needed if you want the Telegram bot. Have it ready *before* running the installer; the onboard wizard prompts for it.
**Software:**
@ -118,8 +117,7 @@ Expected: Ubuntu 24.04, NVIDIA GB10 GPU, Docker 28.x+.
| Item | Where to get it |
|------|----------------|
| NVIDIA API key | [build.nvidia.com/settings/api-keys](https://build.nvidia.com/settings/api-keys) |
| Telegram bot token | [@BotFather](https://t.me/BotFather) on Telegram -- create with `/newbot` |
| Telegram bot token (optional) | [@BotFather](https://t.me/BotFather) on Telegram -- create with `/newbot`. Required only for the Telegram bot; have it ready before running the installer. |
### Ancillary files
@ -129,8 +127,8 @@ All required assets are handled by the NemoClaw installer. No manual cloning is
- **Estimated time:** 20--30 minutes (with Ollama and model already downloaded). First-time model download adds ~15--30 minutes depending on network speed.
- **Risk level:** Medium -- you are running an AI agent in a sandbox; risks are reduced by isolation but not eliminated. Use a clean environment and do not connect sensitive data or production accounts.
- **Last Updated:** 03/31/2026
* First Publication
- **Last Updated:** 04/28/2026
* Updated for NemoClaw v0.0.22+: revised Telegram setup, renamed tunnel commands, refreshed uninstall instructions.
## Instructions
@ -192,14 +190,6 @@ Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
```
Verify it is running:
```bash
curl http://localhost:11434
```
Expected: `Ollama is running`. If not, start it: `ollama serve &`
Configure Ollama to listen on all interfaces so the sandbox container can reach it:
```bash
@ -209,6 +199,17 @@ sudo systemctl daemon-reload
sudo systemctl restart ollama
```
Verify it is running and reachable on all interfaces:
```bash
curl http://0.0.0.0:11434
```
Expected: `Ollama is running`. If not, start it with `sudo systemctl start ollama`.
> [!IMPORTANT]
> Always start Ollama via systemd (`sudo systemctl restart ollama`) — do not use `ollama serve &`. A manually started Ollama process does not pick up the `OLLAMA_HOST=0.0.0.0` setting above, and the NemoClaw sandbox will not be able to reach the inference server.
### Step 3. Pull the Nemotron 3 Super model
Download Nemotron 3 Super 120B (~87 GB; may take 15--30 minutes depending on network speed):
@ -237,18 +238,22 @@ You should see `nemotron-3-super:120b` in the output.
### Step 4. Install NemoClaw
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones NemoClaw at the pinned stable release (`v0.0.1`), builds the CLI, and runs the onboard wizard to create a sandbox.
This single command handles everything: installs Node.js (if needed), installs OpenShell, clones the latest stable NemoClaw release, builds the CLI, and runs the onboard wizard to create a sandbox.
```bash
curl -fsSL https://www.nvidia.com/nemoclaw.sh | NEMOCLAW_INSTALL_TAG=v0.0.4 bash
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
```
The onboard wizard walks you through setup:
1. **Sandbox name** -- Pick a name (e.g. `my-assistant`). Names must be lowercase alphanumeric with hyphens only.
2. **Inference provider** -- Select **Local Ollama** (option 7).
3. **Model** -- Select **nemotron-3-super:120b** (option 1).
4. **Policy presets** -- Accept the suggested presets when prompted (hit **Y**).
2. **Inference provider** -- Select **Local Ollama**.
3. **Model** -- Select **nemotron-3-super:120b**.
4. **Messaging channels** -- If you want a Telegram bot, select `telegram` here and paste your bot token when prompted. Create the bot first via [@BotFather](https://t.me/BotFather) in Telegram (see Step 9). If you skip this, you can re-run the installer later to recreate the sandbox with Telegram enabled.
5. **Policy presets** -- Accept the suggested presets when prompted (hit **Y**).
> [!IMPORTANT]
> Telegram must be configured at this step. The channel plugin and bot token are wired into the sandbox container during onboarding — they cannot be added to an existing sandbox by exporting environment variables on the host.
When complete you will see output like:
@ -294,7 +299,7 @@ Expected: JSON listing `nemotron-3-super:120b`.
Still inside the sandbox, send a test message:
```bash
openclaw agent --agent main --local -m "hello" --session-id test
openclaw agent --agent main -m "hello" --session-id test
```
The agent will respond using Nemotron 3 Super. First responses may take 30--90 seconds for a 120B parameter model running locally.
@ -323,7 +328,7 @@ exit
http://127.0.0.1:18789/#token=<long-token-here>
```
**If accessing the Web UI from a remote machine**, you need to set up port forwarding.
**If accessing the Web UI from a remote machine**, you need to set up an SSH tunnel. The NemoClaw onboard wizard already created the port 18789 forward on the Spark, so you only need to tunnel from your remote machine.
First, find your Spark's IP address. On the Spark, run:
@ -333,13 +338,7 @@ hostname -I | awk '{print $1}'
This prints the primary IP address (e.g. `192.168.1.42`). You can also find it in **Settings > Wi-Fi** or **Settings > Network** on the Spark's desktop, or check your router's connected-devices list.
Start the port forward on the Spark host:
```bash
openshell forward start 18789 my-assistant --background
```
Then from your remote machine, create an SSH tunnel to the Spark (replace `<your-spark-ip>` with the IP address from above):
From your remote machine, create an SSH tunnel to the Spark (replace `<your-spark-ip>` with the IP address from above):
```bash
ssh -L 18789:127.0.0.1:18789 <your-user>@<your-spark-ip>
@ -354,64 +353,70 @@ http://127.0.0.1:18789/#token=<long-token-here>
> [!IMPORTANT]
> Use `127.0.0.1`, not `localhost` -- the gateway origin check requires an exact match.
> [!NOTE]
> If the Web UI fails to load and the port forward may be stale, reset it on the Spark host:
> ```bash
> openshell forward stop 18789 my-assistant || true
> openshell forward start 18789 my-assistant --background
> ```
---
## Phase 3: Telegram Bot
### Step 9. Prepare credentials
> [!IMPORTANT]
> Telegram must be enabled in the **NemoClaw onboard wizard** (Step 4 → Messaging channels). The channel plugin and bot token are wired into the sandbox container at sandbox creation time — `policy-add` only opens network egress and is not enough on its own. If you skipped Telegram during onboard, re-run the installer to recreate the sandbox with Telegram enabled.
You need two items:
### Step 9. Create a Telegram bot
| Item | Where to get it |
|------|----------------|
| Telegram bot token | Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the token it gives you. |
| NVIDIA API key | Go to [build.nvidia.com/settings/api-keys](https://build.nvidia.com/settings/api-keys) and create or copy a key (starts with `nvapi-`). |
Do this **before** running the NemoClaw installer in Step 4 so you have your bot token ready when the wizard prompts for it.
### Step 10. Configure and start the Telegram bridge
Open Telegram, find [@BotFather](https://t.me/BotFather), send `/newbot`, and follow the prompts. Copy the bot token it gives you and paste it into the wizard when you reach the **Messaging channels** step.
### Step 10. Install cloudflared and start the Telegram bridge
The Telegram bridge needs a public webhook URL so Telegram can deliver messages to your bot. NemoClaw uses [cloudflared](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/) to create a free `trycloudflare.com` tunnel.
Make sure you are on the **host** (not inside the sandbox). If you are inside the sandbox, run `exit` first.
Set the required environment variables. Replace the placeholders with your actual values. `SANDBOX_NAME` must match the sandbox name you chose during the onboard wizard:
Install cloudflared (DGX Spark is arm64):
```bash
export TELEGRAM_BOT_TOKEN=<your-bot-token>
export SANDBOX_NAME=my-assistant
curl -L --output cloudflared.deb \
https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-arm64.deb
sudo dpkg -i cloudflared.deb
```
Add the Telegram network policy to the sandbox:
Start the tunnel:
```bash
nemoclaw my-assistant policy-add
nemoclaw tunnel start
```
When prompted, type `telegram` and hit **Y** to confirm.
Start the Telegram bridge. On first run it will ask for your NVIDIA API key:
Verify the public URL is live:
```bash
nemoclaw start
nemoclaw status
```
Paste your `nvapi-` key when prompted.
You should see:
```text
[services] telegram-bridge started
Telegram: bridge running
```
You should see `● cloudflared` with a `trycloudflare.com` public URL (e.g. `https://assembled-peer-persian-kitty.trycloudflare.com`).
Open Telegram, find your bot, and send it a message. The bot forwards it to the agent and replies.
> [!NOTE]
> The first response may include a debug log line like "gateway Running as non-root..." -- this is cosmetic and can be ignored.
> If `nemoclaw tunnel start` prints `cloudflared not found — no public URL`, the cloudflared install above did not complete successfully. Re-run the install, then restart the tunnel:
> ```bash
> nemoclaw tunnel stop && nemoclaw tunnel start
> ```
> [!NOTE]
> If you need to restart the bridge, `nemoclaw stop` may not cleanly stop the process. If that happens, find and kill the bridge process via its PID file:
> ```bash
> kill -9 "$(cat /tmp/nemoclaw-services-${SANDBOX_NAME}/telegram-bridge.pid)"
> ```
> Then run `nemoclaw start` again.
> The first response may take 30--90 seconds for a 120B parameter model running locally.
> [!NOTE]
> If sending a message returns `Error: Channel is unavailable: telegram`, the channel was not enabled during onboard. Re-run the installer to recreate the sandbox with Telegram selected at the **Messaging channels** step.
> [!NOTE]
> For details on restricting which Telegram chats can interact with the agent, see the [NemoClaw Telegram bridge documentation](https://docs.nvidia.com/nemoclaw/latest/deployment/set-up-telegram-bridge.html).
---
@ -419,10 +424,10 @@ Open Telegram, find your bot, and send it a message. The bot forwards it to the
### Step 11. Stop services
Stop any running auxiliary services (Telegram bridge, cloudflared):
Stop the cloudflared tunnel:
```bash
nemoclaw stop
nemoclaw tunnel stop
```
Stop the port forward:
@ -434,14 +439,13 @@ openshell forward stop 18789 # stop the dashboard forward
### Step 12. Uninstall NemoClaw
Run the uninstaller from the cloned source directory. It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and Ollama are preserved.
Run the uninstaller via curl (matches the [NemoClaw README](https://github.com/NVIDIA/NemoClaw)). It removes all sandboxes, the OpenShell gateway, Docker containers/images/volumes, the CLI, and all state files. Docker, Node.js, npm, and Ollama are preserved.
```bash
cd ~/.nemoclaw/source
./uninstall.sh
curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh | bash
```
**Uninstaller flags:**
**Uninstaller flags** (pass via `bash -s -- <flags>`):
| Flag | Effect |
|------|--------|
@ -449,10 +453,10 @@ cd ~/.nemoclaw/source
| `--keep-openshell` | Leave the `openshell` binary in place |
| `--delete-models` | Also remove the Ollama models pulled by NemoClaw |
To remove everything including the Ollama model:
To remove everything including the Ollama model, non-interactively:
```bash
./uninstall.sh --yes --delete-models
curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh | bash -s -- --yes --delete-models
```
The uninstaller runs 6 steps:
@ -464,7 +468,7 @@ The uninstaller runs 6 steps:
6. Remove state directories (`~/.nemoclaw`, `~/.config/openshell`, `~/.config/nemoclaw`) and the OpenShell binary
> [!NOTE]
> The source clone at `~/.nemoclaw/source` is removed as part of state cleanup in step 6. If you want to keep a local copy, move or back it up before running the uninstaller.
> If you have a local clone at `~/.nemoclaw/source` you want to keep, move or back it up before running the uninstaller — it is removed as part of state cleanup in step 6.
## Useful commands
@ -474,13 +478,13 @@ The uninstaller runs 6 steps:
| `nemoclaw my-assistant status` | Show sandbox status and inference config |
| `nemoclaw my-assistant logs --follow` | Stream sandbox logs in real time |
| `nemoclaw list` | List all registered sandboxes |
| `nemoclaw start` | Start auxiliary services (Telegram bridge) |
| `nemoclaw stop` | Stop auxiliary services |
| `nemoclaw tunnel start` | Start cloudflared tunnel (public URL for Telegram webhooks) |
| `nemoclaw tunnel stop` | Stop the cloudflared tunnel |
| `openshell term` | Open the monitoring TUI on the host |
| `openshell forward list` | List active port forwards |
| `openshell forward start 18789 my-assistant --background` | Restart port forwarding for Web UI |
| `cd ~/.nemoclaw/source && ./uninstall.sh` | Remove NemoClaw (preserves Docker, Node.js, Ollama) |
| `cd ~/.nemoclaw/source && ./uninstall.sh --delete-models` | Remove NemoClaw and Ollama models |
| `curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh \| bash` | Remove NemoClaw (preserves Docker, Node.js, Ollama) |
| `curl -fsSL https://raw.githubusercontent.com/NVIDIA/NemoClaw/refs/heads/main/uninstall.sh \| bash -s -- --delete-models` | Remove NemoClaw and Ollama models |
## Troubleshooting

View File

@ -214,34 +214,22 @@ Verify Ollama is running (it auto-starts as a service after installation). If no
ollama serve &
```
Configure Ollama to listen on all interfaces so the OpenShell gateway container can reach it. Create a systemd override:
```bash
mkdir -p /etc/systemd/system/ollama.service.d/
sudo nano /etc/systemd/system/ollama.service.d/override.conf
```
Add these lines to the file (create the file if it does not exist):
```ini
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
```
Save and exit, then reload and restart Ollama:
Configure Ollama to listen on all interfaces so the OpenShell gateway container can reach it:
```bash
sudo mkdir -p /etc/systemd/system/ollama.service.d
printf '[Service]\nEnvironment="OLLAMA_HOST=0.0.0.0"\n' | sudo tee /etc/systemd/system/ollama.service.d/override.conf
sudo systemctl daemon-reload
sudo systemctl restart ollama
```
Verify Ollama is listening on all interfaces:
Verify Ollama is running and reachable on all interfaces:
```bash
ss -tlnp | grep 11434
curl http://0.0.0.0:11434
```
You should see `*:11434` in the output. If it only shows `127.0.0.1:11434`, confirm the override file contents and that you ran `systemctl daemon-reload` before restarting.
Expected: `Ollama is running`. If not, start it with `sudo systemctl start ollama`.
Next, run a model from Ollama (adjust the model name to match your choice from [the Ollama model library](https://ollama.com/library)). The `ollama run` command will pull the model automatically if it is not already present. Running the model here ensures it is loaded and ready when you use it with OpenClaw, reducing the chance of timeouts later. Example for nemotron-3-super:

View File

@ -53,6 +53,7 @@ The following models are supported with SGLang on Spark. All listed models are a
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | BF16 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16) |
| **GPT-OSS-20B** | MXFP4 | ✅ | `openai/gpt-oss-20b` |
| **GPT-OSS-120B** | MXFP4 | ✅ | `openai/gpt-oss-120b` |
| **Llama-3.1-8B-Instruct** | FP8 | ✅ | `nvidia/Llama-3.1-8B-Instruct-FP8` |
@ -75,12 +76,19 @@ Note: for NVFP4 models, add the `--quantization modelopt_fp4` flag.
* **Estimated time:** 30 minutes for initial setup and validation
* **Risk level:** Low - Uses pre-built, validated SGLang container with minimal configuration
* **Rollback:** Stop and remove containers with `docker stop` and `docker rm` commands
* **Last Updated:** 03/15/2026
* Use latest NGC SGLang container: nvcr.io/nvidia/sglang:26.02-py3
* **Last Updated:** 04/28/2026
* Introduce Nemotron-3-Nano-Omni reasoning FP8 support
## Instructions
## Step 1. Verify system prerequisites
## Step 1. Use model specific deployment guide
Certain models require special deployment configurations. Please refer to their respective model cards to run on DGX Spark:
| Model | Quantization | HF Model Card Link |
|-------|-------------|----------------|
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | BF16 | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 |
## Step 2. Verify system prerequisites
Check that your NVIDIA Spark device meets all requirements before proceeding. This step runs on
your host system and ensures Docker, GPU drivers, and container toolkit are properly configured.
@ -108,7 +116,7 @@ sudo usermod -aG docker $USER
newgrp docker
```
## Step 2. Pull the SGLang Container
## Step 3. Pull the SGLang Container
Download the latest SGLang container. This step runs on the host and may take
several minutes depending on your network connection.
@ -122,7 +130,7 @@ docker pull nvcr.io/nvidia/sglang:26.02-py3
docker images | grep sglang
```
## Step 3. Launch SGLang container for server mode
## Step 4. Launch SGLang container for server mode
Start the SGLang container in server mode to enable HTTP API access. This runs the inference
server inside the container, exposing it on port 30000 for client connections.
@ -136,7 +144,7 @@ docker run --gpus all -it --rm \
bash
```
## Step 4. Start the SGLang inference server
## Step 5. Start the SGLang inference server
Inside the container, launch the HTTP inference server with a supported model. This step runs
inside the Docker container and starts the SGLang server daemon.
@ -159,7 +167,7 @@ sleep 30
curl http://localhost:30000/health
```
## Step 5. Test client-server inference
## Step 6. Test client-server inference
From a new terminal on your host system, test the SGLang server API to ensure it's working
correctly. This validates that the server is accepting requests and generating responses.
@ -177,7 +185,7 @@ curl -X POST http://localhost:30000/generate \
}'
```
## Step 6. Test Python client API
## Step 7. Test Python client API
Create a simple Python script to test programmatic access to the SGLang server. This runs on
the host system and demonstrates how to integrate SGLang into applications.
@ -197,7 +205,7 @@ response = requests.post('http://localhost:30000/generate', json={
print(f"Response: {response.json()['text']}")
```
## Step 7. Validate installation
## Step 8. Validate installation
Confirm that both server and offline modes are working correctly. This step verifies the
complete SGLang setup and ensures reliable operation.
@ -213,7 +221,7 @@ docker ps
docker logs <CONTAINER_ID>
```
## Step 8. Cleanup and rollback
## Step 9. Cleanup and rollback
Stop and remove containers to clean up resources. This step returns your system to its
original state.
@ -232,7 +240,7 @@ docker container prune -f
docker rmi nvcr.io/nvidia/sglang:26.02-py3
```
## Step 9. Next steps
## Step 10. Next steps
With SGLang successfully deployed, you can now:

View File

@ -57,7 +57,7 @@ In short: two Sparks let you run models that are too large for one, while specul
- Docker with GPU support enabled
```bash
docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 nvidia-smi
docker run --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 nvidia-smi
```
- Active HuggingFace Token for model access
- Network connectivity for model downloads
@ -68,9 +68,9 @@ In short: two Sparks let you run models that are too large for one, while specul
* **Duration:** 10-20 minutes for setup, additional time for model downloads (varies by network speed)
* **Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
* **Rollback:** Stop Docker containers and optionally clean up downloaded model cache.
* **Last Updated:** 01/02/2026
* Upgrade to latest container v1.2.0rc6
* Add EAGLE-3 Speculative Decoding example with GPT-OSS-120B
* **Last Updated:** 04/20/2026
* Upgrade to latest container 1.3.0rc12
* Add Speculative Decoding example with Qwen3-235B-A22B on Two Sparks
## Instructions
@ -111,7 +111,7 @@ docker run \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus=all --ipc=host --network host \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 \
bash -c '
hf download openai/gpt-oss-120b && \
hf download nvidia/gpt-oss-120b-Eagle3-long-context \
@ -172,7 +172,7 @@ docker run \
-e HF_TOKEN=$HF_TOKEN \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus=all --ipc=host --network host nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
--gpus=all --ipc=host --network host nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 \
bash -c "
# # Download models
hf download nvidia/Llama-3.3-70B-Instruct-FP4 && \
@ -309,7 +309,7 @@ docker run -d --rm \
-e TRITON_PTXAS_PATH="/usr/local/cuda/bin/ptxas" \
-v ~/.cache/huggingface/:/root/.cache/huggingface/ \
-v ~/.ssh:/tmp/.ssh:ro \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc12 \
bash -c "curl https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/trt-llm/assets/trtllm-mn-entrypoint.sh | bash"
```

View File

@ -57,7 +57,7 @@ inference through kernel-level optimizations, efficient memory layouts, and adva
- DGX Spark device
- NVIDIA drivers compatible with CUDA 12.x: `nvidia-smi`
- Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 nvidia-smi`
- Docker installed and GPU support configured: `docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc13 nvidia-smi`
- Hugging Face account with token for model access: `echo $HF_TOKEN`
- Sufficient GPU VRAM (40GB+ recommended for 70B models)
- Internet connectivity for downloading models and container images
@ -75,6 +75,9 @@ The following models are supported with TensorRT-LLM on Spark. All listed models
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | BF16 | ✅ | `nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16` |
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | FP8 | ✅ | `nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8` |
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | NVFP4 | ✅ | `nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4` |
| **Nemotron-3-Super-120B** | NVFP4 | ✅ | `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4` |
| **GPT-OSS-20B** | MXFP4 | ✅ | `openai/gpt-oss-20b` |
| **GPT-OSS-120B** | MXFP4 | ✅ | `openai/gpt-oss-120b` |
@ -104,8 +107,8 @@ Reminder: not all model architectures are supported for NVFP4 quantization.
* **Duration**: 45-60 minutes for setup and API server deployment
* **Risk level**: Medium - container pulls and model downloads may fail due to network issues
* **Rollback**: Stop inference servers and remove downloaded models to free resources.
* **Last Updated:** 03/12/2026
* Introduce Nemotron-3-Super-120B support on TRT-LLM
* **Last Updated:** 04/28/2026
* Docker image 1.3.0rc13; Nemotron Omni reasoning BF16, FP8, NVFP4 in matrix
## Single Spark
@ -136,7 +139,7 @@ models and containers.
nvidia-smi
## Verify Docker GPU support
docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc13 nvidia-smi
```
@ -146,7 +149,7 @@ docker run --rm --gpus all nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 nvidia-s
## Set `HF_TOKEN` for model access.
export HF_TOKEN=<your-huggingface-token>
export DOCKER_IMAGE="nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6"
export DOCKER_IMAGE="nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc13"
```
## Step 4. Validate TensorRT-LLM installation
@ -161,8 +164,8 @@ docker run --rm -it --gpus all \
Expected output:
```
[TensorRT-LLM] TensorRT-LLM version: 1.2.0rc6
TensorRT-LLM version: 1.2.0rc6
[TensorRT-LLM] TensorRT-LLM version: 1.3.0rc13
TensorRT-LLM version: 1.3.0rc13
```
## Step 5. Create cache directory
@ -290,6 +293,43 @@ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
Serve with OpenAI-compatible API via trtllm-serve:
#### Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
This example writes **`nano_v3.yaml`** for KV cache, MoE, and CUDA graph settings, then starts **`trtllm-serve`** on port **8000** with Nemotron Omni reasoning parsers.
```bash
export MODEL_HANDLE="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16"
docker run --name trtllm_llm_server --rm -it --gpus all --ipc host --network host \
-e HF_TOKEN=$HF_TOKEN \
-e MODEL_HANDLE="$MODEL_HANDLE" \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
$DOCKER_IMAGE \
bash -c '
hf download $MODEL_HANDLE && \
cat > nano_v3.yaml <<EOF
kv_cache_config:
enable_block_reuse: false
free_gpu_memory_fraction: 0.80
mamba_ssm_cache_dtype: float32
moe_config:
backend: CUTLASS
cuda_graph_config:
enable_padding: true
max_batch_size: 1
max_batch_size: 1
EOF
PYTORCH_ALLOC_CONF=expandable_segments:True \
trtllm-serve serve "$MODEL_HANDLE" \
--host 0.0.0.0 \
--port 8355 \
--trust_remote_code \
--reasoning_parser nano-v3 \
--tool_parser qwen3_coder \
--extra_llm_api_options nano_v3.yaml
'
```
#### Llama 3.1 8B Instruct
```bash
export MODEL_HANDLE="nvidia/Llama-3.1-8B-Instruct-FP4"
@ -685,6 +725,7 @@ docker rmi ghcr.io/open-webui/open-webui:main
| "invalid mount config for type 'bind'" | Missing or non-executable entrypoint script | Run `docker inspect <container_id>` to see full error message. Verify `trtllm-mn-entrypoint.sh` exists on both nodes in your home directory (`ls -la $HOME/trtllm-mn-entrypoint.sh`) and has executable permissions (`chmod +x $HOME/trtllm-mn-entrypoint.sh`) |
| "task: non-zero exit (255)" | Container exit with error code 255 | Check container logs with `docker ps -a --filter "name=trtllm-multinode_trtllm"` to get container ID, then `docker logs <container_id>` to see detailed error messages |
| Docker state stuck in "Pending" with "no suitable node (insufficien...)" | Docker daemon not properly configured for GPU access | Verify steps 2-4 were completed successfully and check that `/etc/docker/daemon.json` contains correct GPU configuration |
| Serving model fails `ptxas fatal` errors | Model needs runtime triton kernel compilation | In Step 10, add `-x TRITON_PTXAS_PATH` to your `mpirun` command |
> [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.

View File

@ -54,6 +54,9 @@ The following models are supported with vLLM on Spark. All listed models are ava
| Model | Quantization | Support Status | HF Handle |
|-------|-------------|----------------|-----------|
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | BF16 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16) |
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | FP8 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8) |
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | NVFP4 | ✅ | [`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4`](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4) |
| **Gemma 4 31B IT** | Base | ✅ | [`google/gemma-4-31B-it`](https://huggingface.co/google/gemma-4-31B-it) |
| **Gemma 4 31B IT** | NVFP4 | ✅ | [`nvidia/Gemma-4-31B-IT-NVFP4`](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) |
| **Gemma 4 26B A4B IT** | Base | ✅ | [`google/gemma-4-26B-A4B-it`](https://huggingface.co/google/gemma-4-26B-A4B-it) |
@ -94,12 +97,22 @@ Reminder: not all model architectures are supported for NVFP4 quantization.
* **Duration:** 30 minutes for Docker approach
* **Risks:** Container registry access requires internal credentials
* **Rollback:** Container approach is non-destructive.
* **Last Updated:** 04/02/2026
* Add support for Gemma 4 model family
* **Last Updated:** 04/28/2026
* Add support for Nemotron-3-Nano-Omni reasoning BF16, FP8, NVFP4
## Instructions
## Step 1. Configure Docker permissions
## Step 1. Use model specific deployment guide
Certain models require special deployment configurations. Please refer to their respective model cards to run on DGX Spark:
| Model | Quantization | HF Model Card Link |
|-------|-------------|----------------|
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | BF16 | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 |
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | FP8 | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 |
| **Nemotron-3-Nano-Omni-30B-A3B-Reasoning** | NVFP4 | https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 |
## Step 2. Configure Docker permissions
To easily manage containers without sudo, you must be in the `docker` group. If you choose to skip this step, you will need to run Docker commands with sudo.
@ -115,7 +128,7 @@ sudo usermod -aG docker $USER
newgrp docker
```
## Step 2. Pull vLLM container image
## Step 3. Pull vLLM container image
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm
@ -136,7 +149,7 @@ For Gemma 4 model family, use vLLM custom containers:
docker pull vllm/vllm-openai:gemma4-cu130
```
## Step 3. Test vLLM in container
## Step 4. Test vLLM in container
Launch the container and start vLLM server with a test model to verify basic functionality.
@ -171,7 +184,7 @@ curl http://localhost:8000/v1/chat/completions \
Expected response should contain `"content": "204"` or similar mathematical calculation.
## Step 4. Cleanup and rollback
## Step 5. Cleanup and rollback
For container approach (non-destructive):
@ -180,7 +193,7 @@ docker rm $(docker ps -aq --filter ancestor=nvcr.io/nvidia/vllm:${LATEST_VLLM_VE
docker rmi nvcr.io/nvidia/vllm
```
## Step 5. Next steps
## Step 6. Next steps
- **Production deployment:** Configure vLLM with your specific model requirements
- **Performance tuning:** Adjust batch sizes and memory settings for your workload