mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-22 18:13:52 +00:00
647 lines
18 KiB
Markdown
647 lines
18 KiB
Markdown
# Heterogeneous Distributed Inference over RDMA
|
|
|
|
> Set up high-speed RDMA networking between DGX Spark (ConnectX-7) and a Linux Workstation (ConnectX-5) for distributed AI inference
|
|
|
|
## Table of Contents
|
|
|
|
- [Overview](#overview)
|
|
- [Instructions](#instructions)
|
|
- [Troubleshooting](#troubleshooting)
|
|
- [Next Steps](#next-steps)
|
|
- [Credits](#credits)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
## Basic idea
|
|
|
|
This playbook guides you through setting up a heterogeneous distributed computing environment using RDMA (Remote Direct Memory Access) over Converged Ethernet (RoCE v2). You will connect a DGX Spark system with a Linux workstation equipped with a Mellanox ConnectX network adapter, enabling high-speed GPU-to-GPU communication for distributed AI workloads.
|
|
|
|
With RDMA enabled, data flows directly between GPU memories:
|
|
|
|
```
|
|
GPU memory → PCIe → NIC (mlx5) → wire → NIC → PCIe → GPU memory
|
|
```
|
|
|
|
**Key properties:**
|
|
- **No CPU copies:** Data bypasses system memory
|
|
- **No kernel networking stack:** Direct hardware-to-hardware communication
|
|
- **Ultra-low latency:** Microsecond-level communication
|
|
- **High throughput:** 93+ Gbps validated over 100 Gbps link
|
|
|
|
## What you'll accomplish
|
|
|
|
- Enable low-latency, zero-copy GPU↔GPU communication between heterogeneous systems
|
|
- Configure RoCE v2 networking over 100 Gbps direct QSFP connection
|
|
- Validate RDMA performance (93+ Gbps achievable)
|
|
- Prepare both systems for multi-node inference and training with NCCL
|
|
|
|
## What to know before starting
|
|
|
|
- Basic understanding of Linux networking and command line
|
|
- Familiarity with network interface configuration (netplan)
|
|
- Understanding of PCIe and GPU computing concepts
|
|
- Basic knowledge of RDMA/InfiniBand terminology is helpful but not required
|
|
|
|
## Prerequisites
|
|
|
|
**Node A: DGX Spark**
|
|
- GPU: 128 GB unified memory (Grace Blackwell GB10)
|
|
- NIC: ConnectX-7 (QSFP56/QSFP112)
|
|
- OS: NVIDIA DGX OS (Ubuntu-based, ARM64)
|
|
|
|
**Node B: Linux Workstation**
|
|
- GPU: NVIDIA GPU with sufficient VRAM (e.g., RTX 6000 Pro, RTX 5090)
|
|
- NIC: ConnectX-5 or newer (e.g., MCX516A-CDAT for 100 GbE dual-port)
|
|
- OS: Ubuntu 20.04 / 22.04 / 24.04
|
|
- PCIe: Gen4 x16 slot recommended
|
|
|
|
**Physical Requirements:**
|
|
- One QSFP cable (QSFP56 ↔ QSFP28 compatible, 100 Gbps negotiated)
|
|
- Direct connection or dedicated switch
|
|
|
|
> [!NOTE]
|
|
> **About the hardware used in this tutorial:** We used a ConnectX-5 (MCX516A-CDAT, 100GbE dual-port) on the workstation because that's what we had available. This limits the link speed to 100 Gbps. If you use a ConnectX-7 NIC on the workstation side (matching the DGX Spark), you can achieve up to 200 Gbps. The setup process is the same or very similar - just with higher bandwidth.
|
|
|
|
> [!NOTE]
|
|
> Interface names (e.g., `enp1s0f0np0`, `rocep1s0f0`) are system-specific and will differ on your hardware. Use these commands to identify your interfaces:
|
|
> ```bash
|
|
> ## Find RDMA device to network interface mapping
|
|
> ibdev2netdev
|
|
>
|
|
> ## List all network interfaces
|
|
> ip link show
|
|
>
|
|
> ## Show detailed RDMA device info
|
|
> ibv_devinfo
|
|
> ```
|
|
|
|
## Ancillary files
|
|
|
|
All required files for this playbook can be found [here on GitHub](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/heterogeneous-distributed-inference-rdma/)
|
|
|
|
- [**test_nccl.py**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/heterogeneous-distributed-inference-rdma/assets/test_nccl.py) - NCCL communication test script
|
|
|
|
## Time & risk
|
|
|
|
- **Duration:** 2-3 hours including validation and testing
|
|
|
|
- **Risk level:** Medium - involves network reconfiguration
|
|
|
|
- **Rollback:** Network changes can be reversed by removing netplan configs or IP assignments
|
|
|
|
- **Last Updated:** 01/23/2026
|
|
|
|
---
|
|
|
|
## Instructions
|
|
|
|
## Step 1. Understand the Architecture
|
|
|
|
Your distributed inference system uses **two separate communication planes**:
|
|
|
|
| Component | Purpose | Protocol | Latency |
|
|
|-----------|---------|----------|---------|
|
|
| **Control Plane (Ray)** | Orchestration, scheduling, actor management | TCP/IP (gRPC) | Milliseconds |
|
|
| **Data Plane (NCCL)** | High-speed GPU tensor transfers | RoCE v2 (RDMA) | Microseconds |
|
|
|
|
Both planes use the same 100 Gbps ConnectX network in this configuration.
|
|
|
|
**RoCE vs InfiniBand:**
|
|
|
|
| Mode | What it is | Notes |
|
|
|------|------------|-------|
|
|
| **RoCE v2 (Ethernet)** | RDMA over Ethernet | Recommended for this setup |
|
|
| **InfiniBand** | Native IB fabric | Requires IB switches |
|
|
|
|
> [!NOTE]
|
|
> If your ConnectX-5 is Ethernet-only (not VPI), RoCE v2 is the correct and only supported mode.
|
|
|
|
**Core software components (required on both nodes):**
|
|
|
|
| Component | Purpose | Notes |
|
|
|-----------|---------|--------|
|
|
| `mlx5_core` | Main NIC driver | Kernel module |
|
|
| `mlx5_ib` | RDMA support | Kernel module |
|
|
| `rdma-core` | Userspace RDMA stack | Package: rdma-core |
|
|
| `infiniband-diags` | Diagnostics (`ibstat`) | Package: infiniband-diags |
|
|
| `mstflint` | Firmware inspection | Package: mstflint |
|
|
| `NCCL` | Multi-GPU collectives | Built into PyTorch/frameworks |
|
|
| `GPUDirect RDMA` | GPU↔NIC zero-copy | Requires nvidia-peermem |
|
|
|
|
---
|
|
|
|
## Step 2. Set Up the Workstation (ConnectX-5)
|
|
|
|
**Hardware & BIOS checklist:**
|
|
|
|
1. Install the ConnectX card in a PCIe Gen3/4 x16 slot (CPU-direct, not via chipset)
|
|
|
|
2. **Cooling Requirements:** ConnectX-5/7 100GbE cards are primarily designed for server environments with active cooling. In a workstation, ensure adequate case airflow directed at the card, and consider adding a PCIe slot fan for sustained high-bandwidth workloads.
|
|
|
|
3. **BIOS settings:**
|
|
```
|
|
Above 4G Decoding: Enabled
|
|
ASPM (Power Management): Disabled
|
|
PCIe Speed: Auto / Gen4
|
|
SR-IOV: Enabled (optional, for virtualization)
|
|
```
|
|
|
|
Verify PCIe detection:
|
|
|
|
```bash
|
|
## Check if ConnectX card is detected
|
|
lspci -nn | grep -i mellanox
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
03:00.0 Ethernet controller [0200]: Mellanox MT27800 [ConnectX-5] [15b3:1017]
|
|
03:00.1 Ethernet controller [0200]: Mellanox MT27800 [ConnectX-5] [15b3:1017]
|
|
```
|
|
|
|
## Step 3. Install Drivers on Workstation
|
|
|
|
Check if mlx5 drivers are already installed:
|
|
|
|
```bash
|
|
## Check for existing Mellanox drivers
|
|
lsmod | grep mlx5
|
|
```
|
|
|
|
**Option 1: Ubuntu Inbox Drivers (Recommended)**
|
|
|
|
```bash
|
|
## Update package list
|
|
sudo apt update
|
|
|
|
## Install kernel modules
|
|
sudo apt install linux-modules-extra-$(uname -r)
|
|
|
|
## Load drivers
|
|
sudo modprobe mlx5_core mlx5_ib
|
|
```
|
|
|
|
**Option 2: NVIDIA MLNX_OFED (If inbox drivers insufficient)**
|
|
|
|
```bash
|
|
## Download from: https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/
|
|
wget https://content.mellanox.com/ofed/MLNX_OFED-24.01-0.3.3.1/MLNX_OFED_LINUX-24.01-0.3.3.1-ubuntu24.04-x86_64.tgz
|
|
|
|
## Extract and install
|
|
tar -xzf MLNX_OFED_LINUX-*.tgz
|
|
cd MLNX_OFED_LINUX-*
|
|
sudo ./mlnxofedinstall --upstream-libs --dpdk
|
|
sudo /etc/init.d/openibd restart
|
|
```
|
|
|
|
## Step 4. Install Required Packages on Workstation
|
|
|
|
```bash
|
|
## Update package list
|
|
sudo apt update
|
|
|
|
## Install RDMA and networking packages
|
|
sudo apt install -y \
|
|
rdma-core \
|
|
ibverbs-utils \
|
|
rdmacm-utils \
|
|
libibmad5 \
|
|
infiniband-diags \
|
|
perftest \
|
|
mstflint \
|
|
ethtool \
|
|
ibutils
|
|
```
|
|
|
|
## Step 5. Verify Workstation RDMA Stack
|
|
|
|
Verify kernel drivers are loaded:
|
|
|
|
```bash
|
|
## Check loaded drivers
|
|
lsmod | grep mlx5
|
|
```
|
|
|
|
You must see `mlx5_core` and `mlx5_ib`. If missing, load them:
|
|
|
|
```bash
|
|
## Load drivers manually
|
|
sudo modprobe mlx5_core mlx5_ib
|
|
|
|
## Make permanent
|
|
echo 'mlx5_core' | sudo tee -a /etc/modules
|
|
echo 'mlx5_ib' | sudo tee -a /etc/modules
|
|
```
|
|
|
|
Validate RDMA stack:
|
|
|
|
```bash
|
|
## Show RDMA device info
|
|
ibv_devinfo
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
hca_id: mlx5_0
|
|
transport: InfiniBand (0)
|
|
fw_ver: 16.35.2000
|
|
node_guid: xxxx:xxxx:xxxx:xxxx
|
|
vendor_id: 0x02c9
|
|
vendor_part_id: 4119
|
|
phys_port_cnt: 1
|
|
```
|
|
|
|
```bash
|
|
## Show adapter status
|
|
ibstat
|
|
```
|
|
|
|
Validate PCIe bandwidth (replace `03:00.0` with your actual bus address):
|
|
|
|
```bash
|
|
## Check PCIe link speed and width
|
|
sudo lspci -s 03:00.0 -vv | grep -E "LnkCap|LnkSta"
|
|
```
|
|
|
|
Target output:
|
|
```
|
|
LnkCap: Port #0, Speed 16GT/s, Width x16
|
|
LnkSta: Speed 16GT/s (ok), Width x16 (ok)
|
|
```
|
|
|
|
---
|
|
|
|
## Step 6. Set Up DGX Spark (ConnectX-7)
|
|
|
|
**Fix repository signature issues (if needed):**
|
|
|
|
If you encounter GPG key errors:
|
|
|
|
```bash
|
|
## Remove problematic repository
|
|
sudo rm -f /etc/apt/sources.list.d/*ffmpeg* 2>/dev/null || true
|
|
|
|
## Download and install updated GPG key
|
|
curl -fsSL https://workbench.download.nvidia.com/stable/linux/gpgkey | \
|
|
gpg --dearmor | sudo tee /usr/share/keyrings/ai-workbench-desktop-key.gpg > /dev/null
|
|
|
|
## Update package list
|
|
sudo apt update
|
|
```
|
|
|
|
## Step 7. Install Required Packages on DGX Spark
|
|
|
|
```bash
|
|
## Update package list
|
|
sudo apt update
|
|
|
|
## Install RDMA packages
|
|
sudo apt install -y \
|
|
infiniband-diags \
|
|
rdma-core \
|
|
ibverbs-utils \
|
|
mstflint \
|
|
perftest \
|
|
ethtool
|
|
```
|
|
|
|
> [!NOTE]
|
|
> DOCA-OFED is **not required** for DGX Spark systems. The standard Ubuntu packages provide all necessary functionality.
|
|
|
|
## Step 8. Verify DGX Spark Interfaces
|
|
|
|
Verify network interfaces:
|
|
|
|
```bash
|
|
## Show network interfaces
|
|
ip link show | grep -E "enp|ib"
|
|
```
|
|
|
|
You should see ConnectX-7 ports like `enp1s0f0np0`, `enp1s0f1np1`, etc.
|
|
|
|
Verify RDMA interfaces:
|
|
|
|
```bash
|
|
## Show RDMA device to interface mapping
|
|
ibdev2netdev
|
|
```
|
|
|
|
Example output:
|
|
```
|
|
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
|
|
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
|
|
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
|
|
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
|
|
```
|
|
|
|
Check PCIe topology:
|
|
|
|
```bash
|
|
## Show GPU and NIC topology
|
|
nvidia-smi topo -m
|
|
```
|
|
|
|
This shows how GPUs and NICs are interconnected via PCIe.
|
|
|
|
---
|
|
|
|
## Step 9. Connect the QSFP Cable
|
|
|
|
**Cable type:** QSFP56 or QSFP28 cable (they are cross-compatible at 100 Gbps)
|
|
|
|
**Connection procedure:**
|
|
1. Identify ports: DGX Spark has 2 physical QSFP ports with 4 logical interfaces
|
|
2. Connect QSFP cable between any available ports
|
|
3. Cable compatibility: QSFP56 ↔ QSFP28 works (100 Gbps negotiated)
|
|
4. Link detection: Should be automatic within 10-20 seconds
|
|
|
|
Verify physical link detection on DGX Spark:
|
|
|
|
```bash
|
|
## Check link status
|
|
ibdev2netdev
|
|
```
|
|
|
|
Expected output (after cable connection):
|
|
```
|
|
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
|
|
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
|
|
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
|
|
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
|
|
```
|
|
|
|
> [!NOTE]
|
|
> If none of the interfaces are showing as 'Up', please check the QSFP cable connection, reboot the systems and try again.
|
|
|
|
Verify on Workstation:
|
|
|
|
```bash
|
|
## Check link status
|
|
ibdev2netdev
|
|
ip link show | grep -E "enp|mlx"
|
|
```
|
|
|
|
---
|
|
|
|
## Step 10. Configure Network Interfaces
|
|
|
|
**Network Configuration:**
|
|
- **RDMA Network:** 192.168.200.0/24
|
|
- **DGX Spark:** 192.168.200.1
|
|
- **Workstation:** 192.168.200.2
|
|
- **MTU:** 9000 (jumbo frames for optimal RDMA performance)
|
|
|
|
> [!NOTE]
|
|
> The management IP addresses shown in examples (192.168.1.x) are placeholders. Replace these with your actual network IP addresses that you see when running `ip addr show`.
|
|
|
|
**Option 1: Temporary Configuration (Testing)**
|
|
|
|
> [!NOTE]
|
|
> These commands are temporary and will be lost on reboot!
|
|
|
|
On DGX Spark:
|
|
```bash
|
|
## Configure RDMA interface (use interface showing "Up" from ibdev2netdev)
|
|
sudo ip addr add 192.168.200.1/24 dev enp1s0f0np0
|
|
sudo ip link set enp1s0f0np0 up
|
|
sudo ip link set enp1s0f0np0 mtu 9000
|
|
```
|
|
|
|
On Workstation:
|
|
```bash
|
|
## Configure RDMA interface
|
|
sudo ip addr add 192.168.200.2/24 dev enp1s0f0np0
|
|
sudo ip link set enp1s0f0np0 up
|
|
sudo ip link set enp1s0f0np0 mtu 9000
|
|
```
|
|
|
|
**Option 2: Permanent Configuration**
|
|
|
|
First, identify your active internet interface on both systems:
|
|
|
|
```bash
|
|
## Find your internet interface
|
|
ip addr show | grep -A 2 "inet.*scope global"
|
|
ip link show | grep "state UP"
|
|
```
|
|
|
|
On DGX Spark:
|
|
```bash
|
|
## Create netplan configuration (REPLACE interface names with YOUR actual interfaces!)
|
|
sudo tee /etc/netplan/99-rdma.yaml > /dev/null <<EOF
|
|
network:
|
|
version: 2
|
|
renderer: networkd
|
|
ethernets:
|
|
enp1s0f0np0:
|
|
addresses:
|
|
- 192.168.200.1/24
|
|
mtu: 9000
|
|
dhcp4: false
|
|
enP7s7: # Replace with YOUR actual internet interface!
|
|
dhcp4: true
|
|
wifis:
|
|
wlP9s9: # WiFi - optional backup
|
|
dhcp4: true
|
|
access-points:
|
|
"<your-wifi-ssid>":
|
|
password: "<your-wifi-password>"
|
|
EOF
|
|
|
|
## Set permissions and apply
|
|
sudo chmod 600 /etc/netplan/99-rdma.yaml
|
|
sudo netplan apply
|
|
```
|
|
|
|
On Workstation:
|
|
```bash
|
|
## Create netplan configuration (REPLACE interface names with YOUR actual interfaces!)
|
|
sudo tee /etc/netplan/99-rdma.yaml > /dev/null <<EOF
|
|
network:
|
|
version: 2
|
|
renderer: networkd
|
|
ethernets:
|
|
enp1s0f0np0:
|
|
addresses:
|
|
- 192.168.200.2/24
|
|
mtu: 9000
|
|
dhcp4: false
|
|
eno2np1: # Replace with YOUR actual internet interface!
|
|
dhcp4: true
|
|
EOF
|
|
|
|
## Set permissions and apply
|
|
sudo chmod 600 /etc/netplan/99-rdma.yaml
|
|
sudo netplan apply
|
|
```
|
|
|
|
> [!IMPORTANT]
|
|
> Before applying netplan, identify your active internet interface to avoid losing connectivity. Interface names may change after applying netplan (e.g., `mlx5_0` to `rocep1s0f0`). Always verify current device names with `ibdev2netdev`.
|
|
|
|
## Step 11. Verify Network Connectivity
|
|
|
|
Test basic connectivity:
|
|
|
|
```bash
|
|
## From DGX Spark
|
|
ping -c 4 192.168.200.2
|
|
|
|
## From Workstation
|
|
ping -c 4 192.168.200.1
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
PING 192.168.200.2 (192.168.200.2) 56(84) bytes of data.
|
|
64 bytes from 192.168.200.2: icmp_seq=1 time=0.xxx ms
|
|
...
|
|
4 packets transmitted, 4 received, 0% packet loss
|
|
```
|
|
|
|
---
|
|
|
|
## Step 12. Test RDMA Bandwidth
|
|
|
|
Identify correct device names:
|
|
|
|
```bash
|
|
## Check available RDMA devices
|
|
ibv_devinfo
|
|
ls /sys/class/infiniband/
|
|
```
|
|
|
|
**Device name mapping:**
|
|
- **DGX Spark:** Use `rocep1s0f0` or `roceP2p1s0f0`
|
|
- **Workstation:** Use `mlx5_0` or `mlx5_1` (or `rocep1s0f0` after persistent config)
|
|
|
|
Run bandwidth test:
|
|
|
|
On DGX Spark (server) - Start first:
|
|
```bash
|
|
## Start RDMA bandwidth test server
|
|
ib_send_bw -d rocep1s0f0
|
|
```
|
|
|
|
On Workstation (client) - Connect to server:
|
|
```bash
|
|
## Connect to server and run bandwidth test
|
|
ib_send_bw -d rocep1s0f0 192.168.200.1
|
|
```
|
|
|
|
Example successful output:
|
|
```
|
|
---------------------------------------------------------------------------------------
|
|
Send BW Test
|
|
Dual-port : OFF Device : rocep1s0f0
|
|
Number of qps : 1 Transport type : IB
|
|
Connection type : RC Using SRQ : OFF
|
|
Link type : Ethernet
|
|
---------------------------------------------------------------------------------------
|
|
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
|
|
65536 1000 11664.71 11664.25 0.186628
|
|
---------------------------------------------------------------------------------------
|
|
```
|
|
|
|
**Performance Analysis:**
|
|
- 11,664 MB/sec = ~93.3 Gbps
|
|
- Achieves >93% of 100 Gbps line rate - Excellent!
|
|
- Link type: Ethernet confirms RoCE v2 is working
|
|
|
|
**Performance expectations:**
|
|
- **>90 Gbps:** Excellent - Ready for distributed AI workloads
|
|
- **80-90 Gbps:** Good - Sufficient for most multi-node training
|
|
- **<80 Gbps:** Check MTU (should be 9000), cable quality, or PCIe slot
|
|
|
|
---
|
|
|
|
## Step 13. Configure Environment Variables for NCCL
|
|
|
|
Add to both systems (persistent across reboots):
|
|
|
|
```bash
|
|
## Add RDMA configuration to bashrc
|
|
echo '# RDMA Network Configuration' >> ~/.bashrc
|
|
echo 'export UCX_NET_DEVICES=enp1s0f0np0' >> ~/.bashrc
|
|
echo 'export NCCL_SOCKET_IFNAME=enp1s0f0np0' >> ~/.bashrc
|
|
echo 'export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0' >> ~/.bashrc
|
|
|
|
## Apply to current session
|
|
source ~/.bashrc
|
|
```
|
|
|
|
Verification:
|
|
```bash
|
|
## Check environment variables
|
|
echo $UCX_NET_DEVICES
|
|
echo $NCCL_SOCKET_IFNAME
|
|
## Both should show: enp1s0f0np0
|
|
```
|
|
|
|
---
|
|
|
|
## Step 14. (Optional) Configure GPUDirect RDMA
|
|
|
|
**When needed:**
|
|
- High-frequency GPU-to-GPU transfers
|
|
- Zero-copy GPU memory access
|
|
- Maximum performance training workloads
|
|
|
|
**Configuration:**
|
|
```bash
|
|
## Install nvidia-peermem module
|
|
sudo apt install nvidia-peer-memory-dkms
|
|
sudo modprobe nvidia-peermem
|
|
```
|
|
|
|
---
|
|
|
|
## Step 15. Final Validation
|
|
|
|
At this point, you should have achieved:
|
|
|
|
- [ ] Physical link detected - `ibdev2netdev` shows "(Up)" status
|
|
- [ ] IP connectivity working - `ping 192.168.200.x` succeeds
|
|
- [ ] MTU set to 9000 - Jumbo frames enabled
|
|
- [ ] RDMA bandwidth >90 Gbps validated
|
|
- [ ] RoCE v2 confirmed - Link type: Ethernet
|
|
- [ ] Environment variables set for NCCL
|
|
|
|
Your RDMA setup is **fully operational** and ready for distributed AI workloads!
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
| Symptom | Cause | Fix |
|
|
|---------|-------|-----|
|
|
| `ibdev2netdev` shows no devices | mlx5 drivers not loaded | `sudo modprobe mlx5_core mlx5_ib` |
|
|
| Interface shows "(Down)" after cable | Link not negotiated | Check cable, try different port, reboot |
|
|
| Ping fails between nodes | IP not configured or wrong interface | Verify `ip addr show`, check interface names |
|
|
| RDMA bandwidth <80 Gbps | MTU not set to 9000 | `sudo ip link set <interface> mtu 9000` |
|
|
| "mlx5_0 not found" error | Device name changed after netplan | Run `ibdev2netdev` to find current name |
|
|
| Permission denied on `/dev/infiniband` | Missing RDMA permissions | Run with `sudo` or add user to `rdma` group |
|
|
| GPG key errors on DGX Spark | Expired NVIDIA repository key | See Step 6 for fix |
|
|
| Lost internet after netplan apply | Wrong interface in netplan config | Identify correct interface with `ip link show` first |
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
Continue to [**Distributed Inference Guide**](DISTRIBUTED-INFERENCE.md) to:
|
|
- Set up SSH and hostname configuration
|
|
- Configure NCCL for multi-node communication
|
|
- Deploy RDMA-enabled containers with Ray cluster
|
|
- Run distributed inference with vLLM
|
|
- Benchmark performance across configurations
|
|
|
|
---
|
|
|
|
## Credits
|
|
|
|
This playbook was contributed by **Csaba Kecskemeti** | [DevQuasar](https://devquasar.com/).
|
|
|
|
For a detailed walkthrough and additional context, see the original article:
|
|
[Distributed Inference Cluster: DGX Spark + RTX 6000 Pro](https://devquasar.com/ai/edge-ai/distributed-inference-cluster-dgx-spark-rtx-6000-pro/)
|