# Heterogeneous Distributed Inference over RDMA > Set up high-speed RDMA networking between DGX Spark (ConnectX-7) and a Linux Workstation (ConnectX-5) for distributed AI inference ## Table of Contents - [Overview](#overview) - [Instructions](#instructions) - [Troubleshooting](#troubleshooting) - [Next Steps](#next-steps) - [Credits](#credits) --- ## Overview ## Basic idea This playbook guides you through setting up a heterogeneous distributed computing environment using RDMA (Remote Direct Memory Access) over Converged Ethernet (RoCE v2). You will connect a DGX Spark system with a Linux workstation equipped with a Mellanox ConnectX network adapter, enabling high-speed GPU-to-GPU communication for distributed AI workloads. With RDMA enabled, data flows directly between GPU memories: ``` GPU memory → PCIe → NIC (mlx5) → wire → NIC → PCIe → GPU memory ``` **Key properties:** - **No CPU copies:** Data bypasses system memory - **No kernel networking stack:** Direct hardware-to-hardware communication - **Ultra-low latency:** Microsecond-level communication - **High throughput:** 93+ Gbps validated over 100 Gbps link ## What you'll accomplish - Enable low-latency, zero-copy GPU↔GPU communication between heterogeneous systems - Configure RoCE v2 networking over 100 Gbps direct QSFP connection - Validate RDMA performance (93+ Gbps achievable) - Prepare both systems for multi-node inference and training with NCCL ## What to know before starting - Basic understanding of Linux networking and command line - Familiarity with network interface configuration (netplan) - Understanding of PCIe and GPU computing concepts - Basic knowledge of RDMA/InfiniBand terminology is helpful but not required ## Prerequisites **Node A: DGX Spark** - GPU: 128 GB unified memory (Grace Blackwell GB10) - NIC: ConnectX-7 (QSFP56/QSFP112) - OS: NVIDIA DGX OS (Ubuntu-based, ARM64) **Node B: Linux Workstation** - GPU: NVIDIA GPU with sufficient VRAM (e.g., RTX 6000 Pro, RTX 5090) - NIC: ConnectX-5 or newer (e.g., MCX516A-CDAT for 100 GbE dual-port) - OS: Ubuntu 20.04 / 22.04 / 24.04 - PCIe: Gen4 x16 slot recommended **Physical Requirements:** - One QSFP cable (QSFP56 ↔ QSFP28 compatible, 100 Gbps negotiated) - Direct connection or dedicated switch > [!NOTE] > **About the hardware used in this tutorial:** We used a ConnectX-5 (MCX516A-CDAT, 100GbE dual-port) on the workstation because that's what we had available. This limits the link speed to 100 Gbps. If you use a ConnectX-7 NIC on the workstation side (matching the DGX Spark), you can achieve up to 200 Gbps. The setup process is the same or very similar - just with higher bandwidth. > [!NOTE] > Interface names (e.g., `enp1s0f0np0`, `rocep1s0f0`) are system-specific and will differ on your hardware. Use these commands to identify your interfaces: > ```bash > ## Find RDMA device to network interface mapping > ibdev2netdev > > ## List all network interfaces > ip link show > > ## Show detailed RDMA device info > ibv_devinfo > ``` ## Ancillary files All required files for this playbook can be found [here on GitHub](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/heterogeneous-distributed-inference-rdma/) - [**test_nccl.py**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/heterogeneous-distributed-inference-rdma/assets/test_nccl.py) - NCCL communication test script ## Time & risk - **Duration:** 2-3 hours including validation and testing - **Risk level:** Medium - involves network reconfiguration - **Rollback:** Network changes can be reversed by removing netplan configs or IP assignments - **Last Updated:** 01/23/2026 --- ## Instructions ## Step 1. Understand the Architecture Your distributed inference system uses **two separate communication planes**: | Component | Purpose | Protocol | Latency | |-----------|---------|----------|---------| | **Control Plane (Ray)** | Orchestration, scheduling, actor management | TCP/IP (gRPC) | Milliseconds | | **Data Plane (NCCL)** | High-speed GPU tensor transfers | RoCE v2 (RDMA) | Microseconds | Both planes use the same 100 Gbps ConnectX network in this configuration. **RoCE vs InfiniBand:** | Mode | What it is | Notes | |------|------------|-------| | **RoCE v2 (Ethernet)** | RDMA over Ethernet | Recommended for this setup | | **InfiniBand** | Native IB fabric | Requires IB switches | > [!NOTE] > If your ConnectX-5 is Ethernet-only (not VPI), RoCE v2 is the correct and only supported mode. **Core software components (required on both nodes):** | Component | Purpose | Notes | |-----------|---------|--------| | `mlx5_core` | Main NIC driver | Kernel module | | `mlx5_ib` | RDMA support | Kernel module | | `rdma-core` | Userspace RDMA stack | Package: rdma-core | | `infiniband-diags` | Diagnostics (`ibstat`) | Package: infiniband-diags | | `mstflint` | Firmware inspection | Package: mstflint | | `NCCL` | Multi-GPU collectives | Built into PyTorch/frameworks | | `GPUDirect RDMA` | GPU↔NIC zero-copy | Requires nvidia-peermem | --- ## Step 2. Set Up the Workstation (ConnectX-5) **Hardware & BIOS checklist:** 1. Install the ConnectX card in a PCIe Gen3/4 x16 slot (CPU-direct, not via chipset) 2. **Cooling Requirements:** ConnectX-5/7 100GbE cards are primarily designed for server environments with active cooling. In a workstation, ensure adequate case airflow directed at the card, and consider adding a PCIe slot fan for sustained high-bandwidth workloads. 3. **BIOS settings:** ``` Above 4G Decoding: Enabled ASPM (Power Management): Disabled PCIe Speed: Auto / Gen4 SR-IOV: Enabled (optional, for virtualization) ``` Verify PCIe detection: ```bash ## Check if ConnectX card is detected lspci -nn | grep -i mellanox ``` Expected output: ``` 03:00.0 Ethernet controller [0200]: Mellanox MT27800 [ConnectX-5] [15b3:1017] 03:00.1 Ethernet controller [0200]: Mellanox MT27800 [ConnectX-5] [15b3:1017] ``` ## Step 3. Install Drivers on Workstation Check if mlx5 drivers are already installed: ```bash ## Check for existing Mellanox drivers lsmod | grep mlx5 ``` **Option 1: Ubuntu Inbox Drivers (Recommended)** ```bash ## Update package list sudo apt update ## Install kernel modules sudo apt install linux-modules-extra-$(uname -r) ## Load drivers sudo modprobe mlx5_core mlx5_ib ``` **Option 2: NVIDIA MLNX_OFED (If inbox drivers insufficient)** ```bash ## Download from: https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/ wget https://content.mellanox.com/ofed/MLNX_OFED-24.01-0.3.3.1/MLNX_OFED_LINUX-24.01-0.3.3.1-ubuntu24.04-x86_64.tgz ## Extract and install tar -xzf MLNX_OFED_LINUX-*.tgz cd MLNX_OFED_LINUX-* sudo ./mlnxofedinstall --upstream-libs --dpdk sudo /etc/init.d/openibd restart ``` ## Step 4. Install Required Packages on Workstation ```bash ## Update package list sudo apt update ## Install RDMA and networking packages sudo apt install -y \ rdma-core \ ibverbs-utils \ rdmacm-utils \ libibmad5 \ infiniband-diags \ perftest \ mstflint \ ethtool \ ibutils ``` ## Step 5. Verify Workstation RDMA Stack Verify kernel drivers are loaded: ```bash ## Check loaded drivers lsmod | grep mlx5 ``` You must see `mlx5_core` and `mlx5_ib`. If missing, load them: ```bash ## Load drivers manually sudo modprobe mlx5_core mlx5_ib ## Make permanent echo 'mlx5_core' | sudo tee -a /etc/modules echo 'mlx5_ib' | sudo tee -a /etc/modules ``` Validate RDMA stack: ```bash ## Show RDMA device info ibv_devinfo ``` Expected output: ``` hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 16.35.2000 node_guid: xxxx:xxxx:xxxx:xxxx vendor_id: 0x02c9 vendor_part_id: 4119 phys_port_cnt: 1 ``` ```bash ## Show adapter status ibstat ``` Validate PCIe bandwidth (replace `03:00.0` with your actual bus address): ```bash ## Check PCIe link speed and width sudo lspci -s 03:00.0 -vv | grep -E "LnkCap|LnkSta" ``` Target output: ``` LnkCap: Port #0, Speed 16GT/s, Width x16 LnkSta: Speed 16GT/s (ok), Width x16 (ok) ``` --- ## Step 6. Set Up DGX Spark (ConnectX-7) **Fix repository signature issues (if needed):** If you encounter GPG key errors: ```bash ## Remove problematic repository sudo rm -f /etc/apt/sources.list.d/*ffmpeg* 2>/dev/null || true ## Download and install updated GPG key curl -fsSL https://workbench.download.nvidia.com/stable/linux/gpgkey | \ gpg --dearmor | sudo tee /usr/share/keyrings/ai-workbench-desktop-key.gpg > /dev/null ## Update package list sudo apt update ``` ## Step 7. Install Required Packages on DGX Spark ```bash ## Update package list sudo apt update ## Install RDMA packages sudo apt install -y \ infiniband-diags \ rdma-core \ ibverbs-utils \ mstflint \ perftest \ ethtool ``` > [!NOTE] > DOCA-OFED is **not required** for DGX Spark systems. The standard Ubuntu packages provide all necessary functionality. ## Step 8. Verify DGX Spark Interfaces Verify network interfaces: ```bash ## Show network interfaces ip link show | grep -E "enp|ib" ``` You should see ConnectX-7 ports like `enp1s0f0np0`, `enp1s0f1np1`, etc. Verify RDMA interfaces: ```bash ## Show RDMA device to interface mapping ibdev2netdev ``` Example output: ``` rocep1s0f0 port 1 ==> enp1s0f0np0 (Down) rocep1s0f1 port 1 ==> enp1s0f1np1 (Down) roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down) roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down) ``` Check PCIe topology: ```bash ## Show GPU and NIC topology nvidia-smi topo -m ``` This shows how GPUs and NICs are interconnected via PCIe. --- ## Step 9. Connect the QSFP Cable **Cable type:** QSFP56 or QSFP28 cable (they are cross-compatible at 100 Gbps) **Connection procedure:** 1. Identify ports: DGX Spark has 2 physical QSFP ports with 4 logical interfaces 2. Connect QSFP cable between any available ports 3. Cable compatibility: QSFP56 ↔ QSFP28 works (100 Gbps negotiated) 4. Link detection: Should be automatic within 10-20 seconds Verify physical link detection on DGX Spark: ```bash ## Check link status ibdev2netdev ``` Expected output (after cable connection): ``` rocep1s0f0 port 1 ==> enp1s0f0np0 (Up) rocep1s0f1 port 1 ==> enp1s0f1np1 (Down) roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up) roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down) ``` > [!NOTE] > If none of the interfaces are showing as 'Up', please check the QSFP cable connection, reboot the systems and try again. Verify on Workstation: ```bash ## Check link status ibdev2netdev ip link show | grep -E "enp|mlx" ``` --- ## Step 10. Configure Network Interfaces **Network Configuration:** - **RDMA Network:** 192.168.200.0/24 - **DGX Spark:** 192.168.200.1 - **Workstation:** 192.168.200.2 - **MTU:** 9000 (jumbo frames for optimal RDMA performance) > [!NOTE] > The management IP addresses shown in examples (192.168.1.x) are placeholders. Replace these with your actual network IP addresses that you see when running `ip addr show`. **Option 1: Temporary Configuration (Testing)** > [!NOTE] > These commands are temporary and will be lost on reboot! On DGX Spark: ```bash ## Configure RDMA interface (use interface showing "Up" from ibdev2netdev) sudo ip addr add 192.168.200.1/24 dev enp1s0f0np0 sudo ip link set enp1s0f0np0 up sudo ip link set enp1s0f0np0 mtu 9000 ``` On Workstation: ```bash ## Configure RDMA interface sudo ip addr add 192.168.200.2/24 dev enp1s0f0np0 sudo ip link set enp1s0f0np0 up sudo ip link set enp1s0f0np0 mtu 9000 ``` **Option 2: Permanent Configuration** First, identify your active internet interface on both systems: ```bash ## Find your internet interface ip addr show | grep -A 2 "inet.*scope global" ip link show | grep "state UP" ``` On DGX Spark: ```bash ## Create netplan configuration (REPLACE interface names with YOUR actual interfaces!) sudo tee /etc/netplan/99-rdma.yaml > /dev/null <": password: "" EOF ## Set permissions and apply sudo chmod 600 /etc/netplan/99-rdma.yaml sudo netplan apply ``` On Workstation: ```bash ## Create netplan configuration (REPLACE interface names with YOUR actual interfaces!) sudo tee /etc/netplan/99-rdma.yaml > /dev/null < [!IMPORTANT] > Before applying netplan, identify your active internet interface to avoid losing connectivity. Interface names may change after applying netplan (e.g., `mlx5_0` to `rocep1s0f0`). Always verify current device names with `ibdev2netdev`. ## Step 11. Verify Network Connectivity Test basic connectivity: ```bash ## From DGX Spark ping -c 4 192.168.200.2 ## From Workstation ping -c 4 192.168.200.1 ``` Expected output: ``` PING 192.168.200.2 (192.168.200.2) 56(84) bytes of data. 64 bytes from 192.168.200.2: icmp_seq=1 time=0.xxx ms ... 4 packets transmitted, 4 received, 0% packet loss ``` --- ## Step 12. Test RDMA Bandwidth Identify correct device names: ```bash ## Check available RDMA devices ibv_devinfo ls /sys/class/infiniband/ ``` **Device name mapping:** - **DGX Spark:** Use `rocep1s0f0` or `roceP2p1s0f0` - **Workstation:** Use `mlx5_0` or `mlx5_1` (or `rocep1s0f0` after persistent config) Run bandwidth test: On DGX Spark (server) - Start first: ```bash ## Start RDMA bandwidth test server ib_send_bw -d rocep1s0f0 ``` On Workstation (client) - Connect to server: ```bash ## Connect to server and run bandwidth test ib_send_bw -d rocep1s0f0 192.168.200.1 ``` Example successful output: ``` --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : rocep1s0f0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF Link type : Ethernet --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 11664.71 11664.25 0.186628 --------------------------------------------------------------------------------------- ``` **Performance Analysis:** - 11,664 MB/sec = ~93.3 Gbps - Achieves >93% of 100 Gbps line rate - Excellent! - Link type: Ethernet confirms RoCE v2 is working **Performance expectations:** - **>90 Gbps:** Excellent - Ready for distributed AI workloads - **80-90 Gbps:** Good - Sufficient for most multi-node training - **<80 Gbps:** Check MTU (should be 9000), cable quality, or PCIe slot --- ## Step 13. Configure Environment Variables for NCCL Add to both systems (persistent across reboots): ```bash ## Add RDMA configuration to bashrc echo '# RDMA Network Configuration' >> ~/.bashrc echo 'export UCX_NET_DEVICES=enp1s0f0np0' >> ~/.bashrc echo 'export NCCL_SOCKET_IFNAME=enp1s0f0np0' >> ~/.bashrc echo 'export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0' >> ~/.bashrc ## Apply to current session source ~/.bashrc ``` Verification: ```bash ## Check environment variables echo $UCX_NET_DEVICES echo $NCCL_SOCKET_IFNAME ## Both should show: enp1s0f0np0 ``` --- ## Step 14. (Optional) Configure GPUDirect RDMA **When needed:** - High-frequency GPU-to-GPU transfers - Zero-copy GPU memory access - Maximum performance training workloads **Configuration:** ```bash ## Install nvidia-peermem module sudo apt install nvidia-peer-memory-dkms sudo modprobe nvidia-peermem ``` --- ## Step 15. Final Validation At this point, you should have achieved: - [ ] Physical link detected - `ibdev2netdev` shows "(Up)" status - [ ] IP connectivity working - `ping 192.168.200.x` succeeds - [ ] MTU set to 9000 - Jumbo frames enabled - [ ] RDMA bandwidth >90 Gbps validated - [ ] RoCE v2 confirmed - Link type: Ethernet - [ ] Environment variables set for NCCL Your RDMA setup is **fully operational** and ready for distributed AI workloads! --- ## Troubleshooting | Symptom | Cause | Fix | |---------|-------|-----| | `ibdev2netdev` shows no devices | mlx5 drivers not loaded | `sudo modprobe mlx5_core mlx5_ib` | | Interface shows "(Down)" after cable | Link not negotiated | Check cable, try different port, reboot | | Ping fails between nodes | IP not configured or wrong interface | Verify `ip addr show`, check interface names | | RDMA bandwidth <80 Gbps | MTU not set to 9000 | `sudo ip link set mtu 9000` | | "mlx5_0 not found" error | Device name changed after netplan | Run `ibdev2netdev` to find current name | | Permission denied on `/dev/infiniband` | Missing RDMA permissions | Run with `sudo` or add user to `rdma` group | | GPG key errors on DGX Spark | Expired NVIDIA repository key | See Step 6 for fix | | Lost internet after netplan apply | Wrong interface in netplan config | Identify correct interface with `ip link show` first | --- ## Next Steps Continue to [**Distributed Inference Guide**](DISTRIBUTED-INFERENCE.md) to: - Set up SSH and hostname configuration - Configure NCCL for multi-node communication - Deploy RDMA-enabled containers with Ray cluster - Run distributed inference with vLLM - Benchmark performance across configurations --- ## Credits This playbook was contributed by **Csaba Kecskemeti** | [DevQuasar](https://devquasar.com/). For a detailed walkthrough and additional context, see the original article: [Distributed Inference Cluster: DGX Spark + RTX 6000 Pro](https://devquasar.com/ai/edge-ai/distributed-inference-cluster-dgx-spark-rtx-6000-pro/)