From a8bb88895675f840856dfd6f572f23115475135b Mon Sep 17 00:00:00 2001 From: GitLab CI Date: Fri, 20 Mar 2026 03:06:12 +0000 Subject: [PATCH] chore: Regenerate all playbooks --- README.md | 1 + nvidia/connect-three-sparks/README.md | 344 ++++++++++++++++++ .../assets/cx7-netplan-example.yaml | 36 ++ 3 files changed, 381 insertions(+) create mode 100644 nvidia/connect-three-sparks/README.md create mode 100644 nvidia/connect-three-sparks/assets/cx7-netplan-example.yaml diff --git a/README.md b/README.md index 7082826..9e34c72 100644 --- a/README.md +++ b/README.md @@ -22,6 +22,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting ### NVIDIA - [Comfy UI](nvidia/comfy-ui/) +- [Connect Three DGX Spark in a Ring Topology](nvidia/connect-three-sparks/) - [Set Up Local Network Access](nvidia/connect-to-your-spark/) - [Connect Two Sparks](nvidia/connect-two-sparks/) - [CUDA-X Data Science](nvidia/cuda-x-data-science/) diff --git a/nvidia/connect-three-sparks/README.md b/nvidia/connect-three-sparks/README.md new file mode 100644 index 0000000..f6923a5 --- /dev/null +++ b/nvidia/connect-three-sparks/README.md @@ -0,0 +1,344 @@ +# Connect Three DGX Spark in a Ring Topology + +> Connect and set up three DGX Spark devices in a ring topology + +## Table of Contents + +- [Overview](#overview) +- [Run on Three Sparks](#run-on-three-sparks) + - [Option 1: Automatically configure SSH](#option-1-automatically-configure-ssh) + - [Option 2: Manually discover and configure SSH](#option-2-manually-discover-and-configure-ssh) +- [Troubleshooting](#troubleshooting) + +--- + +## Overview + +## Basic idea + +Configure three DGX Spark systems in a ring topology for high-speed inter-node communication +using 200GbE direct QSFP connections. This setup enables distributed workloads across three +DGX Spark nodes by establishing network connectivity and configuring SSH authentication. + +## What you'll accomplish + +You will physically connect three DGX Spark devices with QSFP cables, configure network +interfaces for cluster communication, and establish passwordless SSH between nodes to create +a functional distributed computing environment. + +## What to know before starting + +- Basic understanding of distributed computing concepts +- Working with network interface configuration and netplan +- Experience with SSH key management + +## Prerequisites + +- Three DGX Spark systems +- Three QSFP cables for direct 200GbE connection between the devices in a ring topology. Use [recommended cable](https://marketplace.nvidia.com/en-us/enterprise/personal-ai-supercomputers/qsfp-cable-0-4m-for-dgx-spark/) or similar. +- SSH access available to all systems +- Root or sudo access on all systems: `sudo whoami` +- The same username on all systems +- Update all systems to the latest OS and Firmware. Refer to the DGX Spark documentation https://docs.nvidia.com/dgx/dgx-spark/os-and-component-update.html + +## Ancillary files + +This playbook's files can be found [here on GitHub](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/connect-three-sparks/) + +- [**discover-sparks.sh**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/connect-two-sparks/assets/discover-sparks) script for automatic node discovery and SSH key distribution +- [**Cluster setup scripts**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup) for automatic network configuration, validation and running NCCL sanity test + +## Time & risk + +- **Duration:** 1 hour including validation + +- **Risk level:** Medium - involves network reconfiguration + +- **Rollback:** Network changes can be reversed by removing netplan configs or IP assignments + +- **Last Updated:** 3/19/2026 + * First publication + +## Run on Three Sparks + +## Step 1. Ensure Same Username on all Systems + +On all systems check the username and make sure it's the same: + +```bash +## Check current username +whoami +``` + +If usernames don't match, create a new user (e.g., nvidia) on all systems and log in with the new user: + +```bash +## Create nvidia user and add to sudo group +sudo useradd -m nvidia +sudo usermod -aG sudo nvidia + +## Set password for nvidia user +sudo passwd nvidia + +## Switch to nvidia user +su - nvidia +``` + +## Step 2. Physical Hardware Connection + +Connect the QSFP cables between the three DGX Spark systems in a ring topology. +Here, Port0 is the CX7 port next to the Ethernet port and Port1 is the CX7 port further away from it. +1. Node1 (Port0) to Node2 (Port1) +2. Node2 (Port0) to Node3 (Port1) +3. Node3 (Port0) to Node1 (Port1) + +> [!NOTE] +> Double check that the connections are correct otherwise the network configuration might fail. + +This establishes the 200GbE direct connection required for high-speed inter-node communication. +Upon connection between the three nodes, you will see an output like the one below on all nodes: in this example the interface showing as 'Up' is **enp1s0f0np0** / **enP2p1s0f0np0** and **enp1s0f1np1** / **enP2p1s0f1np1** (each physical port has two logical interfaces). + +Example output: +```bash +## Check QSFP interface availability on all nodes +nvidia@dgx-spark-1:~$ ibdev2netdev +rocep1s0f0 port 1 ==> enp1s0f0np0 (Up) +rocep1s0f1 port 1 ==> enp1s0f1np1 (Up) +roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up) +roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up) +``` + +> [!NOTE] +> If all of the interfaces are not showing as 'Up', please check the QSFP cable connection, reboot the systems and try again. + +## Step 3. Network Interface Configuration + +Choose one option to set up the network interfaces. The options are mutually exclusive. Option 1 is recommended to avoid complexity of network setup. + +> [!NOTE] +> Each CX7 port provides full 200GbE bandwidth. +> In a three node ring topology all four interfaces on each node must be assigned an IP address to form a symmetric cluster. + +**Option 1: Automatic IP Assignment with script** + +We have created a script [here on GitHub](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup) which automates the following: +1. Interface network configuration for all DGX Sparks +2. Set up passwordless authentication between the DGX Sparks +3. Verify multi-node communication +4. Run NCCL Bandwidth tests + +> [!NOTE] +> If you use the script steps below, you can skip rest of the setup instructions in this playbook. + +Use the steps below to run the script: + +```bash +## Clone the repository +git clone https://github.com/NVIDIA/dgx-spark-playbooks + +## Enter the script directory +cd dgx-spark-playbooks/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup + +## Check the README.md for steps to run the script and configure the cluster networking +``` + +**Option 2: Manual IP Assignment with the netplan configuration file** + +On node 1: +```bash +## Create the netplan configuration file +sudo tee /etc/netplan/40-cx7.yaml > /dev/null < /dev/null < /dev/null < remote nodes)... +You may be prompted for your password for each node. + +SSH setup complete! All nodes can now SSH to each other without passwords. +``` + +> [!NOTE] +> If you encounter any errors, please follow Option 2 below to manually configure SSH and debug the issue. + +### Option 2: Manually discover and configure SSH + +You will need to find the IP addresses for the CX-7 interfaces that are up. On all nodes, run the following command to find the IP addresses and take note of them for the next step. +```bash + ip addr show enp1s0f0np0 + ip addr show enp1s0f1np1 +``` + +Example output: +``` +## In this example, we are using interface enp1s0f1np1. +nvidia@dgx-spark-1:~$ ip addr show enp1s0f1np1 + 4: enp1s0f1np1: mtu 1500 qdisc mq state UP group default qlen 1000 + link/ether 3c:6d:66:cc:b3:b7 brd ff:ff:ff:ff:ff:ff + inet **192.168.1.1**/24 brd 192.168.1.255 scope link noprefixroute enp1s0f1np1 + valid_lft forever preferred_lft forever + inet6 fe80::3e6d:66ff:fecc:b3b7/64 scope link + valid_lft forever preferred_lft forever +``` + +In this example, the IP address for Node 1 is **192.168.1.1**. Repeat the process for other nodes. + +On all nodes, run the following commands to enable passwordless SSH: +```bash +## Copy your SSH public key to all nodes. Please replace the IP addresses with the ones you found in the previous step. +ssh-copy-id -i ~/.ssh/id_rsa.pub @ +ssh-copy-id -i ~/.ssh/id_rsa.pub @ +ssh-copy-id -i ~/.ssh/id_rsa.pub @ +``` + +## Step 5. Verify Multi-Node Communication + +Test basic multi-node functionality: + +```bash +## Test hostname resolution across nodes +ssh hostname +ssh hostname +ssh hostname +``` + +## Step 6. Run NCCL tests + +Now your cluster is set up to run distributed workloads across three nodes. Try running the NCCL bandwidth test. + +Use the steps below to run the script which will run the NCCL test on the cluster: + +```bash +## Clone the repository +git clone https://github.com/NVIDIA/dgx-spark-playbooks + +## Enter the script directory +cd dgx-spark-playbooks/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup + +## Check the README.md in the script directory for steps to run the NCCL tests with "--run-nccl-test" option +``` + +## Step 7. Cleanup and Rollback + +> [!WARNING] +> These steps will reset network configuration. + +```bash +## Rollback network configuration +sudo rm /etc/netplan/40-cx7.yaml +sudo netplan apply +``` + +## Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| "Network unreachable" errors | Network interfaces not configured | Verify netplan config and `sudo netplan apply` | +| SSH authentication failures | SSH keys not properly distributed | Re-run `./discover-sparks` and enter passwords | +| Nodes not visible in cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration | +| "APT update" errors (eg. E: The list of sources could not be read.) | APT sources errors, conflicting sources or signing keys | Check APT and Ubuntu documentation to fix the APT sources or keys conflicts | diff --git a/nvidia/connect-three-sparks/assets/cx7-netplan-example.yaml b/nvidia/connect-three-sparks/assets/cx7-netplan-example.yaml new file mode 100644 index 0000000..8a60bf4 --- /dev/null +++ b/nvidia/connect-three-sparks/assets/cx7-netplan-example.yaml @@ -0,0 +1,36 @@ +# +# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +network: + version: 2 + ethernets: + enp1s0f0np0: + dhcp4: false + addresses: + - 192.168.0.1/24 + enP2p1s0f0np0: + dhcp4: false + addresses: + - 192.168.0.2/24 + enp1s0f1np1: + dhcp4: false + addresses: + - 192.168.1.1/24 + enP2p1s0f1np1: + dhcp4: false + addresses: + - 192.168.1.2/24