| .. | ||
| assets | ||
| README.md | ||
Connect Three DGX Spark in a Ring Topology
Connect and set up three DGX Spark devices in a ring topology
Table of Contents
Overview
Basic idea
Configure three DGX Spark systems in a ring topology for high-speed inter-node communication using 200GbE direct QSFP connections. This setup enables distributed workloads across three DGX Spark nodes by establishing network connectivity and configuring SSH authentication.
What you'll accomplish
You will physically connect three DGX Spark devices with QSFP cables, configure network interfaces for cluster communication, and establish passwordless SSH between nodes to create a functional distributed computing environment.
What to know before starting
- Basic understanding of distributed computing concepts
- Working with network interface configuration and netplan
- Experience with SSH key management
Prerequisites
- Three DGX Spark systems
- Three QSFP cables for direct 200GbE connection between the devices in a ring topology. Use recommended cable or similar.
- SSH access available to all systems
- Root or sudo access on all systems:
sudo whoami - The same username on all systems
- Update all systems to the latest OS and Firmware. Refer to the DGX Spark documentation https://docs.nvidia.com/dgx/dgx-spark/os-and-component-update.html
Ancillary files
This playbook's files can be found here on GitHub
- discover-sparks.sh script for automatic node discovery and SSH key distribution
- Cluster setup scripts for automatic network configuration, validation and running NCCL sanity test
Time & risk
-
Duration: 1 hour including validation
-
Risk level: Medium - involves network reconfiguration
-
Rollback: Network changes can be reversed by removing netplan configs or IP assignments
-
Last Updated: 3/19/2026
- First publication
Run on Three Sparks
Step 1. Ensure Same Username on all Systems
On all systems check the username and make sure it's the same:
## Check current username
whoami
If usernames don't match, create a new user (e.g., nvidia) on all systems and log in with the new user:
## Create nvidia user and add to sudo group
sudo useradd -m nvidia
sudo usermod -aG sudo nvidia
## Set password for nvidia user
sudo passwd nvidia
## Switch to nvidia user
su - nvidia
Step 2. Physical Hardware Connection
Connect the QSFP cables between the three DGX Spark systems in a ring topology. Here, Port0 is the CX7 port next to the Ethernet port and Port1 is the CX7 port further away from it.
- Node1 (Port0) to Node2 (Port1)
- Node2 (Port0) to Node3 (Port1)
- Node3 (Port0) to Node1 (Port1)
Note
Double check that the connections are correct otherwise the network configuration might fail.
This establishes the 200GbE direct connection required for high-speed inter-node communication. Upon connection between the three nodes, you will see an output like the one below on all nodes: in this example the interface showing as 'Up' is enp1s0f0np0 / enP2p1s0f0np0 and enp1s0f1np1 / enP2p1s0f1np1 (each physical port has two logical interfaces).
Example output:
## Check QSFP interface availability on all nodes
nvidia@dgx-spark-1:~$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
Note
If all of the interfaces are not showing as 'Up', please check the QSFP cable connection, reboot the systems and try again.
Step 3. Network Interface Configuration
Choose one option to set up the network interfaces. The options are mutually exclusive. Option 1 is recommended to avoid complexity of network setup.
Note
Each CX7 port provides full 200GbE bandwidth. In a three node ring topology all four interfaces on each node must be assigned an IP address to form a symmetric cluster.
Option 1: Automatic IP Assignment with script
We have created a script here on GitHub which automates the following:
- Interface network configuration for all DGX Sparks
- Set up passwordless authentication between the DGX Sparks
- Verify multi-node communication
- Run NCCL Bandwidth tests
Note
If you use the script steps below, you can skip rest of the setup instructions in this playbook.
Use the steps below to run the script:
## Clone the repository
git clone https://github.com/NVIDIA/dgx-spark-playbooks
## Enter the script directory
cd dgx-spark-playbooks/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup
## Check the README.md for steps to run the script and configure the cluster networking
Option 2: Manual IP Assignment with the netplan configuration file
On node 1:
## Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f0np0:
dhcp4: false
addresses:
- 192.168.0.1/24
enP2p1s0f0np0:
dhcp4: false
addresses:
- 192.168.1.1/24
enp1s0f1np1:
dhcp4: false
addresses:
- 192.168.2.1/24
enP2p1s0f1np1:
dhcp4: false
addresses:
- 192.168.3.1/24
EOF
## Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
## Apply the configuration
sudo netplan apply
On node 2:
## Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f0np0:
dhcp4: false
addresses:
- 192.168.4.1/24
enP2p1s0f0np0:
dhcp4: false
addresses:
- 192.168.5.1/24
enp1s0f1np1:
dhcp4: false
addresses:
- 192.168.0.2/24
enP2p1s0f1np1:
dhcp4: false
addresses:
- 192.168.1.2/24
EOF
## Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
## Apply the configuration
sudo netplan apply
On node 3:
## Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f0np0:
dhcp4: false
addresses:
- 192.168.2.2/24
enP2p1s0f0np0:
dhcp4: false
addresses:
- 192.168.3.2/24
enp1s0f1np1:
dhcp4: false
addresses:
- 192.168.4.2/24
enP2p1s0f1np1:
dhcp4: false
addresses:
- 192.168.5.2/24
EOF
## Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
## Apply the configuration
sudo netplan apply
Step 4. Set up passwordless SSH authentication
Option 1: Automatically configure SSH
Run the DGX Spark discover-sparks.sh script from one of the nodes to automatically discover and configure SSH:
curl -O https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/connect-two-sparks/assets/discover-sparks
bash ./discover-sparks
Expected output similar to the below, with different IPs and node names. You may see more than one IP for each node as four interfaces (enp1s0f0np0, enP2p1s0f0np0, enp1s0f1np1 and enP2p1s0f1np1) have IP addresses assigned. This is expected and does not cause any issues. The first time you run the script, you'll be prompted for your password for each node.
Found: 192.168.0.1 (dgx-spark-1.local)
Found: 192.168.0.2 (dgx-spark-2.local)
Found: 192.168.3.2 (dgx-spark-3.local)
Setting up bidirectional SSH access (local <-> remote nodes)...
You may be prompted for your password for each node.
SSH setup complete! All nodes can now SSH to each other without passwords.
Note
If you encounter any errors, please follow Option 2 below to manually configure SSH and debug the issue.
Option 2: Manually discover and configure SSH
You will need to find the IP addresses for the CX-7 interfaces that are up. On all nodes, run the following command to find the IP addresses and take note of them for the next step.
ip addr show enp1s0f0np0
ip addr show enp1s0f1np1
Example output:
## In this example, we are using interface enp1s0f1np1.
nvidia@dgx-spark-1:~$ ip addr show enp1s0f1np1
4: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 3c:6d:66:cc:b3:b7 brd ff:ff:ff:ff:ff:ff
inet **192.168.1.1**/24 brd 192.168.1.255 scope link noprefixroute enp1s0f1np1
valid_lft forever preferred_lft forever
inet6 fe80::3e6d:66ff:fecc:b3b7/64 scope link
valid_lft forever preferred_lft forever
In this example, the IP address for Node 1 is 192.168.1.1. Repeat the process for other nodes.
On all nodes, run the following commands to enable passwordless SSH:
## Copy your SSH public key to all nodes. Please replace the IP addresses with the ones you found in the previous step.
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 1>
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 2>
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 3>
Step 5. Verify Multi-Node Communication
Test basic multi-node functionality:
## Test hostname resolution across nodes
ssh <IP for Node 1> hostname
ssh <IP for Node 2> hostname
ssh <IP for Node 3> hostname
Step 6. Run NCCL tests
Now your cluster is set up to run distributed workloads across three nodes. Try running the NCCL bandwidth test.
Use the steps below to run the script which will run the NCCL test on the cluster:
## Clone the repository
git clone https://github.com/NVIDIA/dgx-spark-playbooks
## Enter the script directory
cd dgx-spark-playbooks/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup
## Check the README.md in the script directory for steps to run the NCCL tests with "--run-nccl-test" option
Step 7. Cleanup and Rollback
Warning
These steps will reset network configuration.
## Rollback network configuration
sudo rm /etc/netplan/40-cx7.yaml
sudo netplan apply
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| "Network unreachable" errors | Network interfaces not configured | Verify netplan config and sudo netplan apply |
| SSH authentication failures | SSH keys not properly distributed | Re-run ./discover-sparks and enter passwords |
| Nodes not visible in cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |
| "APT update" errors (eg. E: The list of sources could not be read.) | APT sources errors, conflicting sources or signing keys | Check APT and Ubuntu documentation to fix the APT sources or keys conflicts |