2025-10-10 17:30:35 +00:00
# Connect two Sparks
2025-10-03 20:46:11 +00:00
> Connect two Spark devices and setup them up for inference and fine-tuning
## Table of Contents
- [Overview ](#overview )
- [Run on two Sparks ](#run-on-two-sparks )
2025-10-10 20:39:52 +00:00
- [Option 1: Automatic IP Assignment (Recommended) ](#option-1-automatic-ip-assignment-recommended )
- [Option 2: Manual IP Assignment (Advanced) ](#option-2-manual-ip-assignment-advanced )
- [Option 1: Automatically configure SSH ](#option-1-automatically-configure-ssh )
- [Option 2: Manually discover and configure SSH ](#option-2-manually-discover-and-configure-ssh )
2025-10-10 00:11:49 +00:00
- [Troubleshooting ](#troubleshooting )
2025-10-03 20:46:11 +00:00
---
## Overview
2025-10-05 22:27:47 +00:00
## Basic idea
2025-10-03 20:46:11 +00:00
Configure two DGX Spark systems for high-speed inter-node communication using 200GbE direct
2025-10-10 20:39:52 +00:00
QSFP connections. This setup enables distributed workloads across multiple DGX Spark nodes
by establishing network connectivity and configuring SSH authentication.
2025-10-03 20:46:11 +00:00
## What you'll accomplish
You will physically connect two DGX Spark devices with a QSFP cable, configure network
2025-10-10 20:39:52 +00:00
interfaces for cluster communication, and establish passwordless SSH between nodes to create
a functional distributed computing environment.
2025-10-03 20:46:11 +00:00
## What to know before starting
- Basic understanding of distributed computing concepts
2025-10-10 20:39:52 +00:00
- Working with network interface configuration and netplan
2025-10-03 20:46:11 +00:00
- Experience with SSH key management
## Prerequisites
2025-10-10 20:39:52 +00:00
- Two DGX Spark systems
- One QSFP cable for direct 200GbE connection between two devices
- SSH access available to both systems
2025-10-05 22:27:47 +00:00
- Root or sudo access on both systems: `sudo whoami`
2025-10-10 20:39:52 +00:00
- The same username on both systems
2025-10-03 20:46:11 +00:00
## Ancillary files
All required files for this playbook can be found [here on GitHub ](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/ )
2025-10-10 20:39:52 +00:00
- **discover-sparks.sh**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/discover-sparks) script for automatic node discovery and SSH key distribution
2025-10-03 20:46:11 +00:00
## Time & risk
2025-10-10 20:39:52 +00:00
**Duration:** 1 hour including validation
2025-10-05 22:27:47 +00:00
2025-10-10 20:39:52 +00:00
**Risk level:** Medium - involves network reconfiguration
2025-10-05 22:27:47 +00:00
2025-10-03 20:46:11 +00:00
**Rollback:** Network changes can be reversed by removing netplan configs or IP assignments
## Run on two Sparks
2025-10-10 20:39:52 +00:00
## Step 1. Ensure Same Username on Both Systems
On both systems check the username and make sure it's the same:
```bash
## Check current username
whoami
```
If usernames don't match, create a new user (e.g., nvidia) on both systems and login in with the new user:
```bash
## Create nvidia user and add to sudo group
sudo useradd -m nvidia
sudo usermod -aG sudo nvidia
## Set password for nvidia user
sudo passwd nvidia
## Switch to nvidia user
su - nvidia
```
## Step 2. Physical Hardware Connection
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
Connect the QSFP cable between both DGX Spark systems using any QSFP interface
2025-10-03 20:46:11 +00:00
on each device. This establishes the 200GbE direct connection required for high-speed
2025-10-10 20:39:52 +00:00
inter-node communication. Upon connection between the two nodes, you will see the an output like the one below: in this example the interface showing as 'Up' is **enp1s0f1np1** / **enP2p1s0f1np1** (each physical port has two names).
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
Example output:
2025-10-03 20:46:11 +00:00
```bash
## Check QSFP interface availability on both nodes
2025-10-10 20:39:52 +00:00
nvidia@dxg-spark-1:~$ ibdev2netdev
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
2025-10-03 20:46:11 +00:00
```
2025-10-10 20:39:52 +00:00
Note: If none of the interfaces are showing as 'Up', please check the QSFP cable connection, reboot the systems and try again.
Note: The interface showing as 'Up' depends on which port you are using to connect the two nodes. Each physical port has two names, for example, enp1s0f1np1 and enP2p1s0f1np1 refer to the same physical port. Please disregard enP2p1s0f0np0 and enP2p1s0f1np1, and use enp1s0f0np0 and enp1s0f1np1 only.
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
## Step 3. Network Interface Configuration
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
Choose one option to setup the network interfaces. Option 1 and 2 are mutually exclusive.
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
### Option 1: Automatic IP Assignment (Recommended)
2025-10-03 20:46:11 +00:00
Configure network interfaces using netplan on both DGX Spark nodes for automatic
link-local addressing:
```bash
2025-10-10 20:39:52 +00:00
## Create the netplan configuration file
2025-10-03 20:46:11 +00:00
sudo tee /etc/netplan/40-cx7.yaml > /dev/null < < EOF
network:
version: 2
ethernets:
enp1s0f0np0:
link-local: [ ipv4 ]
enp1s0f1np1:
link-local: [ ipv4 ]
EOF
2025-10-10 20:39:52 +00:00
## Set appropriate permissions
2025-10-03 20:46:11 +00:00
sudo chmod 600 /etc/netplan/40-cx7.yaml
2025-10-10 20:39:52 +00:00
## Apply the configuration
2025-10-03 20:46:11 +00:00
sudo netplan apply
```
2025-10-10 20:39:52 +00:00
Note: Using this option, the IPs assigned to the interfaces will change if you reboot the system.
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
### Option 2: Manual IP Assignment (Advanced)
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
First, identify which network ports are available and up:
2025-10-03 20:46:11 +00:00
```bash
2025-10-10 20:39:52 +00:00
## Check network port status
ibdev2netdev
2025-10-03 20:46:11 +00:00
```
2025-10-10 20:39:52 +00:00
Example output:
2025-10-03 20:46:11 +00:00
```
2025-10-10 20:39:52 +00:00
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
2025-10-03 20:46:11 +00:00
```
2025-10-10 20:39:52 +00:00
Use an interface that shows as "(Up)" in your output. In this example, we'll use **enp1s0f1np1** . You can disregard interfaces starting with the prefix`enP2p< ... > ` and only use interfaces starting with `enp1<...>` instead.
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
On Node 1:
2025-10-03 20:46:11 +00:00
```bash
2025-10-10 20:39:52 +00:00
## Assign static IP and bring up interface.
sudo ip addr add 192.168.100.10/24 dev enp1s0f1np1
sudo ip link set enp1s0f1np1 up
2025-10-03 20:46:11 +00:00
```
2025-10-10 20:39:52 +00:00
Repeat the same process for Node 2, but using IP **192.168.100.11/24** . Ensure to use the correct interface name using `ibdev2netdev` command.
```bash
## Assign static IP and bring up interface.
sudo ip addr add 192.168.100.11/24 dev enp1s0f1np1
sudo ip link set enp1s0f1np1 up
2025-10-03 20:46:11 +00:00
```
2025-10-10 20:39:52 +00:00
You can verify the IP assignment on both nodes by running the following command on each node:
2025-10-03 20:46:11 +00:00
```bash
2025-10-10 20:39:52 +00:00
## Replace enp1s0f1np1 with the interface showing as "(Up)" in your output, either enp1s0f0np0 or enp1s0f1np1
ip addr show enp1s0f1np1
2025-10-03 20:46:11 +00:00
```
2025-10-10 20:39:52 +00:00
## Step 3. Set up passwordless SSH authentication
### Option 1: Automatically configure SSH
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
Run the DGX Spark [**discover-sparks.sh** ](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/discover-sparks ) script from one of the nodes to automatically discover and configure SSH:
2025-10-03 20:46:11 +00:00
```bash
2025-10-10 20:39:52 +00:00
bash ./discover-sparks
2025-10-03 20:46:11 +00:00
```
2025-10-10 20:39:52 +00:00
Expected output similar to the below, with different IPs and node names. The first time you run the script, you'll be prompted for your password for each node.
```
Found: 169.254.35.62 (dgx-spark-1.local)
Found: 169.254.35.63 (dgx-spark-2.local)
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
Setting up bidirectional SSH access (local < - > remote nodes)...
You may be prompted for your password for each node.
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
SSH setup complete! Both local and remote nodes can now SSH to each other without passwords.
2025-10-03 20:46:11 +00:00
```
2025-10-10 20:39:52 +00:00
Note: If you encoutner any errors, please follow Option 2 below to manually configure SSH and debug the issue.
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
### Option 2: Manually discover and configure SSH
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
You will need to find the IP addresses for the CX-7 interfaces that are up. On both nodes, run the following command to find the IP addresses and take note of them for the next step.
2025-10-03 20:46:11 +00:00
```bash
2025-10-10 20:39:52 +00:00
ip addr show enp1s0f0np0
ip addr show enp1s0f1np1
2025-10-03 20:46:11 +00:00
```
2025-10-10 20:39:52 +00:00
Example output:
```
## In this example, we are using interface enp1s0f1np1.
nvidia@dgx-spark-1:~$ ip addr show enp1s0f1np1
4: enp1s0f1np1: < BROADCAST , MULTICAST , UP , LOWER_UP > mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 3c:6d:66:cc:b3:b7 brd ff:ff:ff:ff:ff:ff
inet **169.254.35.62** /16 brd 169.254.255.255 scope link noprefixroute enp1s0f1np1
valid_lft forever preferred_lft forever
inet6 fe80::3e6d:66ff:fecc:b3b7/64 scope link
valid_lft forever preferred_lft forever
```
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
In this example, the IP address for Node 1 is **169.254.35.62** . Repeat the process for Node 2.
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
On both nodes, run the following commands to enable passwordless SSH:
2025-10-03 20:46:11 +00:00
```bash
2025-10-10 20:39:52 +00:00
## Copy your SSH public key to both nodes. Please replace the IP addresses with the ones you found in the previous step.
ssh-copy-id -i ~/.ssh/id_rsa.pub nvidia@< IP for Node 1 >
ssh-copy-id -i ~/.ssh/id_rsa.pub nvidia@< IP for Node 2 >
2025-10-03 20:46:11 +00:00
```
2025-10-10 20:39:52 +00:00
## Step 4. Verify Multi-Node Communication
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
Test basic multi-node functionality:
2025-10-03 20:46:11 +00:00
```bash
2025-10-10 20:39:52 +00:00
## Test hostname resolution across nodes
ssh < IP for Node 1 > hostname
ssh < IP for Node 2 > hostname
2025-10-03 20:46:11 +00:00
```
2025-10-10 20:39:52 +00:00
## Step 6. Cleanup and Rollback
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
> **Warning**: These steps will reset network configuration.
2025-10-03 20:46:11 +00:00
```bash
## Rollback network configuration (if using Option 1)
sudo rm /etc/netplan/40-cx7.yaml
sudo netplan apply
## Rollback network configuration (if using Option 2)
2025-10-10 20:39:52 +00:00
sudo ip addr del 192.168.100.10/24 dev enp1s0f0np0 # Adjust the interface name to the one you used in step 3.
sudo ip addr del 192.168.100.11/24 dev enp1s0f0np0 # Adjust the interface name to the one you used in step 3.
2025-10-03 20:46:11 +00:00
```
2025-10-10 00:11:49 +00:00
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| "Network unreachable" errors | Network interfaces not configured | Verify netplan config and `sudo netplan apply` |
| SSH authentication failures | SSH keys not properly distributed | Re-run `./discover-sparks` and enter passwords |
| Node 2 not visible in cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |