mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-21 17:43:52 +00:00

History

GitLab CI c3770ec3c7 chore: Regenerate all playbooks		2026-03-30 15:12:21 +00:00
..
assets	chore: Regenerate all playbooks	2026-03-20 03:06:12 +00:00
README.md	chore: Regenerate all playbooks	2026-03-30 15:12:21 +00:00

README.md

Connect Three DGX Spark in a Ring Topology

Connect and set up three DGX Spark devices in a ring topology

Overview
Run on Three Sparks
- Option 1: Automatically configure SSH
- Option 2: Manually discover and configure SSH
Troubleshooting

Overview

Basic idea

Configure three DGX Spark systems in a ring topology for high-speed inter-node communication using 200GbE direct QSFP connections. This setup enables distributed workloads across three DGX Spark nodes by establishing network connectivity and configuring SSH authentication.

What you'll accomplish

You will physically connect three DGX Spark devices with QSFP cables, configure network interfaces for cluster communication, and establish passwordless SSH between nodes to create a functional distributed computing environment.

What to know before starting

Basic understanding of distributed computing concepts
Working with network interface configuration and netplan
Experience with SSH key management

Prerequisites

Three DGX Spark systems
Three QSFP cables for direct 200GbE connection between the devices in a ring topology. Use recommended cable or similar.
SSH access available to all systems
Root or sudo access on all systems: sudo whoami
The same username on all systems
Update all systems to the latest OS and Firmware. Refer to the DGX Spark documentation https://docs.nvidia.com/dgx/dgx-spark/os-and-component-update.html

Ancillary files

This playbook's files can be found here on GitHub

discover-sparks.sh script for automatic node discovery and SSH key distribution
Cluster setup scripts for automatic network configuration, validation and running NCCL sanity test

Time & risk

Duration: 1 hour including validation
Risk level: Medium - involves network reconfiguration
Rollback: Network changes can be reversed by removing netplan configs or IP assignments
Last Updated: 3/19/2026
- First publication

Run on Three Sparks

Step 1. Ensure Same Username on all Systems

On all systems check the username and make sure it's the same:

## Check current username
whoami

If usernames don't match, create a new user (e.g., nvidia) on all systems and log in with the new user:

## Create nvidia user and add to sudo group
sudo useradd -m nvidia
sudo usermod -aG sudo nvidia

## Set password for nvidia user
sudo passwd nvidia

## Switch to nvidia user
su - nvidia

Step 2. Physical Hardware Connection

Connect the QSFP cables between the three DGX Spark systems in a ring topology. Here, Port0 is the CX7 port next to the Ethernet port and Port1 is the CX7 port further away from it.

Node1 (Port0) to Node2 (Port1)
Node2 (Port0) to Node3 (Port1)
Node3 (Port0) to Node1 (Port1)

Note

Double check that the connections are correct otherwise the network configuration might fail.

This establishes the 200GbE direct connection required for high-speed inter-node communication. Upon connection between the three nodes, you will see an output like the one below on all nodes: in this example the interface showing as 'Up' is enp1s0f0np0 / enP2p1s0f0np0 and enp1s0f1np1 / enP2p1s0f1np1 (each physical port has two logical interfaces).

Example output:

## Check QSFP interface availability on all nodes
nvidia@dgx-spark-1:~$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)

Note

If all of the interfaces are not showing as 'Up', please check the QSFP cable connection, reboot the systems and try again.

Step 3. Network Interface Configuration

Choose one option to set up the network interfaces. The options are mutually exclusive. Option 1 is recommended to avoid complexity of network setup.

Note

Each CX7 port provides full 200GbE bandwidth. In a three node ring topology all four interfaces on each node must be assigned an IP address to form a symmetric cluster.

Option 1: Automatic IP Assignment with script

We have created a script here on GitHub which automates the following:

Interface network configuration for all DGX Sparks
Set up passwordless authentication between the DGX Sparks
Verify multi-node communication
Run NCCL Bandwidth tests

Note

If you use the script steps below, you can skip rest of the setup instructions in this playbook.

Use the steps below to run the script:

## Clone the repository
git clone https://github.com/NVIDIA/dgx-spark-playbooks

## Enter the script directory
cd dgx-spark-playbooks/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup

## Check the README.md for steps to run the script and configure the cluster networking

Option 2: Manual IP Assignment with the netplan configuration file

On node 1:

## Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
  version: 2
  ethernets:
    enp1s0f0np0:
      dhcp4: false
      addresses:
        - 192.168.0.1/24
    enP2p1s0f0np0:
      dhcp4: false
      addresses:
        - 192.168.1.1/24
    enp1s0f1np1:
      dhcp4: false
      addresses:
        - 192.168.2.1/24
    enP2p1s0f1np1:
      dhcp4: false
      addresses:
        - 192.168.3.1/24
EOF

## Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml

## Apply the configuration
sudo netplan apply

On node 2:

## Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
  version: 2
  ethernets:
    enp1s0f0np0:
      dhcp4: false
      addresses:
        - 192.168.4.1/24
    enP2p1s0f0np0:
      dhcp4: false
      addresses:
        - 192.168.5.1/24
    enp1s0f1np1:
      dhcp4: false
      addresses:
        - 192.168.0.2/24
    enP2p1s0f1np1:
      dhcp4: false
      addresses:
        - 192.168.1.2/24
EOF

## Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml

## Apply the configuration
sudo netplan apply

On node 3:

## Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
  version: 2
  ethernets:
    enp1s0f0np0:
      dhcp4: false
      addresses:
        - 192.168.2.2/24
    enP2p1s0f0np0:
      dhcp4: false
      addresses:
        - 192.168.3.2/24
    enp1s0f1np1:
      dhcp4: false
      addresses:
        - 192.168.4.2/24
    enP2p1s0f1np1:
      dhcp4: false
      addresses:
        - 192.168.5.2/24
EOF

## Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml

## Apply the configuration
sudo netplan apply

Step 4. Set up passwordless SSH authentication

Option 1: Automatically configure SSH

Run the DGX Spark discover-sparks.sh script from one of the nodes to automatically discover and configure SSH:

curl -O https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/connect-two-sparks/assets/discover-sparks
bash ./discover-sparks

Expected output similar to the below, with different IPs and node names. You may see more than one IP for each node as four interfaces (enp1s0f0np0, enP2p1s0f0np0, enp1s0f1np1 and enP2p1s0f1np1) have IP addresses assigned. This is expected and does not cause any issues. The first time you run the script, you'll be prompted for your password for each node.

Found: 192.168.0.1 (dgx-spark-1.local)
Found: 192.168.0.2 (dgx-spark-2.local)
Found: 192.168.3.2 (dgx-spark-3.local)

Setting up bidirectional SSH access (local <-> remote nodes)...
You may be prompted for your password for each node.

SSH setup complete! All nodes can now SSH to each other without passwords.

Note

If you encounter any errors, please follow Option 2 below to manually configure SSH and debug the issue.

Option 2: Manually discover and configure SSH

You will need to find the IP addresses for the CX-7 interfaces that are up. On all nodes, run the following command to find the IP addresses and take note of them for the next step.

  ip addr show enp1s0f0np0
  ip addr show enp1s0f1np1

Example output:

## In this example, we are using interface enp1s0f1np1.
nvidia@dgx-spark-1:~$ ip addr show enp1s0f1np1
    4: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
        link/ether 3c:6d:66:cc:b3:b7 brd ff:ff:ff:ff:ff:ff
        inet **192.168.1.1**/24 brd 192.168.1.255 scope link noprefixroute enp1s0f1np1
          valid_lft forever preferred_lft forever
        inet6 fe80::3e6d:66ff:fecc:b3b7/64 scope link
          valid_lft forever preferred_lft forever

In this example, the IP address for Node 1 is 192.168.1.1. Repeat the process for other nodes.

On all nodes, run the following commands to enable passwordless SSH:

## Copy your SSH public key to all nodes. Please replace the IP addresses with the ones you found in the previous step.
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 1>
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 2>
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 3>

Step 5. Verify Multi-Node Communication

Test basic multi-node functionality:

## Test hostname resolution across nodes
ssh <IP for Node 1> hostname
ssh <IP for Node 2> hostname
ssh <IP for Node 3> hostname

Step 6. Run NCCL tests

Now your cluster is set up to run distributed workloads across three nodes. Try running the NCCL bandwidth test.

Use the steps below to run the script which will run the NCCL test on the cluster:

## Clone the repository
git clone https://github.com/NVIDIA/dgx-spark-playbooks

## Enter the script directory
cd dgx-spark-playbooks/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup

## Check the README.md in the script directory for steps to run the NCCL tests with "--run-nccl-test" option

Step 7. Cleanup and Rollback

Warning

These steps will reset network configuration.

## Rollback network configuration
sudo rm /etc/netplan/40-cx7.yaml
sudo netplan apply

Troubleshooting

Symptom	Cause	Fix
"Network unreachable" errors	Network interfaces not configured	Verify netplan config and `sudo netplan apply`
SSH authentication failures	SSH keys not properly distributed	Re-run `./discover-sparks` and enter passwords
Nodes not visible in cluster	Network connectivity issue	Verify QSFP cable connection, check IP configuration
"APT update" errors (eg. E: The list of sources could not be read.)	APT sources errors, conflicting sources or signing keys	Check APT and Ubuntu documentation to fix the APT sources or keys conflicts