diff --git a/README.md b/README.md index 735abde..7082826 100644 --- a/README.md +++ b/README.md @@ -34,7 +34,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting - [LM Studio on DGX Spark](nvidia/lm-studio/) - [Build and Deploy a Multi-Agent Chatbot](nvidia/multi-agent-chatbot/) - [Multi-modal Inference](nvidia/multi-modal-inference/) -- [Connect Multiple Sparks through a Switch](nvidia/multi-sparks-through-switch/) +- [Connect Multiple DGX Spark through a Switch](nvidia/multi-sparks-through-switch/) - [NCCL for Two Sparks](nvidia/nccl/) - [Fine-tune with NeMo](nvidia/nemo-fine-tune/) - [NemoClaw with Nemotron-3-Super on DGX Spark](nvidia/nemoclaw/) diff --git a/nvidia/multi-sparks-through-switch/README.md b/nvidia/multi-sparks-through-switch/README.md index fd6a995..2ce2917 100644 --- a/nvidia/multi-sparks-through-switch/README.md +++ b/nvidia/multi-sparks-through-switch/README.md @@ -1,7 +1,6 @@ -# Connect Multiple Sparks through a Switch - -> Connect multiple Spark devices in a cluster and set them up for distributed inference and fine-tuning +# Connect Multiple DGX Spark through a Switch +> Set up a cluster of DGX Spark devices that are connected through Switch ## Table of Contents @@ -20,7 +19,7 @@ ## Basic idea -Configure four DGX Spark systems for high-speed inter-node communication using 200Gbps QSFP connections through a QSFP switch. This setup enables distributed workloads across multiple DGX Spark nodes by establishing network connectivity and configuring SSH authentication. +Configure multiple DGX Spark systems for high-speed inter-node communication using 200Gbps QSFP connections through a QSFP switch. This setup enables distributed workloads across multiple DGX Spark nodes by establishing network connectivity and configuring SSH authentication. ## What you will accomplish @@ -31,12 +30,16 @@ You will physically connect four DGX Spark devices with QSFP cables and a QSFP s - Basic understanding of distributed computing concepts - Working with network interface configuration and netplan - Experience with SSH key management +- Basic understanding and experience in configuring the managed QSFP network switch which you plan to use. Refer to the instruction manuals to: + - Know how to connect to the switch for management of ports and features + - Know how to enable/disable QSFP ports and create a software bridge on the switch + - Know how to configure the link speed manually on the port and disable auto-negotiation if needed ## Prerequisites -- Four DGX Spark systems (these instructions will also work with two or three nodes cluster with a switch) +- Multiple DGX Spark systems (these instructions will work for any number of DGX Spark devices connected with a switch) - QSFP switch with at least 4 QSFP56-DD ports (at least 200Gbps each) -- QSFP cables for 200Gbps connection from the switch to the devices +- QSFP cables for 200Gbps connection from the switch to the devices. Use [recommended cable](https://marketplace.nvidia.com/en-us/enterprise/personal-ai-supercomputers/qsfp-cable-0-4m-for-dgx-spark/) or similar. - One cable per spark - If the switch has 400Gbps ports then you can also use breakout cables to split them into two 200Gbps ports - SSH access available to all systems @@ -49,23 +52,24 @@ You will physically connect four DGX Spark devices with QSFP cables and a QSFP s All required files for this playbook can be found [here on GitHub](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/multi-sparks-through-switch/) - [**discover-sparks.sh**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/connect-two-sparks/assets/discover-sparks) script for automatic node discovery and SSH key distribution +- [**Cluster setup script**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup) for automatic network configuration, validation and running NCCL sanity test ## Time & risk -- **Duration:** 1 hour including validation +- **Duration:** 2 hours including validation - **Risk level:** Medium - involves network reconfiguration - **Rollback:** Network changes can be reversed by removing netplan configs or IP assignments -- **Last Updated:** 3/12/2026 +- **Last Updated:** 3/19/2026 * First publication ## Run on Four Sparks ## Step 1. Ensure Same Username on all four Systems -On all four systems check the username and make sure it's the same: +On all four systems check and make sure the usernames are the same: ```bash ## Check current username @@ -88,11 +92,11 @@ su - nvidia ## Step 2. Switch Management -Most QSFP switches offer some form of management interface, either through CLI or UI. Refer to the documentation and connect to the management interface. Make sure that the ports on the switch are enabled. For connecting four sparks, you will need to ensure that the switch is configured to provide 200Gbps connection to each DGX Spark. +Most QSFP switches offer some form of management interface, either through CLI or UI. Refer to the documentation and connect to the management interface. Make sure that the ports on the switch are enabled. For connecting four sparks, you will need to ensure that the switch is configured to provide 200Gbps connection to each DGX Spark. If not done already, refer to the [Overview](https://build.nvidia.com/spark/multi-sparks-through-switch/overview) of this playbook for the prior knowledge and pre-requisites required for this playbook. ## Step 3. Physical Hardware Connection -Connect the QSFP cables between DGX Spark systems and the switch(QSFP56-DD/QSFP56 ports) using one CX7 port on each system. It is recommended to use the same CX7 port on all Spark systems for easier network configuration and avoiding NCCL test failures. In this playbook the second port (the one further from the ethernet port) is used. This should establish the 200Gbps connection required for high-speed inter-node communication. You will see an output like the one below on all four sparks. In this example the interfaces showing as 'Up' are **enp1s0f1np1** and **enP2p1s0f1np1** (each physical port has two logical interfaces). +Connect the QSFP cables between DGX Spark systems and the switch(QSFP56-DD/QSFP56 ports) using one CX7 port on each Spark system. It is recommended to use the same CX7 port on all Spark systems for easier network configuration and avoiding NCCL test failures. In this playbook the second port (the one further from the ethernet port) is used. This should establish the 200Gbps connection required for high-speed inter-node communication. You will see an output like the one below on all four sparks. In this example the interfaces showing as 'Up' are **enp1s0f1np1** and **enP2p1s0f1np1** (each physical port has two logical interfaces). Example output: ```bash @@ -110,14 +114,14 @@ roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up) ### Step 3.1. Verify negotiated Link speed -The link speed might not default to 200Gbps with auto-negotiation. To confirm, run the command below on all sparks and check that the speed is shown as 200000Mb/s. If it shows lesser than that value, then the link speed needs to be set to 200Gbps manually in the switch port configuration. Refer to the switch's manual/documentation to disable auto-negotiation and set the link speed manually to 200Gbps (eg. 200G-baseCR4) +The link speed might not default to 200Gbps with auto-negotiation. To confirm, run the command below on all sparks and check that the speed is shown as 200000Mb/s. If it shows lesser than that value, then the link speed needs to be set to 200Gbps manually in the switch port configuration and auto-negotiation should be disabled. Refer to the switch's manual/documentation to disable auto-negotiation and set the link speed manually to 200Gbps (eg. 200G-baseCR4) Example output: ```bash -nvidia@dxg-spark-1:~$ ethtool enp1s0f1np1 | grep Speed +nvidia@dxg-spark-1:~$ sudo ethtool enp1s0f1np1 | grep Speed Speed: 100000Mb/s -nvidia@dxg-spark-1:~$ ethtool enP2p1s0f1np1 | grep Speed +nvidia@dxg-spark-1:~$ sudo ethtool enP2p1s0f1np1 | grep Speed Speed: 100000Mb/s ``` @@ -125,10 +129,10 @@ After setting the correct speed on the switch ports. Verify the link speed on al Example output: ```bash -nvidia@dxg-spark-1:~$ ethtool enp1s0f1np1 | grep Speed +nvidia@dxg-spark-1:~$ sudo ethtool enp1s0f1np1 | grep Speed Speed: 200000Mb/s -nvidia@dxg-spark-1:~$ ethtool enP2p1s0f1np1 | grep Speed +nvidia@dxg-spark-1:~$ sudo ethtool enP2p1s0f1np1 | grep Speed Speed: 200000Mb/s ``` @@ -138,9 +142,9 @@ nvidia@dxg-spark-1:~$ ethtool enP2p1s0f1np1 | grep Speed > Full bandwidth can be achieved with just one QSFP cable. For a clustered setup, all DGX sparks: -1. Should be able to talk to each other using TCP/IP over CX7. -2. Should be accessible for management (eg. SSH and run commands) -3. Should be able to access internet (eg. to download models/utilities) +1. Should be accessible for management (eg. SSH and run commands) +2. Should be able to access internet (eg. to download models/utilities) +3. Should be able to talk to each other using TCP/IP over CX7. The steps below help configure that. It is recommended to use the Ethernet/WiFi network for management and internet traffic and keep it separate from the CX7 network to avoid CX7 bandwidth from being used for non-workload traffic. @@ -153,7 +157,7 @@ Once you are done creating/adding ports to the bridge, you should be ready to co ### 4.1 Script for Cluster networking configuration We have created a script [here on GitHub](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup) which automates the following: -1. Interface network configuration for all DGX Sparks +1. Interface network IP configuration for all DGX Sparks 2. Set up password-less authentication between the DGX Sparks 3. Verify multi-node communication 4. Run NCCL Bandwidth tests @@ -170,7 +174,7 @@ git clone https://github.com/NVIDIA/dgx-spark-playbooks ## Enter the script directory cd dgx-spark-playbooks/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup -## Check the README.md for steps to run the script and configure the cluster networking +## Check the README.md in the script directory for steps to run the script and configure the cluster networking with "--run-setup" argument ``` ### 4.2 Manual Cluster networking configuration @@ -344,7 +348,7 @@ curl -O https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/ bash ./discover-sparks ``` -Expected output similar to the below, with different IPs and node names. The first time you run the script, you'll be prompted for your password for each node. +Expected output similar to the below, with different IPs and node names. You may see up to two IPs for each node as two interfaces (eg. **enp1s0f1np1** and **enP2p1s0f1np1**) have IP addresses assigned. This is expected and does not cause any issues. The first time you run the script, you'll be prompted for your password for each node. ``` Found: 169.254.35.62 (dgx-spark-1.local) Found: 169.254.35.63 (dgx-spark-2.local) @@ -404,9 +408,13 @@ ssh hostname ## Step 7. Running Tests and Workloads -Now your cluster is set up to run distributed workloads across four nodes. For example, you can run the [NCCL playbook](https://build.nvidia.com/spark/nccl/stacked-sparks). Wherever the playbook asks to run a command on two nodes, just run it on all four nodes and modify the mpirun command which you run on the head node to use four nodes instead of two. +Now your cluster is set up to run distributed workloads across four nodes. Try running the [NCCL playbook](https://build.nvidia.com/spark/nccl/stacked-sparks). -Example mpirun command for four nodes: +> [!NOTE] +> Wherever the playbook asks to run a command on **two nodes**, just run it on **all four nodes**. +> Make sure to adapt the *mpirun* NCCL command which you run on the **head node** to accommodate **four nodes** + +Example mpirun command for NCCL: ```bash ## Set network interface environment variables (use your Up interface from the previous step) export UCX_NET_DEVICES=enp1s0f1np1 @@ -444,3 +452,5 @@ sudo netplan apply | "Network unreachable" errors | Network interfaces not configured | Verify netplan config and `sudo netplan apply` | | SSH authentication failures | SSH keys not properly distributed | Re-run `./discover-sparks` and enter passwords | | Nodes not visible in cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration | +| "APT update" errors (eg. E: The list of sources could not be read.) | APT sources errors, conflicting sources or signing keys | Check APT and Ubuntu documentation to fix the APT sources or keys conflicts | +| NCCL test failures (eg. libnccl.so.2: cannot open shared object file) | NCCL configuration not done on all nodes | Make sure to follow the NCCL playbook to configure **all** nodes before running the NCCL test| diff --git a/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup/README.md b/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup/README.md index 50416dc..9f144eb 100644 --- a/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup/README.md +++ b/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup/README.md @@ -9,7 +9,7 @@ Clone the dgx-spark-playbooks repo from GitHub ### Step 2. Switch to the multi spark cluster setup scripts directory ```bash -cd dgx-spark-playbooks/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup-1.0.0 +cd dgx-spark-playbooks/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup ``` ### Step 3. Create or edit a JSON config file with your cluster information @@ -38,13 +38,27 @@ cd dgx-spark-playbooks/nvidia/multi-sparks-through-switch/assets/spark_cluster_s ### Step 4. Run the cluster setup script with your json config file -```bash -bash spark_cluster_setup.sh config/spark_config_b2b.json +The script can be run with different options as mentioned below + +```bash +# To run validation, cluster setup and NCCL bandwidth test (all steps) + +bash spark_cluster_setup.sh -c --run-setup + +# To only run pre-setup validation steps + +bash spark_cluster_setup.sh -c --pre-validate-only + +# To run NCCL test and skip cluster setup (use this after cluster is already set up) + +bash spark_cluster_setup.sh -c --run-nccl-test -# This will do the following -# 1. Create a python virtual env and install required packages -# 2. Validate the environment and cluster config -# 3. Detect the topology and configure the IP addresses -# 4. Configure password-less ssh between the cluster nodes -# 5. Run NCCL BW test ``` + +> [!NOTE] +> The full cluster setup (first command above) will do the following +> 1. Create a python virtual env and install required packages +> 2. Validate the environment and cluster config +> 3. Detect the topology and configure the IP addresses +> 4. Configure password-less ssh between the cluster nodes +> 5. Run NCCL BW test \ No newline at end of file diff --git a/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup/spark_cluster_setup.py b/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup/spark_cluster_setup.py index e5cb0e5..1d7d094 100644 --- a/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup/spark_cluster_setup.py +++ b/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup/spark_cluster_setup.py @@ -652,33 +652,49 @@ def configure_ssh_keys_on_nodes(nodes_info) -> bool: return True -def handle_cluster_setup(config) -> tuple[bool, bool]: - """Handles the cluster network setup.""" +def pre_validate_cluster(config) -> tuple[bool, bool, list[str]]: + """Pre-validates the cluster.""" try: nodes_info = config.get("nodes_info", None) if not nodes_info: print("ERROR: Nodes information not found.") - return False, False + return False, False, [] print(f"Checking UP CX7 interfaces...") up_interfaces = check_and_get_up_cx7_interfaces(nodes_info) if not up_interfaces: print("ERROR: Failed to check UP CX7 interfaces. Check the QSFP cable connection and try again.") - return False, False + return False, False, [] print(f"Checking CX7 interface link speed...") if not check_interface_link_speed(nodes_info, up_interfaces): - return False, False + return False, False, [] + + ring_topology = (len(nodes_info) == 3 and len(up_interfaces) == 4) + + except Exception as e: + print(f"ERROR: An error occurred when pre-validating the cluster:\n{e}") + return False, False, [] + + return True, ring_topology, up_interfaces + +def handle_cluster_setup(config, up_interfaces) -> bool: + """Handles the cluster network setup.""" + try: + nodes_info = config.get("nodes_info", None) + if not nodes_info: + print("ERROR: Nodes information not found.") + return False print(f"Copying network setup scripts on nodes...") # Copy the detect_and_configure_cluster_networking.py script to the nodes and run it in threads if not copy_network_setup_script_to_nodes(nodes_info): - return False, False + return False print(f"Running network setup scripts on nodes...") if not run_network_setup_scripts_on_nodes(nodes_info): print("ERROR: Failed to run network setup scripts on nodes. Check the QSFP cable connections and the nodes config in the json file and try again.") - return False, False + return False # Verify that the IP addresses are assigned to the interfaces max_retries = 5 @@ -695,20 +711,18 @@ def handle_cluster_setup(config) -> tuple[bool, bool]: if retries == 0: print("ERROR: Failed to verify IP addresses on nodes. Check the QSFP cable connections and the nodes config in the json file and try again.") - return False, False + return False # Configure ssh keys across nodes if not configure_ssh_keys_on_nodes(nodes_info): print("ERROR: Failed to configure ssh keys on nodes. Please check the configuration and try again.") - return False, False - - ring_topology = (len(nodes_info) == 3 and len(up_interfaces) == 4) + return False except Exception as e: print(f"ERROR: An error occurred when handling cluster setup:\n{e}") - return False, False + return False - return True, ring_topology + return True def validate_config(config): """Validates the configuration.""" @@ -810,14 +824,35 @@ def validate_environment(): return True +class _HelpHintParser(argparse.ArgumentParser): + """ArgumentParser that appends a --help hint to every error.""" + + def error(self, message): + self.exit(2, f"{self.prog}: error: {message}\nRun with --help for usage.\n") + + def main(): """Main function to setup the Spark cluster.""" - parser = argparse.ArgumentParser(description="Setup the Spark cluster.") + parser = _HelpHintParser( + description="Setup the Spark cluster.", + epilog="One of --pre-validate-only, --run-setup, or --run-nccl-test is required.", + ) parser.add_argument("-c", "--config", type=str, required=True, help="Path to the configuration file.") + parser.add_argument("-v", "--pre-validate-only", action="store_true", help="Only run pre-setup validations.") + parser.add_argument("-s", "--run-setup", action="store_true", help="Run the cluster setup and run NCCL bandwidth test.") + parser.add_argument("-n", "--run-nccl-test", action="store_true", help="Run the NCCL bandwidth test.") args = parser.parse_args() - config = args.config - with open(config, "r") as f: + + if not (args.pre_validate_only or args.run_setup or args.run_nccl_test): + parser.error("One of -v/--pre-validate-only, -s/--run-setup, or -n/--run-nccl-test is required.") + + config_path = args.config + if not os.path.exists(config_path): + print(f"ERROR: Configuration file not found: {config_path}") + return + + with open(config_path, "r") as f: config = json.load(f) if not config: @@ -835,17 +870,29 @@ def main(): if not validate_config(config): return - print("Setting up Spark cluster...") - ret, ring_topology = handle_cluster_setup(config) + print(f"Pre-validating cluster setup...") + ret, ring_topology, up_interfaces = pre_validate_cluster(config) if not ret: return - print("Spark cluster setup completed successfully.") - - print("Running NCCL test...") - if not run_nccl_test(config.get("nodes_info", []), ring_topology): + if args.pre_validate_only: + print("Pre-setup validations completed successfully.") return - print("NCCL test completed.") + + if args.run_setup: + print("Setting up Spark cluster...") + if not handle_cluster_setup(config, up_interfaces): + return + + print("Spark cluster setup completed successfully.") + + if args.run_nccl_test or args.run_setup: + print("Running NCCL test...") + if ring_topology: + print("Detected ring topology...") + if not run_nccl_test(config.get("nodes_info", []), ring_topology): + return + print("NCCL test completed.") except Exception as e: print(f"ERROR: An error occurred when running Spark cluster setup:\n{e}") diff --git a/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup/spark_cluster_setup.sh b/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup/spark_cluster_setup.sh index 875354c..6aa9c4e 100644 --- a/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup/spark_cluster_setup.sh +++ b/nvidia/multi-sparks-through-switch/assets/spark_cluster_setup/spark_cluster_setup.sh @@ -12,8 +12,8 @@ if [[ "$EUID" -eq 0 ]]; then exit 1 fi -if [[ $# -ne 1 ]]; then - echo "Usage: $0 " +if [[ $# -lt 1 ]]; then + echo "Usage: bash $0 --help to see the available options" exit 1 fi @@ -26,7 +26,7 @@ source .venv/bin/activate echo "---- Installing required packages ----" pip install -r requirements.txt -echo "---- Configuring the cluster in $1 ----" -SPARK_CLUSTER_SETUP_WRAPPER=1 python3 ./spark_cluster_setup.py -c "$1" +echo "---- Configuring the cluster (args: $*) ----" +SPARK_CLUSTER_SETUP_WRAPPER=1 python3 ./spark_cluster_setup.py "$@" deactivate