dgx-spark-playbooks/nvidia/nccl/README.md

188 lines
6.9 KiB
Markdown
Raw Normal View History

2025-10-03 20:46:11 +00:00
# NCCL for Two Sparks
> Install and test NCCL on two Sparks
## Table of Contents
- [Overview](#overview)
- [Run on two Sparks](#run-on-two-sparks)
2025-10-12 17:01:59 +00:00
- [Troubleshooting](#troubleshooting)
2025-10-03 20:46:11 +00:00
---
## Overview
2025-10-13 14:55:45 +00:00
## Basic idea
2025-10-03 20:46:11 +00:00
NCCL (NVIDIA Collective Communication Library) enables high-performance GPU-to-GPU communication
across multiple nodes. This walkthrough sets up NCCL for multi-node distributed training on
DGX Spark systems with Blackwell architecture. You'll configure networking, build NCCL from
2025-10-11 23:30:03 +00:00
source with Blackwell support, and validate communication between nodes.
2025-10-03 20:46:11 +00:00
## What you'll accomplish
You'll have a working multi-node NCCL environment that enables high-bandwidth GPU communication
across DGX Spark systems for distributed training workloads, with validated network performance
and proper GPU topology detection.
## What to know before starting
- Working with Linux network configuration and netplan
- Basic understanding of MPI (Message Passing Interface) concepts
- SSH key management and passwordless authentication setup
## Prerequisites
2025-10-11 23:30:03 +00:00
- Two DGX Spark systems
- Completed the Connect two Sparks playbook
- NVIDIA driver installed: `nvidia-smi`
2025-10-04 22:19:40 +00:00
- CUDA toolkit available: `nvcc --version`
- Root/sudo privileges: `sudo whoami`
2025-10-03 20:46:11 +00:00
## Time & risk
2025-10-13 14:55:45 +00:00
- **Duration**: 30 minutes for setup and validation
- **Risk level**: Medium - involves network configuration changes
- **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
2025-10-03 20:46:11 +00:00
## Run on two Sparks
2025-10-10 20:39:52 +00:00
## Step 1. Configure network connectivity
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
2025-10-03 20:46:11 +00:00
2025-10-10 20:39:52 +00:00
This includes:
- Physical QSFP cable connection
- Network interface configuration (automatic or manual IP assignment)
- Passwordless SSH setup
- Network connectivity verification
2025-10-03 20:46:11 +00:00
2025-10-11 23:30:03 +00:00
## Step 2. Build NCCL with Blackwell support
2025-10-03 20:46:11 +00:00
2025-10-11 23:30:03 +00:00
Execute these commands on both nodes to build NCCL from source with Blackwell
2025-10-03 20:46:11 +00:00
architecture support:
```bash
## Install dependencies and build NCCL
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
2025-10-11 23:30:03 +00:00
git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git ~/nccl/
cd ~/nccl/
2025-10-03 20:46:11 +00:00
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
## Set environment variables
2025-10-13 16:51:50 +00:00
export CUDA_HOME="/usr/local/cuda"
2025-10-03 20:46:11 +00:00
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
2025-10-11 23:30:03 +00:00
export NCCL_HOME="$HOME/nccl/build/"
2025-10-03 20:46:11 +00:00
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
```
2025-10-11 23:30:03 +00:00
## Step 3. Build NCCL test suite
2025-10-03 20:46:11 +00:00
Compile the NCCL test suite to validate communication performance:
```bash
## Clone and build NCCL tests
2025-10-11 23:30:03 +00:00
git clone https://github.com/NVIDIA/nccl-tests.git ~/nccl-tests/
cd ~/nccl-tests/
2025-10-03 20:46:11 +00:00
make MPI=1
```
2025-10-11 23:30:03 +00:00
## Step 4. Find the active network interface and IP addresses
2025-10-03 20:46:11 +00:00
2025-10-11 23:30:03 +00:00
Execute multi-node NCCL performance test using the active network interface. First, identify which network ports are available and up:
2025-10-03 20:46:11 +00:00
```bash
2025-10-11 23:30:03 +00:00
## Check network port status
ibdev2netdev
2025-10-03 20:46:11 +00:00
```
2025-10-11 23:30:03 +00:00
Example output:
```
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
```
2025-10-03 20:46:11 +00:00
2025-10-11 23:30:03 +00:00
Use an interface that shows as "(Up)" in your output. In this example, we'll use **enp1s0f1np1**. You can disregard interfaces starting with the prefix`enP2p<...>` and only consider interfaces starting with `enp1<...>` instead.
2025-10-03 20:46:11 +00:00
2025-10-11 23:30:03 +00:00
You will need to find the IP addresses for the CX-7 interfaces that are up. On both nodes, run the following command to find the IP addresses and take note of them for the next step.
2025-10-03 20:46:11 +00:00
```bash
2025-10-11 23:30:03 +00:00
ip addr show enp1s0f0np0
ip addr show enp1s0f1np1
```
2025-10-03 20:46:11 +00:00
2025-10-11 23:30:03 +00:00
Example output:
```
## In this example, we are using interface enp1s0f1np1.
nvidia@dgx-spark-1:~$ ip addr show enp1s0f1np1
4: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 3c:6d:66:cc:b3:b7 brd ff:ff:ff:ff:ff:ff
inet **169.254.35.62**/16 brd 169.254.255.255 scope link noprefixroute enp1s0f1np1
valid_lft forever preferred_lft forever
inet6 fe80::3e6d:66ff:fecc:b3b7/64 scope link
valid_lft forever preferred_lft forever
2025-10-03 20:46:11 +00:00
```
2025-10-11 23:30:03 +00:00
In this example, the IP address for Node 1 is **169.254.35.62**. Repeat the process for Node 2.
2025-10-03 20:46:11 +00:00
2025-10-11 23:30:03 +00:00
## Step 5. Run NCCL communication test
2025-10-03 20:46:11 +00:00
2025-11-25 03:08:49 +00:00
> [!NOTE]
> Full bandwidth can be achieved with just one QSFP cable.
> When two QSFP cables are connected, all four interfaces must be assigned IP addresses to obtain full bandwidth.
2025-10-11 23:30:03 +00:00
Execute the following commands on both nodes to run the NCCL communication test. Replace the IP addresses and interface names with the ones you found in the previous step.
2025-10-03 20:46:11 +00:00
```bash
2025-10-11 23:30:03 +00:00
## Set network interface environment variables (use your Up interface from the previous step)
export UCX_NET_DEVICES=enp1s0f1np1
export NCCL_SOCKET_IFNAME=enp1s0f1np1
export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
## Run the all_gather performance test across both nodes (replace the IP addresses with the ones you found in the previous step)
mpirun -np 2 -H <IP for Node 1>:1,<IP for Node 2>:1 \
--mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
$HOME/nccl-tests/build/all_gather_perf
```
2025-10-03 20:46:11 +00:00
2025-10-11 23:30:03 +00:00
You can also test your NCCL setup with a larger buffer size to use more of your 200Gbps bandwidth.
2025-10-03 20:46:11 +00:00
2025-10-11 23:30:03 +00:00
```bash
## Set network interface environment variables (use your active interface)
export UCX_NET_DEVICES=enp1s0f1np1
export NCCL_SOCKET_IFNAME=enp1s0f1np1
export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
## Run the all_gather performance test across both nodes
mpirun -np 2 -H <IP for Node 1>:1,<IP for Node 2>:1 \
--mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
$HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2
2025-10-03 20:46:11 +00:00
```
2025-10-11 23:30:03 +00:00
Note: The IP addresses in the `mpirun` command are followed by `:1`. For example, `mpirun -np 2 -H 169.254.35.62:1,169.254.35.63:1`
2025-10-03 20:46:11 +00:00
2025-10-11 23:30:03 +00:00
## Step 7. Cleanup and rollback
2025-10-03 20:46:11 +00:00
```bash
2025-10-11 23:30:03 +00:00
## Rollback network configuration (if needed)
rm -rf ~/nccl/
rm -rf ~/nccl-tests/
2025-10-03 20:46:11 +00:00
```
2025-10-11 23:30:03 +00:00
## Step 8. Next steps
Your NCCL environment is ready for multi-node distributed training workloads on DGX Spark.
Now you can try running a larger distributed workload such as TRT-LLM or vLLM inference.
## Troubleshooting
2025-10-12 17:01:59 +00:00
## Common issues for running on two Spark
2025-10-11 23:30:03 +00:00
| Issue | Cause | Solution |
|-------|-------|----------|
| mpirun hangs or times out | SSH connectivity issues | 1. Test basic SSH connectivity: `ssh <remote_ip>` should work without password prompts<br>2. Try a simple mpirun test: `mpirun -np 2 -H <IP for Node 1>:1,<IP for Node 2>:1 hostname`<br>3. Verify SSH keys are setup correctly for all nodes |
| Network interface not found | Wrong interface name or down status | Check interface status with `ibdev2netdev` and verify IP configuration |
| NCCL build fails | Missing dependencies such as OpenMPI or incorrect CUDA version | Verify CUDA installation and required libraries are present |