diff --git a/nvidia/nccl/README.md b/nvidia/nccl/README.md index 511aa65..4dea57c 100644 --- a/nvidia/nccl/README.md +++ b/nvidia/nccl/README.md @@ -16,7 +16,7 @@ NCCL (NVIDIA Collective Communication Library) enables high-performance GPU-to-GPU communication across multiple nodes. This walkthrough sets up NCCL for multi-node distributed training on DGX Spark systems with Blackwell architecture. You'll configure networking, build NCCL from -source with Blackwell support, and validate communication performance between nodes. +source with Blackwell support, and validate communication between nodes. ## What you'll accomplish @@ -27,34 +27,24 @@ and proper GPU topology detection. ## What to know before starting - Working with Linux network configuration and netplan -- Docker container management and multi-container deployments - Basic understanding of MPI (Message Passing Interface) concepts - SSH key management and passwordless authentication setup -- NVIDIA GPU architecture fundamentals and CUDA toolkit usage ## Prerequisites -- Two DGX Spark systems with Blackwell GPUs: `nvidia-smi --query-gpu=gpu_name --format=csv` -- ConnectX-7 InfiniBand network cards installed: `ibdev2netdev` -- Docker installed on both nodes: `docker --version` +- Two DGX Spark systems +- Completed the Connect two Sparks playbook +- NVIDIA driver installed: `nvidia-smi` - CUDA toolkit available: `nvcc --version` -- SSH access between nodes: `ssh echo "success"` - Root/sudo privileges: `sudo whoami` -## Ancillary files - -- `cx7-netplan.yaml` - Network configuration template for ConnectX-7 interfaces -- `discover-sparks` - Script to discover DGX Spark nodes and configure SSH keys -- `trtllm-mn-entrypoint.sh` - Container entrypoint script for multi-node setup - ## Time & risk -**Duration**: 45-60 minutes for setup and validation +**Duration**: 30 minutes for setup and validation -**Risk level**: Medium - involves network configuration changes and container networking +**Risk level**: Medium - involves network configuration changes -**Rollback**: Network changes can be reverted using `sudo netplan apply` with original configs, -containers can be stopped with `docker stop` +**Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark ## Run on two Sparks @@ -68,119 +58,124 @@ This includes: - Passwordless SSH setup - Network connectivity verification -## Step 2. Launch TensorRT-LLM containers on both nodes +## Step 2. Build NCCL with Blackwell support -Start containers with appropriate network and GPU configuration for NCCL communication: - -```bash -## On both nodes, launch the container -docker run --name trtllm --rm -d \ - --gpus all --network host --ipc=host \ - --ulimit memlock=-1 --ulimit stack=67108864 \ - -e UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1 \ - -e NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \ - -e OMPI_MCA_btl_tcp_if_include=enp1s0f0np0,enp1s0f1np1 \ - -e OMPI_ALLOW_RUN_AS_ROOT=1 \ - -e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \ - -v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \ - -v ./trtllm-mn-entrypoint.sh:/opt/trtllm-mn-entrypoint.sh \ - -v ~/.ssh:/tmp/.ssh:ro \ - --entrypoint /opt/trtllm-mn-entrypoint.sh \ - nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 -``` - -## Step 3. Build NCCL with Blackwell support - -Execute these commands inside both containers to build NCCL from source with Blackwell +Execute these commands on both nodes to build NCCL from source with Blackwell architecture support: ```bash ## Install dependencies and build NCCL sudo apt-get update && sudo apt-get install -y libopenmpi-dev -git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git /opt/nccl/ -cd /opt/nccl/ +git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git ~/nccl/ +cd ~/nccl/ make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121" ## Set environment variables export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi" -export NCCL_HOME="/opt/nccl/build/" +export NCCL_HOME="$HOME/nccl/build/" export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH" ``` -## Step 4. Build NCCL test suite +## Step 3. Build NCCL test suite Compile the NCCL test suite to validate communication performance: ```bash ## Clone and build NCCL tests -git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests/ -cd /opt/nccl-tests/ +git clone https://github.com/NVIDIA/nccl-tests.git ~/nccl-tests/ +cd ~/nccl-tests/ make MPI=1 ``` +## Step 4. Find the active network interface and IP addresses + +Execute multi-node NCCL performance test using the active network interface. First, identify which network ports are available and up: + +```bash +## Check network port status +ibdev2netdev +``` + +Example output: +``` +roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down) +roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up) +rocep1s0f0 port 1 ==> enp1s0f0np0 (Down) +rocep1s0f1 port 1 ==> enp1s0f1np1 (Up) +``` + +Use an interface that shows as "(Up)" in your output. In this example, we'll use **enp1s0f1np1**. You can disregard interfaces starting with the prefix`enP2p<...>` and only consider interfaces starting with `enp1<...>` instead. + +You will need to find the IP addresses for the CX-7 interfaces that are up. On both nodes, run the following command to find the IP addresses and take note of them for the next step. +```bash + ip addr show enp1s0f0np0 + ip addr show enp1s0f1np1 +``` + +Example output: +``` +## In this example, we are using interface enp1s0f1np1. +nvidia@dgx-spark-1:~$ ip addr show enp1s0f1np1 + 4: enp1s0f1np1: mtu 1500 qdisc mq state UP group default qlen 1000 + link/ether 3c:6d:66:cc:b3:b7 brd ff:ff:ff:ff:ff:ff + inet **169.254.35.62**/16 brd 169.254.255.255 scope link noprefixroute enp1s0f1np1 + valid_lft forever preferred_lft forever + inet6 fe80::3e6d:66ff:fecc:b3b7/64 scope link + valid_lft forever preferred_lft forever +``` + +In this example, the IP address for Node 1 is **169.254.35.62**. Repeat the process for Node 2. + ## Step 5. Run NCCL communication test -Execute multi-node NCCL performance test using the active network interface: +Execute the following commands on both nodes to run the NCCL communication test. Replace the IP addresses and interface names with the ones you found in the previous step. ```bash -## Set network interface environment variables (use your active interface from Step 3) -export UCX_NET_DEVICES=enp1s0f0np0 -export NCCL_SOCKET_IFNAME=enp1s0f0np0 -export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0 +## Set network interface environment variables (use your Up interface from the previous step) +export UCX_NET_DEVICES=enp1s0f1np1 +export NCCL_SOCKET_IFNAME=enp1s0f1np1 +export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1 + +## Run the all_gather performance test across both nodes (replace the IP addresses with the ones you found in the previous step) +mpirun -np 2 -H :1,:1 \ + --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \ + -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \ + $HOME/nccl-tests/build/all_gather_perf +``` + +You can also test your NCCL setup with a larger buffer size to use more of your 200Gbps bandwidth. + +```bash +## Set network interface environment variables (use your active interface) +export UCX_NET_DEVICES=enp1s0f1np1 +export NCCL_SOCKET_IFNAME=enp1s0f1np1 +export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1 ## Run the all_gather performance test across both nodes -mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \ - -x NCCL_DEBUG=VERSION -x NCCL_DEBUG_SUBSYS=TUNING \ +mpirun -np 2 -H :1,:1 \ + --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \ -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \ - -x NCCL_MERGE_LEVEL=SYS -x NCCL_PROTO="SIMPLE" \ - /opt/nccl-tests/build/all_gather_perf -b 32G -e 32G -f 2 + $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2 ``` -## Step 6. Validate NCCL installation - -Verify successful NCCL compilation and multi-node communication: - -```bash -## Check NCCL library build -ls -la /opt/nccl/build/lib/ - -## Verify NCCL test binaries -ls -la /opt/nccl-tests/build/ - -## Check MPI configuration -mpirun --version -``` - -Expected output should show NCCL libraries in `/opt/nccl/build/lib/` and test binaries in -`/opt/nccl-tests/build/`. +Note: The IP addresses in the `mpirun` command are followed by `:1`. For example, `mpirun -np 2 -H 169.254.35.62:1,169.254.35.63:1` ## Step 7. Cleanup and rollback -**Warning**: These steps will stop containers and reset network configuration. - ```bash -## Stop containers on both nodes -docker stop trtllm - -## Remove containers (optional) -docker rm trtllm - ## Rollback network configuration (if needed) -sudo rm /etc/netplan/40-cx7.yaml -sudo netplan apply +rm -rf ~/nccl/ +rm -rf ~/nccl-tests/ ``` ## Step 8. Next steps +Your NCCL environment is ready for multi-node distributed training workloads on DGX Spark. +Now you can try running a larger distributed workload such as TRT-LLM or vLLM inference. -Test your NCCL setup with a simple distributed training example: +## Troubleshooting -```bash -## Example: Run a simple NCCL bandwidth test -/opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 - -## Example: Verify GPU topology detection -nvidia-smi topo -m -``` - -Your NCCL environment is ready for multi-node distributed training workloads on DGX Spark -systems with Blackwell GPUs. +| Issue | Cause | Solution | +|-------|-------|----------| +| mpirun hangs or times out | SSH connectivity issues | 1. Test basic SSH connectivity: `ssh ` should work without password prompts
2. Try a simple mpirun test: `mpirun -np 2 -H :1,:1 hostname`
3. Verify SSH keys are setup correctly for all nodes | +| Network interface not found | Wrong interface name or down status | Check interface status with `ibdev2netdev` and verify IP configuration | +| NCCL build fails | Missing dependencies such as OpenMPI or incorrect CUDA version | Verify CUDA installation and required libraries are present |