chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-11 23:30:03 +00:00
parent 0a7238d651
commit d060c4abfe

View File

@ -16,7 +16,7 @@
NCCL (NVIDIA Collective Communication Library) enables high-performance GPU-to-GPU communication
across multiple nodes. This walkthrough sets up NCCL for multi-node distributed training on
DGX Spark systems with Blackwell architecture. You'll configure networking, build NCCL from
source with Blackwell support, and validate communication performance between nodes.
source with Blackwell support, and validate communication between nodes.
## What you'll accomplish
@ -27,34 +27,24 @@ and proper GPU topology detection.
## What to know before starting
- Working with Linux network configuration and netplan
- Docker container management and multi-container deployments
- Basic understanding of MPI (Message Passing Interface) concepts
- SSH key management and passwordless authentication setup
- NVIDIA GPU architecture fundamentals and CUDA toolkit usage
## Prerequisites
- Two DGX Spark systems with Blackwell GPUs: `nvidia-smi --query-gpu=gpu_name --format=csv`
- ConnectX-7 InfiniBand network cards installed: `ibdev2netdev`
- Docker installed on both nodes: `docker --version`
- Two DGX Spark systems
- Completed the Connect two Sparks playbook
- NVIDIA driver installed: `nvidia-smi`
- CUDA toolkit available: `nvcc --version`
- SSH access between nodes: `ssh <OTHER_NODE_IP> echo "success"`
- Root/sudo privileges: `sudo whoami`
## Ancillary files
- `cx7-netplan.yaml` - Network configuration template for ConnectX-7 interfaces
- `discover-sparks` - Script to discover DGX Spark nodes and configure SSH keys
- `trtllm-mn-entrypoint.sh` - Container entrypoint script for multi-node setup
## Time & risk
**Duration**: 45-60 minutes for setup and validation
**Duration**: 30 minutes for setup and validation
**Risk level**: Medium - involves network configuration changes and container networking
**Risk level**: Medium - involves network configuration changes
**Rollback**: Network changes can be reverted using `sudo netplan apply` with original configs,
containers can be stopped with `docker stop`
**Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
## Run on two Sparks
@ -68,119 +58,124 @@ This includes:
- Passwordless SSH setup
- Network connectivity verification
## Step 2. Launch TensorRT-LLM containers on both nodes
## Step 2. Build NCCL with Blackwell support
Start containers with appropriate network and GPU configuration for NCCL communication:
```bash
## On both nodes, launch the container
docker run --name trtllm --rm -d \
--gpus all --network host --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-e UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1 \
-e NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \
-e OMPI_MCA_btl_tcp_if_include=enp1s0f0np0,enp1s0f1np1 \
-e OMPI_ALLOW_RUN_AS_ROOT=1 \
-e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
-v ./trtllm-mn-entrypoint.sh:/opt/trtllm-mn-entrypoint.sh \
-v ~/.ssh:/tmp/.ssh:ro \
--entrypoint /opt/trtllm-mn-entrypoint.sh \
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3
```
## Step 3. Build NCCL with Blackwell support
Execute these commands inside both containers to build NCCL from source with Blackwell
Execute these commands on both nodes to build NCCL from source with Blackwell
architecture support:
```bash
## Install dependencies and build NCCL
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git /opt/nccl/
cd /opt/nccl/
git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git ~/nccl/
cd ~/nccl/
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
## Set environment variables
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="/opt/nccl/build/"
export NCCL_HOME="$HOME/nccl/build/"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
```
## Step 4. Build NCCL test suite
## Step 3. Build NCCL test suite
Compile the NCCL test suite to validate communication performance:
```bash
## Clone and build NCCL tests
git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests/
cd /opt/nccl-tests/
git clone https://github.com/NVIDIA/nccl-tests.git ~/nccl-tests/
cd ~/nccl-tests/
make MPI=1
```
## Step 4. Find the active network interface and IP addresses
Execute multi-node NCCL performance test using the active network interface. First, identify which network ports are available and up:
```bash
## Check network port status
ibdev2netdev
```
Example output:
```
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
```
Use an interface that shows as "(Up)" in your output. In this example, we'll use **enp1s0f1np1**. You can disregard interfaces starting with the prefix`enP2p<...>` and only consider interfaces starting with `enp1<...>` instead.
You will need to find the IP addresses for the CX-7 interfaces that are up. On both nodes, run the following command to find the IP addresses and take note of them for the next step.
```bash
ip addr show enp1s0f0np0
ip addr show enp1s0f1np1
```
Example output:
```
## In this example, we are using interface enp1s0f1np1.
nvidia@dgx-spark-1:~$ ip addr show enp1s0f1np1
4: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 3c:6d:66:cc:b3:b7 brd ff:ff:ff:ff:ff:ff
inet **169.254.35.62**/16 brd 169.254.255.255 scope link noprefixroute enp1s0f1np1
valid_lft forever preferred_lft forever
inet6 fe80::3e6d:66ff:fecc:b3b7/64 scope link
valid_lft forever preferred_lft forever
```
In this example, the IP address for Node 1 is **169.254.35.62**. Repeat the process for Node 2.
## Step 5. Run NCCL communication test
Execute multi-node NCCL performance test using the active network interface:
Execute the following commands on both nodes to run the NCCL communication test. Replace the IP addresses and interface names with the ones you found in the previous step.
```bash
## Set network interface environment variables (use your active interface from Step 3)
export UCX_NET_DEVICES=enp1s0f0np0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0
## Set network interface environment variables (use your Up interface from the previous step)
export UCX_NET_DEVICES=enp1s0f1np1
export NCCL_SOCKET_IFNAME=enp1s0f1np1
export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
## Run the all_gather performance test across both nodes (replace the IP addresses with the ones you found in the previous step)
mpirun -np 2 -H <IP for Node 1>:1,<IP for Node 2>:1 \
--mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
$HOME/nccl-tests/build/all_gather_perf
```
You can also test your NCCL setup with a larger buffer size to use more of your 200Gbps bandwidth.
```bash
## Set network interface environment variables (use your active interface)
export UCX_NET_DEVICES=enp1s0f1np1
export NCCL_SOCKET_IFNAME=enp1s0f1np1
export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
## Run the all_gather performance test across both nodes
mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \
-x NCCL_DEBUG=VERSION -x NCCL_DEBUG_SUBSYS=TUNING \
mpirun -np 2 -H <IP for Node 1>:1,<IP for Node 2>:1 \
--mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
-x NCCL_MERGE_LEVEL=SYS -x NCCL_PROTO="SIMPLE" \
/opt/nccl-tests/build/all_gather_perf -b 32G -e 32G -f 2
$HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2
```
## Step 6. Validate NCCL installation
Verify successful NCCL compilation and multi-node communication:
```bash
## Check NCCL library build
ls -la /opt/nccl/build/lib/
## Verify NCCL test binaries
ls -la /opt/nccl-tests/build/
## Check MPI configuration
mpirun --version
```
Expected output should show NCCL libraries in `/opt/nccl/build/lib/` and test binaries in
`/opt/nccl-tests/build/`.
Note: The IP addresses in the `mpirun` command are followed by `:1`. For example, `mpirun -np 2 -H 169.254.35.62:1,169.254.35.63:1`
## Step 7. Cleanup and rollback
**Warning**: These steps will stop containers and reset network configuration.
```bash
## Stop containers on both nodes
docker stop trtllm
## Remove containers (optional)
docker rm trtllm
## Rollback network configuration (if needed)
sudo rm /etc/netplan/40-cx7.yaml
sudo netplan apply
rm -rf ~/nccl/
rm -rf ~/nccl-tests/
```
## Step 8. Next steps
Your NCCL environment is ready for multi-node distributed training workloads on DGX Spark.
Now you can try running a larger distributed workload such as TRT-LLM or vLLM inference.
Test your NCCL setup with a simple distributed training example:
## Troubleshooting
```bash
## Example: Run a simple NCCL bandwidth test
/opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2
## Example: Verify GPU topology detection
nvidia-smi topo -m
```
Your NCCL environment is ready for multi-node distributed training workloads on DGX Spark
systems with Blackwell GPUs.
| Issue | Cause | Solution |
|-------|-------|----------|
| mpirun hangs or times out | SSH connectivity issues | 1. Test basic SSH connectivity: `ssh <remote_ip>` should work without password prompts<br>2. Try a simple mpirun test: `mpirun -np 2 -H <IP for Node 1>:1,<IP for Node 2>:1 hostname`<br>3. Verify SSH keys are setup correctly for all nodes |
| Network interface not found | Wrong interface name or down status | Check interface status with `ibdev2netdev` and verify IP configuration |
| NCCL build fails | Missing dependencies such as OpenMPI or incorrect CUDA version | Verify CUDA installation and required libraries are present |