mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-26 11:53:53 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
0a7238d651
commit
d060c4abfe
@ -16,7 +16,7 @@
|
|||||||
NCCL (NVIDIA Collective Communication Library) enables high-performance GPU-to-GPU communication
|
NCCL (NVIDIA Collective Communication Library) enables high-performance GPU-to-GPU communication
|
||||||
across multiple nodes. This walkthrough sets up NCCL for multi-node distributed training on
|
across multiple nodes. This walkthrough sets up NCCL for multi-node distributed training on
|
||||||
DGX Spark systems with Blackwell architecture. You'll configure networking, build NCCL from
|
DGX Spark systems with Blackwell architecture. You'll configure networking, build NCCL from
|
||||||
source with Blackwell support, and validate communication performance between nodes.
|
source with Blackwell support, and validate communication between nodes.
|
||||||
|
|
||||||
## What you'll accomplish
|
## What you'll accomplish
|
||||||
|
|
||||||
@ -27,34 +27,24 @@ and proper GPU topology detection.
|
|||||||
## What to know before starting
|
## What to know before starting
|
||||||
|
|
||||||
- Working with Linux network configuration and netplan
|
- Working with Linux network configuration and netplan
|
||||||
- Docker container management and multi-container deployments
|
|
||||||
- Basic understanding of MPI (Message Passing Interface) concepts
|
- Basic understanding of MPI (Message Passing Interface) concepts
|
||||||
- SSH key management and passwordless authentication setup
|
- SSH key management and passwordless authentication setup
|
||||||
- NVIDIA GPU architecture fundamentals and CUDA toolkit usage
|
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
- Two DGX Spark systems with Blackwell GPUs: `nvidia-smi --query-gpu=gpu_name --format=csv`
|
- Two DGX Spark systems
|
||||||
- ConnectX-7 InfiniBand network cards installed: `ibdev2netdev`
|
- Completed the Connect two Sparks playbook
|
||||||
- Docker installed on both nodes: `docker --version`
|
- NVIDIA driver installed: `nvidia-smi`
|
||||||
- CUDA toolkit available: `nvcc --version`
|
- CUDA toolkit available: `nvcc --version`
|
||||||
- SSH access between nodes: `ssh <OTHER_NODE_IP> echo "success"`
|
|
||||||
- Root/sudo privileges: `sudo whoami`
|
- Root/sudo privileges: `sudo whoami`
|
||||||
|
|
||||||
## Ancillary files
|
|
||||||
|
|
||||||
- `cx7-netplan.yaml` - Network configuration template for ConnectX-7 interfaces
|
|
||||||
- `discover-sparks` - Script to discover DGX Spark nodes and configure SSH keys
|
|
||||||
- `trtllm-mn-entrypoint.sh` - Container entrypoint script for multi-node setup
|
|
||||||
|
|
||||||
## Time & risk
|
## Time & risk
|
||||||
|
|
||||||
**Duration**: 45-60 minutes for setup and validation
|
**Duration**: 30 minutes for setup and validation
|
||||||
|
|
||||||
**Risk level**: Medium - involves network configuration changes and container networking
|
**Risk level**: Medium - involves network configuration changes
|
||||||
|
|
||||||
**Rollback**: Network changes can be reverted using `sudo netplan apply` with original configs,
|
**Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
|
||||||
containers can be stopped with `docker stop`
|
|
||||||
|
|
||||||
## Run on two Sparks
|
## Run on two Sparks
|
||||||
|
|
||||||
@ -68,119 +58,124 @@ This includes:
|
|||||||
- Passwordless SSH setup
|
- Passwordless SSH setup
|
||||||
- Network connectivity verification
|
- Network connectivity verification
|
||||||
|
|
||||||
## Step 2. Launch TensorRT-LLM containers on both nodes
|
## Step 2. Build NCCL with Blackwell support
|
||||||
|
|
||||||
Start containers with appropriate network and GPU configuration for NCCL communication:
|
Execute these commands on both nodes to build NCCL from source with Blackwell
|
||||||
|
|
||||||
```bash
|
|
||||||
## On both nodes, launch the container
|
|
||||||
docker run --name trtllm --rm -d \
|
|
||||||
--gpus all --network host --ipc=host \
|
|
||||||
--ulimit memlock=-1 --ulimit stack=67108864 \
|
|
||||||
-e UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1 \
|
|
||||||
-e NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \
|
|
||||||
-e OMPI_MCA_btl_tcp_if_include=enp1s0f0np0,enp1s0f1np1 \
|
|
||||||
-e OMPI_ALLOW_RUN_AS_ROOT=1 \
|
|
||||||
-e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
|
|
||||||
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
|
|
||||||
-v ./trtllm-mn-entrypoint.sh:/opt/trtllm-mn-entrypoint.sh \
|
|
||||||
-v ~/.ssh:/tmp/.ssh:ro \
|
|
||||||
--entrypoint /opt/trtllm-mn-entrypoint.sh \
|
|
||||||
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 3. Build NCCL with Blackwell support
|
|
||||||
|
|
||||||
Execute these commands inside both containers to build NCCL from source with Blackwell
|
|
||||||
architecture support:
|
architecture support:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Install dependencies and build NCCL
|
## Install dependencies and build NCCL
|
||||||
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
|
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
|
||||||
git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git /opt/nccl/
|
git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git ~/nccl/
|
||||||
cd /opt/nccl/
|
cd ~/nccl/
|
||||||
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
|
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
|
||||||
|
|
||||||
## Set environment variables
|
## Set environment variables
|
||||||
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
|
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
|
||||||
export NCCL_HOME="/opt/nccl/build/"
|
export NCCL_HOME="$HOME/nccl/build/"
|
||||||
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
|
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 4. Build NCCL test suite
|
## Step 3. Build NCCL test suite
|
||||||
|
|
||||||
Compile the NCCL test suite to validate communication performance:
|
Compile the NCCL test suite to validate communication performance:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Clone and build NCCL tests
|
## Clone and build NCCL tests
|
||||||
git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests/
|
git clone https://github.com/NVIDIA/nccl-tests.git ~/nccl-tests/
|
||||||
cd /opt/nccl-tests/
|
cd ~/nccl-tests/
|
||||||
make MPI=1
|
make MPI=1
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Step 4. Find the active network interface and IP addresses
|
||||||
|
|
||||||
|
Execute multi-node NCCL performance test using the active network interface. First, identify which network ports are available and up:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
## Check network port status
|
||||||
|
ibdev2netdev
|
||||||
|
```
|
||||||
|
|
||||||
|
Example output:
|
||||||
|
```
|
||||||
|
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
|
||||||
|
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
|
||||||
|
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
|
||||||
|
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
|
||||||
|
```
|
||||||
|
|
||||||
|
Use an interface that shows as "(Up)" in your output. In this example, we'll use **enp1s0f1np1**. You can disregard interfaces starting with the prefix`enP2p<...>` and only consider interfaces starting with `enp1<...>` instead.
|
||||||
|
|
||||||
|
You will need to find the IP addresses for the CX-7 interfaces that are up. On both nodes, run the following command to find the IP addresses and take note of them for the next step.
|
||||||
|
```bash
|
||||||
|
ip addr show enp1s0f0np0
|
||||||
|
ip addr show enp1s0f1np1
|
||||||
|
```
|
||||||
|
|
||||||
|
Example output:
|
||||||
|
```
|
||||||
|
## In this example, we are using interface enp1s0f1np1.
|
||||||
|
nvidia@dgx-spark-1:~$ ip addr show enp1s0f1np1
|
||||||
|
4: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
|
||||||
|
link/ether 3c:6d:66:cc:b3:b7 brd ff:ff:ff:ff:ff:ff
|
||||||
|
inet **169.254.35.62**/16 brd 169.254.255.255 scope link noprefixroute enp1s0f1np1
|
||||||
|
valid_lft forever preferred_lft forever
|
||||||
|
inet6 fe80::3e6d:66ff:fecc:b3b7/64 scope link
|
||||||
|
valid_lft forever preferred_lft forever
|
||||||
|
```
|
||||||
|
|
||||||
|
In this example, the IP address for Node 1 is **169.254.35.62**. Repeat the process for Node 2.
|
||||||
|
|
||||||
## Step 5. Run NCCL communication test
|
## Step 5. Run NCCL communication test
|
||||||
|
|
||||||
Execute multi-node NCCL performance test using the active network interface:
|
Execute the following commands on both nodes to run the NCCL communication test. Replace the IP addresses and interface names with the ones you found in the previous step.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Set network interface environment variables (use your active interface from Step 3)
|
## Set network interface environment variables (use your Up interface from the previous step)
|
||||||
export UCX_NET_DEVICES=enp1s0f0np0
|
export UCX_NET_DEVICES=enp1s0f1np1
|
||||||
export NCCL_SOCKET_IFNAME=enp1s0f0np0
|
export NCCL_SOCKET_IFNAME=enp1s0f1np1
|
||||||
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0
|
export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
|
||||||
|
|
||||||
|
## Run the all_gather performance test across both nodes (replace the IP addresses with the ones you found in the previous step)
|
||||||
|
mpirun -np 2 -H <IP for Node 1>:1,<IP for Node 2>:1 \
|
||||||
|
--mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
|
||||||
|
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
|
||||||
|
$HOME/nccl-tests/build/all_gather_perf
|
||||||
|
```
|
||||||
|
|
||||||
|
You can also test your NCCL setup with a larger buffer size to use more of your 200Gbps bandwidth.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
## Set network interface environment variables (use your active interface)
|
||||||
|
export UCX_NET_DEVICES=enp1s0f1np1
|
||||||
|
export NCCL_SOCKET_IFNAME=enp1s0f1np1
|
||||||
|
export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
|
||||||
|
|
||||||
## Run the all_gather performance test across both nodes
|
## Run the all_gather performance test across both nodes
|
||||||
mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \
|
mpirun -np 2 -H <IP for Node 1>:1,<IP for Node 2>:1 \
|
||||||
-x NCCL_DEBUG=VERSION -x NCCL_DEBUG_SUBSYS=TUNING \
|
--mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
|
||||||
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
|
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
|
||||||
-x NCCL_MERGE_LEVEL=SYS -x NCCL_PROTO="SIMPLE" \
|
$HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2
|
||||||
/opt/nccl-tests/build/all_gather_perf -b 32G -e 32G -f 2
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 6. Validate NCCL installation
|
Note: The IP addresses in the `mpirun` command are followed by `:1`. For example, `mpirun -np 2 -H 169.254.35.62:1,169.254.35.63:1`
|
||||||
|
|
||||||
Verify successful NCCL compilation and multi-node communication:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
## Check NCCL library build
|
|
||||||
ls -la /opt/nccl/build/lib/
|
|
||||||
|
|
||||||
## Verify NCCL test binaries
|
|
||||||
ls -la /opt/nccl-tests/build/
|
|
||||||
|
|
||||||
## Check MPI configuration
|
|
||||||
mpirun --version
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected output should show NCCL libraries in `/opt/nccl/build/lib/` and test binaries in
|
|
||||||
`/opt/nccl-tests/build/`.
|
|
||||||
|
|
||||||
## Step 7. Cleanup and rollback
|
## Step 7. Cleanup and rollback
|
||||||
|
|
||||||
**Warning**: These steps will stop containers and reset network configuration.
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Stop containers on both nodes
|
|
||||||
docker stop trtllm
|
|
||||||
|
|
||||||
## Remove containers (optional)
|
|
||||||
docker rm trtllm
|
|
||||||
|
|
||||||
## Rollback network configuration (if needed)
|
## Rollback network configuration (if needed)
|
||||||
sudo rm /etc/netplan/40-cx7.yaml
|
rm -rf ~/nccl/
|
||||||
sudo netplan apply
|
rm -rf ~/nccl-tests/
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 8. Next steps
|
## Step 8. Next steps
|
||||||
|
Your NCCL environment is ready for multi-node distributed training workloads on DGX Spark.
|
||||||
|
Now you can try running a larger distributed workload such as TRT-LLM or vLLM inference.
|
||||||
|
|
||||||
Test your NCCL setup with a simple distributed training example:
|
## Troubleshooting
|
||||||
|
|
||||||
```bash
|
| Issue | Cause | Solution |
|
||||||
## Example: Run a simple NCCL bandwidth test
|
|-------|-------|----------|
|
||||||
/opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2
|
| mpirun hangs or times out | SSH connectivity issues | 1. Test basic SSH connectivity: `ssh <remote_ip>` should work without password prompts<br>2. Try a simple mpirun test: `mpirun -np 2 -H <IP for Node 1>:1,<IP for Node 2>:1 hostname`<br>3. Verify SSH keys are setup correctly for all nodes |
|
||||||
|
| Network interface not found | Wrong interface name or down status | Check interface status with `ibdev2netdev` and verify IP configuration |
|
||||||
## Example: Verify GPU topology detection
|
| NCCL build fails | Missing dependencies such as OpenMPI or incorrect CUDA version | Verify CUDA installation and required libraries are present |
|
||||||
nvidia-smi topo -m
|
|
||||||
```
|
|
||||||
|
|
||||||
Your NCCL environment is ready for multi-node distributed training workloads on DGX Spark
|
|
||||||
systems with Blackwell GPUs.
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user