chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-11-25 03:08:49 +00:00
parent 699df25ee3
commit 80ab44c98d
3 changed files with 83 additions and 6 deletions

View File

@ -101,7 +101,12 @@ rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
Choose one option to setup the network interfaces. Option 1 and 2 are mutually exclusive.
**Option 1: Automatic IP Assignment (Recommended)**
> [!NOTE]
> Full bandwidth can be achieved with just one QSFP cable.
> When two QSFP cables are connected, all four interfaces must be assigned IP addresses to obtain full bandwidth.
> Option 1 below can only be used when 1 QSFP cable is connected.
**Option 1: Automatic IP Assignment (Can only be used when 1 QSFP cable is connected)**
Configure network interfaces using netplan on both DGX Spark nodes for automatic
link-local addressing:
@ -125,11 +130,78 @@ sudo chmod 600 /etc/netplan/40-cx7.yaml
sudo netplan apply
```
**Option 2: Manual IP Assignment with the netplan configure file**
On node 1:
```bash
## Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f0np0:
addresses:
- 192.168.100.10/24
dhcp4: no
enp1s0f1np1:
addresses:
- 192.168.200.12/24
dhcp4: no
enP2p1s0f0np0:
addresses:
- 192.168.100.14/24
dhcp4: no
enP2p1s0f1np1:
addresses:
- 192.168.200.16/24
dhcp4: no
EOF
## Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
## Apply the configuration
sudo netplan apply
```
On node 2:
```bash
## Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f0np0:
addresses:
- 192.168.100.11/24
dhcp4: no
enp1s0f1np1:
addresses:
- 192.168.200.13/24
dhcp4: no
enP2p1s0f0np0:
addresses:
- 192.168.100.15/24
dhcp4: no
enP2p1s0f1np1:
addresses:
- 192.168.200.17/24
dhcp4: no
EOF
## Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
## Apply the configuration
sudo netplan apply
```
**Option 3: Manual IP Assignment with command line**
> [!NOTE]
> Using this option, the IPs assigned to the interfaces will change if you reboot the system.
**Option 2: Manual IP Assignment (Advanced)**
First, identify which network ports are available and up:
```bash
@ -167,7 +239,7 @@ You can verify the IP assignment on both nodes by running the following command
ip addr show enp1s0f1np1
```
## Step 3. Set up passwordless SSH authentication
## Step 4. Set up passwordless SSH authentication
#### Option 1: Automatically configure SSH
@ -220,7 +292,7 @@ ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 1>
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 2>
```
## Step 4. Verify Multi-Node Communication
## Step 5. Verify Multi-Node Communication
Test basic multi-node functionality:

View File

@ -76,7 +76,7 @@ the TensorRT development environment with all required dependencies pre-installe
docker run --gpus all --ipc=host --ulimit memlock=-1 \
--ulimit stack=67108864 -it --rm --ipc=host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/pytorch:25.09-py3
nvcr.io/nvidia/pytorch:25.10-py3
```
## Step 2. Clone and set up TensorRT repository
@ -101,6 +101,7 @@ apt install -y libgl1 libglu1-mesa libglib2.0-0t64 libxrender1 libxext6 libx11-6
pip install nvidia-modelopt[torch,onnx]
sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt
pip3 install -r requirements.txt
pip install onnxconverter_common
```
## Step 4. Run Flux.1 Dev model inference

View File

@ -128,6 +128,10 @@ In this example, the IP address for Node 1 is **169.254.35.62**. Repeat the proc
## Step 5. Run NCCL communication test
> [!NOTE]
> Full bandwidth can be achieved with just one QSFP cable.
> When two QSFP cables are connected, all four interfaces must be assigned IP addresses to obtain full bandwidth.
Execute the following commands on both nodes to run the NCCL communication test. Replace the IP addresses and interface names with the ones you found in the previous step.
```bash