mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-23 10:33:51 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
699df25ee3
commit
80ab44c98d
@ -101,7 +101,12 @@ rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
|
||||
|
||||
Choose one option to setup the network interfaces. Option 1 and 2 are mutually exclusive.
|
||||
|
||||
**Option 1: Automatic IP Assignment (Recommended)**
|
||||
> [!NOTE]
|
||||
> Full bandwidth can be achieved with just one QSFP cable.
|
||||
> When two QSFP cables are connected, all four interfaces must be assigned IP addresses to obtain full bandwidth.
|
||||
> Option 1 below can only be used when 1 QSFP cable is connected.
|
||||
|
||||
**Option 1: Automatic IP Assignment (Can only be used when 1 QSFP cable is connected)**
|
||||
|
||||
Configure network interfaces using netplan on both DGX Spark nodes for automatic
|
||||
link-local addressing:
|
||||
@ -125,11 +130,78 @@ sudo chmod 600 /etc/netplan/40-cx7.yaml
|
||||
sudo netplan apply
|
||||
```
|
||||
|
||||
**Option 2: Manual IP Assignment with the netplan configure file**
|
||||
|
||||
On node 1:
|
||||
```bash
|
||||
## Create the netplan configuration file
|
||||
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
|
||||
network:
|
||||
version: 2
|
||||
ethernets:
|
||||
enp1s0f0np0:
|
||||
addresses:
|
||||
- 192.168.100.10/24
|
||||
dhcp4: no
|
||||
enp1s0f1np1:
|
||||
addresses:
|
||||
- 192.168.200.12/24
|
||||
dhcp4: no
|
||||
enP2p1s0f0np0:
|
||||
addresses:
|
||||
- 192.168.100.14/24
|
||||
dhcp4: no
|
||||
enP2p1s0f1np1:
|
||||
addresses:
|
||||
- 192.168.200.16/24
|
||||
dhcp4: no
|
||||
EOF
|
||||
|
||||
## Set appropriate permissions
|
||||
sudo chmod 600 /etc/netplan/40-cx7.yaml
|
||||
|
||||
## Apply the configuration
|
||||
sudo netplan apply
|
||||
```
|
||||
|
||||
On node 2:
|
||||
```bash
|
||||
## Create the netplan configuration file
|
||||
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
|
||||
network:
|
||||
version: 2
|
||||
ethernets:
|
||||
enp1s0f0np0:
|
||||
addresses:
|
||||
- 192.168.100.11/24
|
||||
dhcp4: no
|
||||
enp1s0f1np1:
|
||||
addresses:
|
||||
- 192.168.200.13/24
|
||||
dhcp4: no
|
||||
enP2p1s0f0np0:
|
||||
addresses:
|
||||
- 192.168.100.15/24
|
||||
dhcp4: no
|
||||
enP2p1s0f1np1:
|
||||
addresses:
|
||||
- 192.168.200.17/24
|
||||
dhcp4: no
|
||||
EOF
|
||||
|
||||
## Set appropriate permissions
|
||||
sudo chmod 600 /etc/netplan/40-cx7.yaml
|
||||
|
||||
## Apply the configuration
|
||||
sudo netplan apply
|
||||
```
|
||||
|
||||
|
||||
**Option 3: Manual IP Assignment with command line**
|
||||
|
||||
> [!NOTE]
|
||||
> Using this option, the IPs assigned to the interfaces will change if you reboot the system.
|
||||
|
||||
**Option 2: Manual IP Assignment (Advanced)**
|
||||
|
||||
First, identify which network ports are available and up:
|
||||
|
||||
```bash
|
||||
@ -167,7 +239,7 @@ You can verify the IP assignment on both nodes by running the following command
|
||||
ip addr show enp1s0f1np1
|
||||
```
|
||||
|
||||
## Step 3. Set up passwordless SSH authentication
|
||||
## Step 4. Set up passwordless SSH authentication
|
||||
|
||||
#### Option 1: Automatically configure SSH
|
||||
|
||||
@ -220,7 +292,7 @@ ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 1>
|
||||
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 2>
|
||||
```
|
||||
|
||||
## Step 4. Verify Multi-Node Communication
|
||||
## Step 5. Verify Multi-Node Communication
|
||||
|
||||
Test basic multi-node functionality:
|
||||
|
||||
|
||||
@ -76,7 +76,7 @@ the TensorRT development environment with all required dependencies pre-installe
|
||||
docker run --gpus all --ipc=host --ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 -it --rm --ipc=host \
|
||||
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
|
||||
nvcr.io/nvidia/pytorch:25.09-py3
|
||||
nvcr.io/nvidia/pytorch:25.10-py3
|
||||
```
|
||||
|
||||
## Step 2. Clone and set up TensorRT repository
|
||||
@ -101,6 +101,7 @@ apt install -y libgl1 libglu1-mesa libglib2.0-0t64 libxrender1 libxext6 libx11-6
|
||||
pip install nvidia-modelopt[torch,onnx]
|
||||
sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt
|
||||
pip3 install -r requirements.txt
|
||||
pip install onnxconverter_common
|
||||
```
|
||||
|
||||
## Step 4. Run Flux.1 Dev model inference
|
||||
|
||||
@ -128,6 +128,10 @@ In this example, the IP address for Node 1 is **169.254.35.62**. Repeat the proc
|
||||
|
||||
## Step 5. Run NCCL communication test
|
||||
|
||||
> [!NOTE]
|
||||
> Full bandwidth can be achieved with just one QSFP cable.
|
||||
> When two QSFP cables are connected, all four interfaces must be assigned IP addresses to obtain full bandwidth.
|
||||
|
||||
Execute the following commands on both nodes to run the NCCL communication test. Replace the IP addresses and interface names with the ones you found in the previous step.
|
||||
|
||||
```bash
|
||||
|
||||
Loading…
Reference in New Issue
Block a user