mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-26 20:03:52 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
699df25ee3
commit
80ab44c98d
@ -101,7 +101,12 @@ rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
|
|||||||
|
|
||||||
Choose one option to setup the network interfaces. Option 1 and 2 are mutually exclusive.
|
Choose one option to setup the network interfaces. Option 1 and 2 are mutually exclusive.
|
||||||
|
|
||||||
**Option 1: Automatic IP Assignment (Recommended)**
|
> [!NOTE]
|
||||||
|
> Full bandwidth can be achieved with just one QSFP cable.
|
||||||
|
> When two QSFP cables are connected, all four interfaces must be assigned IP addresses to obtain full bandwidth.
|
||||||
|
> Option 1 below can only be used when 1 QSFP cable is connected.
|
||||||
|
|
||||||
|
**Option 1: Automatic IP Assignment (Can only be used when 1 QSFP cable is connected)**
|
||||||
|
|
||||||
Configure network interfaces using netplan on both DGX Spark nodes for automatic
|
Configure network interfaces using netplan on both DGX Spark nodes for automatic
|
||||||
link-local addressing:
|
link-local addressing:
|
||||||
@ -125,11 +130,78 @@ sudo chmod 600 /etc/netplan/40-cx7.yaml
|
|||||||
sudo netplan apply
|
sudo netplan apply
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Option 2: Manual IP Assignment with the netplan configure file**
|
||||||
|
|
||||||
|
On node 1:
|
||||||
|
```bash
|
||||||
|
## Create the netplan configuration file
|
||||||
|
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
|
||||||
|
network:
|
||||||
|
version: 2
|
||||||
|
ethernets:
|
||||||
|
enp1s0f0np0:
|
||||||
|
addresses:
|
||||||
|
- 192.168.100.10/24
|
||||||
|
dhcp4: no
|
||||||
|
enp1s0f1np1:
|
||||||
|
addresses:
|
||||||
|
- 192.168.200.12/24
|
||||||
|
dhcp4: no
|
||||||
|
enP2p1s0f0np0:
|
||||||
|
addresses:
|
||||||
|
- 192.168.100.14/24
|
||||||
|
dhcp4: no
|
||||||
|
enP2p1s0f1np1:
|
||||||
|
addresses:
|
||||||
|
- 192.168.200.16/24
|
||||||
|
dhcp4: no
|
||||||
|
EOF
|
||||||
|
|
||||||
|
## Set appropriate permissions
|
||||||
|
sudo chmod 600 /etc/netplan/40-cx7.yaml
|
||||||
|
|
||||||
|
## Apply the configuration
|
||||||
|
sudo netplan apply
|
||||||
|
```
|
||||||
|
|
||||||
|
On node 2:
|
||||||
|
```bash
|
||||||
|
## Create the netplan configuration file
|
||||||
|
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
|
||||||
|
network:
|
||||||
|
version: 2
|
||||||
|
ethernets:
|
||||||
|
enp1s0f0np0:
|
||||||
|
addresses:
|
||||||
|
- 192.168.100.11/24
|
||||||
|
dhcp4: no
|
||||||
|
enp1s0f1np1:
|
||||||
|
addresses:
|
||||||
|
- 192.168.200.13/24
|
||||||
|
dhcp4: no
|
||||||
|
enP2p1s0f0np0:
|
||||||
|
addresses:
|
||||||
|
- 192.168.100.15/24
|
||||||
|
dhcp4: no
|
||||||
|
enP2p1s0f1np1:
|
||||||
|
addresses:
|
||||||
|
- 192.168.200.17/24
|
||||||
|
dhcp4: no
|
||||||
|
EOF
|
||||||
|
|
||||||
|
## Set appropriate permissions
|
||||||
|
sudo chmod 600 /etc/netplan/40-cx7.yaml
|
||||||
|
|
||||||
|
## Apply the configuration
|
||||||
|
sudo netplan apply
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
**Option 3: Manual IP Assignment with command line**
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> Using this option, the IPs assigned to the interfaces will change if you reboot the system.
|
> Using this option, the IPs assigned to the interfaces will change if you reboot the system.
|
||||||
|
|
||||||
**Option 2: Manual IP Assignment (Advanced)**
|
|
||||||
|
|
||||||
First, identify which network ports are available and up:
|
First, identify which network ports are available and up:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -167,7 +239,7 @@ You can verify the IP assignment on both nodes by running the following command
|
|||||||
ip addr show enp1s0f1np1
|
ip addr show enp1s0f1np1
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 3. Set up passwordless SSH authentication
|
## Step 4. Set up passwordless SSH authentication
|
||||||
|
|
||||||
#### Option 1: Automatically configure SSH
|
#### Option 1: Automatically configure SSH
|
||||||
|
|
||||||
@ -220,7 +292,7 @@ ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 1>
|
|||||||
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 2>
|
ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@<IP for Node 2>
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 4. Verify Multi-Node Communication
|
## Step 5. Verify Multi-Node Communication
|
||||||
|
|
||||||
Test basic multi-node functionality:
|
Test basic multi-node functionality:
|
||||||
|
|
||||||
|
|||||||
@ -76,7 +76,7 @@ the TensorRT development environment with all required dependencies pre-installe
|
|||||||
docker run --gpus all --ipc=host --ulimit memlock=-1 \
|
docker run --gpus all --ipc=host --ulimit memlock=-1 \
|
||||||
--ulimit stack=67108864 -it --rm --ipc=host \
|
--ulimit stack=67108864 -it --rm --ipc=host \
|
||||||
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
|
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
|
||||||
nvcr.io/nvidia/pytorch:25.09-py3
|
nvcr.io/nvidia/pytorch:25.10-py3
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 2. Clone and set up TensorRT repository
|
## Step 2. Clone and set up TensorRT repository
|
||||||
@ -101,6 +101,7 @@ apt install -y libgl1 libglu1-mesa libglib2.0-0t64 libxrender1 libxext6 libx11-6
|
|||||||
pip install nvidia-modelopt[torch,onnx]
|
pip install nvidia-modelopt[torch,onnx]
|
||||||
sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt
|
sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt
|
||||||
pip3 install -r requirements.txt
|
pip3 install -r requirements.txt
|
||||||
|
pip install onnxconverter_common
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 4. Run Flux.1 Dev model inference
|
## Step 4. Run Flux.1 Dev model inference
|
||||||
|
|||||||
@ -128,6 +128,10 @@ In this example, the IP address for Node 1 is **169.254.35.62**. Repeat the proc
|
|||||||
|
|
||||||
## Step 5. Run NCCL communication test
|
## Step 5. Run NCCL communication test
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> Full bandwidth can be achieved with just one QSFP cable.
|
||||||
|
> When two QSFP cables are connected, all four interfaces must be assigned IP addresses to obtain full bandwidth.
|
||||||
|
|
||||||
Execute the following commands on both nodes to run the NCCL communication test. Replace the IP addresses and interface names with the ones you found in the previous step.
|
Execute the following commands on both nodes to run the NCCL communication test. Replace the IP addresses and interface names with the ones you found in the previous step.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user