diff --git a/nvidia/connect-two-sparks/README.md b/nvidia/connect-two-sparks/README.md index bf1d0bc..ecff809 100644 --- a/nvidia/connect-two-sparks/README.md +++ b/nvidia/connect-two-sparks/README.md @@ -101,7 +101,12 @@ rocep1s0f1 port 1 ==> enp1s0f1np1 (Up) Choose one option to setup the network interfaces. Option 1 and 2 are mutually exclusive. -**Option 1: Automatic IP Assignment (Recommended)** +> [!NOTE] +> Full bandwidth can be achieved with just one QSFP cable. +> When two QSFP cables are connected, all four interfaces must be assigned IP addresses to obtain full bandwidth. +> Option 1 below can only be used when 1 QSFP cable is connected. + +**Option 1: Automatic IP Assignment (Can only be used when 1 QSFP cable is connected)** Configure network interfaces using netplan on both DGX Spark nodes for automatic link-local addressing: @@ -125,11 +130,78 @@ sudo chmod 600 /etc/netplan/40-cx7.yaml sudo netplan apply ``` +**Option 2: Manual IP Assignment with the netplan configure file** + +On node 1: +```bash +## Create the netplan configuration file +sudo tee /etc/netplan/40-cx7.yaml > /dev/null < /dev/null < [!NOTE] > Using this option, the IPs assigned to the interfaces will change if you reboot the system. -**Option 2: Manual IP Assignment (Advanced)** - First, identify which network ports are available and up: ```bash @@ -167,7 +239,7 @@ You can verify the IP assignment on both nodes by running the following command ip addr show enp1s0f1np1 ``` -## Step 3. Set up passwordless SSH authentication +## Step 4. Set up passwordless SSH authentication #### Option 1: Automatically configure SSH @@ -220,7 +292,7 @@ ssh-copy-id -i ~/.ssh/id_rsa.pub @ ssh-copy-id -i ~/.ssh/id_rsa.pub @ ``` -## Step 4. Verify Multi-Node Communication +## Step 5. Verify Multi-Node Communication Test basic multi-node functionality: diff --git a/nvidia/multi-modal-inference/README.md b/nvidia/multi-modal-inference/README.md index 0c08ccb..0942553 100644 --- a/nvidia/multi-modal-inference/README.md +++ b/nvidia/multi-modal-inference/README.md @@ -76,7 +76,7 @@ the TensorRT development environment with all required dependencies pre-installe docker run --gpus all --ipc=host --ulimit memlock=-1 \ --ulimit stack=67108864 -it --rm --ipc=host \ -v $HOME/.cache/huggingface:/root/.cache/huggingface \ -nvcr.io/nvidia/pytorch:25.09-py3 +nvcr.io/nvidia/pytorch:25.10-py3 ``` ## Step 2. Clone and set up TensorRT repository @@ -101,6 +101,7 @@ apt install -y libgl1 libglu1-mesa libglib2.0-0t64 libxrender1 libxext6 libx11-6 pip install nvidia-modelopt[torch,onnx] sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt pip3 install -r requirements.txt +pip install onnxconverter_common ``` ## Step 4. Run Flux.1 Dev model inference diff --git a/nvidia/nccl/README.md b/nvidia/nccl/README.md index 9554130..6cbacb9 100644 --- a/nvidia/nccl/README.md +++ b/nvidia/nccl/README.md @@ -128,6 +128,10 @@ In this example, the IP address for Node 1 is **169.254.35.62**. Repeat the proc ## Step 5. Run NCCL communication test +> [!NOTE] +> Full bandwidth can be achieved with just one QSFP cable. +> When two QSFP cables are connected, all four interfaces must be assigned IP addresses to obtain full bandwidth. + Execute the following commands on both nodes to run the NCCL communication test. Replace the IP addresses and interface names with the ones you found in the previous step. ```bash