playbook rev5

2026-06-24 07:09:31 +00:00 · 2026-01-23 19:26:14 -08:00 · 2026-01-23 19:26:14 -08:00 · 59bedc4afe
commit 59bedc4afe
parent 557fba70c8
2 changed files with 17 additions and 43 deletions
--- a/nvidia/heterogeneous-distributed-inference-rdma/DISTRIBUTED-INFERENCE.md
+++ b/nvidia/heterogeneous-distributed-inference-rdma/DISTRIBUTED-INFERENCE.md
@ -62,6 +62,13 @@ This guide walks you through deploying distributed inference across your heterog
 - NVIDIA Container Toolkit installed
 - Hugging Face account for model access (some models require authentication)
 > [!NOTE]
 > **Why we use the `nvcr.io/nvidia/vllm` container:** This tutorial uses the official NVIDIA vLLM container image (`nvcr.io/nvidia/vllm:25.09-py3`) on both nodes. This is important because:
 > - **Version consistency:** Ray cluster is very sensitive to Python and Ray version mismatches between nodes. The container guarantees identical versions on both DGX Spark (ARM64) and Workstation (AMD64).
 > - **Pre-installed dependencies:** NCCL, RDMA libraries, and all required packages are already configured.
 > - **Multi-architecture support:** The same image tag works on both ARM64 (DGX Spark) and AMD64 (Workstation) architectures.
 > - **vLLM ready:** No additional installation needed - just pull and run.
 ## Time & risk
 - **Duration:** 1-2 hours including testing
@ -310,15 +317,18 @@ On Workstation container (worker node):
 ray start \
  --address=192.168.200.1:6379 \
  --node-ip-address=192.168.200.2 \
-  --num-gpus=2
+  --num-gpus=1
 ```
 > [!NOTE]
 > Adjust `--num-gpus` based on your workstation configuration. In our case, we had 2 GPUs (RTX 6000 Pro + RTX 5090) but only used 1 for this tutorial.
 Verify cluster formation:
 ```bash
 ray status
 ```
-Expected output (should show 3 total GPUs):
+Expected output (should show 2+ total GPUs depending on your setup):
 ```
 ======== Autoscaler status: 2026-01-10 19:46:26.274139 ========
 Node status
@ -330,7 +340,7 @@ Resources
 ---------------------------------------------------------------
 Total Usage:
 0.0/68.0 CPU
- 0.0/3.0 GPU
+ 0.0/2.0 GPU
 ```
 ---
@ -480,23 +490,9 @@ vllm bench serve --host 192.168.200.1 --port 8000 --random-input-len 256 --rando
 | **DGX Spark (Single)** | 213.12s | 105.10 tok/s | 132.00 tok/s |
 | **Distributed RDMA** | 191.09s | 205.83 tok/s | 259.41 tok/s |
-### Key Insights
+### What This Demonstrates
-**RTX 6000 Pro: Clear Single-Node Winner**
+The key achievement of this tutorial is successfully running distributed inference across heterogeneous hardware (DGX Spark ARM64 + Linux Workstation AMD64) over RDMA. The distributed setup aggregates GPU memory from both systems, enabling models that wouldn't fit on either device alone.
 - 5.8x faster than DGX Spark for latency-critical workloads
 - 6.5x higher output token throughput
 - Best for: Interactive inference, real-time applications
 **Distributed RDMA: Aggregated Capacity**
 - 259.41 tok/s total throughput - faster than DGX alone
 - Combined 224GB GPU memory (128GB DGX + 96GB RTX)
 - Enables models too large for any single GPU
 - TTFT: 139.94ms mean vs single DGX 213,120ms
 **DGX Spark: Memory Advantage**
 - 128GB unified memory enables larger models
 - Slower inference but handles 100B+ models
 - Best for: Extremely large models, memory-constrained scenarios
 ### FP8 30B Model Results
--- a/nvidia/heterogeneous-distributed-inference-rdma/README.md
+++ b/nvidia/heterogeneous-distributed-inference-rdma/README.md
@ -128,7 +128,6 @@ Both planes use the same 100 Gbps ConnectX network in this configuration.
 | `infiniband-diags` | Diagnostics (`ibstat`) | Package: infiniband-diags |
 | `mstflint` | Firmware inspection | Package: mstflint |
 | `NCCL` | Multi-GPU collectives | Built into PyTorch/frameworks |
 | `GPUDirect RDMA` | GPU↔NIC zero-copy | Requires nvidia-peermem |
 ---
@ -546,14 +545,9 @@ Example successful output:
 **Performance Analysis:**
 - 11,664 MB/sec = ~93.3 Gbps
- Achieves >93% of 100 Gbps line rate - Excellent!
+- Achieves >93% of 100 Gbps line rate
 - Link type: Ethernet confirms RoCE v2 is working
 **Performance expectations:**
 - **>90 Gbps:** Excellent - Ready for distributed AI workloads
 - **80-90 Gbps:** Good - Sufficient for most multi-node training
 - **<80 Gbps:** Check MTU (should be 9000), cable quality, or PCIe slot
 ---
 ## Step 13. Configure Environment Variables for NCCL
@ -581,23 +575,7 @@ echo $NCCL_SOCKET_IFNAME
 ---
-## Step 14. (Optional) Configure GPUDirect RDMA
+## Step 14. Final Validation
 **When needed:**
 - High-frequency GPU-to-GPU transfers
 - Zero-copy GPU memory access
 - Maximum performance training workloads
 **Configuration:**
 ```bash
 ## Install nvidia-peermem module
 sudo apt install nvidia-peer-memory-dkms
 sudo modprobe nvidia-peermem
 ```
 ---
 ## Step 15. Final Validation
 At this point, you should have achieved: