mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-26 03:43:52 +00:00
playbook rev5
This commit is contained in:
parent
557fba70c8
commit
59bedc4afe
@ -62,6 +62,13 @@ This guide walks you through deploying distributed inference across your heterog
|
|||||||
- NVIDIA Container Toolkit installed
|
- NVIDIA Container Toolkit installed
|
||||||
- Hugging Face account for model access (some models require authentication)
|
- Hugging Face account for model access (some models require authentication)
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> **Why we use the `nvcr.io/nvidia/vllm` container:** This tutorial uses the official NVIDIA vLLM container image (`nvcr.io/nvidia/vllm:25.09-py3`) on both nodes. This is important because:
|
||||||
|
> - **Version consistency:** Ray cluster is very sensitive to Python and Ray version mismatches between nodes. The container guarantees identical versions on both DGX Spark (ARM64) and Workstation (AMD64).
|
||||||
|
> - **Pre-installed dependencies:** NCCL, RDMA libraries, and all required packages are already configured.
|
||||||
|
> - **Multi-architecture support:** The same image tag works on both ARM64 (DGX Spark) and AMD64 (Workstation) architectures.
|
||||||
|
> - **vLLM ready:** No additional installation needed - just pull and run.
|
||||||
|
|
||||||
## Time & risk
|
## Time & risk
|
||||||
|
|
||||||
- **Duration:** 1-2 hours including testing
|
- **Duration:** 1-2 hours including testing
|
||||||
@ -310,15 +317,18 @@ On Workstation container (worker node):
|
|||||||
ray start \
|
ray start \
|
||||||
--address=192.168.200.1:6379 \
|
--address=192.168.200.1:6379 \
|
||||||
--node-ip-address=192.168.200.2 \
|
--node-ip-address=192.168.200.2 \
|
||||||
--num-gpus=2
|
--num-gpus=1
|
||||||
```
|
```
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> Adjust `--num-gpus` based on your workstation configuration. In our case, we had 2 GPUs (RTX 6000 Pro + RTX 5090) but only used 1 for this tutorial.
|
||||||
|
|
||||||
Verify cluster formation:
|
Verify cluster formation:
|
||||||
```bash
|
```bash
|
||||||
ray status
|
ray status
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output (should show 3 total GPUs):
|
Expected output (should show 2+ total GPUs depending on your setup):
|
||||||
```
|
```
|
||||||
======== Autoscaler status: 2026-01-10 19:46:26.274139 ========
|
======== Autoscaler status: 2026-01-10 19:46:26.274139 ========
|
||||||
Node status
|
Node status
|
||||||
@ -330,7 +340,7 @@ Resources
|
|||||||
---------------------------------------------------------------
|
---------------------------------------------------------------
|
||||||
Total Usage:
|
Total Usage:
|
||||||
0.0/68.0 CPU
|
0.0/68.0 CPU
|
||||||
0.0/3.0 GPU
|
0.0/2.0 GPU
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@ -480,23 +490,9 @@ vllm bench serve --host 192.168.200.1 --port 8000 --random-input-len 256 --rando
|
|||||||
| **DGX Spark (Single)** | 213.12s | 105.10 tok/s | 132.00 tok/s |
|
| **DGX Spark (Single)** | 213.12s | 105.10 tok/s | 132.00 tok/s |
|
||||||
| **Distributed RDMA** | 191.09s | 205.83 tok/s | 259.41 tok/s |
|
| **Distributed RDMA** | 191.09s | 205.83 tok/s | 259.41 tok/s |
|
||||||
|
|
||||||
### Key Insights
|
### What This Demonstrates
|
||||||
|
|
||||||
**RTX 6000 Pro: Clear Single-Node Winner**
|
The key achievement of this tutorial is successfully running distributed inference across heterogeneous hardware (DGX Spark ARM64 + Linux Workstation AMD64) over RDMA. The distributed setup aggregates GPU memory from both systems, enabling models that wouldn't fit on either device alone.
|
||||||
- 5.8x faster than DGX Spark for latency-critical workloads
|
|
||||||
- 6.5x higher output token throughput
|
|
||||||
- Best for: Interactive inference, real-time applications
|
|
||||||
|
|
||||||
**Distributed RDMA: Aggregated Capacity**
|
|
||||||
- 259.41 tok/s total throughput - faster than DGX alone
|
|
||||||
- Combined 224GB GPU memory (128GB DGX + 96GB RTX)
|
|
||||||
- Enables models too large for any single GPU
|
|
||||||
- TTFT: 139.94ms mean vs single DGX 213,120ms
|
|
||||||
|
|
||||||
**DGX Spark: Memory Advantage**
|
|
||||||
- 128GB unified memory enables larger models
|
|
||||||
- Slower inference but handles 100B+ models
|
|
||||||
- Best for: Extremely large models, memory-constrained scenarios
|
|
||||||
|
|
||||||
### FP8 30B Model Results
|
### FP8 30B Model Results
|
||||||
|
|
||||||
|
|||||||
@ -128,7 +128,6 @@ Both planes use the same 100 Gbps ConnectX network in this configuration.
|
|||||||
| `infiniband-diags` | Diagnostics (`ibstat`) | Package: infiniband-diags |
|
| `infiniband-diags` | Diagnostics (`ibstat`) | Package: infiniband-diags |
|
||||||
| `mstflint` | Firmware inspection | Package: mstflint |
|
| `mstflint` | Firmware inspection | Package: mstflint |
|
||||||
| `NCCL` | Multi-GPU collectives | Built into PyTorch/frameworks |
|
| `NCCL` | Multi-GPU collectives | Built into PyTorch/frameworks |
|
||||||
| `GPUDirect RDMA` | GPU↔NIC zero-copy | Requires nvidia-peermem |
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -546,14 +545,9 @@ Example successful output:
|
|||||||
|
|
||||||
**Performance Analysis:**
|
**Performance Analysis:**
|
||||||
- 11,664 MB/sec = ~93.3 Gbps
|
- 11,664 MB/sec = ~93.3 Gbps
|
||||||
- Achieves >93% of 100 Gbps line rate - Excellent!
|
- Achieves >93% of 100 Gbps line rate
|
||||||
- Link type: Ethernet confirms RoCE v2 is working
|
- Link type: Ethernet confirms RoCE v2 is working
|
||||||
|
|
||||||
**Performance expectations:**
|
|
||||||
- **>90 Gbps:** Excellent - Ready for distributed AI workloads
|
|
||||||
- **80-90 Gbps:** Good - Sufficient for most multi-node training
|
|
||||||
- **<80 Gbps:** Check MTU (should be 9000), cable quality, or PCIe slot
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Step 13. Configure Environment Variables for NCCL
|
## Step 13. Configure Environment Variables for NCCL
|
||||||
@ -581,23 +575,7 @@ echo $NCCL_SOCKET_IFNAME
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Step 14. (Optional) Configure GPUDirect RDMA
|
## Step 14. Final Validation
|
||||||
|
|
||||||
**When needed:**
|
|
||||||
- High-frequency GPU-to-GPU transfers
|
|
||||||
- Zero-copy GPU memory access
|
|
||||||
- Maximum performance training workloads
|
|
||||||
|
|
||||||
**Configuration:**
|
|
||||||
```bash
|
|
||||||
## Install nvidia-peermem module
|
|
||||||
sudo apt install nvidia-peer-memory-dkms
|
|
||||||
sudo modprobe nvidia-peermem
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Step 15. Final Validation
|
|
||||||
|
|
||||||
At this point, you should have achieved:
|
At this point, you should have achieved:
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user