diff --git a/README.md b/README.md index 2cec705..fa932a0 100644 --- a/README.md +++ b/README.md @@ -28,6 +28,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting - [CUDA-X Data Science](nvidia/cuda-x-data-science/) - [DGX Dashboard](nvidia/dgx-dashboard/) - [FLUX.1 Dreambooth LoRA Fine-tuning](nvidia/flux-finetuning/) +- [Heterogeneous Distributed Inference over RDMA](nvidia/heterogeneous-distributed-inference-rdma/) - [Install and Use Isaac Sim and Isaac Lab](nvidia/isaac/) - [Optimized JAX](nvidia/jax/) - [Live VLM WebUI](nvidia/live-vlm-webui/) diff --git a/nvidia/heterogeneous-distributed-inference-rdma/DISTRIBUTED-INFERENCE.md b/nvidia/heterogeneous-distributed-inference-rdma/DISTRIBUTED-INFERENCE.md new file mode 100644 index 0000000..7d87a18 --- /dev/null +++ b/nvidia/heterogeneous-distributed-inference-rdma/DISTRIBUTED-INFERENCE.md @@ -0,0 +1,532 @@ +# Distributed Inference Guide + +> Deploy and run distributed AI inference across DGX Spark and Linux Workstation using vLLM and Ray + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) +- [Performance Benchmarks](#performance-benchmarks) +- [Troubleshooting](#troubleshooting) +- [Credits](#credits) + +--- + +## Overview + +## Basic idea + +This guide walks you through deploying distributed inference across your heterogeneous RDMA cluster. Using Ray for orchestration and vLLM for inference, you can run large language models that exceed the memory capacity of any single GPU by distributing them across your DGX Spark and Linux workstation. + +**Architecture:** +``` +┌─────────────────────────────────┐ ┌───────────────────────────────────┐ +│ DGX SPARK │ │ WORKSTATION │ +│ (Grace Blackwell GB10) │ │ (RTX 6000 Pro / RTX 5090) │ +│ │ │ │ +│ ┌───────────────────────────┐ │ │ ┌───────────────────────────┐ │ +│ │ vLLM Head Node │ │ │ │ vLLM Worker │ │ +│ │ (API Server, Rank 0) │ │ │ │ (Tensor Parallel) │ │ +│ └───────────────────────────┘ │ │ └───────────────────────────┘ │ +│ │ │ │ │ │ +│ ┌───────────────────────────┐ │ │ ┌───────────────────────────┐ │ +│ │ Ray Head (6379) │◄─┼────┼──│ Ray Worker │ │ +│ └───────────────────────────┘ │ │ └───────────────────────────┘ │ +│ │ │ │ │ │ +│ ┌───────────────────────────┐ │ │ ┌───────────────────────────┐ │ +│ │ NCCL over RDMA │◄─┼════┼──│ NCCL over RDMA │ │ +│ │ 192.168.200.1 │ │ │ │ 192.168.200.2 │ │ +│ └───────────────────────────┘ │ │ └───────────────────────────┘ │ +└─────────────────────────────────┘ └───────────────────────────────────┘ +``` + +## What you'll accomplish + +- Configure SSH and hostname resolution between nodes +- Test NCCL communication over RDMA +- Deploy RDMA-enabled Docker containers +- Establish a Ray cluster across both systems +- Run distributed inference with vLLM +- Benchmark performance across different configurations + +## What to know before starting + +- Familiarity with Docker and container networking +- Understanding of distributed computing concepts (Ray, tensor parallelism) +- Basic knowledge of LLM inference serving + +## Prerequisites + +- Completed [RDMA Network Setup](README.md) with validated 90+ Gbps bandwidth +- Docker installed on both systems: `docker --version` +- NVIDIA Container Toolkit installed +- Hugging Face account for model access (some models require authentication) + +> [!NOTE] +> **Why we use the `nvcr.io/nvidia/vllm` container:** This tutorial uses the official NVIDIA vLLM container image (`nvcr.io/nvidia/vllm:25.09-py3`) on both nodes. This is important because: +> - **Version consistency:** Ray cluster is very sensitive to Python and Ray version mismatches between nodes. The container guarantees identical versions on both DGX Spark (ARM64) and Workstation (AMD64). +> - **Pre-installed dependencies:** NCCL, RDMA libraries, and all required packages are already configured. +> - **Multi-architecture support:** The same image tag works on both ARM64 (DGX Spark) and AMD64 (Workstation) architectures. +> - **vLLM ready:** No additional installation needed - just pull and run. + +## Time & risk + +- **Duration:** 1-2 hours including testing + +- **Risk level:** Low - uses containers, non-destructive + +- **Rollback:** Stop containers to revert + +- **Last Updated:** 01/23/2026 + +--- + +## Instructions + +## Step 1. Configure Hostnames + +Add hostname aliases on both systems: + +```bash +## Add hostname resolution on both DGX Spark and Workstation +sudo tee -a /etc/hosts > /dev/null <@workstation +``` + +On Workstation: +```bash +## Check if SSH key exists +ls ~/.ssh/id_*.pub + +## If no key exists, generate one: +ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 + +## Copy key to DGX Spark +ssh-copy-id @dgx-spark +``` + +Verify passwordless SSH: + +```bash +## From DGX Spark +ssh @workstation hostname +## Expected output: workstation + +## From Workstation +ssh @dgx-spark hostname +## Expected output: dgx-spark +``` + +--- + +## Step 3. Test NCCL Communication + +Create the NCCL test script on both systems: + +```bash +## Create test script +cat > test_nccl.py << 'EOF' +import os +import torch +import torch.distributed as dist +import argparse + +def test_nccl_communication(): + parser = argparse.ArgumentParser() + parser.add_argument('--rank', type=int, required=True) + parser.add_argument('--world_size', type=int, default=2) + parser.add_argument('--master_addr', type=str, default='192.168.200.1') + parser.add_argument('--master_port', type=str, default='29500') + args = parser.parse_args() + + os.environ['RANK'] = str(args.rank) + os.environ['WORLD_SIZE'] = str(args.world_size) + os.environ['MASTER_ADDR'] = args.master_addr + os.environ['MASTER_PORT'] = args.master_port + os.environ['NCCL_SOCKET_IFNAME'] = 'enp1s0f0np0' + + print(f"Initializing process group - Rank: {args.rank}, World Size: {args.world_size}") + print(f"Master: {args.master_addr}:{args.master_port}") + + dist.init_process_group(backend='nccl', rank=args.rank, world_size=args.world_size) + print(f"Process group initialized - Rank: {dist.get_rank()}/{dist.get_world_size()}") + + device = torch.device('cuda:0') + tensor = torch.ones(10, device=device) * (args.rank + 1) + print(f"Rank {args.rank} - Before allreduce: {tensor}") + + dist.all_reduce(tensor, op=dist.ReduceOp.SUM) + print(f"Rank {args.rank} - After allreduce: {tensor}") + print(f"Expected result: {torch.ones(10) * (1 + 2)}") + + dist.destroy_process_group() + print(f"Rank {args.rank} - Test completed successfully!") + +if __name__ == "__main__": + test_nccl_communication() +EOF +``` + +Run NCCL test in Docker containers: + +On DGX Spark (start first): +```bash +docker run -it --runtime=nvidia --gpus all --network host --ipc=host \ + --privileged --ulimit memlock=-1 --ulimit stack=67108864 \ + -v /dev/infiniband:/dev/infiniband -v /sys:/sys:ro \ + -e NCCL_IB_DISABLE=0 -e NCCL_IB_HCA=rocep1s0f0:1 -e NCCL_IB_GID_INDEX=3 \ + -e NCCL_SOCKET_IFNAME=enp1s0f0np0 -v $(pwd):/workspace \ + nvcr.io/nvidia/vllm:25.09-py3 python /workspace/test_nccl.py --rank 0 +``` + +On Workstation (connect to DGX): +```bash +docker run -it --runtime=nvidia --gpus all --network host --ipc=host \ + --privileged --ulimit memlock=-1 --ulimit stack=67108864 \ + -v /dev/infiniband:/dev/infiniband -v /sys:/sys:ro \ + -e NCCL_IB_DISABLE=0 -e NCCL_IB_HCA=rocep1s0f0:1 -e NCCL_IB_GID_INDEX=3 \ + -e NCCL_SOCKET_IFNAME=enp1s0f0np0 -v $(pwd):/workspace \ + nvcr.io/nvidia/vllm:25.09-py3 python /workspace/test_nccl.py --rank 1 +``` + +**Success indicators:** +- Output shows: `NCCL INFO Using network IBext_v10` +- All-reduce operation completes successfully +- Final tensors show expected sum values (3.0 for each element) + +--- + +## Step 4. Start RDMA-Enabled Containers + +On DGX Spark: +```bash +docker run -it --runtime=nvidia --gpus all --network host --ipc=host --shm-size=10g \ + --privileged \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -v /dev/infiniband:/dev/infiniband \ + -v /sys:/sys:ro \ + -e CUDA_DEVICE_ORDER=PCI_BUS_ID \ + -e GLOO_SOCKET_IFNAME=enp1s0f0np0 \ + -e NCCL_IB_DISABLE=0 \ + -e NCCL_IB_HCA=rocep1s0f0:1 \ + -e NCCL_IB_GID_INDEX=3 \ + -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \ + -e RAY_USE_MULTIPLE_IPS=0 \ + -e RAY_NODE_IP_ADDRESS=192.168.200.1 \ + -e RAY_OVERRIDE_NODE_IP=192.168.200.1 \ + -e VLLM_HOST_IP=192.168.200.1 \ + nvcr.io/nvidia/vllm:25.09-py3 bash +``` + +On Workstation: +```bash +docker run -it --runtime=nvidia --gpus all --network host --ipc=host --shm-size=10g \ + --privileged \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -v /dev/infiniband:/dev/infiniband \ + -v /sys:/sys:ro \ + -e CUDA_DEVICE_ORDER=PCI_BUS_ID \ + -e GLOO_SOCKET_IFNAME=enp1s0f0np0 \ + -e NCCL_IB_DISABLE=0 \ + -e NCCL_IB_HCA=rocep1s0f0:1 \ + -e NCCL_IB_GID_INDEX=3 \ + -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \ + -e RAY_USE_MULTIPLE_IPS=0 \ + -e RAY_NODE_IP_ADDRESS=192.168.200.2 \ + -e RAY_OVERRIDE_NODE_IP=192.168.200.2 \ + nvcr.io/nvidia/vllm:25.09-py3 bash +``` + +**Key parameters explained:** +- `--runtime=nvidia`: Required for GPU access +- `--network host`: Uses host networking (required for RDMA) +- `--privileged`: Needed for InfiniBand device access +- `--ulimit memlock=-1`: Unlimited memory locking for RDMA +- `-v /dev/infiniband:/dev/infiniband`: Mounts RDMA devices +- `NCCL_IB_HCA=rocep1s0f0:1`: Tells NCCL to use specific RDMA device +- `RAY_USE_MULTIPLE_IPS=0`: Prevents Ray IP detection issues + +--- + +## Step 5. Establish Ray Cluster + +On DGX Spark container (head node): +```bash +ray start --head \ + --node-ip-address=192.168.200.1 \ + --port=6379 \ + --dashboard-host=192.168.200.1 \ + --dashboard-port=8265 \ + --num-gpus=1 +``` + +Verify head node: +```bash +ray status +``` + +Expected output: +``` +======== Autoscaler status: 2026-01-10 19:43:05.517578 ======== +Node status +--------------------------------------------------------------- +Active: + 1 node_xxxxx +Resources +--------------------------------------------------------------- +Total Usage: + 0.0/20.0 CPU + 0.0/1.0 GPU +``` + +On Workstation container (worker node): +```bash +ray start \ + --address=192.168.200.1:6379 \ + --node-ip-address=192.168.200.2 \ + --num-gpus=1 +``` + +> [!NOTE] +> Adjust `--num-gpus` based on your workstation configuration. In our case, we had 2 GPUs (RTX 6000 Pro + RTX 5090) but only used 1 for this tutorial. + +Verify cluster formation: +```bash +ray status +``` + +Expected output (should show 2+ total GPUs depending on your setup): +``` +======== Autoscaler status: 2026-01-10 19:46:26.274139 ======== +Node status +--------------------------------------------------------------- +Active: + 1 node_xxxxx (head) + 1 node_xxxxx (worker) +Resources +--------------------------------------------------------------- +Total Usage: + 0.0/68.0 CPU + 0.0/2.0 GPU +``` + +--- + +## Step 6. Run Validation Test (4B Model) + +Start small model for validation on DGX Spark container: + +```bash +python -m vllm.entrypoints.openai.api_server \ + --model Qwen/Qwen3-4B-Instruct-2507 \ + --tensor-parallel-size 2 \ + --distributed-executor-backend ray \ + --gpu-memory-utilization 0.8 \ + --host 192.168.200.1 \ + --port 8000 +``` + +Test from another terminal: +```bash +curl -X POST "http://192.168.200.1:8000/v1/completions" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen3-4B-Instruct-2507", + "prompt": "Test distributed inference:", + "max_tokens": 500 + }' +``` + +--- + +## Step 7. Run FP8 Quantized Model (30B) + +FP8 quantization provides excellent memory efficiency with good performance: + +```bash +## Stop previous model (Ctrl+C), then start FP8 30B model +python -m vllm.entrypoints.openai.api_server \ + --model Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 \ + --tensor-parallel-size 2 \ + --distributed-executor-backend ray \ + --gpu-memory-utilization 0.8 \ + --host 192.168.200.1 \ + --port 8000 +``` + +**Benefits of FP8:** +- Memory efficiency: Reduced footprint compared to BF16 +- Performance: 341+ tok/s demonstrated +- Hardware compatibility: Fully supported on Blackwell GB10 + +--- + +## Step 8. Run Large Model (72B) + +This step demonstrates the real power of distributed inference: running a model that **exceeds the memory capacity of any single GPU**. + +| Component | Available VRAM | Sufficient for 72B? | +|-----------|---------------|---------------------| +| DGX Spark | 128 GB | No (~136GB needed) | +| RTX 6000 Pro | 96 GB | No (~136GB needed) | +| **Combined Cluster** | **224 GB** | **Yes** | + +The Qwen2.5-72B-Instruct model requires ~136GB in BF16 precision - impossible to run on either GPU alone. This is where our RDMA cluster shines, aggregating memory across both systems. + +Memory-optimized configuration for 136GB model: + +```bash +python -m vllm.entrypoints.openai.api_server \ + --model Qwen/Qwen2.5-72B-Instruct \ + --tensor-parallel-size 2 \ + --distributed-executor-backend ray \ + --gpu-memory-utilization 0.85 \ + --host 192.168.200.1 \ + --port 8000 \ + --max-model-len 2048 \ + --max-num-seqs 8 \ + --disable-sliding-window \ + --enforce-eager +``` + +**Memory optimization parameters:** +- `--gpu-memory-utilization 0.85`: Uses 85% of GPU memory +- `--max-model-len 2048`: Limits context length to save memory +- `--max-num-seqs 8`: Reduces concurrent sequences +- `--disable-sliding-window`: Disables memory-intensive sliding window attention +- `--enforce-eager`: Uses eager execution (saves memory) + +Test 72B model: +```bash +curl -X POST "http://192.168.200.1:8000/v1/chat/completions" \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen2.5-72B-Instruct", + "messages": [ + {"role": "user", "content": "Explain the benefits of RDMA for AI workloads in one paragraph."} + ], + "max_tokens": 500 + }' +``` + +--- + +## Step 9. Monitor RDMA Traffic + +Monitor RDMA activity during inference: + +```bash +## Run on both systems (separate terminals) +watch -n 0.5 " +echo '=== RDMA Counters ==='; +echo -n 'TX: '; cat /sys/class/infiniband/rocep1s0f0/ports/1/counters/port_xmit_data; +echo -n 'RX: '; cat /sys/class/infiniband/rocep1s0f0/ports/1/counters/port_rcv_data; +echo 'Timestamp: '; date; +" +``` + +During inference, you'll see counters increasing as tensors are communicated between GPUs. + +--- + +## Performance Benchmarks + +### Benchmark Commands + +**Single-node testing:** +```bash +## On RTX 6000 Pro or DGX Spark +vllm bench latency --model Qwen/Qwen3-30B-A3B-Thinking-2507 --input-len 512 --output-len 2000 --num-iters 10 +vllm bench throughput --model Qwen/Qwen3-30B-A3B-Thinking-2507 --input-len 512 --output-len 2000 --num-prompts 20 +``` + +**Distributed testing:** +```bash +## 30B Model +vllm bench serve --host 192.168.200.1 --port 8000 --random-input-len 512 --random-output-len 2000 --num-prompts 20 --request-rate 2 --model Qwen/Qwen3-30B-A3B-Thinking-2507 + +## 72B Model +vllm bench serve --host 192.168.200.1 --port 8000 --random-input-len 256 --random-output-len 1500 --num-prompts 20 --request-rate 2 --model Qwen/Qwen2.5-72B-Instruct +``` + +### Performance Results Summary + +| Configuration | Avg Latency | Output Throughput | Total Throughput | +|---------------|-------------|-------------------|------------------| +| **RTX 6000 Pro (Single)** | 36.87s | 679.88 tok/s | 853.90 tok/s | +| **DGX Spark (Single)** | 213.12s | 105.10 tok/s | 132.00 tok/s | +| **Distributed RDMA** | 191.09s | 205.83 tok/s | 259.41 tok/s | + +### What This Demonstrates + +The key achievement of this tutorial is successfully running distributed inference across heterogeneous hardware (DGX Spark ARM64 + Linux Workstation AMD64) over RDMA. The distributed setup aggregates GPU memory from both systems, enabling models that wouldn't fit on either device alone. + +### FP8 30B Model Results + +``` +============ Serving Benchmark Result ============ +Successful requests: 20 +Benchmark duration (s): 115.36 +Output token throughput (tok/s): 341.15 +Total Token throughput (tok/s): 429.89 +Mean TTFT (ms): 171.00 +Mean TPOT (ms): 53.08 +================================================== +``` + +--- + +## Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| Ray worker can't connect to head | Firewall blocking port 6379 | `sudo ufw allow 6379/tcp` | +| NCCL timeout during model load | RDMA not working | Verify `ib_send_bw` test passes | +| "Placement group" errors | Ray cluster not formed | Check `ray status` on both nodes | +| OOM during 72B model load | Insufficient memory optimization | Add `--max-model-len 2048 --enforce-eager` | +| SSH connection refused | SSH server not running | `sudo systemctl start ssh` | +| Container can't access RDMA | Missing device mount | Ensure `-v /dev/infiniband:/dev/infiniband` | +| Wrong IP in Ray cluster | Multiple network interfaces | Set `RAY_USE_MULTIPLE_IPS=0` | +| Slow inference performance | NCCL using wrong interface | Verify `NCCL_SOCKET_IFNAME=enp1s0f0np0` | + +--- + +## Credits + +This playbook was contributed by **Csaba Kecskemeti** | [DevQuasar](https://devquasar.com/). + +For a detailed walkthrough and additional context, see the original article: +[Distributed Inference Cluster: DGX Spark + RTX 6000 Pro](https://devquasar.com/ai/edge-ai/distributed-inference-cluster-dgx-spark-rtx-6000-pro/) diff --git a/nvidia/heterogeneous-distributed-inference-rdma/README.md b/nvidia/heterogeneous-distributed-inference-rdma/README.md new file mode 100644 index 0000000..083176b --- /dev/null +++ b/nvidia/heterogeneous-distributed-inference-rdma/README.md @@ -0,0 +1,624 @@ +# Heterogeneous Distributed Inference over RDMA + +> Set up high-speed RDMA networking between DGX Spark (ConnectX-7) and a Linux Workstation (ConnectX-5) for distributed AI inference + +## Table of Contents + +- [Overview](#overview) +- [Instructions](#instructions) +- [Troubleshooting](#troubleshooting) +- [Next Steps](#next-steps) +- [Credits](#credits) + +--- + +## Overview + +## Basic idea + +This playbook guides you through setting up a heterogeneous distributed computing environment using RDMA (Remote Direct Memory Access) over Converged Ethernet (RoCE v2). You will connect a DGX Spark system with a Linux workstation equipped with a Mellanox ConnectX network adapter, enabling high-speed GPU-to-GPU communication for distributed AI workloads. + +With RDMA enabled, data flows directly between GPU memories: + +``` +GPU memory → PCIe → NIC (mlx5) → wire → NIC → PCIe → GPU memory +``` + +**Key properties:** +- **No CPU copies:** Data bypasses system memory +- **No kernel networking stack:** Direct hardware-to-hardware communication +- **Ultra-low latency:** Microsecond-level communication +- **High throughput:** 93+ Gbps validated over 100 Gbps link + +## What you'll accomplish + +- Enable low-latency, zero-copy GPU↔GPU communication between heterogeneous systems +- Configure RoCE v2 networking over 100 Gbps direct QSFP connection +- Validate RDMA performance (93+ Gbps achievable) +- Prepare both systems for multi-node inference and training with NCCL + +## What to know before starting + +- Basic understanding of Linux networking and command line +- Familiarity with network interface configuration (netplan) +- Understanding of PCIe and GPU computing concepts +- Basic knowledge of RDMA/InfiniBand terminology is helpful but not required + +## Prerequisites + +**Node A: DGX Spark** +- GPU: 128 GB unified memory (Grace Blackwell GB10) +- NIC: ConnectX-7 (QSFP56/QSFP112) +- OS: NVIDIA DGX OS (Ubuntu-based, ARM64) + +**Node B: Linux Workstation** +- GPU: NVIDIA GPU with sufficient VRAM (e.g., RTX 6000 Pro, RTX 5090) +- NIC: ConnectX-5 or newer (e.g., MCX516A-CDAT for 100 GbE dual-port) +- OS: Ubuntu 20.04 / 22.04 / 24.04 +- PCIe: Gen4 x16 slot recommended + +**Physical Requirements:** +- One QSFP cable (QSFP56 ↔ QSFP28 compatible, 100 Gbps negotiated) +- Direct connection or dedicated switch + +> [!NOTE] +> **About the hardware used in this tutorial:** We used a ConnectX-5 (MCX516A-CDAT, 100GbE dual-port) on the workstation because that's what we had available. This limits the link speed to 100 Gbps. If you use a ConnectX-7 NIC on the workstation side (matching the DGX Spark), you can achieve up to 200 Gbps. The setup process is the same or very similar - just with higher bandwidth. + +> [!NOTE] +> Interface names (e.g., `enp1s0f0np0`, `rocep1s0f0`) are system-specific and will differ on your hardware. Use these commands to identify your interfaces: +> ```bash +> ## Find RDMA device to network interface mapping +> ibdev2netdev +> +> ## List all network interfaces +> ip link show +> +> ## Show detailed RDMA device info +> ibv_devinfo +> ``` + +## Ancillary files + +All required files for this playbook can be found [here on GitHub](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/heterogeneous-distributed-inference-rdma/) + +- [**test_nccl.py**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/heterogeneous-distributed-inference-rdma/assets/test_nccl.py) - NCCL communication test script + +## Time & risk + +- **Duration:** 2-3 hours including validation and testing + +- **Risk level:** Medium - involves network reconfiguration + +- **Rollback:** Network changes can be reversed by removing netplan configs or IP assignments + +- **Last Updated:** 01/23/2026 + +--- + +## Instructions + +## Step 1. Understand the Architecture + +Your distributed inference system uses **two separate communication planes**: + +| Component | Purpose | Protocol | Latency | +|-----------|---------|----------|---------| +| **Control Plane (Ray)** | Orchestration, scheduling, actor management | TCP/IP (gRPC) | Milliseconds | +| **Data Plane (NCCL)** | High-speed GPU tensor transfers | RoCE v2 (RDMA) | Microseconds | + +Both planes use the same 100 Gbps ConnectX network in this configuration. + +**RoCE vs InfiniBand:** + +| Mode | What it is | Notes | +|------|------------|-------| +| **RoCE v2 (Ethernet)** | RDMA over Ethernet | Recommended for this setup | +| **InfiniBand** | Native IB fabric | Requires IB switches | + +> [!NOTE] +> If your ConnectX-5 is Ethernet-only (not VPI), RoCE v2 is the correct and only supported mode. + +**Core software components (required on both nodes):** + +| Component | Purpose | Notes | +|-----------|---------|--------| +| `mlx5_core` | Main NIC driver | Kernel module | +| `mlx5_ib` | RDMA support | Kernel module | +| `rdma-core` | Userspace RDMA stack | Package: rdma-core | +| `infiniband-diags` | Diagnostics (`ibstat`) | Package: infiniband-diags | +| `mstflint` | Firmware inspection | Package: mstflint | +| `NCCL` | Multi-GPU collectives | Built into PyTorch/frameworks | + +--- + +## Step 2. Set Up the Workstation (ConnectX-5) + +**Hardware & BIOS checklist:** + +1. Install the ConnectX card in a PCIe Gen3/4 x16 slot (CPU-direct, not via chipset) + +2. **Cooling Requirements:** ConnectX-5/7 100GbE cards are primarily designed for server environments with active cooling. In a workstation, ensure adequate case airflow directed at the card, and consider adding a PCIe slot fan for sustained high-bandwidth workloads. + +3. **BIOS settings:** + ``` + Above 4G Decoding: Enabled + ASPM (Power Management): Disabled + PCIe Speed: Auto / Gen4 + SR-IOV: Enabled (optional, for virtualization) + ``` + +Verify PCIe detection: + +```bash +## Check if ConnectX card is detected +lspci -nn | grep -i mellanox +``` + +Expected output: +``` +03:00.0 Ethernet controller [0200]: Mellanox MT27800 [ConnectX-5] [15b3:1017] +03:00.1 Ethernet controller [0200]: Mellanox MT27800 [ConnectX-5] [15b3:1017] +``` + +## Step 3. Install Drivers on Workstation + +Check if mlx5 drivers are already installed: + +```bash +## Check for existing Mellanox drivers +lsmod | grep mlx5 +``` + +**Option 1: Ubuntu Inbox Drivers (Recommended)** + +```bash +## Update package list +sudo apt update + +## Install kernel modules +sudo apt install linux-modules-extra-$(uname -r) + +## Load drivers +sudo modprobe mlx5_core mlx5_ib +``` + +**Option 2: NVIDIA MLNX_OFED (If inbox drivers insufficient)** + +```bash +## Download from: https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/ +wget https://content.mellanox.com/ofed/MLNX_OFED-24.01-0.3.3.1/MLNX_OFED_LINUX-24.01-0.3.3.1-ubuntu24.04-x86_64.tgz + +## Extract and install +tar -xzf MLNX_OFED_LINUX-*.tgz +cd MLNX_OFED_LINUX-* +sudo ./mlnxofedinstall --upstream-libs --dpdk +sudo /etc/init.d/openibd restart +``` + +## Step 4. Install Required Packages on Workstation + +```bash +## Update package list +sudo apt update + +## Install RDMA and networking packages +sudo apt install -y \ + rdma-core \ + ibverbs-utils \ + rdmacm-utils \ + libibmad5 \ + infiniband-diags \ + perftest \ + mstflint \ + ethtool \ + ibutils +``` + +## Step 5. Verify Workstation RDMA Stack + +Verify kernel drivers are loaded: + +```bash +## Check loaded drivers +lsmod | grep mlx5 +``` + +You must see `mlx5_core` and `mlx5_ib`. If missing, load them: + +```bash +## Load drivers manually +sudo modprobe mlx5_core mlx5_ib + +## Make permanent +echo 'mlx5_core' | sudo tee -a /etc/modules +echo 'mlx5_ib' | sudo tee -a /etc/modules +``` + +Validate RDMA stack: + +```bash +## Show RDMA device info +ibv_devinfo +``` + +Expected output: +``` +hca_id: mlx5_0 + transport: InfiniBand (0) + fw_ver: 16.35.2000 + node_guid: xxxx:xxxx:xxxx:xxxx + vendor_id: 0x02c9 + vendor_part_id: 4119 + phys_port_cnt: 1 +``` + +```bash +## Show adapter status +ibstat +``` + +Validate PCIe bandwidth (replace `03:00.0` with your actual bus address): + +```bash +## Check PCIe link speed and width +sudo lspci -s 03:00.0 -vv | grep -E "LnkCap|LnkSta" +``` + +Target output: +``` +LnkCap: Port #0, Speed 16GT/s, Width x16 +LnkSta: Speed 16GT/s (ok), Width x16 (ok) +``` + +--- + +## Step 6. Set Up DGX Spark (ConnectX-7) + +**Fix repository signature issues (if needed):** + +If you encounter GPG key errors: + +```bash +## Remove problematic repository +sudo rm -f /etc/apt/sources.list.d/*ffmpeg* 2>/dev/null || true + +## Download and install updated GPG key +curl -fsSL https://workbench.download.nvidia.com/stable/linux/gpgkey | \ +gpg --dearmor | sudo tee /usr/share/keyrings/ai-workbench-desktop-key.gpg > /dev/null + +## Update package list +sudo apt update +``` + +## Step 7. Install Required Packages on DGX Spark + +```bash +## Update package list +sudo apt update + +## Install RDMA packages +sudo apt install -y \ + infiniband-diags \ + rdma-core \ + ibverbs-utils \ + mstflint \ + perftest \ + ethtool +``` + +> [!NOTE] +> DOCA-OFED is **not required** for DGX Spark systems. The standard Ubuntu packages provide all necessary functionality. + +## Step 8. Verify DGX Spark Interfaces + +Verify network interfaces: + +```bash +## Show network interfaces +ip link show | grep -E "enp|ib" +``` + +You should see ConnectX-7 ports like `enp1s0f0np0`, `enp1s0f1np1`, etc. + +Verify RDMA interfaces: + +```bash +## Show RDMA device to interface mapping +ibdev2netdev +``` + +Example output: +``` +rocep1s0f0 port 1 ==> enp1s0f0np0 (Down) +rocep1s0f1 port 1 ==> enp1s0f1np1 (Down) +roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down) +roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down) +``` + +Check PCIe topology: + +```bash +## Show GPU and NIC topology +nvidia-smi topo -m +``` + +This shows how GPUs and NICs are interconnected via PCIe. + +--- + +## Step 9. Connect the QSFP Cable + +**Cable type:** QSFP56 or QSFP28 cable (they are cross-compatible at 100 Gbps) + +**Connection procedure:** +1. Identify ports: DGX Spark has 2 physical QSFP ports with 4 logical interfaces +2. Connect QSFP cable between any available ports +3. Cable compatibility: QSFP56 ↔ QSFP28 works (100 Gbps negotiated) +4. Link detection: Should be automatic within 10-20 seconds + +Verify physical link detection on DGX Spark: + +```bash +## Check link status +ibdev2netdev +``` + +Expected output (after cable connection): +``` +rocep1s0f0 port 1 ==> enp1s0f0np0 (Up) +rocep1s0f1 port 1 ==> enp1s0f1np1 (Down) +roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up) +roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down) +``` + +> [!NOTE] +> If none of the interfaces are showing as 'Up', please check the QSFP cable connection, reboot the systems and try again. + +Verify on Workstation: + +```bash +## Check link status +ibdev2netdev +ip link show | grep -E "enp|mlx" +``` + +--- + +## Step 10. Configure Network Interfaces + +**Network Configuration:** +- **RDMA Network:** 192.168.200.0/24 +- **DGX Spark:** 192.168.200.1 +- **Workstation:** 192.168.200.2 +- **MTU:** 9000 (jumbo frames for optimal RDMA performance) + +> [!NOTE] +> The management IP addresses shown in examples (192.168.1.x) are placeholders. Replace these with your actual network IP addresses that you see when running `ip addr show`. + +**Option 1: Temporary Configuration (Testing)** + +> [!NOTE] +> These commands are temporary and will be lost on reboot! + +On DGX Spark: +```bash +## Configure RDMA interface (use interface showing "Up" from ibdev2netdev) +sudo ip addr add 192.168.200.1/24 dev enp1s0f0np0 +sudo ip link set enp1s0f0np0 up +sudo ip link set enp1s0f0np0 mtu 9000 +``` + +On Workstation: +```bash +## Configure RDMA interface +sudo ip addr add 192.168.200.2/24 dev enp1s0f0np0 +sudo ip link set enp1s0f0np0 up +sudo ip link set enp1s0f0np0 mtu 9000 +``` + +**Option 2: Permanent Configuration** + +First, identify your active internet interface on both systems: + +```bash +## Find your internet interface +ip addr show | grep -A 2 "inet.*scope global" +ip link show | grep "state UP" +``` + +On DGX Spark: +```bash +## Create netplan configuration (REPLACE interface names with YOUR actual interfaces!) +sudo tee /etc/netplan/99-rdma.yaml > /dev/null <": + password: "" +EOF + +## Set permissions and apply +sudo chmod 600 /etc/netplan/99-rdma.yaml +sudo netplan apply +``` + +On Workstation: +```bash +## Create netplan configuration (REPLACE interface names with YOUR actual interfaces!) +sudo tee /etc/netplan/99-rdma.yaml > /dev/null < [!IMPORTANT] +> Before applying netplan, identify your active internet interface to avoid losing connectivity. Interface names may change after applying netplan (e.g., `mlx5_0` to `rocep1s0f0`). Always verify current device names with `ibdev2netdev`. + +## Step 11. Verify Network Connectivity + +Test basic connectivity: + +```bash +## From DGX Spark +ping -c 4 192.168.200.2 + +## From Workstation +ping -c 4 192.168.200.1 +``` + +Expected output: +``` +PING 192.168.200.2 (192.168.200.2) 56(84) bytes of data. +64 bytes from 192.168.200.2: icmp_seq=1 time=0.xxx ms +... +4 packets transmitted, 4 received, 0% packet loss +``` + +--- + +## Step 12. Test RDMA Bandwidth + +Identify correct device names: + +```bash +## Check available RDMA devices +ibv_devinfo +ls /sys/class/infiniband/ +``` + +**Device name mapping:** +- **DGX Spark:** Use `rocep1s0f0` or `roceP2p1s0f0` +- **Workstation:** Use `mlx5_0` or `mlx5_1` (or `rocep1s0f0` after persistent config) + +Run bandwidth test: + +On DGX Spark (server) - Start first: +```bash +## Start RDMA bandwidth test server +ib_send_bw -d rocep1s0f0 +``` + +On Workstation (client) - Connect to server: +```bash +## Connect to server and run bandwidth test +ib_send_bw -d rocep1s0f0 192.168.200.1 +``` + +Example successful output: +``` +--------------------------------------------------------------------------------------- + Send BW Test + Dual-port : OFF Device : rocep1s0f0 + Number of qps : 1 Transport type : IB + Connection type : RC Using SRQ : OFF + Link type : Ethernet +--------------------------------------------------------------------------------------- + #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] + 65536 1000 11664.71 11664.25 0.186628 +--------------------------------------------------------------------------------------- +``` + +**Performance Analysis:** +- 11,664 MB/sec = ~93.3 Gbps +- Achieves >93% of 100 Gbps line rate +- Link type: Ethernet confirms RoCE v2 is working + +--- + +## Step 13. Configure Environment Variables for NCCL + +Add to both systems (persistent across reboots): + +```bash +## Add RDMA configuration to bashrc +echo '# RDMA Network Configuration' >> ~/.bashrc +echo 'export UCX_NET_DEVICES=enp1s0f0np0' >> ~/.bashrc +echo 'export NCCL_SOCKET_IFNAME=enp1s0f0np0' >> ~/.bashrc +echo 'export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0' >> ~/.bashrc + +## Apply to current session +source ~/.bashrc +``` + +Verification: +```bash +## Check environment variables +echo $UCX_NET_DEVICES +echo $NCCL_SOCKET_IFNAME +## Both should show: enp1s0f0np0 +``` + +--- + +## Step 14. Final Validation + +At this point, you should have achieved: + +- [ ] Physical link detected - `ibdev2netdev` shows "(Up)" status +- [ ] IP connectivity working - `ping 192.168.200.x` succeeds +- [ ] MTU set to 9000 - Jumbo frames enabled +- [ ] RDMA bandwidth >90 Gbps validated +- [ ] RoCE v2 confirmed - Link type: Ethernet +- [ ] Environment variables set for NCCL + +Your RDMA setup is **fully operational** and ready for distributed AI workloads! + +--- + +## Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| `ibdev2netdev` shows no devices | mlx5 drivers not loaded | `sudo modprobe mlx5_core mlx5_ib` | +| Interface shows "(Down)" after cable | Link not negotiated | Check cable, try different port, reboot | +| Ping fails between nodes | IP not configured or wrong interface | Verify `ip addr show`, check interface names | +| RDMA bandwidth <80 Gbps | MTU not set to 9000 | `sudo ip link set mtu 9000` | +| "mlx5_0 not found" error | Device name changed after netplan | Run `ibdev2netdev` to find current name | +| Permission denied on `/dev/infiniband` | Missing RDMA permissions | Run with `sudo` or add user to `rdma` group | +| GPG key errors on DGX Spark | Expired NVIDIA repository key | See Step 6 for fix | +| Lost internet after netplan apply | Wrong interface in netplan config | Identify correct interface with `ip link show` first | + +--- + +## Next Steps + +Continue to [**Distributed Inference Guide**](DISTRIBUTED-INFERENCE.md) to: +- Set up SSH and hostname configuration +- Configure NCCL for multi-node communication +- Deploy RDMA-enabled containers with Ray cluster +- Run distributed inference with vLLM +- Benchmark performance across configurations + +--- + +## Credits + +This playbook was contributed by **Csaba Kecskemeti** | [DevQuasar](https://devquasar.com/). + +For a detailed walkthrough and additional context, see the original article: +[Distributed Inference Cluster: DGX Spark + RTX 6000 Pro](https://devquasar.com/ai/edge-ai/distributed-inference-cluster-dgx-spark-rtx-6000-pro/) diff --git a/nvidia/heterogeneous-distributed-inference-rdma/assets/test_nccl.py b/nvidia/heterogeneous-distributed-inference-rdma/assets/test_nccl.py new file mode 100644 index 0000000..469188a --- /dev/null +++ b/nvidia/heterogeneous-distributed-inference-rdma/assets/test_nccl.py @@ -0,0 +1,96 @@ +#!/usr/bin/env python3 +""" +NCCL Communication Test Script + +Tests NCCL (NVIDIA Collective Communications Library) communication over RDMA +between two nodes in a distributed setup. + +Usage: + On Node 0 (head): python test_nccl.py --rank 0 + On Node 1 (worker): python test_nccl.py --rank 1 + +Requirements: + - PyTorch with CUDA support + - NCCL backend available + - RDMA network configured between nodes +""" + +import os +import torch +import torch.distributed as dist +import argparse + + +def test_nccl_communication(): + parser = argparse.ArgumentParser(description='Test NCCL communication over RDMA') + parser.add_argument('--rank', type=int, required=True, + help='Rank of this process (0 for head, 1 for worker)') + parser.add_argument('--world_size', type=int, default=2, + help='Total number of processes') + parser.add_argument('--master_addr', type=str, default='192.168.200.1', + help='IP address of the head node') + parser.add_argument('--master_port', type=str, default='29500', + help='Port for distributed communication') + parser.add_argument('--interface', type=str, default='enp1s0f0np0', + help='Network interface for NCCL socket') + args = parser.parse_args() + + # Set environment variables for distributed communication + os.environ['RANK'] = str(args.rank) + os.environ['WORLD_SIZE'] = str(args.world_size) + os.environ['MASTER_ADDR'] = args.master_addr + os.environ['MASTER_PORT'] = args.master_port + os.environ['NCCL_SOCKET_IFNAME'] = args.interface + + print(f"=" * 60) + print(f"NCCL Communication Test") + print(f"=" * 60) + print(f"Rank: {args.rank}") + print(f"World Size: {args.world_size}") + print(f"Master: {args.master_addr}:{args.master_port}") + print(f"Interface: {args.interface}") + print(f"=" * 60) + + print(f"\n[Rank {args.rank}] Initializing process group...") + + # Initialize the process group with NCCL backend + dist.init_process_group( + backend='nccl', + rank=args.rank, + world_size=args.world_size + ) + + print(f"[Rank {args.rank}] Process group initialized successfully!") + print(f"[Rank {args.rank}] Distributed rank: {dist.get_rank()}/{dist.get_world_size()}") + + # Create a tensor on GPU + device = torch.device('cuda:0') + tensor = torch.ones(10, device=device) * (args.rank + 1) + + print(f"\n[Rank {args.rank}] Before all_reduce: {tensor.tolist()}") + + # Perform all-reduce operation (sum across all ranks) + dist.all_reduce(tensor, op=dist.ReduceOp.SUM) + + print(f"[Rank {args.rank}] After all_reduce: {tensor.tolist()}") + + # Calculate expected result + expected = sum(range(1, args.world_size + 1)) + expected_tensor = torch.ones(10) * expected + print(f"[Rank {args.rank}] Expected result: {expected_tensor.tolist()}") + + # Verify result + if torch.allclose(tensor.cpu(), expected_tensor): + print(f"\n[Rank {args.rank}] ✓ All-reduce test PASSED!") + else: + print(f"\n[Rank {args.rank}] ✗ All-reduce test FAILED!") + + # Cleanup + dist.destroy_process_group() + + print(f"[Rank {args.rank}] Test completed successfully!") + print(f"=" * 60) + + +if __name__ == "__main__": + test_nccl_communication()