dgx-spark-playbooks/nvidia/heterogeneous-distributed-inference-rdma/DISTRIBUTED-INFERENCE.md

# Distributed Inference Guide

> Deploy and run distributed AI inference across DGX Spark and Linux Workstation using vLLM and Ray

## Table of Contents

- [Overview](#overview)
- [Instructions](#instructions)
- [Performance Benchmarks](#performance-benchmarks)
- [Troubleshooting](#troubleshooting)
- [Credits](#credits)

---

## Overview

## Basic idea

This guide walks you through deploying distributed inference across your heterogeneous RDMA cluster. Using Ray for orchestration and vLLM for inference, you can run large language models that exceed the memory capacity of any single GPU by distributing them across your DGX Spark and Linux workstation.

**Architecture:**
```
┌─────────────────────────────────┐    ┌───────────────────────────────────┐
│         DGX SPARK               │    │        WORKSTATION                │
│   (Grace Blackwell GB10)        │    │  (RTX 6000 Pro / RTX 5090)        │
│                                 │    │                                   │
│  ┌───────────────────────────┐  │    │  ┌───────────────────────────┐    │
│  │      vLLM Head Node       │  │    │  │      vLLM Worker          │    │
│  │   (API Server, Rank 0)    │  │    │  │   (Tensor Parallel)       │    │
│  └───────────────────────────┘  │    │  └───────────────────────────┘    │
│              │                  │    │              │                    │
│  ┌───────────────────────────┐  │    │  ┌───────────────────────────┐    │
│  │   Ray Head (6379)         │◄─┼────┼──│   Ray Worker              │    │
│  └───────────────────────────┘  │    │  └───────────────────────────┘    │
│              │                  │    │              │                    │
│  ┌───────────────────────────┐  │    │  ┌───────────────────────────┐    │
│  │   NCCL over RDMA          │◄─┼════┼──│   NCCL over RDMA          │    │
│  │   192.168.200.1           │  │    │  │   192.168.200.2           │    │
│  └───────────────────────────┘  │    │  └───────────────────────────┘    │
└─────────────────────────────────┘    └───────────────────────────────────┘
```

## What you'll accomplish

- Configure SSH and hostname resolution between nodes
- Test NCCL communication over RDMA
- Deploy RDMA-enabled Docker containers
- Establish a Ray cluster across both systems
- Run distributed inference with vLLM
- Benchmark performance across different configurations

## What to know before starting

- Familiarity with Docker and container networking
- Understanding of distributed computing concepts (Ray, tensor parallelism)
- Basic knowledge of LLM inference serving

## Prerequisites

- Completed [RDMA Network Setup](README.md) with validated 90+ Gbps bandwidth
- Docker installed on both systems: `docker --version`
- NVIDIA Container Toolkit installed
- Hugging Face account for model access (some models require authentication)

## Time & risk

- **Duration:** 1-2 hours including testing

- **Risk level:** Low - uses containers, non-destructive

- **Rollback:** Stop containers to revert

- **Last Updated:** 01/23/2026

---

## Instructions

## Step 1. Configure Hostnames

Add hostname aliases on both systems:

```bash
## Add hostname resolution on both DGX Spark and Workstation
sudo tee -a /etc/hosts > /dev/null <<EOF
192.168.200.1 dgx-spark
192.168.200.2 workstation
EOF
```

## Step 2. Set Up SSH Access

Install SSH server if needed (common on workstations):

```bash
## Check SSH server status
sudo systemctl status ssh

## If not installed:
sudo apt update
sudo apt install openssh-server
sudo systemctl start ssh
sudo systemctl enable ssh
```

Configure passwordless SSH between nodes:

On DGX Spark:
```bash
## Check if SSH key exists
ls ~/.ssh/id_*.pub

## If no key exists, generate one:
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519

## Copy key to workstation
ssh-copy-id <your-username>@workstation
```

On Workstation:
```bash
## Check if SSH key exists
ls ~/.ssh/id_*.pub

## If no key exists, generate one:
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519

## Copy key to DGX Spark
ssh-copy-id <your-username>@dgx-spark
```

Verify passwordless SSH:

```bash
## From DGX Spark
ssh <your-username>@workstation hostname
## Expected output: workstation

## From Workstation
ssh <your-username>@dgx-spark hostname
## Expected output: dgx-spark
```

---

## Step 3. Test NCCL Communication

Create the NCCL test script on both systems:

```bash
## Create test script
cat > test_nccl.py << 'EOF'
import os
import torch
import torch.distributed as dist
import argparse

def test_nccl_communication():
    parser = argparse.ArgumentParser()
    parser.add_argument('--rank', type=int, required=True)
    parser.add_argument('--world_size', type=int, default=2)
    parser.add_argument('--master_addr', type=str, default='192.168.200.1')
    parser.add_argument('--master_port', type=str, default='29500')
    args = parser.parse_args()

    os.environ['RANK'] = str(args.rank)
    os.environ['WORLD_SIZE'] = str(args.world_size)
    os.environ['MASTER_ADDR'] = args.master_addr
    os.environ['MASTER_PORT'] = args.master_port
    os.environ['NCCL_SOCKET_IFNAME'] = 'enp1s0f0np0'

    print(f"Initializing process group - Rank: {args.rank}, World Size: {args.world_size}")
    print(f"Master: {args.master_addr}:{args.master_port}")

    dist.init_process_group(backend='nccl', rank=args.rank, world_size=args.world_size)
    print(f"Process group initialized - Rank: {dist.get_rank()}/{dist.get_world_size()}")

    device = torch.device('cuda:0')
    tensor = torch.ones(10, device=device) * (args.rank + 1)
    print(f"Rank {args.rank} - Before allreduce: {tensor}")

    dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
    print(f"Rank {args.rank} - After allreduce: {tensor}")
    print(f"Expected result: {torch.ones(10) * (1 + 2)}")

    dist.destroy_process_group()
    print(f"Rank {args.rank} - Test completed successfully!")

if __name__ == "__main__":
    test_nccl_communication()
EOF
```

Run NCCL test in Docker containers:

On DGX Spark (start first):
```bash
docker run -it --runtime=nvidia --gpus all --network host --ipc=host \
  --privileged --ulimit memlock=-1 --ulimit stack=67108864 \
  -v /dev/infiniband:/dev/infiniband -v /sys:/sys:ro \
  -e NCCL_IB_DISABLE=0 -e NCCL_IB_HCA=rocep1s0f0:1 -e NCCL_IB_GID_INDEX=3 \
  -e NCCL_SOCKET_IFNAME=enp1s0f0np0 -v $(pwd):/workspace \
  nvcr.io/nvidia/vllm:25.09-py3 python /workspace/test_nccl.py --rank 0
```

On Workstation (connect to DGX):
```bash
docker run -it --runtime=nvidia --gpus all --network host --ipc=host \
  --privileged --ulimit memlock=-1 --ulimit stack=67108864 \
  -v /dev/infiniband:/dev/infiniband -v /sys:/sys:ro \
  -e NCCL_IB_DISABLE=0 -e NCCL_IB_HCA=rocep1s0f0:1 -e NCCL_IB_GID_INDEX=3 \
  -e NCCL_SOCKET_IFNAME=enp1s0f0np0 -v $(pwd):/workspace \
  nvcr.io/nvidia/vllm:25.09-py3 python /workspace/test_nccl.py --rank 1
```

**Success indicators:**
- Output shows: `NCCL INFO Using network IBext_v10`
- All-reduce operation completes successfully
- Final tensors show expected sum values (3.0 for each element)

---

## Step 4. Start RDMA-Enabled Containers

On DGX Spark:
```bash
docker run -it --runtime=nvidia --gpus all --network host --ipc=host --shm-size=10g \
  --privileged \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v /dev/infiniband:/dev/infiniband \
  -v /sys:/sys:ro \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -e GLOO_SOCKET_IFNAME=enp1s0f0np0 \
  -e NCCL_IB_DISABLE=0 \
  -e NCCL_IB_HCA=rocep1s0f0:1 \
  -e NCCL_IB_GID_INDEX=3 \
  -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
  -e RAY_USE_MULTIPLE_IPS=0 \
  -e RAY_NODE_IP_ADDRESS=192.168.200.1 \
  -e RAY_OVERRIDE_NODE_IP=192.168.200.1 \
  -e VLLM_HOST_IP=192.168.200.1 \
  nvcr.io/nvidia/vllm:25.09-py3 bash
```

On Workstation:
```bash
docker run -it --runtime=nvidia --gpus all --network host --ipc=host --shm-size=10g \
  --privileged \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v /dev/infiniband:/dev/infiniband \
  -v /sys:/sys:ro \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -e GLOO_SOCKET_IFNAME=enp1s0f0np0 \
  -e NCCL_IB_DISABLE=0 \
  -e NCCL_IB_HCA=rocep1s0f0:1 \
  -e NCCL_IB_GID_INDEX=3 \
  -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
  -e RAY_USE_MULTIPLE_IPS=0 \
  -e RAY_NODE_IP_ADDRESS=192.168.200.2 \
  -e RAY_OVERRIDE_NODE_IP=192.168.200.2 \
  nvcr.io/nvidia/vllm:25.09-py3 bash
```

**Key parameters explained:**
- `--runtime=nvidia`: Required for GPU access
- `--network host`: Uses host networking (required for RDMA)
- `--privileged`: Needed for InfiniBand device access
- `--ulimit memlock=-1`: Unlimited memory locking for RDMA
- `-v /dev/infiniband:/dev/infiniband`: Mounts RDMA devices
- `NCCL_IB_HCA=rocep1s0f0:1`: Tells NCCL to use specific RDMA device
- `RAY_USE_MULTIPLE_IPS=0`: Prevents Ray IP detection issues

---

## Step 5. Establish Ray Cluster

On DGX Spark container (head node):
```bash
ray start --head \
  --node-ip-address=192.168.200.1 \
  --port=6379 \
  --dashboard-host=192.168.200.1 \
  --dashboard-port=8265 \
  --num-gpus=1
```

Verify head node:
```bash
ray status
```

Expected output:
```
======== Autoscaler status: 2026-01-10 19:43:05.517578 ========
Node status
---------------------------------------------------------------
Active:
 1 node_xxxxx
Resources
---------------------------------------------------------------
Total Usage:
 0.0/20.0 CPU
 0.0/1.0 GPU
```

On Workstation container (worker node):
```bash
ray start \
  --address=192.168.200.1:6379 \
  --node-ip-address=192.168.200.2 \
  --num-gpus=2
```

Verify cluster formation:
```bash
ray status
```

Expected output (should show 3 total GPUs):
```
======== Autoscaler status: 2026-01-10 19:46:26.274139 ========
Node status
---------------------------------------------------------------
Active:
 1 node_xxxxx (head)
 1 node_xxxxx (worker)
Resources
---------------------------------------------------------------
Total Usage:
 0.0/68.0 CPU
 0.0/3.0 GPU
```

---

## Step 6. Run Validation Test (4B Model)

Start small model for validation on DGX Spark container:

```bash
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.8 \
  --host 192.168.200.1 \
  --port 8000
```

Test from another terminal:
```bash
curl -X POST "http://192.168.200.1:8000/v1/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B-Instruct-2507",
    "prompt": "Test distributed inference:",
    "max_tokens": 500
  }'
```

---

## Step 7. Run FP8 Quantized Model (30B)

FP8 quantization provides excellent memory efficiency with good performance:

```bash
## Stop previous model (Ctrl+C), then start FP8 30B model
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.8 \
  --host 192.168.200.1 \
  --port 8000
```

**Benefits of FP8:**
- Memory efficiency: Reduced footprint compared to BF16
- Performance: 341+ tok/s demonstrated
- Hardware compatibility: Fully supported on Blackwell GB10

---

## Step 8. Run Large Model (72B)

This step demonstrates the real power of distributed inference: running a model that **exceeds the memory capacity of any single GPU**.

| Component | Available VRAM | Sufficient for 72B? |
|-----------|---------------|---------------------|
| DGX Spark | 128 GB | No (~136GB needed) |
| RTX 6000 Pro | 96 GB | No (~136GB needed) |
| **Combined Cluster** | **224 GB** | **Yes** |

The Qwen2.5-72B-Instruct model requires ~136GB in BF16 precision - impossible to run on either GPU alone. This is where our RDMA cluster shines, aggregating memory across both systems.

Memory-optimized configuration for 136GB model:

```bash
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.85 \
  --host 192.168.200.1 \
  --port 8000 \
  --max-model-len 2048 \
  --max-num-seqs 8 \
  --disable-sliding-window \
  --enforce-eager
```

**Memory optimization parameters:**
- `--gpu-memory-utilization 0.85`: Uses 85% of GPU memory
- `--max-model-len 2048`: Limits context length to save memory
- `--max-num-seqs 8`: Reduces concurrent sequences
- `--disable-sliding-window`: Disables memory-intensive sliding window attention
- `--enforce-eager`: Uses eager execution (saves memory)

Test 72B model:
```bash
curl -X POST "http://192.168.200.1:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-72B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain the benefits of RDMA for AI workloads in one paragraph."}
    ],
    "max_tokens": 500
  }'
```

---

## Step 9. Monitor RDMA Traffic

Monitor RDMA activity during inference:

```bash
## Run on both systems (separate terminals)
watch -n 0.5 "
echo '=== RDMA Counters ===';
echo -n 'TX: '; cat /sys/class/infiniband/rocep1s0f0/ports/1/counters/port_xmit_data;
echo -n 'RX: '; cat /sys/class/infiniband/rocep1s0f0/ports/1/counters/port_rcv_data;
echo 'Timestamp: '; date;
"
```

During inference, you'll see counters increasing as tensors are communicated between GPUs.

---

## Performance Benchmarks

### Benchmark Commands

**Single-node testing:**
```bash
## On RTX 6000 Pro or DGX Spark
vllm bench latency --model Qwen/Qwen3-30B-A3B-Thinking-2507 --input-len 512 --output-len 2000 --num-iters 10
vllm bench throughput --model Qwen/Qwen3-30B-A3B-Thinking-2507 --input-len 512 --output-len 2000 --num-prompts 20
```

**Distributed testing:**
```bash
## 30B Model
vllm bench serve --host 192.168.200.1 --port 8000 --random-input-len 512 --random-output-len 2000 --num-prompts 20 --request-rate 2 --model Qwen/Qwen3-30B-A3B-Thinking-2507

## 72B Model
vllm bench serve --host 192.168.200.1 --port 8000 --random-input-len 256 --random-output-len 1500 --num-prompts 20 --request-rate 2 --model Qwen/Qwen2.5-72B-Instruct
```

### Performance Results Summary

| Configuration | Avg Latency | Output Throughput | Total Throughput |
|---------------|-------------|-------------------|------------------|
| **RTX 6000 Pro (Single)** | 36.87s | 679.88 tok/s | 853.90 tok/s |
| **DGX Spark (Single)** | 213.12s | 105.10 tok/s | 132.00 tok/s |
| **Distributed RDMA** | 191.09s | 205.83 tok/s | 259.41 tok/s |

### Key Insights

**RTX 6000 Pro: Clear Single-Node Winner**
- 5.8x faster than DGX Spark for latency-critical workloads
- 6.5x higher output token throughput
- Best for: Interactive inference, real-time applications

**Distributed RDMA: Aggregated Capacity**
- 259.41 tok/s total throughput - faster than DGX alone
- Combined 224GB GPU memory (128GB DGX + 96GB RTX)
- Enables models too large for any single GPU
- TTFT: 139.94ms mean vs single DGX 213,120ms

**DGX Spark: Memory Advantage**
- 128GB unified memory enables larger models
- Slower inference but handles 100B+ models
- Best for: Extremely large models, memory-constrained scenarios

### FP8 30B Model Results

```
============ Serving Benchmark Result ============
Successful requests:                     20
Benchmark duration (s):                  115.36
Output token throughput (tok/s):         341.15
Total Token throughput (tok/s):          429.89
Mean TTFT (ms):                          171.00
Mean TPOT (ms):                          53.08
==================================================
```

---

## Troubleshooting

| Symptom | Cause | Fix |
|---------|-------|-----|
| Ray worker can't connect to head | Firewall blocking port 6379 | `sudo ufw allow 6379/tcp` |
| NCCL timeout during model load | RDMA not working | Verify `ib_send_bw` test passes |
| "Placement group" errors | Ray cluster not formed | Check `ray status` on both nodes |
| OOM during 72B model load | Insufficient memory optimization | Add `--max-model-len 2048 --enforce-eager` |
| SSH connection refused | SSH server not running | `sudo systemctl start ssh` |
| Container can't access RDMA | Missing device mount | Ensure `-v /dev/infiniband:/dev/infiniband` |
| Wrong IP in Ray cluster | Multiple network interfaces | Set `RAY_USE_MULTIPLE_IPS=0` |
| Slow inference performance | NCCL using wrong interface | Verify `NCCL_SOCKET_IFNAME=enp1s0f0np0` |

---

## Credits

This playbook was contributed by **Csaba Kecskemeti** | [DevQuasar](https://devquasar.com/).

For a detailed walkthrough and additional context, see the original article:
[Distributed Inference Cluster: DGX Spark + RTX 6000 Pro](https://devquasar.com/ai/edge-ai/distributed-inference-cluster-dgx-spark-rtx-6000-pro/)