sarman/dgx-spark-playbooks

Fork 0

mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-22 18:13:52 +00:00

Csaba Kecskemeti c5d7b777e1 initial version of the playbook

2026-01-23 18:39:02 -08:00

17 KiB

Raw Blame History

Distributed Inference Guide

Deploy and run distributed AI inference across DGX Spark and Linux Workstation using vLLM and Ray

Overview
Instructions
Performance Benchmarks
Troubleshooting
Credits

Overview

Basic idea

This guide walks you through deploying distributed inference across your heterogeneous RDMA cluster. Using Ray for orchestration and vLLM for inference, you can run large language models that exceed the memory capacity of any single GPU by distributing them across your DGX Spark and Linux workstation.

Architecture:

┌─────────────────────────────────┐    ┌───────────────────────────────────┐
│         DGX SPARK               │    │        WORKSTATION                │
│   (Grace Blackwell GB10)        │    │  (RTX 6000 Pro / RTX 5090)        │
│                                 │    │                                   │
│  ┌───────────────────────────┐  │    │  ┌───────────────────────────┐    │
│  │      vLLM Head Node       │  │    │  │      vLLM Worker          │    │
│  │   (API Server, Rank 0)    │  │    │  │   (Tensor Parallel)       │    │
│  └───────────────────────────┘  │    │  └───────────────────────────┘    │
│              │                  │    │              │                    │
│  ┌───────────────────────────┐  │    │  ┌───────────────────────────┐    │
│  │   Ray Head (6379)         │◄─┼────┼──│   Ray Worker              │    │
│  └───────────────────────────┘  │    │  └───────────────────────────┘    │
│              │                  │    │              │                    │
│  ┌───────────────────────────┐  │    │  ┌───────────────────────────┐    │
│  │   NCCL over RDMA          │◄─┼════┼──│   NCCL over RDMA          │    │
│  │   192.168.200.1           │  │    │  │   192.168.200.2           │    │
│  └───────────────────────────┘  │    │  └───────────────────────────┘    │
└─────────────────────────────────┘    └───────────────────────────────────┘

What you'll accomplish

Configure SSH and hostname resolution between nodes
Test NCCL communication over RDMA
Deploy RDMA-enabled Docker containers
Establish a Ray cluster across both systems
Run distributed inference with vLLM
Benchmark performance across different configurations

What to know before starting

Familiarity with Docker and container networking
Understanding of distributed computing concepts (Ray, tensor parallelism)
Basic knowledge of LLM inference serving

Prerequisites

Completed RDMA Network Setup with validated 90+ Gbps bandwidth
Docker installed on both systems: docker --version
NVIDIA Container Toolkit installed
Hugging Face account for model access (some models require authentication)

Time & risk

Duration: 1-2 hours including testing
Risk level: Low - uses containers, non-destructive
Rollback: Stop containers to revert
Last Updated: 01/23/2026

Instructions

Step 1. Configure Hostnames

Add hostname aliases on both systems:

## Add hostname resolution on both DGX Spark and Workstation
sudo tee -a /etc/hosts > /dev/null <<EOF
192.168.200.1 dgx-spark
192.168.200.2 workstation
EOF

Step 2. Set Up SSH Access

Install SSH server if needed (common on workstations):

## Check SSH server status
sudo systemctl status ssh

## If not installed:
sudo apt update
sudo apt install openssh-server
sudo systemctl start ssh
sudo systemctl enable ssh

Configure passwordless SSH between nodes:

On DGX Spark:

## Check if SSH key exists
ls ~/.ssh/id_*.pub

## If no key exists, generate one:
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519

## Copy key to workstation
ssh-copy-id <your-username>@workstation

On Workstation:

## Check if SSH key exists
ls ~/.ssh/id_*.pub

## If no key exists, generate one:
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519

## Copy key to DGX Spark
ssh-copy-id <your-username>@dgx-spark

Verify passwordless SSH:

## From DGX Spark
ssh <your-username>@workstation hostname
## Expected output: workstation

## From Workstation
ssh <your-username>@dgx-spark hostname
## Expected output: dgx-spark

Step 3. Test NCCL Communication

Create the NCCL test script on both systems:

## Create test script
cat > test_nccl.py << 'EOF'
import os
import torch
import torch.distributed as dist
import argparse

def test_nccl_communication():
    parser = argparse.ArgumentParser()
    parser.add_argument('--rank', type=int, required=True)
    parser.add_argument('--world_size', type=int, default=2)
    parser.add_argument('--master_addr', type=str, default='192.168.200.1')
    parser.add_argument('--master_port', type=str, default='29500')
    args = parser.parse_args()

    os.environ['RANK'] = str(args.rank)
    os.environ['WORLD_SIZE'] = str(args.world_size)
    os.environ['MASTER_ADDR'] = args.master_addr
    os.environ['MASTER_PORT'] = args.master_port
    os.environ['NCCL_SOCKET_IFNAME'] = 'enp1s0f0np0'

    print(f"Initializing process group - Rank: {args.rank}, World Size: {args.world_size}")
    print(f"Master: {args.master_addr}:{args.master_port}")

    dist.init_process_group(backend='nccl', rank=args.rank, world_size=args.world_size)
    print(f"Process group initialized - Rank: {dist.get_rank()}/{dist.get_world_size()}")

    device = torch.device('cuda:0')
    tensor = torch.ones(10, device=device) * (args.rank + 1)
    print(f"Rank {args.rank} - Before allreduce: {tensor}")

    dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
    print(f"Rank {args.rank} - After allreduce: {tensor}")
    print(f"Expected result: {torch.ones(10) * (1 + 2)}")

    dist.destroy_process_group()
    print(f"Rank {args.rank} - Test completed successfully!")

if __name__ == "__main__":
    test_nccl_communication()
EOF

Run NCCL test in Docker containers:

On DGX Spark (start first):

docker run -it --runtime=nvidia --gpus all --network host --ipc=host \
  --privileged --ulimit memlock=-1 --ulimit stack=67108864 \
  -v /dev/infiniband:/dev/infiniband -v /sys:/sys:ro \
  -e NCCL_IB_DISABLE=0 -e NCCL_IB_HCA=rocep1s0f0:1 -e NCCL_IB_GID_INDEX=3 \
  -e NCCL_SOCKET_IFNAME=enp1s0f0np0 -v $(pwd):/workspace \
  nvcr.io/nvidia/vllm:25.09-py3 python /workspace/test_nccl.py --rank 0

On Workstation (connect to DGX):

docker run -it --runtime=nvidia --gpus all --network host --ipc=host \
  --privileged --ulimit memlock=-1 --ulimit stack=67108864 \
  -v /dev/infiniband:/dev/infiniband -v /sys:/sys:ro \
  -e NCCL_IB_DISABLE=0 -e NCCL_IB_HCA=rocep1s0f0:1 -e NCCL_IB_GID_INDEX=3 \
  -e NCCL_SOCKET_IFNAME=enp1s0f0np0 -v $(pwd):/workspace \
  nvcr.io/nvidia/vllm:25.09-py3 python /workspace/test_nccl.py --rank 1

Success indicators:

Output shows: NCCL INFO Using network IBext_v10
All-reduce operation completes successfully
Final tensors show expected sum values (3.0 for each element)

Step 4. Start RDMA-Enabled Containers

On DGX Spark:

docker run -it --runtime=nvidia --gpus all --network host --ipc=host --shm-size=10g \
  --privileged \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v /dev/infiniband:/dev/infiniband \
  -v /sys:/sys:ro \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -e GLOO_SOCKET_IFNAME=enp1s0f0np0 \
  -e NCCL_IB_DISABLE=0 \
  -e NCCL_IB_HCA=rocep1s0f0:1 \
  -e NCCL_IB_GID_INDEX=3 \
  -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
  -e RAY_USE_MULTIPLE_IPS=0 \
  -e RAY_NODE_IP_ADDRESS=192.168.200.1 \
  -e RAY_OVERRIDE_NODE_IP=192.168.200.1 \
  -e VLLM_HOST_IP=192.168.200.1 \
  nvcr.io/nvidia/vllm:25.09-py3 bash

On Workstation:

docker run -it --runtime=nvidia --gpus all --network host --ipc=host --shm-size=10g \
  --privileged \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v /dev/infiniband:/dev/infiniband \
  -v /sys:/sys:ro \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -e GLOO_SOCKET_IFNAME=enp1s0f0np0 \
  -e NCCL_IB_DISABLE=0 \
  -e NCCL_IB_HCA=rocep1s0f0:1 \
  -e NCCL_IB_GID_INDEX=3 \
  -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
  -e RAY_USE_MULTIPLE_IPS=0 \
  -e RAY_NODE_IP_ADDRESS=192.168.200.2 \
  -e RAY_OVERRIDE_NODE_IP=192.168.200.2 \
  nvcr.io/nvidia/vllm:25.09-py3 bash

Key parameters explained:

--runtime=nvidia: Required for GPU access
--network host: Uses host networking (required for RDMA)
--privileged: Needed for InfiniBand device access
--ulimit memlock=-1: Unlimited memory locking for RDMA
-v /dev/infiniband:/dev/infiniband: Mounts RDMA devices
NCCL_IB_HCA=rocep1s0f0:1: Tells NCCL to use specific RDMA device
RAY_USE_MULTIPLE_IPS=0: Prevents Ray IP detection issues

Step 5. Establish Ray Cluster

On DGX Spark container (head node):

ray start --head \
  --node-ip-address=192.168.200.1 \
  --port=6379 \
  --dashboard-host=192.168.200.1 \
  --dashboard-port=8265 \
  --num-gpus=1

Verify head node:

ray status

Expected output:

======== Autoscaler status: 2026-01-10 19:43:05.517578 ========
Node status
---------------------------------------------------------------
Active:
 1 node_xxxxx
Resources
---------------------------------------------------------------
Total Usage:
 0.0/20.0 CPU
 0.0/1.0 GPU

On Workstation container (worker node):

ray start \
  --address=192.168.200.1:6379 \
  --node-ip-address=192.168.200.2 \
  --num-gpus=2

Verify cluster formation:

ray status

Expected output (should show 3 total GPUs):

======== Autoscaler status: 2026-01-10 19:46:26.274139 ========
Node status
---------------------------------------------------------------
Active:
 1 node_xxxxx (head)
 1 node_xxxxx (worker)
Resources
---------------------------------------------------------------
Total Usage:
 0.0/68.0 CPU
 0.0/3.0 GPU

Step 6. Run Validation Test (4B Model)

Start small model for validation on DGX Spark container:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.8 \
  --host 192.168.200.1 \
  --port 8000

Test from another terminal:

curl -X POST "http://192.168.200.1:8000/v1/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B-Instruct-2507",
    "prompt": "Test distributed inference:",
    "max_tokens": 500
  }'

Step 7. Run FP8 Quantized Model (30B)

FP8 quantization provides excellent memory efficiency with good performance:

## Stop previous model (Ctrl+C), then start FP8 30B model
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.8 \
  --host 192.168.200.1 \
  --port 8000

Benefits of FP8:

Memory efficiency: Reduced footprint compared to BF16
Performance: 341+ tok/s demonstrated
Hardware compatibility: Fully supported on Blackwell GB10

Step 8. Run Production Model (72B)

Memory-optimized configuration for 136GB model:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.85 \
  --host 192.168.200.1 \
  --port 8000 \
  --max-model-len 2048 \
  --max-num-seqs 8 \
  --disable-sliding-window \
  --enforce-eager

Memory optimization parameters:

--gpu-memory-utilization 0.85: Uses 85% of GPU memory
--max-model-len 2048: Limits context length to save memory
--max-num-seqs 8: Reduces concurrent sequences
--disable-sliding-window: Disables memory-intensive sliding window attention
--enforce-eager: Uses eager execution (saves memory)

Test 72B model:

curl -X POST "http://192.168.200.1:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-72B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain the benefits of RDMA for AI workloads in one paragraph."}
    ],
    "max_tokens": 500
  }'

Step 9. Monitor RDMA Traffic

Monitor RDMA activity during inference:

## Run on both systems (separate terminals)
watch -n 0.5 "
echo '=== RDMA Counters ===';
echo -n 'TX: '; cat /sys/class/infiniband/rocep1s0f0/ports/1/counters/port_xmit_data;
echo -n 'RX: '; cat /sys/class/infiniband/rocep1s0f0/ports/1/counters/port_rcv_data;
echo 'Timestamp: '; date;
"

During inference, you'll see counters increasing as tensors are communicated between GPUs.

Performance Benchmarks

Benchmark Commands

Single-node testing:

## On RTX 6000 Pro or DGX Spark
vllm bench latency --model Qwen/Qwen3-30B-A3B-Thinking-2507 --input-len 512 --output-len 2000 --num-iters 10
vllm bench throughput --model Qwen/Qwen3-30B-A3B-Thinking-2507 --input-len 512 --output-len 2000 --num-prompts 20

Distributed testing:

## 30B Model
vllm bench serve --host 192.168.200.1 --port 8000 --random-input-len 512 --random-output-len 2000 --num-prompts 20 --request-rate 2 --model Qwen/Qwen3-30B-A3B-Thinking-2507

## 72B Model
vllm bench serve --host 192.168.200.1 --port 8000 --random-input-len 256 --random-output-len 1500 --num-prompts 20 --request-rate 2 --model Qwen/Qwen2.5-72B-Instruct

Performance Results Summary

Configuration	Avg Latency	Output Throughput	Total Throughput
RTX 6000 Pro (Single)	36.87s	679.88 tok/s	853.90 tok/s
DGX Spark (Single)	213.12s	105.10 tok/s	132.00 tok/s
Distributed RDMA	191.09s	205.83 tok/s	259.41 tok/s

Key Insights

RTX 6000 Pro: Clear Single-Node Winner

5.8x faster than DGX Spark for latency-critical workloads
6.5x higher output token throughput
Best for: Interactive inference, real-time applications

Distributed RDMA: Aggregated Capacity

259.41 tok/s total throughput - faster than DGX alone
Combined 224GB GPU memory (128GB DGX + 96GB RTX)
Enables models too large for any single GPU
TTFT: 139.94ms mean vs single DGX 213,120ms

DGX Spark: Memory Advantage

128GB unified memory enables larger models
Slower inference but handles 100B+ models
Best for: Extremely large models, memory-constrained scenarios

FP8 30B Model Results

============ Serving Benchmark Result ============
Successful requests:                     20
Benchmark duration (s):                  115.36
Output token throughput (tok/s):         341.15
Total Token throughput (tok/s):          429.89
Mean TTFT (ms):                          171.00
Mean TPOT (ms):                          53.08
==================================================

Troubleshooting

Symptom	Cause	Fix
Ray worker can't connect to head	Firewall blocking port 6379	`sudo ufw allow 6379/tcp`
NCCL timeout during model load	RDMA not working	Verify `ib_send_bw` test passes
"Placement group" errors	Ray cluster not formed	Check `ray status` on both nodes
OOM during 72B model load	Insufficient memory optimization	Add `--max-model-len 2048 --enforce-eager`
SSH connection refused	SSH server not running	`sudo systemctl start ssh`
Container can't access RDMA	Missing device mount	Ensure `-v /dev/infiniband:/dev/infiniband`
Wrong IP in Ray cluster	Multiple network interfaces	Set `RAY_USE_MULTIPLE_IPS=0`
Slow inference performance	NCCL using wrong interface	Verify `NCCL_SOCKET_IFNAME=enp1s0f0np0`

Credits

This playbook was contributed by Csaba Kecskemeti | DevQuasar.

For a detailed walkthrough and additional context, see the original article: Distributed Inference Cluster: DGX Spark + RTX 6000 Pro

17 KiB Raw Blame History