17 KiB
Distributed Inference Guide
Deploy and run distributed AI inference across DGX Spark and Linux Workstation using vLLM and Ray
Table of Contents
Overview
Basic idea
This guide walks you through deploying distributed inference across your heterogeneous RDMA cluster. Using Ray for orchestration and vLLM for inference, you can run large language models that exceed the memory capacity of any single GPU by distributing them across your DGX Spark and Linux workstation.
Architecture:
┌─────────────────────────────────┐ ┌───────────────────────────────────┐
│ DGX SPARK │ │ WORKSTATION │
│ (Grace Blackwell GB10) │ │ (RTX 6000 Pro / RTX 5090) │
│ │ │ │
│ ┌───────────────────────────┐ │ │ ┌───────────────────────────┐ │
│ │ vLLM Head Node │ │ │ │ vLLM Worker │ │
│ │ (API Server, Rank 0) │ │ │ │ (Tensor Parallel) │ │
│ └───────────────────────────┘ │ │ └───────────────────────────┘ │
│ │ │ │ │ │
│ ┌───────────────────────────┐ │ │ ┌───────────────────────────┐ │
│ │ Ray Head (6379) │◄─┼────┼──│ Ray Worker │ │
│ └───────────────────────────┘ │ │ └───────────────────────────┘ │
│ │ │ │ │ │
│ ┌───────────────────────────┐ │ │ ┌───────────────────────────┐ │
│ │ NCCL over RDMA │◄─┼════┼──│ NCCL over RDMA │ │
│ │ 192.168.200.1 │ │ │ │ 192.168.200.2 │ │
│ └───────────────────────────┘ │ │ └───────────────────────────┘ │
└─────────────────────────────────┘ └───────────────────────────────────┘
What you'll accomplish
- Configure SSH and hostname resolution between nodes
- Test NCCL communication over RDMA
- Deploy RDMA-enabled Docker containers
- Establish a Ray cluster across both systems
- Run distributed inference with vLLM
- Benchmark performance across different configurations
What to know before starting
- Familiarity with Docker and container networking
- Understanding of distributed computing concepts (Ray, tensor parallelism)
- Basic knowledge of LLM inference serving
Prerequisites
- Completed RDMA Network Setup with validated 90+ Gbps bandwidth
- Docker installed on both systems:
docker --version - NVIDIA Container Toolkit installed
- Hugging Face account for model access (some models require authentication)
Time & risk
-
Duration: 1-2 hours including testing
-
Risk level: Low - uses containers, non-destructive
-
Rollback: Stop containers to revert
-
Last Updated: 01/23/2026
Instructions
Step 1. Configure Hostnames
Add hostname aliases on both systems:
## Add hostname resolution on both DGX Spark and Workstation
sudo tee -a /etc/hosts > /dev/null <<EOF
192.168.200.1 dgx-spark
192.168.200.2 workstation
EOF
Step 2. Set Up SSH Access
Install SSH server if needed (common on workstations):
## Check SSH server status
sudo systemctl status ssh
## If not installed:
sudo apt update
sudo apt install openssh-server
sudo systemctl start ssh
sudo systemctl enable ssh
Configure passwordless SSH between nodes:
On DGX Spark:
## Check if SSH key exists
ls ~/.ssh/id_*.pub
## If no key exists, generate one:
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519
## Copy key to workstation
ssh-copy-id <your-username>@workstation
On Workstation:
## Check if SSH key exists
ls ~/.ssh/id_*.pub
## If no key exists, generate one:
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519
## Copy key to DGX Spark
ssh-copy-id <your-username>@dgx-spark
Verify passwordless SSH:
## From DGX Spark
ssh <your-username>@workstation hostname
## Expected output: workstation
## From Workstation
ssh <your-username>@dgx-spark hostname
## Expected output: dgx-spark
Step 3. Test NCCL Communication
Create the NCCL test script on both systems:
## Create test script
cat > test_nccl.py << 'EOF'
import os
import torch
import torch.distributed as dist
import argparse
def test_nccl_communication():
parser = argparse.ArgumentParser()
parser.add_argument('--rank', type=int, required=True)
parser.add_argument('--world_size', type=int, default=2)
parser.add_argument('--master_addr', type=str, default='192.168.200.1')
parser.add_argument('--master_port', type=str, default='29500')
args = parser.parse_args()
os.environ['RANK'] = str(args.rank)
os.environ['WORLD_SIZE'] = str(args.world_size)
os.environ['MASTER_ADDR'] = args.master_addr
os.environ['MASTER_PORT'] = args.master_port
os.environ['NCCL_SOCKET_IFNAME'] = 'enp1s0f0np0'
print(f"Initializing process group - Rank: {args.rank}, World Size: {args.world_size}")
print(f"Master: {args.master_addr}:{args.master_port}")
dist.init_process_group(backend='nccl', rank=args.rank, world_size=args.world_size)
print(f"Process group initialized - Rank: {dist.get_rank()}/{dist.get_world_size()}")
device = torch.device('cuda:0')
tensor = torch.ones(10, device=device) * (args.rank + 1)
print(f"Rank {args.rank} - Before allreduce: {tensor}")
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
print(f"Rank {args.rank} - After allreduce: {tensor}")
print(f"Expected result: {torch.ones(10) * (1 + 2)}")
dist.destroy_process_group()
print(f"Rank {args.rank} - Test completed successfully!")
if __name__ == "__main__":
test_nccl_communication()
EOF
Run NCCL test in Docker containers:
On DGX Spark (start first):
docker run -it --runtime=nvidia --gpus all --network host --ipc=host \
--privileged --ulimit memlock=-1 --ulimit stack=67108864 \
-v /dev/infiniband:/dev/infiniband -v /sys:/sys:ro \
-e NCCL_IB_DISABLE=0 -e NCCL_IB_HCA=rocep1s0f0:1 -e NCCL_IB_GID_INDEX=3 \
-e NCCL_SOCKET_IFNAME=enp1s0f0np0 -v $(pwd):/workspace \
nvcr.io/nvidia/vllm:25.09-py3 python /workspace/test_nccl.py --rank 0
On Workstation (connect to DGX):
docker run -it --runtime=nvidia --gpus all --network host --ipc=host \
--privileged --ulimit memlock=-1 --ulimit stack=67108864 \
-v /dev/infiniband:/dev/infiniband -v /sys:/sys:ro \
-e NCCL_IB_DISABLE=0 -e NCCL_IB_HCA=rocep1s0f0:1 -e NCCL_IB_GID_INDEX=3 \
-e NCCL_SOCKET_IFNAME=enp1s0f0np0 -v $(pwd):/workspace \
nvcr.io/nvidia/vllm:25.09-py3 python /workspace/test_nccl.py --rank 1
Success indicators:
- Output shows:
NCCL INFO Using network IBext_v10 - All-reduce operation completes successfully
- Final tensors show expected sum values (3.0 for each element)
Step 4. Start RDMA-Enabled Containers
On DGX Spark:
docker run -it --runtime=nvidia --gpus all --network host --ipc=host --shm-size=10g \
--privileged \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v /dev/infiniband:/dev/infiniband \
-v /sys:/sys:ro \
-e CUDA_DEVICE_ORDER=PCI_BUS_ID \
-e GLOO_SOCKET_IFNAME=enp1s0f0np0 \
-e NCCL_IB_DISABLE=0 \
-e NCCL_IB_HCA=rocep1s0f0:1 \
-e NCCL_IB_GID_INDEX=3 \
-e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
-e RAY_USE_MULTIPLE_IPS=0 \
-e RAY_NODE_IP_ADDRESS=192.168.200.1 \
-e RAY_OVERRIDE_NODE_IP=192.168.200.1 \
-e VLLM_HOST_IP=192.168.200.1 \
nvcr.io/nvidia/vllm:25.09-py3 bash
On Workstation:
docker run -it --runtime=nvidia --gpus all --network host --ipc=host --shm-size=10g \
--privileged \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v /dev/infiniband:/dev/infiniband \
-v /sys:/sys:ro \
-e CUDA_DEVICE_ORDER=PCI_BUS_ID \
-e GLOO_SOCKET_IFNAME=enp1s0f0np0 \
-e NCCL_IB_DISABLE=0 \
-e NCCL_IB_HCA=rocep1s0f0:1 \
-e NCCL_IB_GID_INDEX=3 \
-e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
-e RAY_USE_MULTIPLE_IPS=0 \
-e RAY_NODE_IP_ADDRESS=192.168.200.2 \
-e RAY_OVERRIDE_NODE_IP=192.168.200.2 \
nvcr.io/nvidia/vllm:25.09-py3 bash
Key parameters explained:
--runtime=nvidia: Required for GPU access--network host: Uses host networking (required for RDMA)--privileged: Needed for InfiniBand device access--ulimit memlock=-1: Unlimited memory locking for RDMA-v /dev/infiniband:/dev/infiniband: Mounts RDMA devicesNCCL_IB_HCA=rocep1s0f0:1: Tells NCCL to use specific RDMA deviceRAY_USE_MULTIPLE_IPS=0: Prevents Ray IP detection issues
Step 5. Establish Ray Cluster
On DGX Spark container (head node):
ray start --head \
--node-ip-address=192.168.200.1 \
--port=6379 \
--dashboard-host=192.168.200.1 \
--dashboard-port=8265 \
--num-gpus=1
Verify head node:
ray status
Expected output:
======== Autoscaler status: 2026-01-10 19:43:05.517578 ========
Node status
---------------------------------------------------------------
Active:
1 node_xxxxx
Resources
---------------------------------------------------------------
Total Usage:
0.0/20.0 CPU
0.0/1.0 GPU
On Workstation container (worker node):
ray start \
--address=192.168.200.1:6379 \
--node-ip-address=192.168.200.2 \
--num-gpus=2
Verify cluster formation:
ray status
Expected output (should show 3 total GPUs):
======== Autoscaler status: 2026-01-10 19:46:26.274139 ========
Node status
---------------------------------------------------------------
Active:
1 node_xxxxx (head)
1 node_xxxxx (worker)
Resources
---------------------------------------------------------------
Total Usage:
0.0/68.0 CPU
0.0/3.0 GPU
Step 6. Run Validation Test (4B Model)
Start small model for validation on DGX Spark container:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-4B-Instruct-2507 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--gpu-memory-utilization 0.8 \
--host 192.168.200.1 \
--port 8000
Test from another terminal:
curl -X POST "http://192.168.200.1:8000/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-4B-Instruct-2507",
"prompt": "Test distributed inference:",
"max_tokens": 500
}'
Step 7. Run FP8 Quantized Model (30B)
FP8 quantization provides excellent memory efficiency with good performance:
## Stop previous model (Ctrl+C), then start FP8 30B model
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--gpu-memory-utilization 0.8 \
--host 192.168.200.1 \
--port 8000
Benefits of FP8:
- Memory efficiency: Reduced footprint compared to BF16
- Performance: 341+ tok/s demonstrated
- Hardware compatibility: Fully supported on Blackwell GB10
Step 8. Run Production Model (72B)
Memory-optimized configuration for 136GB model:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--gpu-memory-utilization 0.85 \
--host 192.168.200.1 \
--port 8000 \
--max-model-len 2048 \
--max-num-seqs 8 \
--disable-sliding-window \
--enforce-eager
Memory optimization parameters:
--gpu-memory-utilization 0.85: Uses 85% of GPU memory--max-model-len 2048: Limits context length to save memory--max-num-seqs 8: Reduces concurrent sequences--disable-sliding-window: Disables memory-intensive sliding window attention--enforce-eager: Uses eager execution (saves memory)
Test 72B model:
curl -X POST "http://192.168.200.1:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-72B-Instruct",
"messages": [
{"role": "user", "content": "Explain the benefits of RDMA for AI workloads in one paragraph."}
],
"max_tokens": 500
}'
Step 9. Monitor RDMA Traffic
Monitor RDMA activity during inference:
## Run on both systems (separate terminals)
watch -n 0.5 "
echo '=== RDMA Counters ===';
echo -n 'TX: '; cat /sys/class/infiniband/rocep1s0f0/ports/1/counters/port_xmit_data;
echo -n 'RX: '; cat /sys/class/infiniband/rocep1s0f0/ports/1/counters/port_rcv_data;
echo 'Timestamp: '; date;
"
During inference, you'll see counters increasing as tensors are communicated between GPUs.
Performance Benchmarks
Benchmark Commands
Single-node testing:
## On RTX 6000 Pro or DGX Spark
vllm bench latency --model Qwen/Qwen3-30B-A3B-Thinking-2507 --input-len 512 --output-len 2000 --num-iters 10
vllm bench throughput --model Qwen/Qwen3-30B-A3B-Thinking-2507 --input-len 512 --output-len 2000 --num-prompts 20
Distributed testing:
## 30B Model
vllm bench serve --host 192.168.200.1 --port 8000 --random-input-len 512 --random-output-len 2000 --num-prompts 20 --request-rate 2 --model Qwen/Qwen3-30B-A3B-Thinking-2507
## 72B Model
vllm bench serve --host 192.168.200.1 --port 8000 --random-input-len 256 --random-output-len 1500 --num-prompts 20 --request-rate 2 --model Qwen/Qwen2.5-72B-Instruct
Performance Results Summary
| Configuration | Avg Latency | Output Throughput | Total Throughput |
|---|---|---|---|
| RTX 6000 Pro (Single) | 36.87s | 679.88 tok/s | 853.90 tok/s |
| DGX Spark (Single) | 213.12s | 105.10 tok/s | 132.00 tok/s |
| Distributed RDMA | 191.09s | 205.83 tok/s | 259.41 tok/s |
Key Insights
RTX 6000 Pro: Clear Single-Node Winner
- 5.8x faster than DGX Spark for latency-critical workloads
- 6.5x higher output token throughput
- Best for: Interactive inference, real-time applications
Distributed RDMA: Aggregated Capacity
- 259.41 tok/s total throughput - faster than DGX alone
- Combined 224GB GPU memory (128GB DGX + 96GB RTX)
- Enables models too large for any single GPU
- TTFT: 139.94ms mean vs single DGX 213,120ms
DGX Spark: Memory Advantage
- 128GB unified memory enables larger models
- Slower inference but handles 100B+ models
- Best for: Extremely large models, memory-constrained scenarios
FP8 30B Model Results
============ Serving Benchmark Result ============
Successful requests: 20
Benchmark duration (s): 115.36
Output token throughput (tok/s): 341.15
Total Token throughput (tok/s): 429.89
Mean TTFT (ms): 171.00
Mean TPOT (ms): 53.08
==================================================
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Ray worker can't connect to head | Firewall blocking port 6379 | sudo ufw allow 6379/tcp |
| NCCL timeout during model load | RDMA not working | Verify ib_send_bw test passes |
| "Placement group" errors | Ray cluster not formed | Check ray status on both nodes |
| OOM during 72B model load | Insufficient memory optimization | Add --max-model-len 2048 --enforce-eager |
| SSH connection refused | SSH server not running | sudo systemctl start ssh |
| Container can't access RDMA | Missing device mount | Ensure -v /dev/infiniband:/dev/infiniband |
| Wrong IP in Ray cluster | Multiple network interfaces | Set RAY_USE_MULTIPLE_IPS=0 |
| Slow inference performance | NCCL using wrong interface | Verify NCCL_SOCKET_IFNAME=enp1s0f0np0 |
Credits
This playbook was contributed by Csaba Kecskemeti | DevQuasar.
For a detailed walkthrough and additional context, see the original article: Distributed Inference Cluster: DGX Spark + RTX 6000 Pro
