26 KiB
DGX Spark User Performance Guide
This repository contains benchmarking information, setup instructions, and example runs for evaluating AI workloads on NVIDIA DGX Spark.
It covers a wide range of frameworks and workloads including large language models (LLMs), diffusion models, fine-tuning, and more using tools such as TensorRT-LLM, vLLM, SGLang, Llama.cpp, and others.
Before running any benchmarks, ensure the following prerequisites are met for your selected workload:
Prerequisites
- Access to a DGX Spark (or 2x DGX Spark)
- Docker and NVIDIA Container Toolkit
- A valid Hugging Face Token
This guide includes benchmarking instructions for:
Single Spark
- TensorRT-LLM (TRT-LLM)
- vLLM
- SGLang
- Llama.cpp
- Image generation (Flux and SDXL)
- Fine-tuning
Dual Spark
Single Spark
TensorRT-LLM (TRT-LLM)
What this measures
- Offline: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead.
- Online: End-to-end serving performance through trtllm-serve (HTTP + scheduler + KV cache).
For more details, visit trtllm-bench and trtllm-serve official documentation.
Prerequisites (applies to Offline & Online)
1) Docker permissions
sudo usermod -aG docker $USER
newgrp docker
2) Set environment variables
# -------------------------------
# Environment Setup
# -------------------------------
export HF_TOKEN="<your_huggingface_token>" # optional if model is public
export MODEL_HANDLE="openai/gpt-oss-20b"
export ISL=128
export OSL=128
export MAX_TOKENS=$((ISL + OSL))
Offline Benchmark
This runs trtllm-bench with a deterministic synthetic dataset matching your ISL/OSL.
# -------------------------------
# TensorRT-LLM Offline Benchmark
# -------------------------------
docker run \
--rm -it \
--gpus all \
--ipc host \
--network host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e HF_TOKEN="$HF_TOKEN" \
-e MODEL_HANDLE="$MODEL_HANDLE" \
-e ISL="$ISL" \
-e OSL="$OSL" \
-e MAX_TOKENS="$MAX_TOKENS" \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
bash -lc '
set -e
# 1) Download model (cached in /root/.cache/huggingface)
hf download "$MODEL_HANDLE"
# 2) Prepare synthetic dataset (fixed ISL/OSL)
python benchmarks/cpp/prepare_dataset.py \
--tokenizer "$MODEL_HANDLE" \
--stdout token-norm-dist \
--num-requests 1 \
--input-mean "$ISL" --input-stdev 0 \
--output-mean "$OSL" --output-stdev 0 \
> /tmp/dataset.txt
# 3) Optional tuning config
cat > /tmp/extra-llm-api-config.yml <<EOF
kv_cache_config:
dtype: "auto"
cuda_graph_config:
enable_padding: true
EOF
# 4) Run offline benchmark
trtllm-bench -m "$MODEL_HANDLE" throughput \
--dataset /tmp/dataset.txt \
--backend pytorch \
--tp 1 \
--max_num_tokens "$MAX_TOKENS" \
--concurrency 1 \
--max_batch_size 1 \
--kv_cache_free_gpu_mem_fraction 0.95 \
--extra_llm_api_options /tmp/extra-llm-api-config.yml
'
Online Benchmark
Terminal 1 - run the TRT-LLM server
# -------------------------------
# Launch TensorRT-LLM OpenAI server
# -------------------------------
docker run \
--rm -it \
--gpus=all \
--ipc=host \
--network host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e HF_TOKEN="$HF_TOKEN" \
-e MODEL_HANDLE="$MODEL_HANDLE" \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
bash -lc '
trtllm-serve serve "$MODEL_HANDLE" \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_num_tokens '"$MAX_TOKENS"' \
--max_batch_size 1 \
--kv_cache_free_gpu_memory_fraction 0.9 \
--tp_size 1 \
--ep_size 1 \
--trust_remote_code
'
Terminal 2 - run the client (vLLM's built-in bench)
export MODEL_HANDLE="openai/gpt-oss-20b"
export ISL=128
export OSL=128
export MAX_TOKENS=$((ISL + OSL))
# -------------------------------
# Launch Benchmark Client
# -------------------------------
docker run \
--rm -it \
--gpus all \
--ipc host \
--network host \
-e HF_TOKEN="$HF_TOKEN" \
-e MODEL_HANDLE="$MODEL_HANDLE" \
-e ISL="$ISL" \
-e OSL="$OSL" \
nvcr.io/nvidia/vllm:25.12-py3 \
bash -lc '
vllm bench serve \
--base-url http://127.0.0.1:8000 \
--endpoint /v1/completions \
--model "$MODEL_HANDLE" \
--dataset-name random \
--num-prompts 1 \
--random-input-len "$ISL" \
--random-output-len "$OSL" \
--percentile-metrics ttft,tpot,itl,e2el \
--max-concurrency 1 \
--request-rate inf
'
vLLM
What this measures
- Offline: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead.
- Online: End-to-end serving performance through vLLM (HTTP + scheduler + KV cache).
For more details, visit vllm benchmarking official documentation
Prerequisites (applies to Offline & Online)
1) Docker permissions
sudo usermod -aG docker $USER
newgrp docker
2) Set environment variables
# -------------------------------
# Environment Setup
# -------------------------------
export HF_TOKEN="<your_huggingface_token>" # optional if model is public
export MODEL_HANDLE="openai/gpt-oss-20b"
export ISL=128
export OSL=128
export MAX_TOKENS=$((ISL + OSL))
Offline Benchmark
This runs vllm bench throughput directly inside the vLLM container with a synthetic random dataset.
# -------------------------------
# vLLM Offline Benchmark
# -------------------------------
docker run \
--rm -it \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e HF_TOKEN="$HF_TOKEN" \
-e MODEL_HANDLE="$MODEL_HANDLE" \
-e ISL="$ISL" \
-e OSL="$OSL" \
-e MAX_TOKENS="$MAX_TOKENS" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
nvcr.io/nvidia/vllm:25.12-py3 \
bash -lc '
pip install -q datasets && \
vllm bench throughput \
--model "$MODEL_HANDLE" \
--dataset-name random \
--num-prompts 1 \
--input-len $ISL \
--output-len $OSL \
--max-model-len $MAX_TOKENS \
--gpu-memory-utilization 0.8
'
Online Benchmark
Terminal 1 - run the vLLM server
# -------------------------------
# Launch vLLM OpenAI server
# -------------------------------
docker run \
--rm -it \
--gpus all \
--ipc host \
--network host \
-e HF_TOKEN="$HF_TOKEN" \
-e MODEL_HANDLE="$MODEL_HANDLE" \
-e MAX_TOKENS="$MAX_TOKENS" \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
nvcr.io/nvidia/vllm:25.12-py3 \
bash -lc '
vllm serve "$MODEL_HANDLE" \
--host 0.0.0.0 \
--port 8000 \
--dtype auto \
--max-model-len $MAX_TOKENS \
--gpu-memory-utilization 0.9 \
--trust-remote-code
'
Terminal 2 - run the client benchmark
export MODEL_HANDLE="openai/gpt-oss-20b"
export ISL=128
export OSL=128
export MAX_TOKENS=$((ISL + OSL))
# -------------------------------
# Launch Benchmark Client
# -------------------------------
docker run \
--rm -it \
--gpus all \
--ipc host \
--network host \
-e HF_TOKEN="$HF_TOKEN" \
-e MODEL_HANDLE="$MODEL_HANDLE" \
-e ISL="$ISL" \
-e OSL="$OSL" \
nvcr.io/nvidia/vllm:25.12-py3 \
bash -lc '
vllm bench serve \
--base-url http://127.0.0.1:8000 \
--endpoint /v1/completions \
--model "$MODEL_HANDLE" \
--dataset-name random \
--num-prompts 1 \
--random-input-len $ISL \
--random-output-len $OSL \
--percentile-metrics ttft,tpot,itl,e2el \
--max-concurrency 1 \
--request-rate inf
'
SGLang
What this measures
- Offline: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead.
- Online: End-to-end serving performance through SGLang (HTTP + scheduler + KV cache).
For more details, visit SGLang benchmarking official documentation
Prerequisites (applies to Offline & Online)
1) Docker permissions
sudo usermod -aG docker $USER
newgrp docker
2) Set environment variables
# -------------------------------
# Environment Setup
# -------------------------------
export HF_TOKEN="<your_huggingface_token>" # optional if model is public
export MODEL_HANDLE="openai/gpt-oss-20b"
export ISL=128
export OSL=128
export MAX_TOKENS=$((ISL + OSL))
Offline Benchmark
This runs the official SGLang offline throughput benchmark to measure raw model execution performance without launching a server.
# -------------------------------
# SGLang Offline Benchmark
# -------------------------------
docker run \
--rm -it \
--gpus all \
--ipc host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e HF_TOKEN="$HF_TOKEN" \
-e MODEL_HANDLE="$MODEL_HANDLE" \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
nvcr.io/nvidia/sglang:25.12-py3 \
bash -lc '
python3 -m sglang.bench_offline_throughput \
--model-path "$MODEL_HANDLE" \
--dataset-name random \
--num-prompts 1
'
Online Benchmark
Terminal 1 - run the SGLang server
# -------------------------------
# Launch SGLang HTTP Server
# -------------------------------
docker run \
--rm -it \
--gpus all \
--ipc host \
--network host \
-e HF_TOKEN="$HF_TOKEN" \
-e MODEL_HANDLE="$MODEL_HANDLE" \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
nvcr.io/nvidia/sglang:25.12-py3 \
bash -lc '
python3 -m sglang.launch_server \
--model-path "$MODEL_HANDLE" \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-code \
--tp 1 \
--attention-backend triton \
--mem-fraction-static 0.75
'
Terminal 2 - run the client benchmark
# -------------------------------
# SGLang Online Benchmark Client
# -------------------------------
docker run \
--rm -it \
--network host \
-e HF_TOKEN="$HF_TOKEN" \
-e MODEL_HANDLE="$MODEL_HANDLE" \
nvcr.io/nvidia/sglang:25.12-py3 \
bash -lc '
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 30000 \
--model "$MODEL_HANDLE" \
--dataset-name random \
--num-prompts 1 \
--random-input-len '"$ISL"' \
--random-output-len '"$OSL"'
'
Llama.cpp
What this measures
- Offline: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead.
- Online: End-to-end serving performance through llama-server (HTTP + scheduler + KV cache).
For more details, visit Github Discussion
Prerequisites (applies to Offline & Online)
Note:
DGX Spark uses a long-term supported (LTS) base software stack, so the host OS, driver, and CUDA toolkit are updated together on a fixed release cadence. To access the latest CUDA features and performance improvements, users should run NVIDIA NGC containers (PyTorch, vLLM, TensorRT-LLM, etc.), which are validated for DGX Spark and include newer CUDA toolkits without modifying the host system. If required, users may also install CUDA directly via Debian packages; however, this approach is not recommended for most users and falls outside the supported DGX OS stack.
1) Launch the latest pytorch container from NGC.
export HF_TOKEN="<your_huggingface_token>" # optional if model is public
docker run --rm -it \
--gpus all \
--ipc=host \
-p 8080:8080 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME":/home/nvidia \
-w /home/nvidia \
nvcr.io/nvidia/pytorch:25.12-py3
2) Clone and build the latest Llama.cpp
# -------------------------------
# Clone and build Llama.cpp
# -------------------------------
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
sudo apt-get update
sudo apt-get install -y libcurl4-openssl-dev cmake g++ make
# Build with CUDA support for NVIDIA GPUs (adjust arch as needed)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121a" -DGGML_CUDA_CUB_3DOT2=on
cmake --build build --config Release -j
3) Download model weights
# -------------------------------
# Download GGUF model
# -------------------------------
cd models
# Example: GPT-OSS-20B
curl -L -o gpt-oss-20b-mxfp4.gguf https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-mxfp4.gguf
4) Set environment variables
# -------------------------------
# Environment Setup
# -------------------------------
export MODEL_HANDLE="gpt-oss-20b-mxfp4.gguf"
export ISL=128
export OSL=128
export MAX_TOKENS=$((ISL + OSL))
Offline Benchmark
# -------------------------------
# Llama.cpp Offline Benchmark
# -------------------------------
./build/bin/llama-bench \
-m models/$MODEL_HANDLE \
-t $(nproc) \
-p $ISL \
-n $OSL \
-ngl 99 \
-dio 1 \
-fa 1
Online Benchmark
Terminal 1 - run the server
# -------------------------------
# Launch Llama.cpp Server
# -------------------------------
./build/bin/llama-server \
--model models/$MODEL_HANDLE \
--ctx-size $MAX_TOKENS \
--n-predict $OSL \
--threads $(nproc) \
--host 0.0.0.0 \
--port 8080 \
-fa 1 \
--backend-sampling
Terminal 2 - run the client
# -------------------------------
# Launch Benchmark Client
# -------------------------------
curl -s -H "Content-Type: application/json" \
-d "{
\"prompt\": \"What is the capital of France?\",
\"temperature\": 0.5,
\"stream\": false
}" \
http://127.0.0.1:8080/completion | jq .
Image Generation
This benchmark evaluates diffusion model performance using TensorRT-based pipelines for:
- Flux.1 Schnell
- SDXL 1.0
You will measure image generation latency and throughput for text-to-image workloads on DGX Spark.
Prerequisites
1) Docker permissions
sudo usermod -aG docker $USER
newgrp docker
2) Set environment variables
# -------------------------------
# Environment Setup
# -------------------------------
export HF_TOKEN="<your_huggingface_token>"
3) Launch latest PyTorch container
docker run --rm -it \
--gpus all \
--ipc=host \
-e HF_TOKEN="$HF_TOKEN" \
-v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub \
nvcr.io/nvidia/pytorch:25.12-py3
4) Inside the container
git clone https://github.com/NVIDIA/TensorRT.git -b main --single-branch
cd TensorRT/demo/Diffusion
export TRT_OSSPATH=$HOME/TensorRT/
cd $TRT_OSSPATH/demo/Diffusion
pip install nvidia-modelopt[onnx,hf]
sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt
pip install -r requirements.txt
apt-get update && \
apt-get install -y libgl1 libglib2.0-0 libsm6 libxrender1 libxext6
Flux.1 Schnell:
# -------------------------------
# Flux.1 Schnell txt2img Benchmark
# -------------------------------
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token="$HF_TOKEN" \
--version="flux.1-schnell" \
--fp4 \
--download-onnx-models \
--batch-size 1 \
--width 1024 \
--height 1024 \
--denoising-steps 4
SDXL 1.0:
# -------------------------------
# SDXL 1.0 txt2img Benchmark
# -------------------------------
python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
--hf-token="$HF_TOKEN" \
--version xl-1.0 \
--download-onnx-models \
--batch-size 2 \
--width 1024 \
--height 1024 \
--denoising-steps 50
Fine-tuning
What this measures
This benchmark evaluates training performance (step time, throughput, memory usage) for different fine-tuning strategies on DGX Spark:
- LoRA fine-tuning for parameter-efficient adaptation of Llama 3 8B
- qLoRA fine-tuning for memory-efficient fine-tuning of Llama 3 70B
- Full fine-tuning of a smaller Llama 3 3B model
Prerequisites
1) Docker permissions
sudo usermod -aG docker $USER
newgrp docker
2) Set environment variables
# -------------------------------
# Environment Setup
# -------------------------------
export HF_TOKEN="<your_huggingface_token>"
3) Launch latest PyTorch container
docker run --rm -it \
--gpus all \
-e HF_TOKEN="$HF_TOKEN" \
--ipc=host \
-v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub \
nvcr.io/nvidia/pytorch:25.12-py3
4) Inside the container
# Install dependencies
pip install transformers peft datasets "trl==0.26.2" "bitsandbytes==0.49.1"
# Clone DGX Spark playbooks
git clone https://github.com/NVIDIA/dgx-spark-playbooks
cd dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets
# Force bitsandbytes to use CUDA 13.0 binary (CUDA 13.1 not yet supported)
export BNB_CUDA_VERSION=130
LoRA Fine-tuning
# -------------------------------
# Llama 3 8B LoRA Fine-tuning
# -------------------------------
python Llama3_8B_LoRA_finetuning.py --use_torch_compile
qLoRA Fine-tuning
# -------------------------------
# Llama 3 70B qLoRA Fine-tuning
# -------------------------------
python Llama3_70B_qLoRA_finetuning.py
Full Fine-tuning
# -------------------------------
# Llama 3 3B Full Fine-tuning
# -------------------------------
python Llama3_3B_full_finetuning.py --use_torch_compile
Dual Spark
Measure BW Between Dual Sparks
DGX Spark systems support high-bandwidth, low-latency interconnects over QSFP ports. Bandwidth between two Sparks can be validated at two different layers:
GPU collective bandwidth using NCCL
- Follow the instruction here - https://build.nvidia.com/spark/nccl/stacked-sparks
What this measures
- This test measures effective GPU collective communication bandwidth using NCCL.
Raw RDMA fabric bandwidth (RoCE)
What this measures
- This test measures raw point-to-point RDMA bandwidth over the QSFP-connected CX-7 NICs using RoCE.
Prerequisites
1) Install perftest tools (on both Sparks)
sudo apt install perftest
2) Ensure one QSFP cable connects Spark-1 ↔ Spark-2 directly
Setup
Step 1 – Identify devices and logical ports
Run on both Spark systems:
ibdev2netdev
Example output (from Spark-1):
nvidia@spark-5c2d:~$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
You will use the Up interfaces for IP assignment.
Example output (from Spark-2):
nvidia@spark-bd26:~$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
You will use the Up interfaces for IP assignment.
Step 2 - Assign Manual IPs
Assign unique subnets to each active port.
Note: Repeat this step after reboot if NetworkManager clears them.
Spark-1 (HOST)
# Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
renderer: NetworkManager
ethernets:
enp1s0f0np0:
addresses:
- 192.168.200.12/24
dhcp4: no
enP2p1s0f0np0:
addresses:
- 192.168.201.12/24
dhcp4: no
EOF
sudo chmod 600 /etc/netplan/40-cx7.yaml
sudo netplan apply
Note: Interfaces may differ; use the ones marked Up.
Spark-2 (CLIENT)
# Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
renderer: NetworkManager
ethernets:
enp1s0f0np0:
addresses:
- 192.168.200.13/24
dhcp4: no
enP2p1s0f0np0:
addresses:
- 192.168.201.13/24
dhcp4: no
EOF
sudo chmod 600 /etc/netplan/40-cx7.yaml
sudo netplan apply
Note: Interfaces may differ; use the ones marked Up.
Step 3 – Run Bandwidth Test
Open two terminals on each Spark (4 total).
Note: Make sure ports 12000 and 12001 are open and not in use.
Spark-1 (HOST) - Terminal 1
ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits --run_infinitely
Note: Replace device names with your actual Up interfaces. (replace rocep1s0f0 with your <HOST_NIC1_INTERFACE>)
Spark-1 (HOST) - Terminal 2
ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits --run_infinitely
Note: Replace device names with your actual Up interfaces. (replace roceP2p1s0f0 with your <HOST_NIC2_INTERFACE>)
Spark-2 (CLIENT) - Terminal 1
ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits 192.168.200.12 --run_infinitely
Note: Replace device names with your actual Up interfaces. (replace rocep1s0f0 with your <CLIENT_NIC1_INTERFACE>)
Spark-2 (CLIENT) - Terminal 2
ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits 192.168.201.12 --run_infinitely
Note: Replace device names with your actual Up interfaces. (replace roceP2p1s0f0 with your <CLIENT_NIC2_INTERFACE>)
STEP 4 – Monitor Bandwidth
Example client output:
Client-1 Output
nvidia@spark-bd26:~$ ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits 192.168.200.12 --run_infinitely
WARNING: BW peak won't be measured in this run.
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : rocep1s0f0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0129 PSN 0x57279d RKey 0x184300 VAddr 0x00ec99bedad000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:200:13
remote address: LID 0000 QPN 0x0129 PSN 0x531b7 RKey 0x184300 VAddr 0x00ffeec955d000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:200:12
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 882805 0.00 92.57 0.176554
65536 882802 0.00 92.57 0.176554
65536 882791 0.00 92.57 0.176554
65536 882791 0.00 92.56 0.176552
65536 882821 0.00 92.57 0.176555
Client-2 Output
nvidia@spark-bd26:~$ ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits 192.168.201.12 --run_infinitely
WARNING: BW peak won't be measured in this run.
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : roceP2p1s0f0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x01a9 PSN 0x5e41f9 RKey 0x1a03ed VAddr 0x00f374277dd000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:201:13
remote address: LID 0000 QPN 0x01a9 PSN 0x8ab8e7 RKey 0x1a0300 VAddr 0x00f285f5f1d000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:201:12
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 927940 0.00 97.28 0.185548
65536 927790 0.00 97.28 0.185549
65536 927766 0.00 97.28 0.185550
65536 927754 0.00 97.28 0.185545
65536 927804 0.00 97.29 0.185557
65536 927807 0.00 97.28 0.185554
Total throughput = 92.57 + 97.28 = 189.85 Gbps
Measure RDMA Latency Between Dual Sparks
What this measures
This test measures one-way RDMA latency between two DGX Spark systems over the same QSFP RoCE links used for bandwidth testing.
Prerequisites
Before running the latency tests, complete Step 1 (Identify devices and logical ports) and Step 2 (Assign Manual IPs) from the Measure BW Between Dual Sparks section above.
Step 1.1 – Run RDMA Write Latency Test
This measures RDMA write latency on a single QSFP link.
Open two terminals (one per Spark).
Spark-1 (HOST)
ib_write_lat -d rocep1s0f0 -i 1 -p 13000 -F
Replace rocep1s0f0 with your actual Up RDMA device from ibdev2netdev.
Spark-2 (CLIENT)
ib_write_lat -d rocep1s0f0 -i 1 -p 13000 -F 192.168.200.12
Step 1.2 – Run RDMA Read Latency Test
This measures RDMA read latency on the second QSFP link.
Open two terminals (one per Spark).
Spark-1 (HOST)
ib_read_lat -d rocep1s0f0 -i 1 -p 13001 -F
Spark-2 (CLIENT)
ib_read_lat -d rocep1s0f0 -i 1 -p 13001 -F 192.168.201.12
Note: RDMA latency is a per-link metric and should be measured on a single QSFP link at a time. Latency values are not aggregated across multiple links.