chore: Regenerate all playbooks

2026-06-18 04:22:21 +00:00 · 2026-01-27 16:23:49 +00:00 · 2026-01-27 16:23:49 +00:00 · a77367d758
commit a77367d758
parent 7acab77601
2 changed files with 967 additions and 1 deletions
--- a/nvidia/connect-two-sparks/assets/performance_benchmarking_guide.md
+++ b/nvidia/connect-two-sparks/assets/performance_benchmarking_guide.md
@ -0,0 +1,964 @@
+# DGX Spark User Performance Guide
+
+This repository contains benchmarking information, setup instructions, and example runs for evaluating AI workloads on NVIDIA DGX Spark.
+
+It covers a wide range of frameworks and workloads including large language models (LLMs), diffusion models, fine-tuning, and more using tools such as TensorRT-LLM, vLLM, SGLang, Llama.cpp, and others.
+
+Before running any benchmarks, ensure the following prerequisites are met for your selected workload:
+
+## Prerequisites
+
+- Access to a DGX Spark (or 2x DGX Spark)
+- Docker and NVIDIA Container Toolkit
+- A valid Hugging Face Token
+
+## This guide includes benchmarking instructions for:
+
+### Single Spark
+- **[TensorRT-LLM (TRT-LLM)](#tensorrt-llm-trt-llm)**
+  - [Offline](#offline-benchmark)
+  - [Online](#online-benchmark)
+- **[vLLM](#vllm)**
+  - [Offline](#offline-benchmark-1)
+  - [Online](#online-benchmark-1)
+- **[SGLang](#sglang)**
+  - [Offline](#offline-benchmark-2)
+  - [Online](#online-benchmark-2)
+- **[Llama.cpp](#llamacpp)**
+  - [Offline](#offline-benchmark-3)
+  - [Online](#online-benchmark-3)
+- **[Image generation](#image-generation)** (Flux and SDXL)
+- **[Fine-tuning](#fine-tuning)**
+
+### Dual Spark
+- [Measure bandwidth for Dual Spark setup](#measure-bw-between-dual-sparks)
+- [Measure RDMA latency between Dual Sparks](#measure-rdma-latency-between-dual-sparks)
+
+---
+
+# Single Spark
+
+## TensorRT-LLM (TRT-LLM)
+
+### What this measures
+
+- **Offline**: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead.
+- **Online**: End-to-end serving performance through trtllm-serve (HTTP + scheduler + KV cache).
+
+For more details, visit [trtllm-bench](https://github.com/NVIDIA/TensorRT-LLM/tree/main/benchmarks/cpp) and [trtllm-serve](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/serve) official documentation.
+
+### Prerequisites (applies to Offline & Online)
+
+#### 1) Docker permissions
+```bash
+sudo usermod -aG docker $USER
+newgrp docker
+```
+
+#### 2) Set environment variables
+```bash
+# -------------------------------
+# Environment Setup
+# -------------------------------
+export HF_TOKEN="<your_huggingface_token>"       # optional if model is public
+export MODEL_HANDLE="openai/gpt-oss-20b"
+export ISL=128
+export OSL=128
+export MAX_TOKENS=$((ISL + OSL))
+```
+
+### Offline Benchmark
+
+This runs trtllm-bench with a deterministic synthetic dataset matching your ISL/OSL.
+
+```bash
+# -------------------------------
+# TensorRT-LLM Offline Benchmark
+# -------------------------------
+docker run \
+  --rm -it \
+  --gpus all \
+  --ipc host \
+  --network host \
+  --ulimit memlock=-1 \
+  --ulimit stack=67108864 \
+  -e HF_TOKEN="$HF_TOKEN" \
+  -e MODEL_HANDLE="$MODEL_HANDLE" \
+  -e ISL="$ISL" \
+  -e OSL="$OSL" \
+  -e MAX_TOKENS="$MAX_TOKENS" \
+  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
+  nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
+  bash -lc '
+set -e
+
+# 1) Download model (cached in /root/.cache/huggingface)
+hf download "$MODEL_HANDLE"
+
+# 2) Prepare synthetic dataset (fixed ISL/OSL)
+python benchmarks/cpp/prepare_dataset.py \
+  --tokenizer "$MODEL_HANDLE" \
+  --stdout token-norm-dist \
+  --num-requests 1 \
+  --input-mean "$ISL" --input-stdev 0 \
+  --output-mean "$OSL" --output-stdev 0 \
+  > /tmp/dataset.txt
+
+# 3) Optional tuning config
+cat > /tmp/extra-llm-api-config.yml <<EOF
+kv_cache_config:
+  dtype: "auto"
+cuda_graph_config:
+  enable_padding: true
+EOF
+
+# 4) Run offline benchmark
+trtllm-bench -m "$MODEL_HANDLE" throughput \
+  --dataset /tmp/dataset.txt \
+  --backend pytorch \
+  --tp 1 \
+  --max_num_tokens "$MAX_TOKENS" \
+  --concurrency 1 \
+  --max_batch_size 1 \
+  --kv_cache_free_gpu_mem_fraction 0.95 \
+  --extra_llm_api_options /tmp/extra-llm-api-config.yml
+'
+```
+
+### Online Benchmark
+
+#### Terminal 1 -  run the TRT-LLM server
+```bash
+# -------------------------------
+# Launch TensorRT-LLM OpenAI server
+# -------------------------------
+docker run \
+  --rm -it \
+  --gpus=all \
+  --ipc=host \
+  --network host \
+  --ulimit memlock=-1 \
+  --ulimit stack=67108864 \
+  -e HF_TOKEN="$HF_TOKEN" \
+  -e MODEL_HANDLE="$MODEL_HANDLE" \
+  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
+  nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
+  bash -lc '
+trtllm-serve serve "$MODEL_HANDLE" \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --backend pytorch \
+  --max_num_tokens '"$MAX_TOKENS"' \
+  --max_batch_size 1 \
+  --kv_cache_free_gpu_memory_fraction 0.9 \
+  --tp_size 1 \
+  --ep_size 1 \
+  --trust_remote_code
+'
+```
+
+#### Terminal 2 - run the client (vLLM's built-in bench)
+```bash
+export MODEL_HANDLE="openai/gpt-oss-20b"
+export ISL=128
+export OSL=128
+export MAX_TOKENS=$((ISL + OSL))
+
+# -------------------------------
+# Launch Benchmark Client
+# -------------------------------
+docker run \
+  --rm -it \
+  --gpus all \
+  --ipc host \
+  --network host \
+  -e HF_TOKEN="$HF_TOKEN" \
+  -e MODEL_HANDLE="$MODEL_HANDLE" \
+  -e ISL="$ISL" \
+  -e OSL="$OSL" \
+  nvcr.io/nvidia/vllm:25.12-py3 \
+  bash -lc '
+vllm bench serve \
+  --base-url http://127.0.0.1:8000 \
+  --endpoint /v1/completions \
+  --model "$MODEL_HANDLE" \
+  --dataset-name random \
+  --num-prompts 1 \
+  --random-input-len "$ISL" \
+  --random-output-len "$OSL" \
+  --percentile-metrics ttft,tpot,itl,e2el \
+  --max-concurrency 1 \
+  --request-rate inf
+'
+```
+
+---
+
+## vLLM
+
+### What this measures
+
+- **Offline**: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead.
+- **Online**: End-to-end serving performance through vLLM (HTTP + scheduler + KV cache).
+
+For more details, visit [vllm benchmarking official documentation](https://docs.vllm.ai/en/latest/getting_started/benchmarking.html)
+
+### Prerequisites (applies to Offline & Online)
+
+#### 1) Docker permissions
+
+```bash
+sudo usermod -aG docker $USER
+newgrp docker
+```
+
+#### 2) Set environment variables
+```bash
+# -------------------------------
+# Environment Setup
+# -------------------------------
+export HF_TOKEN="<your_huggingface_token>"       # optional if model is public
+export MODEL_HANDLE="openai/gpt-oss-20b"
+export ISL=128
+export OSL=128
+export MAX_TOKENS=$((ISL + OSL))
+```
+
+### Offline Benchmark
+
+This runs vllm bench throughput directly inside the vLLM container with a synthetic random dataset.
+
+```bash
+# -------------------------------
+# vLLM Offline Benchmark
+# -------------------------------
+docker run \
+  --rm -it \
+  --gpus all \
+  --ipc host \
+  --ulimit memlock=-1 \
+  --ulimit stack=67108864 \
+  -e MODEL_HANDLE="$MODEL_HANDLE" \
+  -e ISL="$ISL" \
+  -e OSL="$OSL" \
+  -e MAX_TOKENS="$MAX_TOKENS" \
+  -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
+  nvcr.io/nvidia/vllm:25.12-py3 \
+  bash -lc '
+pip install -q datasets && \
+vllm bench throughput \
+  --model "$MODEL_HANDLE" \
+  --dataset-name random \
+  --num-prompts 1 \
+  --input-len $ISL \
+  --output-len $OSL \
+  --max-model-len $MAX_TOKENS \
+  --gpu-memory-utilization 0.8
+'
+```
+
+### Online Benchmark
+
+#### Terminal 1 - run the vLLM server
+```bash
+# -------------------------------
+# Launch vLLM OpenAI server
+# -------------------------------
+docker run \
+  --rm -it \
+  --gpus all \
+  --ipc host \
+  --network host \
+  -e HF_TOKEN="$HF_TOKEN" \
+  -e MODEL_HANDLE="$MODEL_HANDLE" \
+  -e MAX_TOKENS="$MAX_TOKENS" \
+  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
+  nvcr.io/nvidia/vllm:25.12-py3 \
+  bash -lc '
+vllm serve "$MODEL_HANDLE" \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --dtype auto \
+  --max-model-len $MAX_TOKENS \
+  --gpu-memory-utilization 0.9 \
+  --trust-remote-code
+'
+```
+
+#### Terminal 2 - run the client benchmark
+```bash
+export MODEL_HANDLE="openai/gpt-oss-20b"
+export ISL=128
+export OSL=128
+export MAX_TOKENS=$((ISL + OSL))
+
+# -------------------------------
+# Launch Benchmark Client
+# -------------------------------
+docker run \
+  --rm -it \
+  --gpus all \
+  --ipc host \
+  --network host \
+  -e HF_TOKEN="$HF_TOKEN" \
+  -e MODEL_HANDLE="$MODEL_HANDLE" \
+  -e ISL="$ISL" \
+  -e OSL="$OSL" \
+  nvcr.io/nvidia/vllm:25.12-py3 \
+  bash -lc '
+vllm bench serve \
+  --base-url http://127.0.0.1:8000 \
+  --endpoint /v1/completions \
+  --model "$MODEL_HANDLE" \
+  --dataset-name random \
+  --num-prompts 1 \
+  --random-input-len $ISL \
+  --random-output-len $OSL \
+  --percentile-metrics ttft,tpot,itl,e2el \
+  --max-concurrency 1 \
+  --request-rate inf
+'
+```
+
+---
+
+## SGLang
+
+### What this measures
+
+- **Offline**: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead.
+- **Online**: End-to-end serving performance through SGLang (HTTP + scheduler + KV cache).
+
+For more details, visit [SGLang benchmarking official documentation](https://sgl-project.github.io/references/benchmark.html)
+
+### Prerequisites (applies to Offline & Online)
+
+#### 1) Docker permissions
+
+```bash
+sudo usermod -aG docker $USER
+newgrp docker
+```
+
+#### 2) Set environment variables
+```bash
+# -------------------------------
+# Environment Setup
+# -------------------------------
+export HF_TOKEN="<your_huggingface_token>"       # optional if model is public
+export MODEL_HANDLE="openai/gpt-oss-20b"
+export ISL=128
+export OSL=128
+export MAX_TOKENS=$((ISL + OSL))
+```
+
+### Offline Benchmark
+
+This runs the official SGLang offline throughput benchmark to measure raw model execution performance without launching a server.
+
+```bash
+# -------------------------------
+# SGLang Offline Benchmark
+# -------------------------------
+docker run \
+  --rm -it \
+  --gpus all \
+  --ipc host \
+  --ulimit memlock=-1 \
+  --ulimit stack=67108864 \
+  -e HF_TOKEN="$HF_TOKEN" \
+  -e MODEL_HANDLE="$MODEL_HANDLE" \
+  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
+  nvcr.io/nvidia/sglang:25.12-py3 \
+  bash -lc '
+python3 -m sglang.bench_offline_throughput \
+  --model-path "$MODEL_HANDLE" \
+  --dataset-name random \
+  --num-prompts 1
+'
+```
+
+### Online Benchmark
+
+#### Terminal 1 - run the SGLang server
+```bash
+# -------------------------------
+# Launch SGLang HTTP Server
+# -------------------------------
+docker run \
+  --rm -it \
+  --gpus all \
+  --ipc host \
+  --network host \
+  -e HF_TOKEN="$HF_TOKEN" \
+  -e MODEL_HANDLE="$MODEL_HANDLE" \
+  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
+  nvcr.io/nvidia/sglang:25.12-py3 \
+  bash -lc '
+python3 -m sglang.launch_server \
+  --model-path "$MODEL_HANDLE" \
+  --host 0.0.0.0 \
+  --port 30000 \
+  --trust-remote-code \
+  --tp 1 \
+  --attention-backend triton \
+  --mem-fraction-static 0.75
+'
+```
+
+#### Terminal 2 - run the client benchmark
+```bash
+# -------------------------------
+# SGLang Online Benchmark Client
+# -------------------------------
+docker run \
+  --rm -it \
+  --network host \
+  -e HF_TOKEN="$HF_TOKEN" \
+  -e MODEL_HANDLE="$MODEL_HANDLE" \
+  nvcr.io/nvidia/sglang:25.12-py3 \
+  bash -lc '
+python3 -m sglang.bench_serving \
+  --backend sglang \
+  --host 127.0.0.1 \
+  --port 30000 \
+  --model "$MODEL_HANDLE" \
+  --dataset-name random \
+  --num-prompts 1 \
+  --random-input-len '"$ISL"' \
+  --random-output-len '"$OSL"'
+'
+```
+
+---
+
+## Llama.cpp
+
+### What this measures
+
+- **Offline**: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead.
+- **Online**: End-to-end serving performance through llama-server (HTTP + scheduler + KV cache).
+
+For more details, visit [Github Discussion](https://github.com/ggml-org/llama.cpp/discussions)
+
+### Prerequisites (applies to Offline & Online)
+
+**Note:**  
+DGX Spark uses a long-term supported (LTS) base software stack, so the host OS, driver, and CUDA toolkit are updated together on a fixed release cadence. To access the latest CUDA features and performance improvements, users should run NVIDIA NGC containers (PyTorch, vLLM, TensorRT-LLM, etc.), which are validated for DGX Spark and include newer CUDA toolkits without modifying the host system. If required, users may also install CUDA directly via Debian packages; however, this approach is not recommended for most users and falls outside the supported DGX OS stack.
+
+#### 1) Launch the latest pytorch container from NGC.
+```bash
+export HF_TOKEN="<your_huggingface_token>"       # optional if model is public
+docker run --rm -it \
+  --gpus all \
+  --ipc=host \
+  -p 8080:8080 \
+  -e HF_TOKEN="$HF_TOKEN" \
+  -v "$HOME":/home/nvidia \
+  -w /home/nvidia \
+  nvcr.io/nvidia/pytorch:25.12-py3
+```
+
+#### 2) Clone and build the latest Llama.cpp
+```bash
+# -------------------------------
+# Clone and build Llama.cpp
+# -------------------------------
+git clone https://github.com/ggml-org/llama.cpp
+cd llama.cpp
+
+sudo apt-get update
+sudo apt-get install -y libcurl4-openssl-dev cmake g++ make
+
+# Build with CUDA support for NVIDIA GPUs (adjust arch as needed)
+cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121a" -DGGML_CUDA_CUB_3DOT2=on
+
+cmake --build build --config Release -j
+```
+
+#### 3) Download model weights
+```bash
+# -------------------------------
+# Download GGUF model
+# -------------------------------
+cd models
+
+# Example: GPT-OSS-20B
+curl -L -o gpt-oss-20b-mxfp4.gguf https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-mxfp4.gguf
+```
+
+#### 4) Set environment variables
+```bash
+# -------------------------------
+# Environment Setup
+# -------------------------------
+
+export MODEL_HANDLE="gpt-oss-20b-mxfp4.gguf"
+export ISL=128
+export OSL=128
+export MAX_TOKENS=$((ISL + OSL))
+```
+
+### Offline Benchmark
+```bash
+# -------------------------------
+# Llama.cpp Offline Benchmark
+# -------------------------------
+./build/bin/llama-bench \
+  -m models/$MODEL_HANDLE \
+  -t $(nproc) \
+  -p $ISL \
+  -n $OSL \
+  -ngl 99 \
+  -dio 1 \
+  -fa 1
+```
+
+### Online Benchmark
+
+#### Terminal 1 - run the server
+```bash
+# -------------------------------
+# Launch Llama.cpp Server
+# -------------------------------
+./build/bin/llama-server \
+  --model models/$MODEL_HANDLE \
+  --ctx-size $MAX_TOKENS \
+  --n-predict $OSL \
+  --threads $(nproc) \
+  --host 0.0.0.0 \
+  --port 8080 \
+  -fa 1 \
+  --backend-sampling
+```
+
+#### Terminal 2 - run the client
+```bash
+# -------------------------------
+# Launch Benchmark Client
+# -------------------------------
+curl -s -H "Content-Type: application/json" \
+  -d "{
+    \"prompt\": \"What is the capital of France?\",
+    \"temperature\": 0.5,
+    \"stream\": false
+  }" \
+  http://127.0.0.1:8080/completion | jq .
+```
+
+---
+
+## Image Generation
+
+This benchmark evaluates diffusion model performance using TensorRT-based pipelines for:
+- Flux.1 Schnell
+- SDXL 1.0
+
+You will measure image generation latency and throughput for text-to-image workloads on DGX Spark.
+
+### Prerequisites
+
+#### 1) Docker permissions
+```bash
+sudo usermod -aG docker $USER
+newgrp docker
+```
+
+#### 2) Set environment variables
+```bash
+# -------------------------------
+# Environment Setup
+# -------------------------------
+export HF_TOKEN="<your_huggingface_token>"
+```
+
+#### 3) Launch latest PyTorch container
+```bash
+docker run --rm -it \
+  --gpus all \
+  --ipc=host \
+  -e HF_TOKEN="$HF_TOKEN" \
+  -v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub \
+  nvcr.io/nvidia/pytorch:25.12-py3
+```
+
+#### 4) Inside the container
+```bash
+git clone https://github.com/NVIDIA/TensorRT.git -b main --single-branch
+cd TensorRT/demo/Diffusion
+export TRT_OSSPATH=$HOME/TensorRT/
+cd $TRT_OSSPATH/demo/Diffusion
+pip install nvidia-modelopt[onnx,hf]
+sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt
+pip install -r requirements.txt
+
+apt-get update && \
+apt-get install -y libgl1 libglib2.0-0 libsm6 libxrender1 libxext6
+```
+
+### Flux.1 Schnell:
+```bash
+# -------------------------------
+# Flux.1 Schnell txt2img Benchmark
+# -------------------------------
+python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
+  --hf-token="$HF_TOKEN" \
+  --version="flux.1-schnell" \
+  --fp4 \
+  --download-onnx-models \
+  --batch-size 1 \
+  --width 1024 \
+  --height 1024 \
+  --denoising-steps 4
+```
+
+### SDXL 1.0:
+```bash
+# -------------------------------
+# SDXL 1.0 txt2img Benchmark
+# -------------------------------
+python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
+  --hf-token="$HF_TOKEN" \
+  --version xl-1.0 \
+  --download-onnx-models \
+  --batch-size 2 \
+  --width 1024 \
+  --height 1024 \
+  --denoising-steps 50
+```
+
+---
+
+## Fine-tuning
+
+### What this measures
+
+This benchmark evaluates training performance (step time, throughput, memory usage) for different fine-tuning strategies on DGX Spark:
+- LoRA fine-tuning for parameter-efficient adaptation of Llama 3 8B
+- qLoRA fine-tuning for memory-efficient fine-tuning of Llama 3 70B
+- Full fine-tuning of a smaller Llama 3 3B model
+
+### Prerequisites
+
+#### 1) Docker permissions
+```bash
+sudo usermod -aG docker $USER
+newgrp docker
+```
+
+#### 2) Set environment variables
+```bash
+# -------------------------------
+# Environment Setup
+# -------------------------------
+export HF_TOKEN="<your_huggingface_token>"
+```
+
+#### 3) Launch latest PyTorch container
+```bash
+docker run --rm -it \
+  --gpus all \
+  -e HF_TOKEN="$HF_TOKEN" \
+  --ipc=host \
+  -v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub \
+  nvcr.io/nvidia/pytorch:25.12-py3
+```
+
+#### 4) Inside the container
+```bash
+# Install dependencies
+pip install transformers peft datasets "trl==0.26.2" "bitsandbytes==0.49.1"
+
+# Clone DGX Spark playbooks
+git clone https://github.com/NVIDIA/dgx-spark-playbooks
+cd dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets
+
+# Force bitsandbytes to use CUDA 13.0 binary (CUDA 13.1 not yet supported)
+export BNB_CUDA_VERSION=130
+```
+
+### LoRA Fine-tuning
+```bash
+# -------------------------------
+# Llama 3 8B LoRA Fine-tuning
+# -------------------------------
+python Llama3_8B_LoRA_finetuning.py --use_torch_compile
+```
+
+### qLoRA Fine-tuning
+```bash
+# -------------------------------
+# Llama 3 70B qLoRA Fine-tuning
+# -------------------------------
+python Llama3_70B_qLoRA_finetuning.py
+```
+
+### Full Fine-tuning
+```bash
+# -------------------------------
+# Llama 3 3B Full Fine-tuning
+# -------------------------------
+python Llama3_3B_full_finetuning.py --use_torch_compile
+```
+
+---
+
+# Dual Spark
+
+## Measure BW Between Dual Sparks
+
+DGX Spark systems support high-bandwidth, low-latency interconnects over QSFP ports.
+Bandwidth between two Sparks can be validated at two different layers:
+- [GPU collective bandwidth using NCCL](#gpu-collective-bandwidth-using-nccl)
+- [Raw RDMA fabric bandwidth (RoCE)](#raw-rdma-fabric-bandwidth-roce)
+
+### GPU collective bandwidth using NCCL
+- Follow the instruction here - https://build.nvidia.com/spark/nccl/stacked-sparks
+
+#### What this measures
+- This test measures effective GPU collective communication bandwidth using NCCL.
+
+### Raw RDMA fabric bandwidth (RoCE)
+
+#### What this measures
+- This test measures raw point-to-point RDMA bandwidth over the QSFP-connected CX-7 NICs using RoCE.
+
+### Prerequisites
+
+#### 1) Install perftest tools (on both Sparks)
+```bash
+sudo apt install perftest
+```
+
+#### 2) Ensure one QSFP cable connects Spark-1 ↔ Spark-2 directly
+
+### Setup
+
+#### Step 1 – Identify devices and logical ports
+
+Run on both Spark systems:
+```bash
+ibdev2netdev
+```
+
+**Example output (from Spark-1):**
+```
+nvidia@spark-5c2d:~$ ibdev2netdev
+rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
+rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
+roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
+roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
+```
+You will use the **Up** interfaces for IP assignment.
+
+**Example output (from Spark-2):**
+```
+nvidia@spark-bd26:~$ ibdev2netdev
+rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
+rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
+roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
+roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
+```
+You will use the **Up** interfaces for IP assignment.
+
+#### Step 2 - Assign Manual IPs
+
+Assign unique subnets to each active port.
+
+**Note:** Repeat this step after reboot if NetworkManager clears them.
+
+**Spark-1 (HOST)**
+```bash
+# Create the netplan configuration file
+sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
+network:
+  version: 2
+  renderer: NetworkManager
+  ethernets:
+    enp1s0f0np0:
+      addresses:
+        - 192.168.200.12/24
+      dhcp4: no
+    enP2p1s0f0np0:
+      addresses:
+        - 192.168.201.12/24
+      dhcp4: no
+EOF
+
+sudo chmod 600 /etc/netplan/40-cx7.yaml
+sudo netplan apply
+```
+**Note:** Interfaces may differ; use the ones marked **Up**.
+
+**Spark-2 (CLIENT)**
+```bash
+# Create the netplan configuration file
+sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
+network:
+  version: 2
+  renderer: NetworkManager
+  ethernets:
+    enp1s0f0np0:
+      addresses:
+        - 192.168.200.13/24
+      dhcp4: no
+    enP2p1s0f0np0:
+      addresses:
+        - 192.168.201.13/24
+      dhcp4: no
+EOF
+
+sudo chmod 600 /etc/netplan/40-cx7.yaml
+sudo netplan apply
+```
+**Note:** Interfaces may differ; use the ones marked **Up**.
+
+#### Step 3 – Run Bandwidth Test
+
+Open two terminals on each Spark (4 total).
+
+**Note:** Make sure ports 12000 and 12001 are open and not in use.
+
+**Spark-1 (HOST) - Terminal 1**
+```bash
+ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits --run_infinitely
+```
+**Note:** Replace device names with your actual **Up** interfaces. (replace `rocep1s0f0` with your `<HOST_NIC1_INTERFACE>`)
+
+**Spark-1 (HOST) - Terminal 2**
+```bash
+ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits --run_infinitely
+```
+**Note:** Replace device names with your actual **Up** interfaces. (replace `roceP2p1s0f0` with your `<HOST_NIC2_INTERFACE>`)
+
+**Spark-2 (CLIENT) - Terminal 1**
+```bash
+ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits 192.168.200.12 --run_infinitely
+```
+**Note:** Replace device names with your actual **Up** interfaces. (replace `rocep1s0f0` with your `<CLIENT_NIC1_INTERFACE>`)
+
+**Spark-2 (CLIENT) - Terminal 2**
+```bash
+ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits 192.168.201.12 --run_infinitely
+```
+**Note:** Replace device names with your actual **Up** interfaces. (replace `roceP2p1s0f0` with your `<CLIENT_NIC2_INTERFACE>`)
+
+#### STEP 4 – Monitor Bandwidth
+
+**Example client output:**
+
+**Client-1 Output**
+```
+nvidia@spark-bd26:~$ ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits 192.168.200.12 --run_infinitely
+  WARNING: BW peak won't be measured in this run.
+---------------------------------------------------------------------------------------
+                    RDMA_Write BW Test
+ Dual-port       : OFF            Device         : rocep1s0f0
+ Number of qps   : 1              Transport type : IB
+ Connection type : RC             Using SRQ      : OFF
+ PCIe relax order: ON
+ ibv_wr* API     : ON
+ TX depth        : 128
+ CQ Moderation   : 1
+ Mtu             : 1024[B]
+ Link type       : Ethernet
+ GID index       : 3
+ Max inline data : 0[B]
+ rdma_cm QPs     : OFF
+ Data ex. method : Ethernet
+---------------------------------------------------------------------------------------
+ local address:  LID 0000 QPN 0x0129 PSN 0x57279d RKey 0x184300 VAddr 0x00ec99bedad000
+ GID:            00:00:00:00:00:00:00:00:00:00:255:255:192:168:200:13
+ remote address: LID 0000 QPN 0x0129 PSN 0x531b7  RKey 0x184300 VAddr 0x00ffeec955d000
+ GID:            00:00:00:00:00:00:00:00:00:00:255:255:192:168:200:12
+---------------------------------------------------------------------------------------
+ #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]    MsgRate[Mpps]
+ 65536      882805           0.00               92.57                0.176554
+ 65536      882802           0.00               92.57                0.176554
+ 65536      882791           0.00               92.57                0.176554
+ 65536      882791           0.00               92.56                0.176552
+ 65536      882821           0.00               92.57                0.176555
+```
+
+**Client-2 Output**
+```
+nvidia@spark-bd26:~$ ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits 192.168.201.12 --run_infinitely
+ WARNING: BW peak won't be measured in this run.
+---------------------------------------------------------------------------------------
+                    RDMA_Write BW Test
+ Dual-port       : OFF            Device         : roceP2p1s0f0
+ Number of qps   : 1              Transport type : IB
+ Connection type : RC             Using SRQ      : OFF
+ PCIe relax order: ON
+ ibv_wr* API     : ON
+ TX depth        : 128
+ CQ Moderation   : 1
+ Mtu             : 1024[B]
+ Link type       : Ethernet
+ GID index       : 3
+ Max inline data : 0[B]
+ rdma_cm QPs     : OFF
+ Data ex. method : Ethernet
+---------------------------------------------------------------------------------------
+ local address:  LID 0000 QPN 0x01a9 PSN 0x5e41f9 RKey 0x1a03ed VAddr 0x00f374277dd000
+ GID:            00:00:00:00:00:00:00:00:00:00:255:255:192:168:201:13
+ remote address: LID 0000 QPN 0x01a9 PSN 0x8ab8e7  RKey 0x1a0300 VAddr 0x00f285f5f1d000
+ GID:            00:00:00:00:00:00:00:00:00:00:255:255:192:168:201:12
+---------------------------------------------------------------------------------------
+ #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]    MsgRate[Mpps]
+ 65536      927940           0.00               97.28                0.185548
+ 65536      927790           0.00               97.28                0.185549
+ 65536      927766           0.00               97.28                0.185550
+ 65536      927754           0.00               97.28                0.185545
+ 65536      927804           0.00               97.29                0.185557
+ 65536      927807           0.00               97.28                0.185554
+```
+
+**Total throughput = 92.57 + 97.28 = 189.85 Gbps**
+
+## Measure RDMA Latency Between Dual Sparks
+
+### What this measures
+
+This test measures one-way RDMA latency between two DGX Spark systems over the same QSFP RoCE links used for bandwidth testing.
+
+### Prerequisites
+
+Before running the latency tests, complete Step 1 (Identify devices and logical ports) and
+Step 2 (Assign Manual IPs) from the [Measure BW Between Dual Sparks](#measure-bw-between-dual-sparks) section above.
+
+### Step 1.1 – Run RDMA Write Latency Test
+
+This measures RDMA write latency on a single QSFP link.
+
+Open two terminals (one per Spark).
+
+**Spark-1 (HOST)**
+```bash
+ib_write_lat -d rocep1s0f0 -i 1 -p 13000 -F
+```
+
+Replace `rocep1s0f0` with your actual **Up** RDMA device from `ibdev2netdev`.
+
+**Spark-2 (CLIENT)**
+```bash
+ib_write_lat -d rocep1s0f0 -i 1 -p 13000 -F 192.168.200.12
+```
+
+### Step 1.2 – Run RDMA Read Latency Test
+
+This measures RDMA read latency on the second QSFP link.
+
+Open two terminals (one per Spark).
+
+**Spark-1 (HOST)**
+```bash
+ib_read_lat -d rocep1s0f0 -i 1 -p 13001 -F
+```
+
+**Spark-2 (CLIENT)**
+```bash
+ib_read_lat -d rocep1s0f0 -i 1 -p 13001 -F 192.168.201.12
+```
+
+Note: RDMA latency is a per-link metric and should be measured on a single QSFP link at a time. Latency values are not aggregated across multiple links.
--- a/nvidia/sglang/README.md
+++ b/nvidia/sglang/README.md
@ -83,7 +83,9 @@ Note: for NVFP4 models, add the `--quantization modelopt_fp4` flag.
 ## Step 1. Verify system prerequisites

 Check that your NVIDIA Spark device meets all requirements before proceeding. This step runs on
-your host system and ensures Docker, GPU drivers, and container toolkit are properly configured.
+your host system and ensures Docker, GPU drivers, and container toolkit are properly configured. 
+
+> Note: If you experience timeouts or "connection refused" errors while pulling the container image, you may need to use a VPN or a proxy, as some registries may be restricted by your local network or ISP.

 ```bash
 ## Verify Docker installation