diff --git a/nvidia/connect-two-sparks/assets/performance_benchmarking_guide.md b/nvidia/connect-two-sparks/assets/performance_benchmarking_guide.md new file mode 100644 index 0000000..b061fcb --- /dev/null +++ b/nvidia/connect-two-sparks/assets/performance_benchmarking_guide.md @@ -0,0 +1,964 @@ +# DGX Spark User Performance Guide + +This repository contains benchmarking information, setup instructions, and example runs for evaluating AI workloads on NVIDIA DGX Spark. + +It covers a wide range of frameworks and workloads including large language models (LLMs), diffusion models, fine-tuning, and more using tools such as TensorRT-LLM, vLLM, SGLang, Llama.cpp, and others. + +Before running any benchmarks, ensure the following prerequisites are met for your selected workload: + +## Prerequisites + +- Access to a DGX Spark (or 2x DGX Spark) +- Docker and NVIDIA Container Toolkit +- A valid Hugging Face Token + +## This guide includes benchmarking instructions for: + +### Single Spark +- **[TensorRT-LLM (TRT-LLM)](#tensorrt-llm-trt-llm)** + - [Offline](#offline-benchmark) + - [Online](#online-benchmark) +- **[vLLM](#vllm)** + - [Offline](#offline-benchmark-1) + - [Online](#online-benchmark-1) +- **[SGLang](#sglang)** + - [Offline](#offline-benchmark-2) + - [Online](#online-benchmark-2) +- **[Llama.cpp](#llamacpp)** + - [Offline](#offline-benchmark-3) + - [Online](#online-benchmark-3) +- **[Image generation](#image-generation)** (Flux and SDXL) +- **[Fine-tuning](#fine-tuning)** + +### Dual Spark +- [Measure bandwidth for Dual Spark setup](#measure-bw-between-dual-sparks) +- [Measure RDMA latency between Dual Sparks](#measure-rdma-latency-between-dual-sparks) + +--- + +# Single Spark + +## TensorRT-LLM (TRT-LLM) + +### What this measures + +- **Offline**: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead. +- **Online**: End-to-end serving performance through trtllm-serve (HTTP + scheduler + KV cache). + +For more details, visit [trtllm-bench](https://github.com/NVIDIA/TensorRT-LLM/tree/main/benchmarks/cpp) and [trtllm-serve](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/serve) official documentation. + +### Prerequisites (applies to Offline & Online) + +#### 1) Docker permissions +```bash +sudo usermod -aG docker $USER +newgrp docker +``` + +#### 2) Set environment variables +```bash +# ------------------------------- +# Environment Setup +# ------------------------------- +export HF_TOKEN="" # optional if model is public +export MODEL_HANDLE="openai/gpt-oss-20b" +export ISL=128 +export OSL=128 +export MAX_TOKENS=$((ISL + OSL)) +``` + +### Offline Benchmark + +This runs trtllm-bench with a deterministic synthetic dataset matching your ISL/OSL. + +```bash +# ------------------------------- +# TensorRT-LLM Offline Benchmark +# ------------------------------- +docker run \ + --rm -it \ + --gpus all \ + --ipc host \ + --network host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -e HF_TOKEN="$HF_TOKEN" \ + -e MODEL_HANDLE="$MODEL_HANDLE" \ + -e ISL="$ISL" \ + -e OSL="$OSL" \ + -e MAX_TOKENS="$MAX_TOKENS" \ + -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \ + nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \ + bash -lc ' +set -e + +# 1) Download model (cached in /root/.cache/huggingface) +hf download "$MODEL_HANDLE" + +# 2) Prepare synthetic dataset (fixed ISL/OSL) +python benchmarks/cpp/prepare_dataset.py \ + --tokenizer "$MODEL_HANDLE" \ + --stdout token-norm-dist \ + --num-requests 1 \ + --input-mean "$ISL" --input-stdev 0 \ + --output-mean "$OSL" --output-stdev 0 \ + > /tmp/dataset.txt + +# 3) Optional tuning config +cat > /tmp/extra-llm-api-config.yml < enp1s0f0np0 (Up) +rocep1s0f1 port 1 ==> enp1s0f1np1 (Down) +roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up) +roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down) +``` +You will use the **Up** interfaces for IP assignment. + +**Example output (from Spark-2):** +``` +nvidia@spark-bd26:~$ ibdev2netdev +rocep1s0f0 port 1 ==> enp1s0f0np0 (Up) +rocep1s0f1 port 1 ==> enp1s0f1np1 (Down) +roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up) +roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down) +``` +You will use the **Up** interfaces for IP assignment. + +#### Step 2 - Assign Manual IPs + +Assign unique subnets to each active port. + +**Note:** Repeat this step after reboot if NetworkManager clears them. + +**Spark-1 (HOST)** +```bash +# Create the netplan configuration file +sudo tee /etc/netplan/40-cx7.yaml > /dev/null < /dev/null <`) + +**Spark-1 (HOST) - Terminal 2** +```bash +ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits --run_infinitely +``` +**Note:** Replace device names with your actual **Up** interfaces. (replace `roceP2p1s0f0` with your ``) + +**Spark-2 (CLIENT) - Terminal 1** +```bash +ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits 192.168.200.12 --run_infinitely +``` +**Note:** Replace device names with your actual **Up** interfaces. (replace `rocep1s0f0` with your ``) + +**Spark-2 (CLIENT) - Terminal 2** +```bash +ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits 192.168.201.12 --run_infinitely +``` +**Note:** Replace device names with your actual **Up** interfaces. (replace `roceP2p1s0f0` with your ``) + +#### STEP 4 – Monitor Bandwidth + +**Example client output:** + +**Client-1 Output** +``` +nvidia@spark-bd26:~$ ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits 192.168.200.12 --run_infinitely + WARNING: BW peak won't be measured in this run. +--------------------------------------------------------------------------------------- + RDMA_Write BW Test + Dual-port : OFF Device : rocep1s0f0 + Number of qps : 1 Transport type : IB + Connection type : RC Using SRQ : OFF + PCIe relax order: ON + ibv_wr* API : ON + TX depth : 128 + CQ Moderation : 1 + Mtu : 1024[B] + Link type : Ethernet + GID index : 3 + Max inline data : 0[B] + rdma_cm QPs : OFF + Data ex. method : Ethernet +--------------------------------------------------------------------------------------- + local address: LID 0000 QPN 0x0129 PSN 0x57279d RKey 0x184300 VAddr 0x00ec99bedad000 + GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:200:13 + remote address: LID 0000 QPN 0x0129 PSN 0x531b7 RKey 0x184300 VAddr 0x00ffeec955d000 + GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:200:12 +--------------------------------------------------------------------------------------- + #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] + 65536 882805 0.00 92.57 0.176554 + 65536 882802 0.00 92.57 0.176554 + 65536 882791 0.00 92.57 0.176554 + 65536 882791 0.00 92.56 0.176552 + 65536 882821 0.00 92.57 0.176555 +``` + +**Client-2 Output** +``` +nvidia@spark-bd26:~$ ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits 192.168.201.12 --run_infinitely + WARNING: BW peak won't be measured in this run. +--------------------------------------------------------------------------------------- + RDMA_Write BW Test + Dual-port : OFF Device : roceP2p1s0f0 + Number of qps : 1 Transport type : IB + Connection type : RC Using SRQ : OFF + PCIe relax order: ON + ibv_wr* API : ON + TX depth : 128 + CQ Moderation : 1 + Mtu : 1024[B] + Link type : Ethernet + GID index : 3 + Max inline data : 0[B] + rdma_cm QPs : OFF + Data ex. method : Ethernet +--------------------------------------------------------------------------------------- + local address: LID 0000 QPN 0x01a9 PSN 0x5e41f9 RKey 0x1a03ed VAddr 0x00f374277dd000 + GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:201:13 + remote address: LID 0000 QPN 0x01a9 PSN 0x8ab8e7 RKey 0x1a0300 VAddr 0x00f285f5f1d000 + GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:201:12 +--------------------------------------------------------------------------------------- + #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] + 65536 927940 0.00 97.28 0.185548 + 65536 927790 0.00 97.28 0.185549 + 65536 927766 0.00 97.28 0.185550 + 65536 927754 0.00 97.28 0.185545 + 65536 927804 0.00 97.29 0.185557 + 65536 927807 0.00 97.28 0.185554 +``` + +**Total throughput = 92.57 + 97.28 = 189.85 Gbps** + +## Measure RDMA Latency Between Dual Sparks + +### What this measures + +This test measures one-way RDMA latency between two DGX Spark systems over the same QSFP RoCE links used for bandwidth testing. + +### Prerequisites + +Before running the latency tests, complete Step 1 (Identify devices and logical ports) and +Step 2 (Assign Manual IPs) from the [Measure BW Between Dual Sparks](#measure-bw-between-dual-sparks) section above. + +### Step 1.1 – Run RDMA Write Latency Test + +This measures RDMA write latency on a single QSFP link. + +Open two terminals (one per Spark). + +**Spark-1 (HOST)** +```bash +ib_write_lat -d rocep1s0f0 -i 1 -p 13000 -F +``` + +Replace `rocep1s0f0` with your actual **Up** RDMA device from `ibdev2netdev`. + +**Spark-2 (CLIENT)** +```bash +ib_write_lat -d rocep1s0f0 -i 1 -p 13000 -F 192.168.200.12 +``` + +### Step 1.2 – Run RDMA Read Latency Test + +This measures RDMA read latency on the second QSFP link. + +Open two terminals (one per Spark). + +**Spark-1 (HOST)** +```bash +ib_read_lat -d rocep1s0f0 -i 1 -p 13001 -F +``` + +**Spark-2 (CLIENT)** +```bash +ib_read_lat -d rocep1s0f0 -i 1 -p 13001 -F 192.168.201.12 +``` + +Note: RDMA latency is a per-link metric and should be measured on a single QSFP link at a time. Latency values are not aggregated across multiple links. diff --git a/nvidia/sglang/README.md b/nvidia/sglang/README.md index de0bf87..55b7e8e 100644 --- a/nvidia/sglang/README.md +++ b/nvidia/sglang/README.md @@ -83,7 +83,9 @@ Note: for NVFP4 models, add the `--quantization modelopt_fp4` flag. ## Step 1. Verify system prerequisites Check that your NVIDIA Spark device meets all requirements before proceeding. This step runs on -your host system and ensures Docker, GPU drivers, and container toolkit are properly configured. +your host system and ensures Docker, GPU drivers, and container toolkit are properly configured. + +> Note: If you experience timeouts or "connection refused" errors while pulling the container image, you may need to use a VPN or a proxy, as some registries may be restricted by your local network or ISP. ```bash ## Verify Docker installation