# DGX Spark User Performance Guide This repository contains benchmarking information, setup instructions, and example runs for evaluating AI workloads on NVIDIA DGX Spark. It covers a wide range of frameworks and workloads including large language models (LLMs), diffusion models, fine-tuning, and more using tools such as TensorRT-LLM, vLLM, SGLang, Llama.cpp, and others. Before running any benchmarks, ensure the following prerequisites are met for your selected workload: ## Prerequisites - Access to a DGX Spark (or 2x DGX Spark) - Docker and NVIDIA Container Toolkit - A valid Hugging Face Token ## This guide includes benchmarking instructions for: ### Single Spark - **[TensorRT-LLM (TRT-LLM)](#tensorrt-llm-trt-llm)** - [Offline](#offline-benchmark) - [Online](#online-benchmark) - **[vLLM](#vllm)** - [Offline](#offline-benchmark-1) - [Online](#online-benchmark-1) - **[SGLang](#sglang)** - [Offline](#offline-benchmark-2) - [Online](#online-benchmark-2) - **[Llama.cpp](#llamacpp)** - [Offline](#offline-benchmark-3) - [Online](#online-benchmark-3) - **[Image generation](#image-generation)** (Flux and SDXL) - **[Fine-tuning](#fine-tuning)** ### Dual Spark - [Measure bandwidth for Dual Spark setup](#measure-bw-between-dual-sparks) - [Measure RDMA latency between Dual Sparks](#measure-rdma-latency-between-dual-sparks) --- # Single Spark ## TensorRT-LLM (TRT-LLM) ### What this measures - **Offline**: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead. - **Online**: End-to-end serving performance through trtllm-serve (HTTP + scheduler + KV cache). For more details, visit [trtllm-bench](https://github.com/NVIDIA/TensorRT-LLM/tree/main/benchmarks/cpp) and [trtllm-serve](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/serve) official documentation. ### Prerequisites (applies to Offline & Online) #### 1) Docker permissions ```bash sudo usermod -aG docker $USER newgrp docker ``` #### 2) Set environment variables ```bash # ------------------------------- # Environment Setup # ------------------------------- export HF_TOKEN="" # optional if model is public export MODEL_HANDLE="openai/gpt-oss-20b" export ISL=128 export OSL=128 export MAX_TOKENS=$((ISL + OSL)) ``` ### Offline Benchmark This runs trtllm-bench with a deterministic synthetic dataset matching your ISL/OSL. ```bash # ------------------------------- # TensorRT-LLM Offline Benchmark # ------------------------------- docker run \ --rm -it \ --gpus all \ --ipc host \ --network host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -e HF_TOKEN="$HF_TOKEN" \ -e MODEL_HANDLE="$MODEL_HANDLE" \ -e ISL="$ISL" \ -e OSL="$OSL" \ -e MAX_TOKENS="$MAX_TOKENS" \ -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \ nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \ bash -lc ' set -e # 1) Download model (cached in /root/.cache/huggingface) hf download "$MODEL_HANDLE" # 2) Prepare synthetic dataset (fixed ISL/OSL) python benchmarks/cpp/prepare_dataset.py \ --tokenizer "$MODEL_HANDLE" \ --stdout token-norm-dist \ --num-requests 1 \ --input-mean "$ISL" --input-stdev 0 \ --output-mean "$OSL" --output-stdev 0 \ > /tmp/dataset.txt # 3) Optional tuning config cat > /tmp/extra-llm-api-config.yml < enp1s0f0np0 (Up) rocep1s0f1 port 1 ==> enp1s0f1np1 (Down) roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up) roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down) ``` You will use the **Up** interfaces for IP assignment. **Example output (from Spark-2):** ``` nvidia@spark-bd26:~$ ibdev2netdev rocep1s0f0 port 1 ==> enp1s0f0np0 (Up) rocep1s0f1 port 1 ==> enp1s0f1np1 (Down) roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up) roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down) ``` You will use the **Up** interfaces for IP assignment. #### Step 2 - Assign Manual IPs Assign unique subnets to each active port. **Note:** Repeat this step after reboot if NetworkManager clears them. **Spark-1 (HOST)** ```bash # Create the netplan configuration file sudo tee /etc/netplan/40-cx7.yaml > /dev/null < /dev/null <`) **Spark-1 (HOST) - Terminal 2** ```bash ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits --run_infinitely ``` **Note:** Replace device names with your actual **Up** interfaces. (replace `roceP2p1s0f0` with your ``) **Spark-2 (CLIENT) - Terminal 1** ```bash ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits 192.168.200.12 --run_infinitely ``` **Note:** Replace device names with your actual **Up** interfaces. (replace `rocep1s0f0` with your ``) **Spark-2 (CLIENT) - Terminal 2** ```bash ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits 192.168.201.12 --run_infinitely ``` **Note:** Replace device names with your actual **Up** interfaces. (replace `roceP2p1s0f0` with your ``) #### STEP 4 – Monitor Bandwidth **Example client output:** **Client-1 Output** ``` nvidia@spark-bd26:~$ ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits 192.168.200.12 --run_infinitely WARNING: BW peak won't be measured in this run. --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : rocep1s0f0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x0129 PSN 0x57279d RKey 0x184300 VAddr 0x00ec99bedad000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:200:13 remote address: LID 0000 QPN 0x0129 PSN 0x531b7 RKey 0x184300 VAddr 0x00ffeec955d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:200:12 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 882805 0.00 92.57 0.176554 65536 882802 0.00 92.57 0.176554 65536 882791 0.00 92.57 0.176554 65536 882791 0.00 92.56 0.176552 65536 882821 0.00 92.57 0.176555 ``` **Client-2 Output** ``` nvidia@spark-bd26:~$ ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits 192.168.201.12 --run_infinitely WARNING: BW peak won't be measured in this run. --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : roceP2p1s0f0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 1024[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x01a9 PSN 0x5e41f9 RKey 0x1a03ed VAddr 0x00f374277dd000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:201:13 remote address: LID 0000 QPN 0x01a9 PSN 0x8ab8e7 RKey 0x1a0300 VAddr 0x00f285f5f1d000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:201:12 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 927940 0.00 97.28 0.185548 65536 927790 0.00 97.28 0.185549 65536 927766 0.00 97.28 0.185550 65536 927754 0.00 97.28 0.185545 65536 927804 0.00 97.29 0.185557 65536 927807 0.00 97.28 0.185554 ``` **Total throughput = 92.57 + 97.28 = 189.85 Gbps** ## Measure RDMA Latency Between Dual Sparks ### What this measures This test measures one-way RDMA latency between two DGX Spark systems over the same QSFP RoCE links used for bandwidth testing. ### Prerequisites Before running the latency tests, complete Step 1 (Identify devices and logical ports) and Step 2 (Assign Manual IPs) from the [Measure BW Between Dual Sparks](#measure-bw-between-dual-sparks) section above. ### Step 1.1 – Run RDMA Write Latency Test This measures RDMA write latency on a single QSFP link. Open two terminals (one per Spark). **Spark-1 (HOST)** ```bash ib_write_lat -d rocep1s0f0 -i 1 -p 13000 -F ``` Replace `rocep1s0f0` with your actual **Up** RDMA device from `ibdev2netdev`. **Spark-2 (CLIENT)** ```bash ib_write_lat -d rocep1s0f0 -i 1 -p 13000 -F 192.168.200.12 ``` ### Step 1.2 – Run RDMA Read Latency Test This measures RDMA read latency on the second QSFP link. Open two terminals (one per Spark). **Spark-1 (HOST)** ```bash ib_read_lat -d rocep1s0f0 -i 1 -p 13001 -F ``` **Spark-2 (CLIENT)** ```bash ib_read_lat -d rocep1s0f0 -i 1 -p 13001 -F 192.168.201.12 ``` Note: RDMA latency is a per-link metric and should be measured on a single QSFP link at a time. Latency values are not aggregated across multiple links.