mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-23 02:23:53 +00:00
966 lines
26 KiB
Markdown
966 lines
26 KiB
Markdown
# DGX Spark User Performance Guide
|
||
|
||
This repository contains benchmarking information, setup instructions, and example runs for evaluating AI workloads on NVIDIA DGX Spark.
|
||
|
||
It covers a wide range of frameworks and workloads including large language models (LLMs), diffusion models, fine-tuning, and more using tools such as TensorRT-LLM, vLLM, SGLang, Llama.cpp, and others.
|
||
|
||
Before running any benchmarks, ensure the following prerequisites are met for your selected workload:
|
||
|
||
## Prerequisites
|
||
|
||
- Access to a DGX Spark (or 2x DGX Spark)
|
||
- Docker and NVIDIA Container Toolkit
|
||
- A valid Hugging Face Token
|
||
|
||
## This guide includes benchmarking instructions for:
|
||
|
||
### Single Spark
|
||
- **[TensorRT-LLM (TRT-LLM)](#tensorrt-llm-trt-llm)**
|
||
- [Offline](#offline-benchmark)
|
||
- [Online](#online-benchmark)
|
||
- **[vLLM](#vllm)**
|
||
- [Offline](#offline-benchmark-1)
|
||
- [Online](#online-benchmark-1)
|
||
- **[SGLang](#sglang)**
|
||
- [Offline](#offline-benchmark-2)
|
||
- [Online](#online-benchmark-2)
|
||
- **[Llama.cpp](#llamacpp)**
|
||
- [Offline](#offline-benchmark-3)
|
||
- [Online](#online-benchmark-3)
|
||
- **[Image generation](#image-generation)** (Flux and SDXL)
|
||
- **[Fine-tuning](#fine-tuning)**
|
||
|
||
### Dual Spark
|
||
- [Measure bandwidth for Dual Spark setup](#measure-bw-between-dual-sparks)
|
||
- [Measure RDMA latency between Dual Sparks](#measure-rdma-latency-between-dual-sparks)
|
||
|
||
---
|
||
|
||
# Single Spark
|
||
|
||
## TensorRT-LLM (TRT-LLM)
|
||
|
||
### What this measures
|
||
|
||
- **Offline**: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead.
|
||
- **Online**: End-to-end serving performance through trtllm-serve (HTTP + scheduler + KV cache).
|
||
|
||
For more details, visit [trtllm-bench](https://github.com/NVIDIA/TensorRT-LLM/tree/main/benchmarks/cpp) and [trtllm-serve](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/serve) official documentation.
|
||
|
||
### Prerequisites (applies to Offline & Online)
|
||
|
||
#### 1) Docker permissions
|
||
```bash
|
||
sudo usermod -aG docker $USER
|
||
newgrp docker
|
||
```
|
||
|
||
#### 2) Set environment variables
|
||
```bash
|
||
# -------------------------------
|
||
# Environment Setup
|
||
# -------------------------------
|
||
export HF_TOKEN="<your_huggingface_token>" # optional if model is public
|
||
export MODEL_HANDLE="openai/gpt-oss-20b"
|
||
export ISL=128
|
||
export OSL=128
|
||
export MAX_TOKENS=$((ISL + OSL))
|
||
```
|
||
|
||
### Offline Benchmark
|
||
|
||
This runs trtllm-bench with a deterministic synthetic dataset matching your ISL/OSL.
|
||
|
||
```bash
|
||
# -------------------------------
|
||
# TensorRT-LLM Offline Benchmark
|
||
# -------------------------------
|
||
docker run \
|
||
--rm -it \
|
||
--gpus all \
|
||
--ipc host \
|
||
--network host \
|
||
--ulimit memlock=-1 \
|
||
--ulimit stack=67108864 \
|
||
-e HF_TOKEN="$HF_TOKEN" \
|
||
-e MODEL_HANDLE="$MODEL_HANDLE" \
|
||
-e ISL="$ISL" \
|
||
-e OSL="$OSL" \
|
||
-e MAX_TOKENS="$MAX_TOKENS" \
|
||
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
|
||
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
|
||
bash -lc '
|
||
set -e
|
||
|
||
# 1) Download model (cached in /root/.cache/huggingface)
|
||
hf download "$MODEL_HANDLE"
|
||
|
||
# 2) Prepare synthetic dataset (fixed ISL/OSL)
|
||
python benchmarks/cpp/prepare_dataset.py \
|
||
--tokenizer "$MODEL_HANDLE" \
|
||
--stdout token-norm-dist \
|
||
--num-requests 1 \
|
||
--input-mean "$ISL" --input-stdev 0 \
|
||
--output-mean "$OSL" --output-stdev 0 \
|
||
> /tmp/dataset.txt
|
||
|
||
# 3) Optional tuning config
|
||
cat > /tmp/extra-llm-api-config.yml <<EOF
|
||
kv_cache_config:
|
||
dtype: "auto"
|
||
cuda_graph_config:
|
||
enable_padding: true
|
||
EOF
|
||
|
||
# 4) Run offline benchmark
|
||
trtllm-bench -m "$MODEL_HANDLE" throughput \
|
||
--dataset /tmp/dataset.txt \
|
||
--backend pytorch \
|
||
--tp 1 \
|
||
--max_num_tokens "$MAX_TOKENS" \
|
||
--concurrency 1 \
|
||
--max_batch_size 1 \
|
||
--kv_cache_free_gpu_mem_fraction 0.95 \
|
||
--extra_llm_api_options /tmp/extra-llm-api-config.yml
|
||
'
|
||
```
|
||
|
||
### Online Benchmark
|
||
|
||
#### Terminal 1 - run the TRT-LLM server
|
||
```bash
|
||
# -------------------------------
|
||
# Launch TensorRT-LLM OpenAI server
|
||
# -------------------------------
|
||
docker run \
|
||
--rm -it \
|
||
--gpus=all \
|
||
--ipc=host \
|
||
--network host \
|
||
--ulimit memlock=-1 \
|
||
--ulimit stack=67108864 \
|
||
-e HF_TOKEN="$HF_TOKEN" \
|
||
-e MODEL_HANDLE="$MODEL_HANDLE" \
|
||
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
|
||
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc6 \
|
||
bash -lc '
|
||
trtllm-serve serve "$MODEL_HANDLE" \
|
||
--host 0.0.0.0 \
|
||
--port 8000 \
|
||
--backend pytorch \
|
||
--max_num_tokens '"$MAX_TOKENS"' \
|
||
--max_batch_size 1 \
|
||
--kv_cache_free_gpu_memory_fraction 0.9 \
|
||
--tp_size 1 \
|
||
--ep_size 1 \
|
||
--trust_remote_code
|
||
'
|
||
```
|
||
|
||
#### Terminal 2 - run the client (vLLM's built-in bench)
|
||
```bash
|
||
export MODEL_HANDLE="openai/gpt-oss-20b"
|
||
export ISL=128
|
||
export OSL=128
|
||
export MAX_TOKENS=$((ISL + OSL))
|
||
|
||
# -------------------------------
|
||
# Launch Benchmark Client
|
||
# -------------------------------
|
||
docker run \
|
||
--rm -it \
|
||
--gpus all \
|
||
--ipc host \
|
||
--network host \
|
||
-e HF_TOKEN="$HF_TOKEN" \
|
||
-e MODEL_HANDLE="$MODEL_HANDLE" \
|
||
-e ISL="$ISL" \
|
||
-e OSL="$OSL" \
|
||
nvcr.io/nvidia/vllm:25.12-py3 \
|
||
bash -lc '
|
||
vllm bench serve \
|
||
--base-url http://127.0.0.1:8000 \
|
||
--endpoint /v1/completions \
|
||
--model "$MODEL_HANDLE" \
|
||
--dataset-name random \
|
||
--num-prompts 1 \
|
||
--random-input-len "$ISL" \
|
||
--random-output-len "$OSL" \
|
||
--percentile-metrics ttft,tpot,itl,e2el \
|
||
--max-concurrency 1 \
|
||
--request-rate inf
|
||
'
|
||
```
|
||
|
||
---
|
||
|
||
## vLLM
|
||
|
||
### What this measures
|
||
|
||
- **Offline**: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead.
|
||
- **Online**: End-to-end serving performance through vLLM (HTTP + scheduler + KV cache).
|
||
|
||
For more details, visit [vllm benchmarking official documentation](https://docs.vllm.ai/en/latest/getting_started/benchmarking.html)
|
||
|
||
### Prerequisites (applies to Offline & Online)
|
||
|
||
#### 1) Docker permissions
|
||
|
||
```bash
|
||
sudo usermod -aG docker $USER
|
||
newgrp docker
|
||
```
|
||
|
||
#### 2) Set environment variables
|
||
```bash
|
||
# -------------------------------
|
||
# Environment Setup
|
||
# -------------------------------
|
||
export HF_TOKEN="<your_huggingface_token>" # optional if model is public
|
||
export MODEL_HANDLE="openai/gpt-oss-20b"
|
||
export ISL=128
|
||
export OSL=128
|
||
export MAX_TOKENS=$((ISL + OSL))
|
||
```
|
||
|
||
### Offline Benchmark
|
||
|
||
This runs vllm bench throughput directly inside the vLLM container with a synthetic random dataset.
|
||
|
||
```bash
|
||
# -------------------------------
|
||
# vLLM Offline Benchmark
|
||
# -------------------------------
|
||
docker run \
|
||
--rm -it \
|
||
--gpus all \
|
||
--ipc host \
|
||
--ulimit memlock=-1 \
|
||
--ulimit stack=67108864 \
|
||
-e HF_TOKEN="$HF_TOKEN" \
|
||
-e MODEL_HANDLE="$MODEL_HANDLE" \
|
||
-e ISL="$ISL" \
|
||
-e OSL="$OSL" \
|
||
-e MAX_TOKENS="$MAX_TOKENS" \
|
||
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
|
||
nvcr.io/nvidia/vllm:25.12-py3 \
|
||
bash -lc '
|
||
pip install -q datasets && \
|
||
vllm bench throughput \
|
||
--model "$MODEL_HANDLE" \
|
||
--dataset-name random \
|
||
--num-prompts 1 \
|
||
--input-len $ISL \
|
||
--output-len $OSL \
|
||
--max-model-len $MAX_TOKENS \
|
||
--gpu-memory-utilization 0.8
|
||
'
|
||
```
|
||
|
||
### Online Benchmark
|
||
|
||
#### Terminal 1 - run the vLLM server
|
||
```bash
|
||
# -------------------------------
|
||
# Launch vLLM OpenAI server
|
||
# -------------------------------
|
||
docker run \
|
||
--rm -it \
|
||
--gpus all \
|
||
--ipc host \
|
||
--network host \
|
||
-e HF_TOKEN="$HF_TOKEN" \
|
||
-e MODEL_HANDLE="$MODEL_HANDLE" \
|
||
-e MAX_TOKENS="$MAX_TOKENS" \
|
||
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
|
||
nvcr.io/nvidia/vllm:25.12-py3 \
|
||
bash -lc '
|
||
vllm serve "$MODEL_HANDLE" \
|
||
--host 0.0.0.0 \
|
||
--port 8000 \
|
||
--dtype auto \
|
||
--max-model-len $MAX_TOKENS \
|
||
--gpu-memory-utilization 0.9 \
|
||
--trust-remote-code
|
||
'
|
||
```
|
||
|
||
#### Terminal 2 - run the client benchmark
|
||
```bash
|
||
export MODEL_HANDLE="openai/gpt-oss-20b"
|
||
export ISL=128
|
||
export OSL=128
|
||
export MAX_TOKENS=$((ISL + OSL))
|
||
|
||
# -------------------------------
|
||
# Launch Benchmark Client
|
||
# -------------------------------
|
||
docker run \
|
||
--rm -it \
|
||
--gpus all \
|
||
--ipc host \
|
||
--network host \
|
||
-e HF_TOKEN="$HF_TOKEN" \
|
||
-e MODEL_HANDLE="$MODEL_HANDLE" \
|
||
-e ISL="$ISL" \
|
||
-e OSL="$OSL" \
|
||
nvcr.io/nvidia/vllm:25.12-py3 \
|
||
bash -lc '
|
||
vllm bench serve \
|
||
--base-url http://127.0.0.1:8000 \
|
||
--endpoint /v1/completions \
|
||
--model "$MODEL_HANDLE" \
|
||
--dataset-name random \
|
||
--num-prompts 1 \
|
||
--random-input-len $ISL \
|
||
--random-output-len $OSL \
|
||
--percentile-metrics ttft,tpot,itl,e2el \
|
||
--max-concurrency 1 \
|
||
--request-rate inf
|
||
'
|
||
```
|
||
|
||
---
|
||
|
||
## SGLang
|
||
|
||
### What this measures
|
||
|
||
- **Offline**: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead.
|
||
- **Online**: End-to-end serving performance through SGLang (HTTP + scheduler + KV cache).
|
||
|
||
For more details, visit [SGLang benchmarking official documentation](https://sgl-project.github.io/references/benchmark.html)
|
||
|
||
### Prerequisites (applies to Offline & Online)
|
||
|
||
#### 1) Docker permissions
|
||
|
||
```bash
|
||
sudo usermod -aG docker $USER
|
||
newgrp docker
|
||
```
|
||
|
||
#### 2) Set environment variables
|
||
```bash
|
||
# -------------------------------
|
||
# Environment Setup
|
||
# -------------------------------
|
||
export HF_TOKEN="<your_huggingface_token>" # optional if model is public
|
||
export MODEL_HANDLE="openai/gpt-oss-20b"
|
||
export ISL=128
|
||
export OSL=128
|
||
export MAX_TOKENS=$((ISL + OSL))
|
||
```
|
||
|
||
### Offline Benchmark
|
||
|
||
This runs the official SGLang offline throughput benchmark to measure raw model execution performance without launching a server.
|
||
|
||
```bash
|
||
# -------------------------------
|
||
# SGLang Offline Benchmark
|
||
# -------------------------------
|
||
docker run \
|
||
--rm -it \
|
||
--gpus all \
|
||
--ipc host \
|
||
--ulimit memlock=-1 \
|
||
--ulimit stack=67108864 \
|
||
-e HF_TOKEN="$HF_TOKEN" \
|
||
-e MODEL_HANDLE="$MODEL_HANDLE" \
|
||
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
|
||
nvcr.io/nvidia/sglang:25.12-py3 \
|
||
bash -lc '
|
||
python3 -m sglang.bench_offline_throughput \
|
||
--model-path "$MODEL_HANDLE" \
|
||
--dataset-name random \
|
||
--num-prompts 1
|
||
'
|
||
```
|
||
|
||
### Online Benchmark
|
||
|
||
#### Terminal 1 - run the SGLang server
|
||
```bash
|
||
# -------------------------------
|
||
# Launch SGLang HTTP Server
|
||
# -------------------------------
|
||
docker run \
|
||
--rm -it \
|
||
--gpus all \
|
||
--ipc host \
|
||
--network host \
|
||
-e HF_TOKEN="$HF_TOKEN" \
|
||
-e MODEL_HANDLE="$MODEL_HANDLE" \
|
||
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
|
||
nvcr.io/nvidia/sglang:25.12-py3 \
|
||
bash -lc '
|
||
python3 -m sglang.launch_server \
|
||
--model-path "$MODEL_HANDLE" \
|
||
--host 0.0.0.0 \
|
||
--port 30000 \
|
||
--trust-remote-code \
|
||
--tp 1 \
|
||
--attention-backend triton \
|
||
--mem-fraction-static 0.75
|
||
'
|
||
```
|
||
|
||
#### Terminal 2 - run the client benchmark
|
||
```bash
|
||
# -------------------------------
|
||
# SGLang Online Benchmark Client
|
||
# -------------------------------
|
||
docker run \
|
||
--rm -it \
|
||
--network host \
|
||
-e HF_TOKEN="$HF_TOKEN" \
|
||
-e MODEL_HANDLE="$MODEL_HANDLE" \
|
||
nvcr.io/nvidia/sglang:25.12-py3 \
|
||
bash -lc '
|
||
python3 -m sglang.bench_serving \
|
||
--backend sglang \
|
||
--host 127.0.0.1 \
|
||
--port 30000 \
|
||
--model "$MODEL_HANDLE" \
|
||
--dataset-name random \
|
||
--num-prompts 1 \
|
||
--random-input-len '"$ISL"' \
|
||
--random-output-len '"$OSL"'
|
||
'
|
||
```
|
||
|
||
---
|
||
|
||
## Llama.cpp
|
||
|
||
### What this measures
|
||
|
||
- **Offline**: Raw model throughput/latency under synthetic load without HTTP stack, no scheduler overhead.
|
||
- **Online**: End-to-end serving performance through llama-server (HTTP + scheduler + KV cache).
|
||
|
||
For more details, visit [Github Discussion](https://github.com/ggml-org/llama.cpp/discussions)
|
||
|
||
### Prerequisites (applies to Offline & Online)
|
||
|
||
**Note:**
|
||
DGX Spark uses a long-term supported (LTS) base software stack, so the host OS, driver, and CUDA toolkit are updated together on a fixed release cadence. To access the latest CUDA features and performance improvements, users should run NVIDIA NGC containers (PyTorch, vLLM, TensorRT-LLM, etc.), which are validated for DGX Spark and include newer CUDA toolkits without modifying the host system. If required, users may also install CUDA directly via Debian packages; however, this approach is not recommended for most users and falls outside the supported DGX OS stack.
|
||
|
||
#### 1) Launch the latest pytorch container from NGC.
|
||
```bash
|
||
export HF_TOKEN="<your_huggingface_token>" # optional if model is public
|
||
docker run --rm -it \
|
||
--gpus all \
|
||
--ipc=host \
|
||
-p 8080:8080 \
|
||
-e HF_TOKEN="$HF_TOKEN" \
|
||
-v "$HOME":/home/nvidia \
|
||
-w /home/nvidia \
|
||
nvcr.io/nvidia/pytorch:25.12-py3
|
||
```
|
||
|
||
#### 2) Clone and build the latest Llama.cpp
|
||
```bash
|
||
# -------------------------------
|
||
# Clone and build Llama.cpp
|
||
# -------------------------------
|
||
git clone https://github.com/ggml-org/llama.cpp
|
||
cd llama.cpp
|
||
|
||
sudo apt-get update
|
||
sudo apt-get install -y libcurl4-openssl-dev cmake g++ make
|
||
|
||
# Build with CUDA support for NVIDIA GPUs (adjust arch as needed)
|
||
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121a" -DGGML_CUDA_CUB_3DOT2=on
|
||
|
||
cmake --build build --config Release -j
|
||
```
|
||
|
||
#### 3) Download model weights
|
||
```bash
|
||
# -------------------------------
|
||
# Download GGUF model
|
||
# -------------------------------
|
||
cd models
|
||
|
||
# Example: GPT-OSS-20B
|
||
curl -L -o gpt-oss-20b-mxfp4.gguf https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-mxfp4.gguf
|
||
```
|
||
|
||
#### 4) Set environment variables
|
||
```bash
|
||
# -------------------------------
|
||
# Environment Setup
|
||
# -------------------------------
|
||
|
||
export MODEL_HANDLE="gpt-oss-20b-mxfp4.gguf"
|
||
export ISL=128
|
||
export OSL=128
|
||
export MAX_TOKENS=$((ISL + OSL))
|
||
```
|
||
|
||
### Offline Benchmark
|
||
```bash
|
||
# -------------------------------
|
||
# Llama.cpp Offline Benchmark
|
||
# -------------------------------
|
||
./build/bin/llama-bench \
|
||
-m models/$MODEL_HANDLE \
|
||
-t $(nproc) \
|
||
-p $ISL \
|
||
-n $OSL \
|
||
-ngl 99 \
|
||
-dio 1 \
|
||
-fa 1
|
||
```
|
||
|
||
### Online Benchmark
|
||
|
||
#### Terminal 1 - run the server
|
||
```bash
|
||
# -------------------------------
|
||
# Launch Llama.cpp Server
|
||
# -------------------------------
|
||
./build/bin/llama-server \
|
||
--model models/$MODEL_HANDLE \
|
||
--ctx-size $MAX_TOKENS \
|
||
--n-predict $OSL \
|
||
--threads $(nproc) \
|
||
--host 0.0.0.0 \
|
||
--port 8080 \
|
||
-fa 1 \
|
||
--backend-sampling
|
||
```
|
||
|
||
#### Terminal 2 - run the client
|
||
```bash
|
||
# -------------------------------
|
||
# Launch Benchmark Client
|
||
# -------------------------------
|
||
curl -s -H "Content-Type: application/json" \
|
||
-d "{
|
||
\"prompt\": \"What is the capital of France?\",
|
||
\"temperature\": 0.5,
|
||
\"stream\": false
|
||
}" \
|
||
http://127.0.0.1:8080/completion | jq .
|
||
```
|
||
|
||
---
|
||
|
||
## Image Generation
|
||
|
||
This benchmark evaluates diffusion model performance using TensorRT-based pipelines for:
|
||
- Flux.1 Schnell
|
||
- SDXL 1.0
|
||
|
||
You will measure image generation latency and throughput for text-to-image workloads on DGX Spark.
|
||
|
||
### Prerequisites
|
||
|
||
#### 1) Docker permissions
|
||
```bash
|
||
sudo usermod -aG docker $USER
|
||
newgrp docker
|
||
```
|
||
|
||
#### 2) Set environment variables
|
||
```bash
|
||
# -------------------------------
|
||
# Environment Setup
|
||
# -------------------------------
|
||
export HF_TOKEN="<your_huggingface_token>"
|
||
```
|
||
|
||
#### 3) Launch latest PyTorch container
|
||
```bash
|
||
docker run --rm -it \
|
||
--gpus all \
|
||
--ipc=host \
|
||
-e HF_TOKEN="$HF_TOKEN" \
|
||
-v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub \
|
||
nvcr.io/nvidia/pytorch:25.12-py3
|
||
```
|
||
|
||
#### 4) Inside the container
|
||
```bash
|
||
git clone https://github.com/NVIDIA/TensorRT.git -b main --single-branch
|
||
cd TensorRT/demo/Diffusion
|
||
export TRT_OSSPATH=$HOME/TensorRT/
|
||
cd $TRT_OSSPATH/demo/Diffusion
|
||
pip install nvidia-modelopt[onnx,hf]
|
||
sed -i '/^nvidia-modelopt\[.*\]=.*/d' requirements.txt
|
||
pip install -r requirements.txt
|
||
|
||
apt-get update && \
|
||
apt-get install -y libgl1 libglib2.0-0 libsm6 libxrender1 libxext6
|
||
```
|
||
|
||
### Flux.1 Schnell:
|
||
```bash
|
||
# -------------------------------
|
||
# Flux.1 Schnell txt2img Benchmark
|
||
# -------------------------------
|
||
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
|
||
--hf-token="$HF_TOKEN" \
|
||
--version="flux.1-schnell" \
|
||
--fp4 \
|
||
--download-onnx-models \
|
||
--batch-size 1 \
|
||
--width 1024 \
|
||
--height 1024 \
|
||
--denoising-steps 4
|
||
```
|
||
|
||
### SDXL 1.0:
|
||
```bash
|
||
# -------------------------------
|
||
# SDXL 1.0 txt2img Benchmark
|
||
# -------------------------------
|
||
python3 demo_txt2img_xl.py "a beautiful photograph of Mt. Fuji during cherry blossom" \
|
||
--hf-token="$HF_TOKEN" \
|
||
--version xl-1.0 \
|
||
--download-onnx-models \
|
||
--batch-size 2 \
|
||
--width 1024 \
|
||
--height 1024 \
|
||
--denoising-steps 50
|
||
```
|
||
|
||
---
|
||
|
||
## Fine-tuning
|
||
|
||
### What this measures
|
||
|
||
This benchmark evaluates training performance (step time, throughput, memory usage) for different fine-tuning strategies on DGX Spark:
|
||
- LoRA fine-tuning for parameter-efficient adaptation of Llama 3 8B
|
||
- qLoRA fine-tuning for memory-efficient fine-tuning of Llama 3 70B
|
||
- Full fine-tuning of a smaller Llama 3 3B model
|
||
|
||
### Prerequisites
|
||
|
||
#### 1) Docker permissions
|
||
```bash
|
||
sudo usermod -aG docker $USER
|
||
newgrp docker
|
||
```
|
||
|
||
#### 2) Set environment variables
|
||
```bash
|
||
# -------------------------------
|
||
# Environment Setup
|
||
# -------------------------------
|
||
export HF_TOKEN="<your_huggingface_token>"
|
||
```
|
||
|
||
#### 3) Launch latest PyTorch container
|
||
```bash
|
||
docker run --rm -it \
|
||
--gpus all \
|
||
-e HF_TOKEN="$HF_TOKEN" \
|
||
--ipc=host \
|
||
-v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub \
|
||
nvcr.io/nvidia/pytorch:25.12-py3
|
||
```
|
||
|
||
#### 4) Inside the container
|
||
```bash
|
||
# Install dependencies
|
||
pip install transformers peft datasets "trl==0.26.2" "bitsandbytes==0.49.1"
|
||
|
||
# Clone DGX Spark playbooks
|
||
git clone https://github.com/NVIDIA/dgx-spark-playbooks
|
||
cd dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets
|
||
|
||
# Force bitsandbytes to use CUDA 13.0 binary (CUDA 13.1 not yet supported)
|
||
export BNB_CUDA_VERSION=130
|
||
```
|
||
|
||
### LoRA Fine-tuning
|
||
```bash
|
||
# -------------------------------
|
||
# Llama 3 8B LoRA Fine-tuning
|
||
# -------------------------------
|
||
python Llama3_8B_LoRA_finetuning.py --use_torch_compile
|
||
```
|
||
|
||
### qLoRA Fine-tuning
|
||
```bash
|
||
# -------------------------------
|
||
# Llama 3 70B qLoRA Fine-tuning
|
||
# -------------------------------
|
||
python Llama3_70B_qLoRA_finetuning.py
|
||
```
|
||
|
||
### Full Fine-tuning
|
||
```bash
|
||
# -------------------------------
|
||
# Llama 3 3B Full Fine-tuning
|
||
# -------------------------------
|
||
python Llama3_3B_full_finetuning.py --use_torch_compile
|
||
```
|
||
|
||
---
|
||
|
||
# Dual Spark
|
||
|
||
## Measure BW Between Dual Sparks
|
||
|
||
DGX Spark systems support high-bandwidth, low-latency interconnects over QSFP ports.
|
||
Bandwidth between two Sparks can be validated at two different layers:
|
||
- [GPU collective bandwidth using NCCL](#gpu-collective-bandwidth-using-nccl)
|
||
- [Raw RDMA fabric bandwidth (RoCE)](#raw-rdma-fabric-bandwidth-roce)
|
||
|
||
### GPU collective bandwidth using NCCL
|
||
- Follow the instruction here - https://build.nvidia.com/spark/nccl/stacked-sparks
|
||
|
||
#### What this measures
|
||
- This test measures effective GPU collective communication bandwidth using NCCL.
|
||
|
||
### Raw RDMA fabric bandwidth (RoCE)
|
||
|
||
#### What this measures
|
||
- This test measures raw point-to-point RDMA bandwidth over the QSFP-connected CX-7 NICs using RoCE.
|
||
|
||
### Prerequisites
|
||
|
||
#### 1) Install perftest tools (on both Sparks)
|
||
```bash
|
||
sudo apt install perftest
|
||
```
|
||
|
||
#### 2) Ensure one QSFP cable connects Spark-1 ↔ Spark-2 directly
|
||
|
||
### Setup
|
||
|
||
#### Step 1 – Identify devices and logical ports
|
||
|
||
Run on both Spark systems:
|
||
```bash
|
||
ibdev2netdev
|
||
```
|
||
|
||
**Example output (from Spark-1):**
|
||
```
|
||
nvidia@spark-5c2d:~$ ibdev2netdev
|
||
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
|
||
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
|
||
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
|
||
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
|
||
```
|
||
You will use the **Up** interfaces for IP assignment.
|
||
|
||
**Example output (from Spark-2):**
|
||
```
|
||
nvidia@spark-bd26:~$ ibdev2netdev
|
||
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
|
||
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
|
||
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
|
||
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
|
||
```
|
||
You will use the **Up** interfaces for IP assignment.
|
||
|
||
#### Step 2 - Assign Manual IPs
|
||
|
||
Assign unique subnets to each active port.
|
||
|
||
**Note:** Repeat this step after reboot if NetworkManager clears them.
|
||
|
||
**Spark-1 (HOST)**
|
||
```bash
|
||
# Create the netplan configuration file
|
||
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
|
||
network:
|
||
version: 2
|
||
renderer: NetworkManager
|
||
ethernets:
|
||
enp1s0f0np0:
|
||
addresses:
|
||
- 192.168.200.12/24
|
||
dhcp4: no
|
||
enP2p1s0f0np0:
|
||
addresses:
|
||
- 192.168.201.12/24
|
||
dhcp4: no
|
||
EOF
|
||
|
||
sudo chmod 600 /etc/netplan/40-cx7.yaml
|
||
sudo netplan apply
|
||
```
|
||
**Note:** Interfaces may differ; use the ones marked **Up**.
|
||
|
||
**Spark-2 (CLIENT)**
|
||
```bash
|
||
# Create the netplan configuration file
|
||
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
|
||
network:
|
||
version: 2
|
||
renderer: NetworkManager
|
||
ethernets:
|
||
enp1s0f0np0:
|
||
addresses:
|
||
- 192.168.200.13/24
|
||
dhcp4: no
|
||
enP2p1s0f0np0:
|
||
addresses:
|
||
- 192.168.201.13/24
|
||
dhcp4: no
|
||
EOF
|
||
|
||
sudo chmod 600 /etc/netplan/40-cx7.yaml
|
||
sudo netplan apply
|
||
```
|
||
**Note:** Interfaces may differ; use the ones marked **Up**.
|
||
|
||
#### Step 3 – Run Bandwidth Test
|
||
|
||
Open two terminals on each Spark (4 total).
|
||
|
||
**Note:** Make sure ports 12000 and 12001 are open and not in use.
|
||
|
||
**Spark-1 (HOST) - Terminal 1**
|
||
```bash
|
||
ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits --run_infinitely
|
||
```
|
||
**Note:** Replace device names with your actual **Up** interfaces. (replace `rocep1s0f0` with your `<HOST_NIC1_INTERFACE>`)
|
||
|
||
**Spark-1 (HOST) - Terminal 2**
|
||
```bash
|
||
ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits --run_infinitely
|
||
```
|
||
**Note:** Replace device names with your actual **Up** interfaces. (replace `roceP2p1s0f0` with your `<HOST_NIC2_INTERFACE>`)
|
||
|
||
**Spark-2 (CLIENT) - Terminal 1**
|
||
```bash
|
||
ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits 192.168.200.12 --run_infinitely
|
||
```
|
||
**Note:** Replace device names with your actual **Up** interfaces. (replace `rocep1s0f0` with your `<CLIENT_NIC1_INTERFACE>`)
|
||
|
||
**Spark-2 (CLIENT) - Terminal 2**
|
||
```bash
|
||
ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits 192.168.201.12 --run_infinitely
|
||
```
|
||
**Note:** Replace device names with your actual **Up** interfaces. (replace `roceP2p1s0f0` with your `<CLIENT_NIC2_INTERFACE>`)
|
||
|
||
#### STEP 4 – Monitor Bandwidth
|
||
|
||
**Example client output:**
|
||
|
||
**Client-1 Output**
|
||
```
|
||
nvidia@spark-bd26:~$ ib_write_bw -d rocep1s0f0 -i 1 -p 12000 -F --report_gbits 192.168.200.12 --run_infinitely
|
||
WARNING: BW peak won't be measured in this run.
|
||
---------------------------------------------------------------------------------------
|
||
RDMA_Write BW Test
|
||
Dual-port : OFF Device : rocep1s0f0
|
||
Number of qps : 1 Transport type : IB
|
||
Connection type : RC Using SRQ : OFF
|
||
PCIe relax order: ON
|
||
ibv_wr* API : ON
|
||
TX depth : 128
|
||
CQ Moderation : 1
|
||
Mtu : 1024[B]
|
||
Link type : Ethernet
|
||
GID index : 3
|
||
Max inline data : 0[B]
|
||
rdma_cm QPs : OFF
|
||
Data ex. method : Ethernet
|
||
---------------------------------------------------------------------------------------
|
||
local address: LID 0000 QPN 0x0129 PSN 0x57279d RKey 0x184300 VAddr 0x00ec99bedad000
|
||
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:200:13
|
||
remote address: LID 0000 QPN 0x0129 PSN 0x531b7 RKey 0x184300 VAddr 0x00ffeec955d000
|
||
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:200:12
|
||
---------------------------------------------------------------------------------------
|
||
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
|
||
65536 882805 0.00 92.57 0.176554
|
||
65536 882802 0.00 92.57 0.176554
|
||
65536 882791 0.00 92.57 0.176554
|
||
65536 882791 0.00 92.56 0.176552
|
||
65536 882821 0.00 92.57 0.176555
|
||
```
|
||
|
||
**Client-2 Output**
|
||
```
|
||
nvidia@spark-bd26:~$ ib_write_bw -d roceP2p1s0f0 -i 1 -p 12001 -F --report_gbits 192.168.201.12 --run_infinitely
|
||
WARNING: BW peak won't be measured in this run.
|
||
---------------------------------------------------------------------------------------
|
||
RDMA_Write BW Test
|
||
Dual-port : OFF Device : roceP2p1s0f0
|
||
Number of qps : 1 Transport type : IB
|
||
Connection type : RC Using SRQ : OFF
|
||
PCIe relax order: ON
|
||
ibv_wr* API : ON
|
||
TX depth : 128
|
||
CQ Moderation : 1
|
||
Mtu : 1024[B]
|
||
Link type : Ethernet
|
||
GID index : 3
|
||
Max inline data : 0[B]
|
||
rdma_cm QPs : OFF
|
||
Data ex. method : Ethernet
|
||
---------------------------------------------------------------------------------------
|
||
local address: LID 0000 QPN 0x01a9 PSN 0x5e41f9 RKey 0x1a03ed VAddr 0x00f374277dd000
|
||
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:201:13
|
||
remote address: LID 0000 QPN 0x01a9 PSN 0x8ab8e7 RKey 0x1a0300 VAddr 0x00f285f5f1d000
|
||
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:201:12
|
||
---------------------------------------------------------------------------------------
|
||
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
|
||
65536 927940 0.00 97.28 0.185548
|
||
65536 927790 0.00 97.28 0.185549
|
||
65536 927766 0.00 97.28 0.185550
|
||
65536 927754 0.00 97.28 0.185545
|
||
65536 927804 0.00 97.29 0.185557
|
||
65536 927807 0.00 97.28 0.185554
|
||
```
|
||
|
||
**Total throughput = 92.57 + 97.28 = 189.85 Gbps**
|
||
|
||
## Measure RDMA Latency Between Dual Sparks
|
||
|
||
### What this measures
|
||
|
||
This test measures one-way RDMA latency between two DGX Spark systems over the same QSFP RoCE links used for bandwidth testing.
|
||
|
||
### Prerequisites
|
||
|
||
Before running the latency tests, complete Step 1 (Identify devices and logical ports) and
|
||
Step 2 (Assign Manual IPs) from the [Measure BW Between Dual Sparks](#measure-bw-between-dual-sparks) section above.
|
||
|
||
### Step 1.1 – Run RDMA Write Latency Test
|
||
|
||
This measures RDMA write latency on a single QSFP link.
|
||
|
||
Open two terminals (one per Spark).
|
||
|
||
**Spark-1 (HOST)**
|
||
```bash
|
||
ib_write_lat -d rocep1s0f0 -i 1 -p 13000 -F
|
||
```
|
||
|
||
Replace `rocep1s0f0` with your actual **Up** RDMA device from `ibdev2netdev`.
|
||
|
||
**Spark-2 (CLIENT)**
|
||
```bash
|
||
ib_write_lat -d rocep1s0f0 -i 1 -p 13000 -F 192.168.200.12
|
||
```
|
||
|
||
### Step 1.2 – Run RDMA Read Latency Test
|
||
|
||
This measures RDMA read latency on the second QSFP link.
|
||
|
||
Open two terminals (one per Spark).
|
||
|
||
**Spark-1 (HOST)**
|
||
```bash
|
||
ib_read_lat -d rocep1s0f0 -i 1 -p 13001 -F
|
||
```
|
||
|
||
**Spark-2 (CLIENT)**
|
||
```bash
|
||
ib_read_lat -d rocep1s0f0 -i 1 -p 13001 -F 192.168.201.12
|
||
```
|
||
|
||
Note: RDMA latency is a per-link metric and should be measured on a single QSFP link at a time. Latency values are not aggregated across multiple links.
|