chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-10 20:39:52 +00:00
parent 819ce6334c
commit e1bed13f13
19 changed files with 242 additions and 535 deletions

View File

@ -23,7 +23,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
- [Comfy UI](nvidia/comfy-ui/)
- [Set Up Local Network Access](nvidia/connect-to-your-spark/)
- [CUDA-X Data Science](nvidia/cuda-x-data-science/)
- [CUDA-X](nvidia/cuda-x-data-science/)
- [DGX Dashboard](nvidia/dgx-dashboard/)
- [FLUX.1 Dreambooth LoRA Fine-tuning](nvidia/flux-finetuning/)
- [Optimized JAX](nvidia/jax/)

View File

@ -1,6 +1,6 @@
# CUDA-X Data Science
# CUDA-X
> Install and use NVIDIA cuML and NVIDIA cuDF to accelerate UMAP, HDBSCAN, pandas and more with zero code changes.
> Accelerated data science with NVIDIA RAPIDS
## Table of Contents
@ -12,25 +12,18 @@
## Overview
## Basic Idea
This playbook includes two example notebooks that demonstrate the acceleration of key machine learning algorithms and core pandas operations using CUDA-X Data Science libraries:
CUDA-X Data Science (formally RAPIDS) is an open-source library collection that accelerates the data science and data processing ecosystem. Accelerate popular python tools like scikit-learn and pandas with zero code changes on DGX Spark to maximize performance at your desk. This playbook orients you with example workflows, demonstrating the acceleration of key machine learning algorithms like UMAP and HBDSCAN and core pandas operations, without changing your code.
- **NVIDIA cuDF:** Accelerates operations for data preparation and core data processing of 8GB of strings data, with no code changes.
- **NVIDIA cuML:** Accelerates popular, compute intensive machine learning algorithms in sci-kit learn (LinearSVC), UMAP, and HDBSCAN, with no code changes.
In this playbook, we will demonstrate the acceleration of key machine learning algorithms like UMAP and HBDSCAN and core pandas operations, without changing your code.
CUDA-X Data Science (formally RAPIDS) is an open-source library collection that accelerates the data science and data processing ecosystem. These libraries accelerate popular Python tools like scikit-learn and pandas with zero code changes. On DGX Spark, these libraries maximize performance at your desk with your existing code.
## What you'll accomplish
You will accelerate popular machine learning algorithms and data analytics operations GPU. You will understand how to accelerate popular Python tools, and the value of running data science workflows on your DGX Spark.
## What to know before starting
- Familiarity with pandas, scikit learn, machine learning algorithms, such as support vector machine, clustering, and dimensionality reduction algorithms
## Prerequisites
- Familiarity with pandas, scikit-learn, machine learning algorithms, such as support vector machine, clustering, and dimensionality reduction algorithms.
- Install conda
- Generate a Kaggle API key
## Time & risk
- Duration:
- 20-30 minutes setup time.
- 2-3 minutes to run each notebook.
**Duration:** 20-30 minutes setup time and 2-3 minutes to run each notebook.
## Instructions
@ -40,34 +33,32 @@ You will accelerate popular machine learning algorithms and data analytics opera
- Install conda using [these instructions](https://docs.anaconda.com/miniconda/install/)
- Create Kaggle API key using [these instructions](https://www.kaggle.com/discussions/general/74235) and place the **kaggle.json** file in the same folder as the notebook
## Step 2. Installing Data Science libraries
- Use the following command to install the CUDA-X libraries (this will create a new conda environment)
## Step 2. Installing CUDA-X libraries
- use the following command to install the CUDA-X libraries (this will create a new conda environment)
```bash
conda create -n rapids-test -c rapidsai-nightly -c conda-forge -c nvidia \
rapids=25.10 python=3.12 'cuda-version=13.0' \
jupyterlab hdbscan umap-learn
```
## Step 3. Activate the conda environment
- Activate the conda environment
- activate the conda environment
```bash
conda activate rapids-test
```
## Step 4. Cloning the playbook repository
- Clone the github repository and go the assets folder place in cuda-x-data-science folder
## Step 4. Cloning the notebooks
- clone the github repository and go the cuda-x-data-science/assets folder
```bash
git clone https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets
ssh://git@******:12051/spark-playbooks/dgx-spark-playbook-assets.git
```
- Place the **kaggle.json** created in Step 1 in the assets folder
- place the **kaggle.json** created in Step 1 in the assets folder
## Step 5. Run the notebooks
There are two notebooks in the GitHub repository.
One runs an example of a large strings data processing workflow with pandas code on GPU.
- Run the cudf_pandas_demo.ipynb notebook
- Both the notebooks are self explanatory
- To experience the acceleration achieved using cudf.pandas, run the cudf_pandas_demo.ipynb notebook
```bash
jupyter notebook cudf_pandas_demo.ipynb
```
The other goes over an example of machine learning algorithms including UMAP and HDBSCAN.
- Run the cuml_sklearn_demo.ipynb notebook
- To experience the acceleration achieved using cuml, run the cuml_sklearn_demo.ipynb notebook
```bash
jupyter notebook cuml_sklearn_demo.ipynb
```

View File

@ -171,10 +171,6 @@ Unlike the base model, we can see that the fine-tuned model can generate multipl
## Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

View File

@ -22,7 +22,7 @@ COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
RUN mkdir /app
WORKDIR /app
RUN uv init && uv venv && uv pip install marimo && uv pip install "jax[cuda13]==0.7.2" && uv pip install "numpy==2.3.3" && uv pip install "plotly==6.3.0" && uv pip install "opencv-python-headless==4.12.0.88" && uv pip install "tqdm==4.67.1"
RUN uv init --python 3.12 && uv venv && uv pip install "marimo==[0.16.5]" && uv pip install "jax[cuda13]==0.7.2" && uv pip install "numpy==2.3.3" && uv pip install "plotly==6.3.0" && uv pip install "opencv-python-headless==4.12.0.88" && uv pip install "tqdm==4.67.1"
COPY *.py *.mp4 /app

View File

@ -202,7 +202,6 @@ docker container prune -f
| Symptom | Cause | Fix |
|---------|--------|-----|
| CUDA out of memory during training | Batch size too large for GPU VRAM | Reduce `per_device_train_batch_size` or increase `gradient_accumulation_steps` |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
| Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models |
| Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality |

View File

@ -142,7 +142,7 @@ docker volume rm "$(basename "$PWD")_postgres_data"
| Symptom | Cause | Fix |
|---------|--------|-----|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within

View File

@ -9,7 +9,7 @@
"lint": "next lint"
},
"dependencies": {
"next": "15.1.7",
"next": "15.2.4",
"react": "^19.0.0",
"react-dom": "^19.0.0",
"react-markdown": "^10.1.0",
@ -22,7 +22,7 @@
"@types/react": "^19",
"@types/react-dom": "^19",
"eslint": "^9",
"eslint-config-next": "15.1.7",
"eslint-config-next": "15.2.4",
"postcss": "^8",
"tailwindcss": "^3.4.1",
"typescript": "^5"

View File

@ -213,7 +213,6 @@ environment.
|---------|-------|-----|
| "CUDA out of memory" error | Insufficient VRAM for model | Use FP8/FP4 quantization or smaller model |
| "Invalid HF token" error | Missing or expired HuggingFace token | Set valid token: `export HF_TOKEN=<YOUR_TOKEN>` |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
| Model download timeouts | Network issues or rate limiting | Retry command or pre-download models |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.

View File

@ -58,98 +58,17 @@ containers can be stopped with `docker stop`
## Run on two Sparks
## Step 1. Setup networking between nodes
## Step 1. Configure network connectivity
Configure network interfaces for high-performance inter-node communication. Choose one option
based on your network requirements.
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
**Option 1: Suggested - Netplan configuration**
This includes:
- Physical QSFP cable connection
- Network interface configuration (automatic or manual IP assignment)
- Passwordless SSH setup
- Network connectivity verification
Configure network interfaces using netplan on both DGX Spark nodes for automatic link-local
addressing:
```bash
## On both nodes, create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f0np0:
link-local: [ ipv4 ]
enp1s0f1np1:
link-local: [ ipv4 ]
EOF
## On both nodes, set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
## On both nodes, apply the netplan configuration
sudo netplan apply
```
**Option 2: Manual IP assignment (advanced)**
Configure dedicated cluster networking with static IP addresses:
```bash
## On Node 1
sudo ip addr add 192.168.100.10/24 dev enP2p1s0f1np1
sudo ip link set enP2p1s0f1np1 up
## On Node 2
sudo ip addr add 192.168.100.11/24 dev enP2p1s0f1np1
sudo ip link set enP2p1s0f1np1 up
## Verify connectivity from Node 1
ping -c 3 192.168.100.11
## Verify connectivity from Node 2
ping -c 3 192.168.100.10
```
## Step 2. Run the DGX Spark discovery script
Automatically identify interconnected DGX Spark systems and configure SSH passwordless
authentication for multi-node operations:
```bash
## On either node, run the discovery script
./discover-sparks
```
Expected output:
```
Found: 192.168.100.10 (spark-1b3b.local)
Found: 192.168.100.11 (spark-1d84.local)
Copying your SSH public key to all discovered nodes using ssh-copy-id.
You may be prompted for your password on each node.
Copying SSH key to 192.168.100.10 ...
Copying SSH key to 192.168.100.11 ...
nvidia@192.168.100.11's password:
SSH key copy process complete. These two sparks can now talk to each other.
```
## Step 3. Identify active network interfaces
Check which ConnectX-7 network interfaces are active and available for NCCL communication:
```bash
ibdev2netdev
```
Expected output (showing "Up" for active interfaces):
```
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
```
Note the active interface names (marked "Up") for use in container configuration.
## Step 4. Launch TensorRT-LLM containers on both nodes
## Step 2. Launch TensorRT-LLM containers on both nodes
Start containers with appropriate network and GPU configuration for NCCL communication:
@ -170,7 +89,7 @@ docker run --name trtllm --rm -d \
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3
```
## Step 5. Build NCCL with Blackwell support
## Step 3. Build NCCL with Blackwell support
Execute these commands inside both containers to build NCCL from source with Blackwell
architecture support:
@ -188,7 +107,7 @@ export NCCL_HOME="/opt/nccl/build/"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
```
## Step 6. Build NCCL test suite
## Step 4. Build NCCL test suite
Compile the NCCL test suite to validate communication performance:
@ -199,7 +118,7 @@ cd /opt/nccl-tests/
make MPI=1
```
## Step 7. Run NCCL communication test
## Step 5. Run NCCL communication test
Execute multi-node NCCL performance test using the active network interface:
@ -217,7 +136,7 @@ mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \
/opt/nccl-tests/build/all_gather_perf -b 32G -e 32G -f 2
```
## Step 8. Validate NCCL installation
## Step 6. Validate NCCL installation
Verify successful NCCL compilation and multi-node communication:
@ -235,7 +154,7 @@ mpirun --version
Expected output should show NCCL libraries in `/opt/nccl/build/lib/` and test binaries in
`/opt/nccl-tests/build/`.
## Step 10. Cleanup and rollback
## Step 7. Cleanup and rollback
**Warning**: These steps will stop containers and reset network configuration.
@ -251,7 +170,7 @@ sudo rm /etc/netplan/40-cx7.yaml
sudo netplan apply
```
## Step 11. Next steps
## Step 8. Next steps
Test your NCCL setup with a simple distributed training example:

View File

@ -319,7 +319,6 @@ Explore the [NeMo AutoModel GitHub repository](https://github.com/NVIDIA-NeMo/Au
| GPU not detected in training | CUDA driver/runtime mismatch | Verify driver compatibility: `nvidia-smi` and reinstall CUDA if needed |
| Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism |
| ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within

View File

@ -256,7 +256,6 @@ The quantized model is now ready for deployment. Common next steps include:
| Model files not found in output directory | Volume mount failed or wrong path | Verify `$(pwd)/output_models` resolves correctly |
| Git clone fails inside container | Network connectivity issues | Check internet connection and retry |
| Quantization process hangs | Container resource limits | Increase Docker memory limits or use `--ulimit` flags |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within

View File

@ -117,10 +117,6 @@ python Llama3_3B_full_finetuning.py
## Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:

View File

@ -163,7 +163,7 @@ docker stop <container_id>
| "CUDA out of memory" error | Insufficient GPU memory | Reduce `kv_cache_free_gpu_memory_fraction` to 0.9 or use a device with more VRAM |
| Container fails to start | Docker GPU support issues | Verify `nvidia-docker` is installed and `--gpus=all` flag is supported |
| Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
| Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.

View File

@ -6,6 +6,10 @@
- [Overview](#overview)
- [Run on two Sparks](#run-on-two-sparks)
- [Option 1: Automatic IP Assignment (Recommended)](#option-1-automatic-ip-assignment-recommended)
- [Option 2: Manual IP Assignment (Advanced)](#option-2-manual-ip-assignment-advanced)
- [Option 1: Automatically configure SSH](#option-1-automatically-configure-ssh)
- [Option 2: Manually discover and configure SSH](#option-2-manually-discover-and-configure-ssh)
- [Troubleshooting](#troubleshooting)
---
@ -15,76 +19,98 @@
## Basic idea
Configure two DGX Spark systems for high-speed inter-node communication using 200GbE direct
QSFP connections and NCCL multi-node communication. This setup enables distributed training
and inference workloads across multiple Blackwell GPUs by establishing network connectivity,
configuring SSH authentication, and validating communication with NCCL performance tests.
QSFP connections. This setup enables distributed workloads across multiple DGX Spark nodes
by establishing network connectivity and configuring SSH authentication.
## What you'll accomplish
You will physically connect two DGX Spark devices with a QSFP cable, configure network
interfaces for cluster communication, establish passwordless SSH between nodes, and validate
the setup with NCCL multi-node tests to create a functional distributed computing environment.
interfaces for cluster communication, and establish passwordless SSH between nodes to create
a functional distributed computing environment.
## What to know before starting
- Working with network interface configuration and netplan
- Using Docker containers with GPU and network access
- Basic understanding of distributed computing concepts
- Working with network interface configuration and netplan
- Experience with SSH key management
- Familiarity with NVIDIA GPU architectures and CUDA environments
## Prerequisites
- Two DGX Spark systems with NVIDIA Blackwell GPUs available
- QSFP cable for direct 200GbE connection between devices
- Docker installed on both systems: `docker --version`
- CUDA toolkit installed: `nvcc --version` (should show 12.9 or higher)
- SSH access available on both systems: `ssh-keygen -t rsa` (if keys don't exist)
- Git available for source code compilation: `git --version`
- Two DGX Spark systems
- One QSFP cable for direct 200GbE connection between two devices
- SSH access available to both systems
- Root or sudo access on both systems: `sudo whoami`
- The same username on both systems
## Ancillary files
All required files for this playbook can be found [here on GitHub](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/)
- `discover-sparks` script for automatic node discovery and SSH key distribution
- `trtllm-mn-entrypoint.sh` container entrypoint script for multi-node setup
- Network interface mapping tools (`ibdev2netdev`, `ip link show`)
- **discover-sparks.sh**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/discover-sparks) script for automatic node discovery and SSH key distribution
## Time & risk
**Duration:** 2-3 hours including validation tests
**Duration:** 1 hour including validation
**Risk level:** Medium - involves network reconfiguration and container setup
**Risk level:** Medium - involves network reconfiguration
**Rollback:** Network changes can be reversed by removing netplan configs or IP assignments
## Run on two Sparks
## Step 1. Physical Hardware Connection
## Step 1. Ensure Same Username on Both Systems
Connect the QSFP cable between both DGX Spark systems using the rightmost QSFP interface
on each device. This establishes the 200GbE direct connection required for high-speed
inter-node communication.
On both systems check the username and make sure it's the same:
```bash
## Check QSFP interface availability on both nodes
ip link show | grep enP2p1s0f1np1
## Check current username
whoami
```
Expected output shows the interface exists but may be down initially.
If usernames don't match, create a new user (e.g., nvidia) on both systems and login in with the new user:
## Step 2. Network Interface Configuration
```bash
## Create nvidia user and add to sudo group
sudo useradd -m nvidia
sudo usermod -aG sudo nvidia
Choose one option based on your network requirements.
## Set password for nvidia user
sudo passwd nvidia
**Option 1: Automatic IP Assignment (Recommended)**
## Switch to nvidia user
su - nvidia
```
## Step 2. Physical Hardware Connection
Connect the QSFP cable between both DGX Spark systems using any QSFP interface
on each device. This establishes the 200GbE direct connection required for high-speed
inter-node communication. Upon connection between the two nodes, you will see the an output like the one below: in this example the interface showing as 'Up' is **enp1s0f1np1** / **enP2p1s0f1np1** (each physical port has two names).
Example output:
```bash
## Check QSFP interface availability on both nodes
nvidia@dxg-spark-1:~$ ibdev2netdev
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
```
Note: If none of the interfaces are showing as 'Up', please check the QSFP cable connection, reboot the systems and try again.
Note: The interface showing as 'Up' depends on which port you are using to connect the two nodes. Each physical port has two names, for example, enp1s0f1np1 and enP2p1s0f1np1 refer to the same physical port. Please disregard enP2p1s0f0np0 and enP2p1s0f1np1, and use enp1s0f0np0 and enp1s0f1np1 only.
## Step 3. Network Interface Configuration
Choose one option to setup the network interfaces. Option 1 and 2 are mutually exclusive.
### Option 1: Automatic IP Assignment (Recommended)
Configure network interfaces using netplan on both DGX Spark nodes for automatic
link-local addressing:
```bash
## On both nodes, create the netplan configuration file
## Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
@ -95,217 +121,128 @@ network:
link-local: [ ipv4 ]
EOF
## On both nodes, set appropriate permissions
## Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
## On both nodes, apply the netplan configuration
## Apply the configuration
sudo netplan apply
```
**Option 2: Manual IP Assignment (Advanced)**
Note: Using this option, the IPs assigned to the interfaces will change if you reboot the system.
Configure dedicated cluster networking with static IP addresses:
```bash
## On Node 1
sudo ip addr add 192.168.100.10/24 dev enP2p1s0f1np1
sudo ip link set enP2p1s0f1np1 up
## On Node 2
sudo ip addr add 192.168.100.11/24 dev enP2p1s0f1np1
sudo ip link set enP2p1s0f1np1 up
## Verify connectivity from Node 1
ping -c 3 192.168.100.11
## Verify connectivity from Node 2
ping -c 3 192.168.100.10
```
## Step 3. SSH Key Distribution
Automatically identify interconnected DGX Spark systems and configure SSH passwordless
authentication for multi-node operations. This step runs on either node.
```bash
## On either node, run the discovery script
./discover-sparks
```
Expected output:
```
Found: 192.168.100.10 (spark-1b3b.local)
Found: 192.168.100.11 (spark-1d84.local)
Copying your SSH public key to all discovered nodes using ssh-copy-id.
You may be prompted for your password on each node.
Copying SSH key to 192.168.100.10 ...
Copying SSH key to 192.168.100.11 ...
nvidia@192.168.100.11's password:
SSH key copy process complete. These two sparks can now talk to each other.
```
## Step 4. Network Interface Validation
Check which ConnectX-7 network interfaces are active and available for communication:
### Option 2: Manual IP Assignment (Advanced)
First, identify which network ports are available and up:
```bash
## Check network port status
ibdev2netdev
```
Expected output (showing "Up" for active interfaces):
Example output:
```
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
```
Note the active interface names (marked "Up") for use in container configuration.
Use an interface that shows as "(Up)" in your output. In this example, we'll use **enp1s0f1np1**. You can disregard interfaces starting with the prefix`enP2p<...>` and only use interfaces starting with `enp1<...>` instead.
## Step 5. Launch Containers with Network Configuration
On Node 1:
```bash
## Assign static IP and bring up interface.
sudo ip addr add 192.168.100.10/24 dev enp1s0f1np1
sudo ip link set enp1s0f1np1 up
```
Start containers with appropriate network and GPU configuration for NCCL communication.
This step runs on both nodes.
Repeat the same process for Node 2, but using IP **192.168.100.11/24**. Ensure to use the correct interface name using `ibdev2netdev` command.
```bash
## Assign static IP and bring up interface.
sudo ip addr add 192.168.100.11/24 dev enp1s0f1np1
sudo ip link set enp1s0f1np1 up
```
You can verify the IP assignment on both nodes by running the following command on each node:
```bash
## Replace enp1s0f1np1 with the interface showing as "(Up)" in your output, either enp1s0f0np0 or enp1s0f1np1
ip addr show enp1s0f1np1
```
## Step 3. Set up passwordless SSH authentication
### Option 1: Automatically configure SSH
Run the DGX Spark [**discover-sparks.sh**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/discover-sparks) script from one of the nodes to automatically discover and configure SSH:
```bash
## On both nodes, launch the container
docker run --name trtllm --rm -d \
--gpus all --network host --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-e UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1 \
-e NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \
-e OMPI_MCA_btl_tcp_if_include=enp1s0f0np0,enp1s0f1np1 \
-e OMPI_ALLOW_RUN_AS_ROOT=1 \
-e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
-v ./trtllm-mn-entrypoint.sh:/opt/trtllm-mn-entrypoint.sh \
-v ~/.ssh:/tmp/.ssh:ro \
--entrypoint /opt/trtllm-mn-entrypoint.sh \
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3
bash ./discover-sparks
```
## Step 6. Build NCCL with Blackwell Support
Expected output similar to the below, with different IPs and node names. The first time you run the script, you'll be prompted for your password for each node.
```
Found: 169.254.35.62 (dgx-spark-1.local)
Found: 169.254.35.63 (dgx-spark-2.local)
Execute these commands inside both containers to build NCCL from source with Blackwell
architecture support. Access the container with `docker exec -it trtllm bash`.
Setting up bidirectional SSH access (local <-> remote nodes)...
You may be prompted for your password for each node.
```bash
## Install dependencies and build NCCL
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git /opt/nccl/
cd /opt/nccl/
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
## Set environment variables
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="/opt/nccl/build/"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
SSH setup complete! Both local and remote nodes can now SSH to each other without passwords.
```
## Step 7. Build NCCL Test Suite
Note: If you encoutner any errors, please follow Option 2 below to manually configure SSH and debug the issue.
Compile the NCCL test suite to validate communication performance. This runs inside
both containers.
### Option 2: Manually discover and configure SSH
You will need to find the IP addresses for the CX-7 interfaces that are up. On both nodes, run the following command to find the IP addresses and take note of them for the next step.
```bash
## Clone and build NCCL tests
git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests/
cd /opt/nccl-tests/
make MPI=1
ip addr show enp1s0f0np0
ip addr show enp1s0f1np1
```
## Step 8. Run NCCL Communication Test
Execute multi-node NCCL performance test using the active network interface. This runs
from one of the containers.
```bash
## Set network interface environment variables (use your active interface from Step 4)
export UCX_NET_DEVICES=enp1s0f0np0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0
## Run the all_gather performance test across both nodes
mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \
-x NCCL_DEBUG=VERSION -x NCCL_DEBUG_SUBSYS=TUNING \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
-x NCCL_MERGE_LEVEL=SYS -x NCCL_PROTO="SIMPLE" \
/opt/nccl-tests/build/all_gather_perf -b 32G -e 32G -f 2
Example output:
```
## In this example, we are using interface enp1s0f1np1.
nvidia@dgx-spark-1:~$ ip addr show enp1s0f1np1
4: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 3c:6d:66:cc:b3:b7 brd ff:ff:ff:ff:ff:ff
inet **169.254.35.62**/16 brd 169.254.255.255 scope link noprefixroute enp1s0f1np1
valid_lft forever preferred_lft forever
inet6 fe80::3e6d:66ff:fecc:b3b7/64 scope link
valid_lft forever preferred_lft forever
```
## Step 9. Validate NCCL Installation
Verify successful NCCL compilation and multi-node communication by checking built
components.
In this example, the IP address for Node 1 is **169.254.35.62**. Repeat the process for Node 2.
On both nodes, run the following commands to enable passwordless SSH:
```bash
## Check NCCL library build
ls -la /opt/nccl/build/lib/
## Verify NCCL test binaries
ls -la /opt/nccl-tests/build/
## Check MPI configuration
mpirun --version
## Copy your SSH public key to both nodes. Please replace the IP addresses with the ones you found in the previous step.
ssh-copy-id -i ~/.ssh/id_rsa.pub nvidia@<IP for Node 1>
ssh-copy-id -i ~/.ssh/id_rsa.pub nvidia@<IP for Node 2>
```
Expected output should show NCCL libraries in `/opt/nccl/build/lib/` and test binaries
in `/opt/nccl-tests/build/`.
## Step 4. Verify Multi-Node Communication
## Step 10. Performance Validation
Review the all_gather test output for communication performance metrics from Step 8.
Expected metrics from the test output:
- Bandwidth measurements between nodes
- Latency for different message sizes
- GPU-to-GPU communication confirmation
- No error messages or communication failures
## Step 11. Additional NCCL Tests
Run additional performance validation tests to verify the complete setup.
Test basic multi-node functionality:
```bash
## Example: Run a simple NCCL bandwidth test
/opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2
## Example: Verify GPU topology detection
nvidia-smi topo -m
## Test hostname resolution across nodes
ssh <IP for Node 1> hostname
ssh <IP for Node 2> hostname
```
## Step 13. Cleanup and Rollback
## Step 6. Cleanup and Rollback
> **Warning**: These steps will stop containers and reset network configuration.
> **Warning**: These steps will reset network configuration.
```bash
## Stop containers on both nodes
docker stop trtllm
docker rm trtllm
## Rollback network configuration (if using Option 1)
sudo rm /etc/netplan/40-cx7.yaml
sudo netplan apply
## Rollback network configuration (if using Option 2)
sudo ip addr del 192.168.100.10/24 dev enP2p1s0f1np1 # Node 1
sudo ip addr del 192.168.100.11/24 dev enP2p1s0f1np1 # Node 2
sudo ip link set enP2p1s0f1np1 down
```
## Step 14. Next Steps
Your NCCL environment is ready for multi-node distributed training workloads on DGX Spark
systems with Blackwell GPUs.
```bash
## Test basic multi-node functionality
mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 hostname
## Verify GPU visibility across nodes
mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 nvidia-smi -L
sudo ip addr del 192.168.100.10/24 dev enp1s0f0np0 # Adjust the interface name to the one you used in step 3.
sudo ip addr del 192.168.100.11/24 dev enp1s0f0np0 # Adjust the interface name to the one you used in step 3.
```
## Troubleshooting
@ -314,7 +251,4 @@ mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 nvidia-smi -L
|---------|-------|-----|
| "Network unreachable" errors | Network interfaces not configured | Verify netplan config and `sudo netplan apply` |
| SSH authentication failures | SSH keys not properly distributed | Re-run `./discover-sparks` and enter passwords |
| NCCL build failures with Blackwell | Wrong compute capability specified | Verify `NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"` |
| MPI communication timeouts | Wrong network interfaces specified | Check `ibdev2netdev` and update interface names |
| Container networking issues | Host network mode problems | Ensure `--network host --ipc=host` in docker run |
| Node 2 not visible in cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |

View File

@ -16,21 +16,20 @@
- [Step 8. Serve LLM with OpenAI-compatible API](#step-8-serve-llm-with-openai-compatible-api)
- [Step 10. Cleanup and rollback](#step-10-cleanup-and-rollback)
- [Run on two Sparks](#run-on-two-sparks)
- [Step 1. User prerequisites](#step-1-user-prerequisites)
- [Step 1. Configure network connectivity](#step-1-configure-network-connectivity)
- [Step 2. Configure Docker permissions](#step-2-configure-docker-permissions)
- [Step 3. Configure network connectivity](#step-3-configure-network-connectivity)
- [Step 4. Install NVIDIA Container Toolkit & setup Docker environment](#step-4-install-nvidia-container-toolkit-setup-docker-environment)
- [Step 5. Enable resource advertising](#step-5-enable-resource-advertising)
- [Step 6. Initialize Docker Swarm](#step-6-initialize-docker-swarm)
- [Step 7. Join worker nodes and deploy](#step-7-join-worker-nodes-and-deploy)
- [Step 8. Create hosts file](#step-8-create-hosts-file)
- [Step 9. Find your Docker container ID](#step-9-find-your-docker-container-id)
- [Step 10. Generate configuration file](#step-10-generate-configuration-file)
- [Step 11. Download model](#step-11-download-model)
- [Step 12. Serve the model](#step-12-serve-the-model)
- [Step 13. Validate API server](#step-13-validate-api-server)
- [Step 15. Cleanup and rollback](#step-15-cleanup-and-rollback)
- [Step 16. Next steps](#step-16-next-steps)
- [Step 3. Install NVIDIA Container Toolkit & setup Docker environment](#step-3-install-nvidia-container-toolkit-setup-docker-environment)
- [Step 4. Enable resource advertising](#step-4-enable-resource-advertising)
- [Step 5. Initialize Docker Swarm](#step-5-initialize-docker-swarm)
- [Step 6. Join worker nodes and deploy](#step-6-join-worker-nodes-and-deploy)
- [Step 7. Create hosts file](#step-7-create-hosts-file)
- [Step 8. Find your Docker container ID](#step-8-find-your-docker-container-id)
- [Step 9. Generate configuration file](#step-9-generate-configuration-file)
- [Step 10. Download model](#step-10-download-model)
- [Step 11. Serve the model](#step-11-serve-the-model)
- [Step 12. Validate API server](#step-12-validate-api-server)
- [Step 14. Cleanup and rollback](#step-14-cleanup-and-rollback)
- [Step 15. Next steps](#step-15-next-steps)
- [Troubleshooting](#troubleshooting)
---
@ -408,13 +407,15 @@ docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
## Run on two Sparks
### Step 1. User prerequisites
Ensure all your DGX Spark nodes are set up and accessible with the same username. If your DGX Spark nodes are set up with different usernames, you will need to create a shared username for all the nodes.
You can create a common user `nvidia` by running the following command:
### Step 1. Configure network connectivity
```bash
sudo usermod -aG docker nvidia
```
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
This includes:
- Physical QSFP cable connection
- Network interface configuration (automatic or manual IP assignment)
- Passwordless SSH setup
- Network connectivity verification
### Step 2. Configure Docker permissions
@ -434,94 +435,11 @@ sudo usermod -aG docker nvidia
Note: Replace `nvidia` with the username of the user you want to allow Docker access to.
Note: After running usermod, you must log out and log back in to start a new session with updated group permissions.
### Step 3. Configure network connectivity
You have two options for configuring network connectivity between your DGX Spark nodes:
#### Option 1: Automatic IP assignment (recommended)
Follow these steps on both DGX Spark nodes to configure network interfaces using netplan:
```bash
## Create the netplan configuration file
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
network:
version: 2
ethernets:
enp1s0f0np0:
link-local: [ ipv4 ]
enp1s0f1np1:
link-local: [ ipv4 ]
EOF
## Set appropriate permissions
sudo chmod 600 /etc/netplan/40-cx7.yaml
## Apply the configuration
sudo netplan apply
```
#### Option 2: Manual IP assignment (advanced)
First, identify which network ports are available and up:
```bash
## Check network port status
ibdev2netdev
```
Example output:
```
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
```
Use an interface that shows as "(Up)" in your output. In this example, we'll use **enp1s0f0np0**.
On Node 1:
```bash
## Assign static IP and bring up interface
sudo ip addr add 192.168.100.10/24 dev enp1s0f0np0
sudo ip link set enp1s0f0np0 up
```
On Node 2:
```bash
## Assign static IP and bring up interface
sudo ip addr add 192.168.100.11/24 dev enp1s0f0np0
sudo ip link set enp1s0f0np0 up
```
#### Set up passwordless SSH authentication
Run the DGX Spark [**discover-sparks.sh**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/discover-sparks.sh) script on both nodes to automatically configure SSH:
```bash
bash ./discover-sparks.sh
```
Expected output similar to the below, with different IPs and node names. The first time you run the script, you'll be prompted for your password for each node.
```
Found: 192.168.100.10 (spark-1b3b.local)
Found: 192.168.100.11 (spark-1d84.local)
Copying your SSH public key to all discovered nodes using ssh-copy-id.
You may be prompted for your password on each node.
Copying SSH key to 192.168.100.10 ...
Copying SSH key to 192.168.100.11 ...
nvidia@192.168.100.11's password:
SSH key copy process complete. These two sparks can now talk to each other.
```
### Step 4. Install NVIDIA Container Toolkit & setup Docker environment
### Step 3. Install NVIDIA Container Toolkit & setup Docker environment
Ensure the NVIDIA drivers and the NVIDIA Container Toolkit are installed on each node (both manager and workers) that will provide GPU resources. This package enables Docker containers to access the host's GPU hardware. Ensure you complete the [installation steps](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), including the [Docker configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) for NVIDIA Container Toolkit.
### Step 5. Enable resource advertising
### Step 4. Enable resource advertising
First, find your GPU UUID by running:
```bash
@ -561,7 +479,7 @@ Finally, restart the Docker daemon to apply all changes:
sudo systemctl restart docker
```
### Step 6. Initialize Docker Swarm
### Step 5. Initialize Docker Swarm
On whichever node you want to use as primary, run the following swarm initialization command
```bash
@ -579,7 +497,7 @@ To add a worker to this swarm, run the following command:
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
```
### Step 7. Join worker nodes and deploy
### Step 6. Join worker nodes and deploy
Now we can proceed with setting up other nodes of your cluster.
@ -609,7 +527,7 @@ oe9k5o6w41le trtllm-multinode_trtllm.1 nvcr.io/nvidia/tensorrt-llm/relea
phszqzk97p83 trtllm-multinode_trtllm.2 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1b3b Running Running 2 minutes ago
```
### Step 8. Create hosts file
### Step 7. Create hosts file
You can check the available nodes using `docker node ls`
```
@ -625,14 +543,14 @@ docker node ls --format '{{.ID}}' | xargs -n1 docker node inspect --format '{{ .
docker cp ~/openmpi-hostfile $(docker ps -q -f name=trtllm-multinode):/etc/openmpi-hostfile
```
### Step 9. Find your Docker container ID
### Step 8. Find your Docker container ID
You can use `docker ps` to find your Docker container ID. Alternatively, you can save the container ID in a variable:
```bash
export TRTLLM_MN_CONTAINER=$(docker ps -q -f name=trtllm-multinode)
```
### Step 10. Generate configuration file
### Step 9. Generate configuration file
```bash
docker exec $TRTLLM_MN_CONTAINER bash -c 'cat <<EOF > /tmp/extra-llm-api-config.yml
@ -645,7 +563,7 @@ cuda_graph_config:
EOF'
```
### Step 11. Download model
### Step 10. Download model
```bash
## Need to specify huggingface token for model download.
@ -657,7 +575,7 @@ docker exec \
-it $TRTLLM_MN_CONTAINER bash -c 'mpirun -x HF_TOKEN bash -c "huggingface-cli download $MODEL"'
```
### Step 12. Serve the model
### Step 11. Serve the model
```bash
docker exec \
@ -677,7 +595,7 @@ This will start the TensorRT-LLM server on port 8000. You can then make inferenc
**Expected output:** Server startup logs and ready message.
### Step 13. Validate API server
### Step 12. Validate API server
Verify successful deployment by checking container status and testing the API endpoint.
@ -703,7 +621,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
**Expected output:** JSON response with generated text completion.
### Step 15. Cleanup and rollback
### Step 14. Cleanup and rollback
Stop and remove containers by using the following command on the leader node:
@ -719,7 +637,7 @@ Remove downloaded models to free disk space:
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3*
```
### Step 16. Next steps
### Step 15. Next steps
Compare performance metrics between speculative decoding and baseline reports to quantify speed improvements. Use the multi-node setup as a foundation for deploying other large models requiring tensor parallelism, or scale to additional nodes for higher throughput workloads.
@ -729,7 +647,6 @@ Compare performance metrics between speculative decoding and baseline reports to
| Symptom | Cause | Fix |
|---------|-------|-----|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
| OOM during weight loading (e.g., [Nemotron Super 49B](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5)) | Parallel weight-loading memory pressure | `export TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=1` |
| "CUDA out of memory" | GPU VRAM insufficient for model | Reduce `free_gpu_memory_fraction: 0.9` or batch size or use smaller model |
| "Model not found" error | HF_TOKEN invalid or model inaccessible | Verify token and model permissions |
@ -742,12 +659,11 @@ Compare performance metrics between speculative decoding and baseline reports to
|---------|-------|-----|
| MPI hostname test returns single hostname | Network connectivity issues | Verify both nodes are on reachable IP addresses |
| "Permission denied" on HuggingFace download | Invalid or missing HF_TOKEN | Set valid token: `export HF_TOKEN=<TOKEN>` |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
| "CUDA out of memory" errors | Insufficient GPU memory | Reduce `--max_batch_size` or `--max_num_tokens` |
| Container exits immediately | Missing entrypoint script | Ensure `trtllm-mn-entrypoint.sh` download succeeded and has executable permissions, also ensure you are not running the container already on your node. If port 2233 is already utilized, the entrypoint script will not start. |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

View File

@ -5,7 +5,7 @@ WORKDIR /app
# Install Flask and other required packages
RUN pip install --no-cache-dir \
flask==2.0.1 \
gunicorn==20.1.0 \
gunicorn==23.0.0 \
tqdm
# Create model directory

View File

@ -1,6 +1,6 @@
sentence-transformers==2.3.1
transformers==4.36.2
transformers==4.46.3
torch==2.1.2
flask==2.3.3
gunicorn==21.2.0
gunicorn==23.0.0
numpy==1.26.2

View File

@ -52,7 +52,7 @@
"langchain": "^0.3.19",
"lucide-react": "^0.454.0",
"neo4j-driver": "^5.28.1",
"next": "15.1.0",
"next": "15.2.4",
"next-themes": "^0.4.4",
"openai": "^4.91.0",
"react": "^19",

View File

@ -7,7 +7,7 @@
- [Overview](#overview)
- [Instructions](#instructions)
- [Run on two Sparks](#run-on-two-sparks)
- [Step 14. (Optional) Launch 405B inference server](#step-14-optional-launch-405b-inference-server)
- [Step 11. (Optional) Launch 405B inference server](#step-11-optional-launch-405b-inference-server)
- [Troubleshooting](#troubleshooting)
---
@ -16,22 +16,22 @@
## Basic idea
vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.
vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.
- It uses a memory-efficient attention algoritm called **PagedAttention** to handle long sequences without running out of GPU memory.
- New requests can be added to a batch already in process through **continuous batching** to keep GPUs fully utilized.
- It has an **OpenAI-compatible API** so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.
- It uses a memory-efficient attention algoritm called **PagedAttention** to handle long sequences without running out of GPU memory.
- New requests can be added to a batch already in process through **continuous batching** to keep GPUs fully utilized.
- It has an **OpenAI-compatible API** so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.
## What you'll accomplish
You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture,
either using a pre-built Docker container or building from source with custom LLVM/Triton
You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture,
either using a pre-built Docker container or building from source with custom LLVM/Triton
support for ARM64.
## What to know before starting
- Experience building and configuring containers with Docker
- Familiarity with CUDA toolkit installation and version management
- Familiarity with CUDA toolkit installation and version management
- Understanding of Python virtual environments and package management
- Knowledge of building software from source using CMake and Ninja
- Experience with Git version control and patch management
@ -39,7 +39,7 @@ support for ARM64.
## Prerequisites
- DGX Spark device with ARM64 processor and Blackwell GPU architecture
- CUDA 12.9 or CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version.
- CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version.
- Docker installed and configured: `docker --version` succeeds
- NVIDIA Container Toolkit installed
- Python 3.12 available: `python3.12 --version` succeeds
@ -55,7 +55,7 @@ support for ARM64.
## Instructions
## Step 1. Pull vLLM container image
## Step 1. Pull vLLM container image
Find the latest container build from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.09-py3
```
@ -119,58 +119,23 @@ sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
## Step 5. Next steps
- **Production deployment:** Configure vLLM with your specific model requirements
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
- **Monitoring:** Set up logging and metrics collection for production use
- **Model management:** Explore additional model formats and quantization options
## Run on two Sparks
## Step 1. Verify hardware connectivity
## Step 1. Configure network connectivity
Connect the QSFP cable between both DGX Spark systems using the rightmost QSFP interface on each device. This step establishes the 200GbE direct connection required for high-speed inter-node communication.
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
```bash
## Check QSFP interface availability on both nodes
ip link show | grep enP2p1s0f1np1
```
This includes:
- Physical QSFP cable connection
- Network interface configuration (automatic or manual IP assignment)
- Passwordless SSH setup
- Network connectivity verification
Expected output shows the interface exists but may be down initially.
## Step 2. Configure cluster network on Node 1
Set up the static IP address for the cluster network interface on the first DGX Spark system. This creates a dedicated network segment for distributed inference communication.
```bash
## Configure static IP on Node 1
sudo ip addr add 192.168.100.10/24 dev enP2p1s0f1np1
sudo ip link set enP2p1s0f1np1 up
```
## Step 3. Configure cluster network on Node 2
Configure the second node with a corresponding static IP in the same network segment.
```bash
## Configure static IP on Node 2
sudo ip addr add 192.168.100.11/24 dev enP2p1s0f1np1
sudo ip link set enP2p1s0f1np1 up
```
## Step 4. Verify network connectivity
Test the direct connection between both nodes to ensure the cluster network is functional.
```bash
## From Node 1, test connectivity to Node 2
ping -c 3 192.168.100.11
## From Node 2, test connectivity to Node 1
ping -c 3 192.168.100.10
```
Expected output shows successful ping responses with low latency.
## Step 5. Download cluster deployment script
## Step 2. Download cluster deployment script
Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.
@ -180,7 +145,7 @@ wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/example
chmod +x run_cluster.sh
```
## Step 6. Pull the NVIDIA vLLM Image from NGC
## Step 3. Pull the NVIDIA vLLM Image from NGC
First, you will need to configure docker to pull from NGC
If this is your first time using docker run:
@ -190,21 +155,16 @@ sudo usermod -aG docker $USER
newgrp docker
```
After this, you should be able to run docker commands without using `sudo`.
After this, you should be able to run docker commands without using `sudo`.
Next, create an NGC API Key [here](https://ngc.nvidia.com/setup/api-key) so that you can pull containers from NGC.
Once you have the API key, you can configure docker to pull from NGC and pull down the VLLM image:
```bash
docker login nvcr.io
## Username will be `$oauthtoken` and the password is your NGC API Key
docker pull nvcr.io/nvidia/vllm:25.09-py3
export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.09-py3
```
## Step 7. Start Ray head node
## Step 4. Start Ray head node
Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.
@ -223,7 +183,7 @@ bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --head ~/.cache/huggingface \
```
## Step 8. Start Ray worker node
## Step 5. Start Ray worker node
Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism.
@ -241,7 +201,7 @@ bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --worker ~/.cache/huggingface \
-e MASTER_ADDR=192.168.100.10
```
## Step 9. Verify cluster status
## Step 6. Verify cluster status
Confirm both nodes are recognized and available in the Ray cluster.
@ -252,7 +212,7 @@ docker exec node ray status
Expected output shows 2 nodes with available GPU resources.
## Step 10. Download Llama 3.3 70B model
## Step 7. Download Llama 3.3 70B model
Authenticate with Hugging Face and download the recommended production-ready model.
@ -262,7 +222,7 @@ huggingface-cli login
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct
```
## Step 11. Launch inference server for Llama 3.3 70B
## Step 8. Launch inference server for Llama 3.3 70B
Start the vLLM inference server with tensor parallelism across both nodes.
@ -273,7 +233,7 @@ vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 --max_model_len 2048
```
## Step 12. Test 70B model inference
## Step 9. Test 70B model inference
Verify the deployment with a sample inference request.
@ -291,7 +251,7 @@ curl http://localhost:8000/v1/completions \
Expected output includes a generated haiku response.
## Step 13. (Optional) Deploy Llama 3.1 405B model
## Step 10. (Optional) Deploy Llama 3.1 405B model
> **Warning:** 405B model has insufficient memory headroom for production use.
@ -302,7 +262,7 @@ Download the quantized 405B model for testing purposes only.
huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4
```
### Step 14. (Optional) Launch 405B inference server
### Step 11. (Optional) Launch 405B inference server
Start the server with memory-constrained parameters for the large model.
@ -314,7 +274,7 @@ vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
--max-num-seqs 1 --max_num_batched_tokens 256
```
## Step 15. (Optional) Test 405B model inference
## Step 12. (Optional) Test 405B model inference
Verify the 405B deployment with constrained parameters.
@ -329,7 +289,7 @@ curl http://localhost:8000/v1/completions \
}'
```
## Step 16. Validate deployment
## Step 13. Validate deployment
Perform comprehensive validation of the distributed inference system.
@ -345,7 +305,7 @@ nvidia-smi
docker exec node nvidia-smi --query-gpu=memory.used,memory.total --format=csv
```
## Step 18. Cleanup and rollback
## Step 14. Cleanup and rollback
Remove temporary configurations and containers when testing is complete.
@ -362,7 +322,7 @@ sudo ip addr del 192.168.100.11/24 dev enP2p1s0f1np1 # Node 2
sudo ip link set enP2p1s0f1np1 down
```
## Step 19. Next steps
## Step 15. Next steps
Access the Ray dashboard for cluster monitoring and explore additional features:
@ -372,7 +332,7 @@ http://192.168.100.10:8265
## Consider implementing for production:
## - Health checks and automatic restarts
## - Log rotation for long-running services
## - Log rotation for long-running services
## - Persistent model caching across restarts
## - Alternative quantization methods (FP8, INT4)
```
@ -382,14 +342,13 @@ http://192.168.100.10:8265
| Symptom | Cause | Fix |
|---------|--------|-----|
| Node 2 not visible in Ray cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
| Model download fails | Authentication or network issue | Re-run `huggingface-cli login`, check internet access |
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
| CUDA out of memory with 405B | Insufficient GPU memory | Use 70B model or reduce max_model_len parameter |
| Container startup fails | Missing ARM64 image | Rebuild vLLM image following ARM64 instructions |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'