mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-24 10:53:52 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
819ce6334c
commit
e1bed13f13
@ -23,7 +23,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
|
|||||||
|
|
||||||
- [Comfy UI](nvidia/comfy-ui/)
|
- [Comfy UI](nvidia/comfy-ui/)
|
||||||
- [Set Up Local Network Access](nvidia/connect-to-your-spark/)
|
- [Set Up Local Network Access](nvidia/connect-to-your-spark/)
|
||||||
- [CUDA-X Data Science](nvidia/cuda-x-data-science/)
|
- [CUDA-X](nvidia/cuda-x-data-science/)
|
||||||
- [DGX Dashboard](nvidia/dgx-dashboard/)
|
- [DGX Dashboard](nvidia/dgx-dashboard/)
|
||||||
- [FLUX.1 Dreambooth LoRA Fine-tuning](nvidia/flux-finetuning/)
|
- [FLUX.1 Dreambooth LoRA Fine-tuning](nvidia/flux-finetuning/)
|
||||||
- [Optimized JAX](nvidia/jax/)
|
- [Optimized JAX](nvidia/jax/)
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
# CUDA-X Data Science
|
# CUDA-X
|
||||||
|
|
||||||
> Install and use NVIDIA cuML and NVIDIA cuDF to accelerate UMAP, HDBSCAN, pandas and more with zero code changes.
|
> Accelerated data science with NVIDIA RAPIDS
|
||||||
|
|
||||||
## Table of Contents
|
## Table of Contents
|
||||||
|
|
||||||
@ -12,25 +12,18 @@
|
|||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
## Basic Idea
|
## Basic Idea
|
||||||
This playbook includes two example notebooks that demonstrate the acceleration of key machine learning algorithms and core pandas operations using CUDA-X Data Science libraries:
|
CUDA-X Data Science (formally RAPIDS) is an open-source library collection that accelerates the data science and data processing ecosystem. Accelerate popular python tools like scikit-learn and pandas with zero code changes on DGX Spark to maximize performance at your desk. This playbook orients you with example workflows, demonstrating the acceleration of key machine learning algorithms like UMAP and HBDSCAN and core pandas operations, without changing your code.
|
||||||
|
|
||||||
- **NVIDIA cuDF:** Accelerates operations for data preparation and core data processing of 8GB of strings data, with no code changes.
|
In this playbook, we will demonstrate the acceleration of key machine learning algorithms like UMAP and HBDSCAN and core pandas operations, without changing your code.
|
||||||
- **NVIDIA cuML:** Accelerates popular, compute intensive machine learning algorithms in sci-kit learn (LinearSVC), UMAP, and HDBSCAN, with no code changes.
|
|
||||||
|
|
||||||
CUDA-X Data Science (formally RAPIDS) is an open-source library collection that accelerates the data science and data processing ecosystem. These libraries accelerate popular Python tools like scikit-learn and pandas with zero code changes. On DGX Spark, these libraries maximize performance at your desk with your existing code.
|
## What to know before starting
|
||||||
|
- Familiarity with pandas, scikit learn, machine learning algorithms, such as support vector machine, clustering, and dimensionality reduction algorithms
|
||||||
## What you'll accomplish
|
|
||||||
You will accelerate popular machine learning algorithms and data analytics operations GPU. You will understand how to accelerate popular Python tools, and the value of running data science workflows on your DGX Spark.
|
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
- Familiarity with pandas, scikit-learn, machine learning algorithms, such as support vector machine, clustering, and dimensionality reduction algorithms.
|
|
||||||
- Install conda
|
- Install conda
|
||||||
- Generate a Kaggle API key
|
- Generate a Kaggle API key
|
||||||
|
|
||||||
## Time & risk
|
**Duration:** 20-30 minutes setup time and 2-3 minutes to run each notebook.
|
||||||
- Duration:
|
|
||||||
- 20-30 minutes setup time.
|
|
||||||
- 2-3 minutes to run each notebook.
|
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
@ -40,34 +33,32 @@ You will accelerate popular machine learning algorithms and data analytics opera
|
|||||||
- Install conda using [these instructions](https://docs.anaconda.com/miniconda/install/)
|
- Install conda using [these instructions](https://docs.anaconda.com/miniconda/install/)
|
||||||
- Create Kaggle API key using [these instructions](https://www.kaggle.com/discussions/general/74235) and place the **kaggle.json** file in the same folder as the notebook
|
- Create Kaggle API key using [these instructions](https://www.kaggle.com/discussions/general/74235) and place the **kaggle.json** file in the same folder as the notebook
|
||||||
|
|
||||||
## Step 2. Installing Data Science libraries
|
## Step 2. Installing CUDA-X libraries
|
||||||
- Use the following command to install the CUDA-X libraries (this will create a new conda environment)
|
- use the following command to install the CUDA-X libraries (this will create a new conda environment)
|
||||||
```bash
|
```bash
|
||||||
conda create -n rapids-test -c rapidsai-nightly -c conda-forge -c nvidia \
|
conda create -n rapids-test -c rapidsai-nightly -c conda-forge -c nvidia \
|
||||||
rapids=25.10 python=3.12 'cuda-version=13.0' \
|
rapids=25.10 python=3.12 'cuda-version=13.0' \
|
||||||
jupyterlab hdbscan umap-learn
|
jupyterlab hdbscan umap-learn
|
||||||
```
|
```
|
||||||
## Step 3. Activate the conda environment
|
## Step 3. Activate the conda environment
|
||||||
- Activate the conda environment
|
- activate the conda environment
|
||||||
```bash
|
```bash
|
||||||
conda activate rapids-test
|
conda activate rapids-test
|
||||||
```
|
```
|
||||||
## Step 4. Cloning the playbook repository
|
## Step 4. Cloning the notebooks
|
||||||
- Clone the github repository and go the assets folder place in cuda-x-data-science folder
|
- clone the github repository and go the cuda-x-data-science/assets folder
|
||||||
```bash
|
```bash
|
||||||
git clone https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets
|
ssh://git@******:12051/spark-playbooks/dgx-spark-playbook-assets.git
|
||||||
```
|
```
|
||||||
- Place the **kaggle.json** created in Step 1 in the assets folder
|
- place the **kaggle.json** created in Step 1 in the assets folder
|
||||||
|
|
||||||
## Step 5. Run the notebooks
|
## Step 5. Run the notebooks
|
||||||
There are two notebooks in the GitHub repository.
|
- Both the notebooks are self explanatory
|
||||||
One runs an example of a large strings data processing workflow with pandas code on GPU.
|
- To experience the acceleration achieved using cudf.pandas, run the cudf_pandas_demo.ipynb notebook
|
||||||
- Run the cudf_pandas_demo.ipynb notebook
|
|
||||||
```bash
|
```bash
|
||||||
jupyter notebook cudf_pandas_demo.ipynb
|
jupyter notebook cudf_pandas_demo.ipynb
|
||||||
```
|
```
|
||||||
The other goes over an example of machine learning algorithms including UMAP and HDBSCAN.
|
- To experience the acceleration achieved using cuml, run the cuml_sklearn_demo.ipynb notebook
|
||||||
- Run the cuml_sklearn_demo.ipynb notebook
|
|
||||||
```bash
|
```bash
|
||||||
jupyter notebook cuml_sklearn_demo.ipynb
|
jupyter notebook cuml_sklearn_demo.ipynb
|
||||||
```
|
```
|
||||||
|
|||||||
@ -171,10 +171,6 @@ Unlike the base model, we can see that the fine-tuned model can generate multipl
|
|||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
| Symptom | Cause | Fix |
|
|
||||||
|---------|--------|-----|
|
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
|
||||||
|
|
||||||
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||||
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
||||||
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
|
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
|
||||||
|
|||||||
@ -22,7 +22,7 @@ COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
|
|||||||
RUN mkdir /app
|
RUN mkdir /app
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
|
|
||||||
RUN uv init && uv venv && uv pip install marimo && uv pip install "jax[cuda13]==0.7.2" && uv pip install "numpy==2.3.3" && uv pip install "plotly==6.3.0" && uv pip install "opencv-python-headless==4.12.0.88" && uv pip install "tqdm==4.67.1"
|
RUN uv init --python 3.12 && uv venv && uv pip install "marimo==[0.16.5]" && uv pip install "jax[cuda13]==0.7.2" && uv pip install "numpy==2.3.3" && uv pip install "plotly==6.3.0" && uv pip install "opencv-python-headless==4.12.0.88" && uv pip install "tqdm==4.67.1"
|
||||||
|
|
||||||
COPY *.py *.mp4 /app
|
COPY *.py *.mp4 /app
|
||||||
|
|
||||||
|
|||||||
@ -202,7 +202,6 @@ docker container prune -f
|
|||||||
| Symptom | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
|---------|--------|-----|
|
|---------|--------|-----|
|
||||||
| CUDA out of memory during training | Batch size too large for GPU VRAM | Reduce `per_device_train_batch_size` or increase `gradient_accumulation_steps` |
|
| CUDA out of memory during training | Batch size too large for GPU VRAM | Reduce `per_device_train_batch_size` or increase `gradient_accumulation_steps` |
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
|
||||||
| Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models |
|
| Model download fails or is slow | Network connectivity or Hugging Face Hub issues | Check internet connection, try using `HF_HUB_OFFLINE=1` for cached models |
|
||||||
| Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality |
|
| Training loss not decreasing | Learning rate too high/low or insufficient data | Adjust `learning_rate` parameter or check dataset quality |
|
||||||
|
|
||||||
|
|||||||
@ -142,7 +142,7 @@ docker volume rm "$(basename "$PWD")_postgres_data"
|
|||||||
|
|
||||||
| Symptom | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
|---------|--------|-----|
|
|---------|--------|-----|
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
|
||||||
|
|
||||||
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||||
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
||||||
|
|||||||
@ -9,7 +9,7 @@
|
|||||||
"lint": "next lint"
|
"lint": "next lint"
|
||||||
},
|
},
|
||||||
"dependencies": {
|
"dependencies": {
|
||||||
"next": "15.1.7",
|
"next": "15.2.4",
|
||||||
"react": "^19.0.0",
|
"react": "^19.0.0",
|
||||||
"react-dom": "^19.0.0",
|
"react-dom": "^19.0.0",
|
||||||
"react-markdown": "^10.1.0",
|
"react-markdown": "^10.1.0",
|
||||||
@ -22,7 +22,7 @@
|
|||||||
"@types/react": "^19",
|
"@types/react": "^19",
|
||||||
"@types/react-dom": "^19",
|
"@types/react-dom": "^19",
|
||||||
"eslint": "^9",
|
"eslint": "^9",
|
||||||
"eslint-config-next": "15.1.7",
|
"eslint-config-next": "15.2.4",
|
||||||
"postcss": "^8",
|
"postcss": "^8",
|
||||||
"tailwindcss": "^3.4.1",
|
"tailwindcss": "^3.4.1",
|
||||||
"typescript": "^5"
|
"typescript": "^5"
|
||||||
|
|||||||
@ -213,7 +213,6 @@ environment.
|
|||||||
|---------|-------|-----|
|
|---------|-------|-----|
|
||||||
| "CUDA out of memory" error | Insufficient VRAM for model | Use FP8/FP4 quantization or smaller model |
|
| "CUDA out of memory" error | Insufficient VRAM for model | Use FP8/FP4 quantization or smaller model |
|
||||||
| "Invalid HF token" error | Missing or expired HuggingFace token | Set valid token: `export HF_TOKEN=<YOUR_TOKEN>` |
|
| "Invalid HF token" error | Missing or expired HuggingFace token | Set valid token: `export HF_TOKEN=<YOUR_TOKEN>` |
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
|
||||||
| Model download timeouts | Network issues or rate limiting | Retry command or pre-download models |
|
| Model download timeouts | Network issues or rate limiting | Retry command or pre-download models |
|
||||||
|
|
||||||
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||||
|
|||||||
@ -58,98 +58,17 @@ containers can be stopped with `docker stop`
|
|||||||
|
|
||||||
## Run on two Sparks
|
## Run on two Sparks
|
||||||
|
|
||||||
## Step 1. Setup networking between nodes
|
## Step 1. Configure network connectivity
|
||||||
|
|
||||||
Configure network interfaces for high-performance inter-node communication. Choose one option
|
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
|
||||||
based on your network requirements.
|
|
||||||
|
|
||||||
**Option 1: Suggested - Netplan configuration**
|
This includes:
|
||||||
|
- Physical QSFP cable connection
|
||||||
|
- Network interface configuration (automatic or manual IP assignment)
|
||||||
|
- Passwordless SSH setup
|
||||||
|
- Network connectivity verification
|
||||||
|
|
||||||
Configure network interfaces using netplan on both DGX Spark nodes for automatic link-local
|
## Step 2. Launch TensorRT-LLM containers on both nodes
|
||||||
addressing:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
## On both nodes, create the netplan configuration file
|
|
||||||
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
|
|
||||||
network:
|
|
||||||
version: 2
|
|
||||||
ethernets:
|
|
||||||
enp1s0f0np0:
|
|
||||||
link-local: [ ipv4 ]
|
|
||||||
enp1s0f1np1:
|
|
||||||
link-local: [ ipv4 ]
|
|
||||||
EOF
|
|
||||||
|
|
||||||
## On both nodes, set appropriate permissions
|
|
||||||
sudo chmod 600 /etc/netplan/40-cx7.yaml
|
|
||||||
|
|
||||||
## On both nodes, apply the netplan configuration
|
|
||||||
sudo netplan apply
|
|
||||||
```
|
|
||||||
|
|
||||||
**Option 2: Manual IP assignment (advanced)**
|
|
||||||
|
|
||||||
Configure dedicated cluster networking with static IP addresses:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
## On Node 1
|
|
||||||
sudo ip addr add 192.168.100.10/24 dev enP2p1s0f1np1
|
|
||||||
sudo ip link set enP2p1s0f1np1 up
|
|
||||||
|
|
||||||
## On Node 2
|
|
||||||
sudo ip addr add 192.168.100.11/24 dev enP2p1s0f1np1
|
|
||||||
sudo ip link set enP2p1s0f1np1 up
|
|
||||||
|
|
||||||
## Verify connectivity from Node 1
|
|
||||||
ping -c 3 192.168.100.11
|
|
||||||
|
|
||||||
## Verify connectivity from Node 2
|
|
||||||
ping -c 3 192.168.100.10
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 2. Run the DGX Spark discovery script
|
|
||||||
|
|
||||||
Automatically identify interconnected DGX Spark systems and configure SSH passwordless
|
|
||||||
authentication for multi-node operations:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
## On either node, run the discovery script
|
|
||||||
./discover-sparks
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected output:
|
|
||||||
```
|
|
||||||
Found: 192.168.100.10 (spark-1b3b.local)
|
|
||||||
Found: 192.168.100.11 (spark-1d84.local)
|
|
||||||
|
|
||||||
Copying your SSH public key to all discovered nodes using ssh-copy-id.
|
|
||||||
You may be prompted for your password on each node.
|
|
||||||
Copying SSH key to 192.168.100.10 ...
|
|
||||||
Copying SSH key to 192.168.100.11 ...
|
|
||||||
nvidia@192.168.100.11's password:
|
|
||||||
|
|
||||||
SSH key copy process complete. These two sparks can now talk to each other.
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 3. Identify active network interfaces
|
|
||||||
|
|
||||||
Check which ConnectX-7 network interfaces are active and available for NCCL communication:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ibdev2netdev
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected output (showing "Up" for active interfaces):
|
|
||||||
```
|
|
||||||
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
|
|
||||||
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
|
|
||||||
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
|
|
||||||
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
|
|
||||||
```
|
|
||||||
|
|
||||||
Note the active interface names (marked "Up") for use in container configuration.
|
|
||||||
|
|
||||||
## Step 4. Launch TensorRT-LLM containers on both nodes
|
|
||||||
|
|
||||||
Start containers with appropriate network and GPU configuration for NCCL communication:
|
Start containers with appropriate network and GPU configuration for NCCL communication:
|
||||||
|
|
||||||
@ -170,7 +89,7 @@ docker run --name trtllm --rm -d \
|
|||||||
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3
|
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 5. Build NCCL with Blackwell support
|
## Step 3. Build NCCL with Blackwell support
|
||||||
|
|
||||||
Execute these commands inside both containers to build NCCL from source with Blackwell
|
Execute these commands inside both containers to build NCCL from source with Blackwell
|
||||||
architecture support:
|
architecture support:
|
||||||
@ -188,7 +107,7 @@ export NCCL_HOME="/opt/nccl/build/"
|
|||||||
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
|
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 6. Build NCCL test suite
|
## Step 4. Build NCCL test suite
|
||||||
|
|
||||||
Compile the NCCL test suite to validate communication performance:
|
Compile the NCCL test suite to validate communication performance:
|
||||||
|
|
||||||
@ -199,7 +118,7 @@ cd /opt/nccl-tests/
|
|||||||
make MPI=1
|
make MPI=1
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 7. Run NCCL communication test
|
## Step 5. Run NCCL communication test
|
||||||
|
|
||||||
Execute multi-node NCCL performance test using the active network interface:
|
Execute multi-node NCCL performance test using the active network interface:
|
||||||
|
|
||||||
@ -217,7 +136,7 @@ mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \
|
|||||||
/opt/nccl-tests/build/all_gather_perf -b 32G -e 32G -f 2
|
/opt/nccl-tests/build/all_gather_perf -b 32G -e 32G -f 2
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 8. Validate NCCL installation
|
## Step 6. Validate NCCL installation
|
||||||
|
|
||||||
Verify successful NCCL compilation and multi-node communication:
|
Verify successful NCCL compilation and multi-node communication:
|
||||||
|
|
||||||
@ -235,7 +154,7 @@ mpirun --version
|
|||||||
Expected output should show NCCL libraries in `/opt/nccl/build/lib/` and test binaries in
|
Expected output should show NCCL libraries in `/opt/nccl/build/lib/` and test binaries in
|
||||||
`/opt/nccl-tests/build/`.
|
`/opt/nccl-tests/build/`.
|
||||||
|
|
||||||
## Step 10. Cleanup and rollback
|
## Step 7. Cleanup and rollback
|
||||||
|
|
||||||
**Warning**: These steps will stop containers and reset network configuration.
|
**Warning**: These steps will stop containers and reset network configuration.
|
||||||
|
|
||||||
@ -251,7 +170,7 @@ sudo rm /etc/netplan/40-cx7.yaml
|
|||||||
sudo netplan apply
|
sudo netplan apply
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 11. Next steps
|
## Step 8. Next steps
|
||||||
|
|
||||||
Test your NCCL setup with a simple distributed training example:
|
Test your NCCL setup with a simple distributed training example:
|
||||||
|
|
||||||
|
|||||||
@ -319,7 +319,6 @@ Explore the [NeMo AutoModel GitHub repository](https://github.com/NVIDIA-NeMo/Au
|
|||||||
| GPU not detected in training | CUDA driver/runtime mismatch | Verify driver compatibility: `nvidia-smi` and reinstall CUDA if needed |
|
| GPU not detected in training | CUDA driver/runtime mismatch | Verify driver compatibility: `nvidia-smi` and reinstall CUDA if needed |
|
||||||
| Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism |
|
| Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism |
|
||||||
| ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags |
|
| ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags |
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
|
||||||
|
|
||||||
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||||
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
||||||
|
|||||||
@ -256,7 +256,6 @@ The quantized model is now ready for deployment. Common next steps include:
|
|||||||
| Model files not found in output directory | Volume mount failed or wrong path | Verify `$(pwd)/output_models` resolves correctly |
|
| Model files not found in output directory | Volume mount failed or wrong path | Verify `$(pwd)/output_models` resolves correctly |
|
||||||
| Git clone fails inside container | Network connectivity issues | Check internet connection and retry |
|
| Git clone fails inside container | Network connectivity issues | Check internet connection and retry |
|
||||||
| Quantization process hangs | Container resource limits | Increase Docker memory limits or use `--ulimit` flags |
|
| Quantization process hangs | Container resource limits | Increase Docker memory limits or use `--ulimit` flags |
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
|
||||||
|
|
||||||
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||||
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
||||||
|
|||||||
@ -117,10 +117,6 @@ python Llama3_3B_full_finetuning.py
|
|||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
| Symptom | Cause | Fix |
|
|
||||||
|---------|--------|-----|
|
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
|
||||||
|
|
||||||
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||||
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
||||||
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
|
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
|
||||||
|
|||||||
@ -163,7 +163,7 @@ docker stop <container_id>
|
|||||||
| "CUDA out of memory" error | Insufficient GPU memory | Reduce `kv_cache_free_gpu_memory_fraction` to 0.9 or use a device with more VRAM |
|
| "CUDA out of memory" error | Insufficient GPU memory | Reduce `kv_cache_free_gpu_memory_fraction` to 0.9 or use a device with more VRAM |
|
||||||
| Container fails to start | Docker GPU support issues | Verify `nvidia-docker` is installed and `--gpus=all` flag is supported |
|
| Container fails to start | Docker GPU support issues | Verify `nvidia-docker` is installed and `--gpus=all` flag is supported |
|
||||||
| Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity |
|
| Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity |
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
|
||||||
| Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked |
|
| Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked |
|
||||||
|
|
||||||
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||||
|
|||||||
@ -6,6 +6,10 @@
|
|||||||
|
|
||||||
- [Overview](#overview)
|
- [Overview](#overview)
|
||||||
- [Run on two Sparks](#run-on-two-sparks)
|
- [Run on two Sparks](#run-on-two-sparks)
|
||||||
|
- [Option 1: Automatic IP Assignment (Recommended)](#option-1-automatic-ip-assignment-recommended)
|
||||||
|
- [Option 2: Manual IP Assignment (Advanced)](#option-2-manual-ip-assignment-advanced)
|
||||||
|
- [Option 1: Automatically configure SSH](#option-1-automatically-configure-ssh)
|
||||||
|
- [Option 2: Manually discover and configure SSH](#option-2-manually-discover-and-configure-ssh)
|
||||||
- [Troubleshooting](#troubleshooting)
|
- [Troubleshooting](#troubleshooting)
|
||||||
|
|
||||||
---
|
---
|
||||||
@ -15,76 +19,98 @@
|
|||||||
## Basic idea
|
## Basic idea
|
||||||
|
|
||||||
Configure two DGX Spark systems for high-speed inter-node communication using 200GbE direct
|
Configure two DGX Spark systems for high-speed inter-node communication using 200GbE direct
|
||||||
QSFP connections and NCCL multi-node communication. This setup enables distributed training
|
QSFP connections. This setup enables distributed workloads across multiple DGX Spark nodes
|
||||||
and inference workloads across multiple Blackwell GPUs by establishing network connectivity,
|
by establishing network connectivity and configuring SSH authentication.
|
||||||
configuring SSH authentication, and validating communication with NCCL performance tests.
|
|
||||||
|
|
||||||
## What you'll accomplish
|
## What you'll accomplish
|
||||||
|
|
||||||
You will physically connect two DGX Spark devices with a QSFP cable, configure network
|
You will physically connect two DGX Spark devices with a QSFP cable, configure network
|
||||||
interfaces for cluster communication, establish passwordless SSH between nodes, and validate
|
interfaces for cluster communication, and establish passwordless SSH between nodes to create
|
||||||
the setup with NCCL multi-node tests to create a functional distributed computing environment.
|
a functional distributed computing environment.
|
||||||
|
|
||||||
## What to know before starting
|
## What to know before starting
|
||||||
|
|
||||||
- Working with network interface configuration and netplan
|
|
||||||
- Using Docker containers with GPU and network access
|
|
||||||
- Basic understanding of distributed computing concepts
|
- Basic understanding of distributed computing concepts
|
||||||
|
- Working with network interface configuration and netplan
|
||||||
- Experience with SSH key management
|
- Experience with SSH key management
|
||||||
- Familiarity with NVIDIA GPU architectures and CUDA environments
|
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
- Two DGX Spark systems with NVIDIA Blackwell GPUs available
|
- Two DGX Spark systems
|
||||||
- QSFP cable for direct 200GbE connection between devices
|
- One QSFP cable for direct 200GbE connection between two devices
|
||||||
- Docker installed on both systems: `docker --version`
|
- SSH access available to both systems
|
||||||
- CUDA toolkit installed: `nvcc --version` (should show 12.9 or higher)
|
|
||||||
- SSH access available on both systems: `ssh-keygen -t rsa` (if keys don't exist)
|
|
||||||
- Git available for source code compilation: `git --version`
|
|
||||||
- Root or sudo access on both systems: `sudo whoami`
|
- Root or sudo access on both systems: `sudo whoami`
|
||||||
|
- The same username on both systems
|
||||||
|
|
||||||
## Ancillary files
|
## Ancillary files
|
||||||
|
|
||||||
All required files for this playbook can be found [here on GitHub](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/)
|
All required files for this playbook can be found [here on GitHub](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/)
|
||||||
|
|
||||||
- `discover-sparks` script for automatic node discovery and SSH key distribution
|
- **discover-sparks.sh**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/discover-sparks) script for automatic node discovery and SSH key distribution
|
||||||
- `trtllm-mn-entrypoint.sh` container entrypoint script for multi-node setup
|
|
||||||
- Network interface mapping tools (`ibdev2netdev`, `ip link show`)
|
|
||||||
|
|
||||||
## Time & risk
|
## Time & risk
|
||||||
|
|
||||||
**Duration:** 2-3 hours including validation tests
|
**Duration:** 1 hour including validation
|
||||||
|
|
||||||
**Risk level:** Medium - involves network reconfiguration and container setup
|
**Risk level:** Medium - involves network reconfiguration
|
||||||
|
|
||||||
**Rollback:** Network changes can be reversed by removing netplan configs or IP assignments
|
**Rollback:** Network changes can be reversed by removing netplan configs or IP assignments
|
||||||
|
|
||||||
## Run on two Sparks
|
## Run on two Sparks
|
||||||
|
|
||||||
## Step 1. Physical Hardware Connection
|
## Step 1. Ensure Same Username on Both Systems
|
||||||
|
|
||||||
Connect the QSFP cable between both DGX Spark systems using the rightmost QSFP interface
|
On both systems check the username and make sure it's the same:
|
||||||
on each device. This establishes the 200GbE direct connection required for high-speed
|
|
||||||
inter-node communication.
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Check QSFP interface availability on both nodes
|
## Check current username
|
||||||
ip link show | grep enP2p1s0f1np1
|
whoami
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output shows the interface exists but may be down initially.
|
If usernames don't match, create a new user (e.g., nvidia) on both systems and login in with the new user:
|
||||||
|
|
||||||
## Step 2. Network Interface Configuration
|
```bash
|
||||||
|
## Create nvidia user and add to sudo group
|
||||||
|
sudo useradd -m nvidia
|
||||||
|
sudo usermod -aG sudo nvidia
|
||||||
|
|
||||||
Choose one option based on your network requirements.
|
## Set password for nvidia user
|
||||||
|
sudo passwd nvidia
|
||||||
|
|
||||||
**Option 1: Automatic IP Assignment (Recommended)**
|
## Switch to nvidia user
|
||||||
|
su - nvidia
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 2. Physical Hardware Connection
|
||||||
|
|
||||||
|
Connect the QSFP cable between both DGX Spark systems using any QSFP interface
|
||||||
|
on each device. This establishes the 200GbE direct connection required for high-speed
|
||||||
|
inter-node communication. Upon connection between the two nodes, you will see the an output like the one below: in this example the interface showing as 'Up' is **enp1s0f1np1** / **enP2p1s0f1np1** (each physical port has two names).
|
||||||
|
|
||||||
|
Example output:
|
||||||
|
```bash
|
||||||
|
## Check QSFP interface availability on both nodes
|
||||||
|
nvidia@dxg-spark-1:~$ ibdev2netdev
|
||||||
|
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
|
||||||
|
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
|
||||||
|
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
|
||||||
|
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: If none of the interfaces are showing as 'Up', please check the QSFP cable connection, reboot the systems and try again.
|
||||||
|
Note: The interface showing as 'Up' depends on which port you are using to connect the two nodes. Each physical port has two names, for example, enp1s0f1np1 and enP2p1s0f1np1 refer to the same physical port. Please disregard enP2p1s0f0np0 and enP2p1s0f1np1, and use enp1s0f0np0 and enp1s0f1np1 only.
|
||||||
|
|
||||||
|
## Step 3. Network Interface Configuration
|
||||||
|
|
||||||
|
Choose one option to setup the network interfaces. Option 1 and 2 are mutually exclusive.
|
||||||
|
|
||||||
|
### Option 1: Automatic IP Assignment (Recommended)
|
||||||
|
|
||||||
Configure network interfaces using netplan on both DGX Spark nodes for automatic
|
Configure network interfaces using netplan on both DGX Spark nodes for automatic
|
||||||
link-local addressing:
|
link-local addressing:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## On both nodes, create the netplan configuration file
|
## Create the netplan configuration file
|
||||||
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
|
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
|
||||||
network:
|
network:
|
||||||
version: 2
|
version: 2
|
||||||
@ -95,217 +121,128 @@ network:
|
|||||||
link-local: [ ipv4 ]
|
link-local: [ ipv4 ]
|
||||||
EOF
|
EOF
|
||||||
|
|
||||||
## On both nodes, set appropriate permissions
|
## Set appropriate permissions
|
||||||
sudo chmod 600 /etc/netplan/40-cx7.yaml
|
sudo chmod 600 /etc/netplan/40-cx7.yaml
|
||||||
|
|
||||||
## On both nodes, apply the netplan configuration
|
## Apply the configuration
|
||||||
sudo netplan apply
|
sudo netplan apply
|
||||||
```
|
```
|
||||||
|
|
||||||
**Option 2: Manual IP Assignment (Advanced)**
|
Note: Using this option, the IPs assigned to the interfaces will change if you reboot the system.
|
||||||
|
|
||||||
Configure dedicated cluster networking with static IP addresses:
|
### Option 2: Manual IP Assignment (Advanced)
|
||||||
|
|
||||||
```bash
|
First, identify which network ports are available and up:
|
||||||
## On Node 1
|
|
||||||
sudo ip addr add 192.168.100.10/24 dev enP2p1s0f1np1
|
|
||||||
sudo ip link set enP2p1s0f1np1 up
|
|
||||||
|
|
||||||
## On Node 2
|
|
||||||
sudo ip addr add 192.168.100.11/24 dev enP2p1s0f1np1
|
|
||||||
sudo ip link set enP2p1s0f1np1 up
|
|
||||||
|
|
||||||
## Verify connectivity from Node 1
|
|
||||||
ping -c 3 192.168.100.11
|
|
||||||
|
|
||||||
## Verify connectivity from Node 2
|
|
||||||
ping -c 3 192.168.100.10
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 3. SSH Key Distribution
|
|
||||||
|
|
||||||
Automatically identify interconnected DGX Spark systems and configure SSH passwordless
|
|
||||||
authentication for multi-node operations. This step runs on either node.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
## On either node, run the discovery script
|
|
||||||
./discover-sparks
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected output:
|
|
||||||
```
|
|
||||||
Found: 192.168.100.10 (spark-1b3b.local)
|
|
||||||
Found: 192.168.100.11 (spark-1d84.local)
|
|
||||||
|
|
||||||
Copying your SSH public key to all discovered nodes using ssh-copy-id.
|
|
||||||
You may be prompted for your password on each node.
|
|
||||||
Copying SSH key to 192.168.100.10 ...
|
|
||||||
Copying SSH key to 192.168.100.11 ...
|
|
||||||
nvidia@192.168.100.11's password:
|
|
||||||
|
|
||||||
SSH key copy process complete. These two sparks can now talk to each other.
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 4. Network Interface Validation
|
|
||||||
|
|
||||||
Check which ConnectX-7 network interfaces are active and available for communication:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
## Check network port status
|
||||||
ibdev2netdev
|
ibdev2netdev
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output (showing "Up" for active interfaces):
|
Example output:
|
||||||
```
|
```
|
||||||
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
|
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
|
||||||
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
|
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
|
||||||
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
|
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
|
||||||
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
|
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
|
||||||
```
|
```
|
||||||
|
|
||||||
Note the active interface names (marked "Up") for use in container configuration.
|
Use an interface that shows as "(Up)" in your output. In this example, we'll use **enp1s0f1np1**. You can disregard interfaces starting with the prefix`enP2p<...>` and only use interfaces starting with `enp1<...>` instead.
|
||||||
|
|
||||||
## Step 5. Launch Containers with Network Configuration
|
On Node 1:
|
||||||
|
```bash
|
||||||
|
## Assign static IP and bring up interface.
|
||||||
|
sudo ip addr add 192.168.100.10/24 dev enp1s0f1np1
|
||||||
|
sudo ip link set enp1s0f1np1 up
|
||||||
|
```
|
||||||
|
|
||||||
Start containers with appropriate network and GPU configuration for NCCL communication.
|
Repeat the same process for Node 2, but using IP **192.168.100.11/24**. Ensure to use the correct interface name using `ibdev2netdev` command.
|
||||||
This step runs on both nodes.
|
```bash
|
||||||
|
## Assign static IP and bring up interface.
|
||||||
|
sudo ip addr add 192.168.100.11/24 dev enp1s0f1np1
|
||||||
|
sudo ip link set enp1s0f1np1 up
|
||||||
|
```
|
||||||
|
|
||||||
|
You can verify the IP assignment on both nodes by running the following command on each node:
|
||||||
|
```bash
|
||||||
|
## Replace enp1s0f1np1 with the interface showing as "(Up)" in your output, either enp1s0f0np0 or enp1s0f1np1
|
||||||
|
ip addr show enp1s0f1np1
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 3. Set up passwordless SSH authentication
|
||||||
|
|
||||||
|
### Option 1: Automatically configure SSH
|
||||||
|
|
||||||
|
Run the DGX Spark [**discover-sparks.sh**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/discover-sparks) script from one of the nodes to automatically discover and configure SSH:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## On both nodes, launch the container
|
bash ./discover-sparks
|
||||||
docker run --name trtllm --rm -d \
|
|
||||||
--gpus all --network host --ipc=host \
|
|
||||||
--ulimit memlock=-1 --ulimit stack=67108864 \
|
|
||||||
-e UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1 \
|
|
||||||
-e NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1 \
|
|
||||||
-e OMPI_MCA_btl_tcp_if_include=enp1s0f0np0,enp1s0f1np1 \
|
|
||||||
-e OMPI_ALLOW_RUN_AS_ROOT=1 \
|
|
||||||
-e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
|
|
||||||
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
|
|
||||||
-v ./trtllm-mn-entrypoint.sh:/opt/trtllm-mn-entrypoint.sh \
|
|
||||||
-v ~/.ssh:/tmp/.ssh:ro \
|
|
||||||
--entrypoint /opt/trtllm-mn-entrypoint.sh \
|
|
||||||
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 6. Build NCCL with Blackwell Support
|
Expected output similar to the below, with different IPs and node names. The first time you run the script, you'll be prompted for your password for each node.
|
||||||
|
```
|
||||||
|
Found: 169.254.35.62 (dgx-spark-1.local)
|
||||||
|
Found: 169.254.35.63 (dgx-spark-2.local)
|
||||||
|
|
||||||
Execute these commands inside both containers to build NCCL from source with Blackwell
|
Setting up bidirectional SSH access (local <-> remote nodes)...
|
||||||
architecture support. Access the container with `docker exec -it trtllm bash`.
|
You may be prompted for your password for each node.
|
||||||
|
|
||||||
```bash
|
SSH setup complete! Both local and remote nodes can now SSH to each other without passwords.
|
||||||
## Install dependencies and build NCCL
|
|
||||||
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
|
|
||||||
git clone -b v2.28.3-1 https://github.com/NVIDIA/nccl.git /opt/nccl/
|
|
||||||
cd /opt/nccl/
|
|
||||||
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"
|
|
||||||
|
|
||||||
## Set environment variables
|
|
||||||
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
|
|
||||||
export NCCL_HOME="/opt/nccl/build/"
|
|
||||||
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 7. Build NCCL Test Suite
|
Note: If you encoutner any errors, please follow Option 2 below to manually configure SSH and debug the issue.
|
||||||
|
|
||||||
Compile the NCCL test suite to validate communication performance. This runs inside
|
### Option 2: Manually discover and configure SSH
|
||||||
both containers.
|
|
||||||
|
|
||||||
|
You will need to find the IP addresses for the CX-7 interfaces that are up. On both nodes, run the following command to find the IP addresses and take note of them for the next step.
|
||||||
```bash
|
```bash
|
||||||
## Clone and build NCCL tests
|
ip addr show enp1s0f0np0
|
||||||
git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests/
|
ip addr show enp1s0f1np1
|
||||||
cd /opt/nccl-tests/
|
|
||||||
make MPI=1
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 8. Run NCCL Communication Test
|
Example output:
|
||||||
|
```
|
||||||
Execute multi-node NCCL performance test using the active network interface. This runs
|
## In this example, we are using interface enp1s0f1np1.
|
||||||
from one of the containers.
|
nvidia@dgx-spark-1:~$ ip addr show enp1s0f1np1
|
||||||
|
4: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
|
||||||
```bash
|
link/ether 3c:6d:66:cc:b3:b7 brd ff:ff:ff:ff:ff:ff
|
||||||
## Set network interface environment variables (use your active interface from Step 4)
|
inet **169.254.35.62**/16 brd 169.254.255.255 scope link noprefixroute enp1s0f1np1
|
||||||
export UCX_NET_DEVICES=enp1s0f0np0
|
valid_lft forever preferred_lft forever
|
||||||
export NCCL_SOCKET_IFNAME=enp1s0f0np0
|
inet6 fe80::3e6d:66ff:fecc:b3b7/64 scope link
|
||||||
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0
|
valid_lft forever preferred_lft forever
|
||||||
|
|
||||||
## Run the all_gather performance test across both nodes
|
|
||||||
mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \
|
|
||||||
-x NCCL_DEBUG=VERSION -x NCCL_DEBUG_SUBSYS=TUNING \
|
|
||||||
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
|
|
||||||
-x NCCL_MERGE_LEVEL=SYS -x NCCL_PROTO="SIMPLE" \
|
|
||||||
/opt/nccl-tests/build/all_gather_perf -b 32G -e 32G -f 2
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 9. Validate NCCL Installation
|
In this example, the IP address for Node 1 is **169.254.35.62**. Repeat the process for Node 2.
|
||||||
|
|
||||||
Verify successful NCCL compilation and multi-node communication by checking built
|
|
||||||
components.
|
|
||||||
|
|
||||||
|
On both nodes, run the following commands to enable passwordless SSH:
|
||||||
```bash
|
```bash
|
||||||
## Check NCCL library build
|
## Copy your SSH public key to both nodes. Please replace the IP addresses with the ones you found in the previous step.
|
||||||
ls -la /opt/nccl/build/lib/
|
ssh-copy-id -i ~/.ssh/id_rsa.pub nvidia@<IP for Node 1>
|
||||||
|
ssh-copy-id -i ~/.ssh/id_rsa.pub nvidia@<IP for Node 2>
|
||||||
## Verify NCCL test binaries
|
|
||||||
ls -la /opt/nccl-tests/build/
|
|
||||||
|
|
||||||
## Check MPI configuration
|
|
||||||
mpirun --version
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Expected output should show NCCL libraries in `/opt/nccl/build/lib/` and test binaries
|
## Step 4. Verify Multi-Node Communication
|
||||||
in `/opt/nccl-tests/build/`.
|
|
||||||
|
|
||||||
## Step 10. Performance Validation
|
Test basic multi-node functionality:
|
||||||
|
|
||||||
Review the all_gather test output for communication performance metrics from Step 8.
|
|
||||||
|
|
||||||
Expected metrics from the test output:
|
|
||||||
- Bandwidth measurements between nodes
|
|
||||||
- Latency for different message sizes
|
|
||||||
- GPU-to-GPU communication confirmation
|
|
||||||
- No error messages or communication failures
|
|
||||||
|
|
||||||
## Step 11. Additional NCCL Tests
|
|
||||||
|
|
||||||
Run additional performance validation tests to verify the complete setup.
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Example: Run a simple NCCL bandwidth test
|
## Test hostname resolution across nodes
|
||||||
/opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2
|
ssh <IP for Node 1> hostname
|
||||||
|
ssh <IP for Node 2> hostname
|
||||||
## Example: Verify GPU topology detection
|
|
||||||
nvidia-smi topo -m
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 13. Cleanup and Rollback
|
## Step 6. Cleanup and Rollback
|
||||||
|
|
||||||
> **Warning**: These steps will stop containers and reset network configuration.
|
> **Warning**: These steps will reset network configuration.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Stop containers on both nodes
|
|
||||||
docker stop trtllm
|
|
||||||
docker rm trtllm
|
|
||||||
|
|
||||||
## Rollback network configuration (if using Option 1)
|
## Rollback network configuration (if using Option 1)
|
||||||
sudo rm /etc/netplan/40-cx7.yaml
|
sudo rm /etc/netplan/40-cx7.yaml
|
||||||
sudo netplan apply
|
sudo netplan apply
|
||||||
|
|
||||||
## Rollback network configuration (if using Option 2)
|
## Rollback network configuration (if using Option 2)
|
||||||
sudo ip addr del 192.168.100.10/24 dev enP2p1s0f1np1 # Node 1
|
sudo ip addr del 192.168.100.10/24 dev enp1s0f0np0 # Adjust the interface name to the one you used in step 3.
|
||||||
sudo ip addr del 192.168.100.11/24 dev enP2p1s0f1np1 # Node 2
|
sudo ip addr del 192.168.100.11/24 dev enp1s0f0np0 # Adjust the interface name to the one you used in step 3.
|
||||||
sudo ip link set enP2p1s0f1np1 down
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 14. Next Steps
|
|
||||||
|
|
||||||
Your NCCL environment is ready for multi-node distributed training workloads on DGX Spark
|
|
||||||
systems with Blackwell GPUs.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
## Test basic multi-node functionality
|
|
||||||
mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 hostname
|
|
||||||
|
|
||||||
## Verify GPU visibility across nodes
|
|
||||||
mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 nvidia-smi -L
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
@ -314,7 +251,4 @@ mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 nvidia-smi -L
|
|||||||
|---------|-------|-----|
|
|---------|-------|-----|
|
||||||
| "Network unreachable" errors | Network interfaces not configured | Verify netplan config and `sudo netplan apply` |
|
| "Network unreachable" errors | Network interfaces not configured | Verify netplan config and `sudo netplan apply` |
|
||||||
| SSH authentication failures | SSH keys not properly distributed | Re-run `./discover-sparks` and enter passwords |
|
| SSH authentication failures | SSH keys not properly distributed | Re-run `./discover-sparks` and enter passwords |
|
||||||
| NCCL build failures with Blackwell | Wrong compute capability specified | Verify `NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"` |
|
|
||||||
| MPI communication timeouts | Wrong network interfaces specified | Check `ibdev2netdev` and update interface names |
|
|
||||||
| Container networking issues | Host network mode problems | Ensure `--network host --ipc=host` in docker run |
|
|
||||||
| Node 2 not visible in cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |
|
| Node 2 not visible in cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |
|
||||||
|
|||||||
@ -16,21 +16,20 @@
|
|||||||
- [Step 8. Serve LLM with OpenAI-compatible API](#step-8-serve-llm-with-openai-compatible-api)
|
- [Step 8. Serve LLM with OpenAI-compatible API](#step-8-serve-llm-with-openai-compatible-api)
|
||||||
- [Step 10. Cleanup and rollback](#step-10-cleanup-and-rollback)
|
- [Step 10. Cleanup and rollback](#step-10-cleanup-and-rollback)
|
||||||
- [Run on two Sparks](#run-on-two-sparks)
|
- [Run on two Sparks](#run-on-two-sparks)
|
||||||
- [Step 1. User prerequisites](#step-1-user-prerequisites)
|
- [Step 1. Configure network connectivity](#step-1-configure-network-connectivity)
|
||||||
- [Step 2. Configure Docker permissions](#step-2-configure-docker-permissions)
|
- [Step 2. Configure Docker permissions](#step-2-configure-docker-permissions)
|
||||||
- [Step 3. Configure network connectivity](#step-3-configure-network-connectivity)
|
- [Step 3. Install NVIDIA Container Toolkit & setup Docker environment](#step-3-install-nvidia-container-toolkit-setup-docker-environment)
|
||||||
- [Step 4. Install NVIDIA Container Toolkit & setup Docker environment](#step-4-install-nvidia-container-toolkit-setup-docker-environment)
|
- [Step 4. Enable resource advertising](#step-4-enable-resource-advertising)
|
||||||
- [Step 5. Enable resource advertising](#step-5-enable-resource-advertising)
|
- [Step 5. Initialize Docker Swarm](#step-5-initialize-docker-swarm)
|
||||||
- [Step 6. Initialize Docker Swarm](#step-6-initialize-docker-swarm)
|
- [Step 6. Join worker nodes and deploy](#step-6-join-worker-nodes-and-deploy)
|
||||||
- [Step 7. Join worker nodes and deploy](#step-7-join-worker-nodes-and-deploy)
|
- [Step 7. Create hosts file](#step-7-create-hosts-file)
|
||||||
- [Step 8. Create hosts file](#step-8-create-hosts-file)
|
- [Step 8. Find your Docker container ID](#step-8-find-your-docker-container-id)
|
||||||
- [Step 9. Find your Docker container ID](#step-9-find-your-docker-container-id)
|
- [Step 9. Generate configuration file](#step-9-generate-configuration-file)
|
||||||
- [Step 10. Generate configuration file](#step-10-generate-configuration-file)
|
- [Step 10. Download model](#step-10-download-model)
|
||||||
- [Step 11. Download model](#step-11-download-model)
|
- [Step 11. Serve the model](#step-11-serve-the-model)
|
||||||
- [Step 12. Serve the model](#step-12-serve-the-model)
|
- [Step 12. Validate API server](#step-12-validate-api-server)
|
||||||
- [Step 13. Validate API server](#step-13-validate-api-server)
|
- [Step 14. Cleanup and rollback](#step-14-cleanup-and-rollback)
|
||||||
- [Step 15. Cleanup and rollback](#step-15-cleanup-and-rollback)
|
- [Step 15. Next steps](#step-15-next-steps)
|
||||||
- [Step 16. Next steps](#step-16-next-steps)
|
|
||||||
- [Troubleshooting](#troubleshooting)
|
- [Troubleshooting](#troubleshooting)
|
||||||
|
|
||||||
---
|
---
|
||||||
@ -408,13 +407,15 @@ docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
|
|||||||
|
|
||||||
## Run on two Sparks
|
## Run on two Sparks
|
||||||
|
|
||||||
### Step 1. User prerequisites
|
### Step 1. Configure network connectivity
|
||||||
Ensure all your DGX Spark nodes are set up and accessible with the same username. If your DGX Spark nodes are set up with different usernames, you will need to create a shared username for all the nodes.
|
|
||||||
You can create a common user `nvidia` by running the following command:
|
|
||||||
|
|
||||||
```bash
|
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
|
||||||
sudo usermod -aG docker nvidia
|
|
||||||
```
|
This includes:
|
||||||
|
- Physical QSFP cable connection
|
||||||
|
- Network interface configuration (automatic or manual IP assignment)
|
||||||
|
- Passwordless SSH setup
|
||||||
|
- Network connectivity verification
|
||||||
|
|
||||||
### Step 2. Configure Docker permissions
|
### Step 2. Configure Docker permissions
|
||||||
|
|
||||||
@ -434,94 +435,11 @@ sudo usermod -aG docker nvidia
|
|||||||
Note: Replace `nvidia` with the username of the user you want to allow Docker access to.
|
Note: Replace `nvidia` with the username of the user you want to allow Docker access to.
|
||||||
Note: After running usermod, you must log out and log back in to start a new session with updated group permissions.
|
Note: After running usermod, you must log out and log back in to start a new session with updated group permissions.
|
||||||
|
|
||||||
### Step 3. Configure network connectivity
|
### Step 3. Install NVIDIA Container Toolkit & setup Docker environment
|
||||||
|
|
||||||
You have two options for configuring network connectivity between your DGX Spark nodes:
|
|
||||||
|
|
||||||
#### Option 1: Automatic IP assignment (recommended)
|
|
||||||
|
|
||||||
Follow these steps on both DGX Spark nodes to configure network interfaces using netplan:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
## Create the netplan configuration file
|
|
||||||
sudo tee /etc/netplan/40-cx7.yaml > /dev/null <<EOF
|
|
||||||
network:
|
|
||||||
version: 2
|
|
||||||
ethernets:
|
|
||||||
enp1s0f0np0:
|
|
||||||
link-local: [ ipv4 ]
|
|
||||||
enp1s0f1np1:
|
|
||||||
link-local: [ ipv4 ]
|
|
||||||
EOF
|
|
||||||
|
|
||||||
## Set appropriate permissions
|
|
||||||
sudo chmod 600 /etc/netplan/40-cx7.yaml
|
|
||||||
|
|
||||||
## Apply the configuration
|
|
||||||
sudo netplan apply
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Option 2: Manual IP assignment (advanced)
|
|
||||||
|
|
||||||
First, identify which network ports are available and up:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
## Check network port status
|
|
||||||
ibdev2netdev
|
|
||||||
```
|
|
||||||
|
|
||||||
Example output:
|
|
||||||
```
|
|
||||||
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
|
|
||||||
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
|
|
||||||
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
|
|
||||||
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
|
|
||||||
```
|
|
||||||
|
|
||||||
Use an interface that shows as "(Up)" in your output. In this example, we'll use **enp1s0f0np0**.
|
|
||||||
|
|
||||||
On Node 1:
|
|
||||||
```bash
|
|
||||||
## Assign static IP and bring up interface
|
|
||||||
sudo ip addr add 192.168.100.10/24 dev enp1s0f0np0
|
|
||||||
sudo ip link set enp1s0f0np0 up
|
|
||||||
```
|
|
||||||
|
|
||||||
On Node 2:
|
|
||||||
```bash
|
|
||||||
## Assign static IP and bring up interface
|
|
||||||
sudo ip addr add 192.168.100.11/24 dev enp1s0f0np0
|
|
||||||
sudo ip link set enp1s0f0np0 up
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
#### Set up passwordless SSH authentication
|
|
||||||
|
|
||||||
Run the DGX Spark [**discover-sparks.sh**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/discover-sparks.sh) script on both nodes to automatically configure SSH:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
bash ./discover-sparks.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected output similar to the below, with different IPs and node names. The first time you run the script, you'll be prompted for your password for each node.
|
|
||||||
```
|
|
||||||
Found: 192.168.100.10 (spark-1b3b.local)
|
|
||||||
Found: 192.168.100.11 (spark-1d84.local)
|
|
||||||
|
|
||||||
Copying your SSH public key to all discovered nodes using ssh-copy-id.
|
|
||||||
You may be prompted for your password on each node.
|
|
||||||
Copying SSH key to 192.168.100.10 ...
|
|
||||||
Copying SSH key to 192.168.100.11 ...
|
|
||||||
nvidia@192.168.100.11's password:
|
|
||||||
|
|
||||||
SSH key copy process complete. These two sparks can now talk to each other.
|
|
||||||
```
|
|
||||||
|
|
||||||
### Step 4. Install NVIDIA Container Toolkit & setup Docker environment
|
|
||||||
|
|
||||||
Ensure the NVIDIA drivers and the NVIDIA Container Toolkit are installed on each node (both manager and workers) that will provide GPU resources. This package enables Docker containers to access the host's GPU hardware. Ensure you complete the [installation steps](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), including the [Docker configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) for NVIDIA Container Toolkit.
|
Ensure the NVIDIA drivers and the NVIDIA Container Toolkit are installed on each node (both manager and workers) that will provide GPU resources. This package enables Docker containers to access the host's GPU hardware. Ensure you complete the [installation steps](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), including the [Docker configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) for NVIDIA Container Toolkit.
|
||||||
|
|
||||||
### Step 5. Enable resource advertising
|
### Step 4. Enable resource advertising
|
||||||
|
|
||||||
First, find your GPU UUID by running:
|
First, find your GPU UUID by running:
|
||||||
```bash
|
```bash
|
||||||
@ -561,7 +479,7 @@ Finally, restart the Docker daemon to apply all changes:
|
|||||||
sudo systemctl restart docker
|
sudo systemctl restart docker
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 6. Initialize Docker Swarm
|
### Step 5. Initialize Docker Swarm
|
||||||
|
|
||||||
On whichever node you want to use as primary, run the following swarm initialization command
|
On whichever node you want to use as primary, run the following swarm initialization command
|
||||||
```bash
|
```bash
|
||||||
@ -579,7 +497,7 @@ To add a worker to this swarm, run the following command:
|
|||||||
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
|
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 7. Join worker nodes and deploy
|
### Step 6. Join worker nodes and deploy
|
||||||
|
|
||||||
Now we can proceed with setting up other nodes of your cluster.
|
Now we can proceed with setting up other nodes of your cluster.
|
||||||
|
|
||||||
@ -609,7 +527,7 @@ oe9k5o6w41le trtllm-multinode_trtllm.1 nvcr.io/nvidia/tensorrt-llm/relea
|
|||||||
phszqzk97p83 trtllm-multinode_trtllm.2 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1b3b Running Running 2 minutes ago
|
phszqzk97p83 trtllm-multinode_trtllm.2 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1b3b Running Running 2 minutes ago
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 8. Create hosts file
|
### Step 7. Create hosts file
|
||||||
|
|
||||||
You can check the available nodes using `docker node ls`
|
You can check the available nodes using `docker node ls`
|
||||||
```
|
```
|
||||||
@ -625,14 +543,14 @@ docker node ls --format '{{.ID}}' | xargs -n1 docker node inspect --format '{{ .
|
|||||||
docker cp ~/openmpi-hostfile $(docker ps -q -f name=trtllm-multinode):/etc/openmpi-hostfile
|
docker cp ~/openmpi-hostfile $(docker ps -q -f name=trtllm-multinode):/etc/openmpi-hostfile
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 9. Find your Docker container ID
|
### Step 8. Find your Docker container ID
|
||||||
|
|
||||||
You can use `docker ps` to find your Docker container ID. Alternatively, you can save the container ID in a variable:
|
You can use `docker ps` to find your Docker container ID. Alternatively, you can save the container ID in a variable:
|
||||||
```bash
|
```bash
|
||||||
export TRTLLM_MN_CONTAINER=$(docker ps -q -f name=trtllm-multinode)
|
export TRTLLM_MN_CONTAINER=$(docker ps -q -f name=trtllm-multinode)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 10. Generate configuration file
|
### Step 9. Generate configuration file
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker exec $TRTLLM_MN_CONTAINER bash -c 'cat <<EOF > /tmp/extra-llm-api-config.yml
|
docker exec $TRTLLM_MN_CONTAINER bash -c 'cat <<EOF > /tmp/extra-llm-api-config.yml
|
||||||
@ -645,7 +563,7 @@ cuda_graph_config:
|
|||||||
EOF'
|
EOF'
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 11. Download model
|
### Step 10. Download model
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
## Need to specify huggingface token for model download.
|
## Need to specify huggingface token for model download.
|
||||||
@ -657,7 +575,7 @@ docker exec \
|
|||||||
-it $TRTLLM_MN_CONTAINER bash -c 'mpirun -x HF_TOKEN bash -c "huggingface-cli download $MODEL"'
|
-it $TRTLLM_MN_CONTAINER bash -c 'mpirun -x HF_TOKEN bash -c "huggingface-cli download $MODEL"'
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 12. Serve the model
|
### Step 11. Serve the model
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker exec \
|
docker exec \
|
||||||
@ -677,7 +595,7 @@ This will start the TensorRT-LLM server on port 8000. You can then make inferenc
|
|||||||
|
|
||||||
**Expected output:** Server startup logs and ready message.
|
**Expected output:** Server startup logs and ready message.
|
||||||
|
|
||||||
### Step 13. Validate API server
|
### Step 12. Validate API server
|
||||||
|
|
||||||
Verify successful deployment by checking container status and testing the API endpoint.
|
Verify successful deployment by checking container status and testing the API endpoint.
|
||||||
|
|
||||||
@ -703,7 +621,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
|
|||||||
|
|
||||||
**Expected output:** JSON response with generated text completion.
|
**Expected output:** JSON response with generated text completion.
|
||||||
|
|
||||||
### Step 15. Cleanup and rollback
|
### Step 14. Cleanup and rollback
|
||||||
|
|
||||||
Stop and remove containers by using the following command on the leader node:
|
Stop and remove containers by using the following command on the leader node:
|
||||||
|
|
||||||
@ -719,7 +637,7 @@ Remove downloaded models to free disk space:
|
|||||||
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3*
|
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3*
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 16. Next steps
|
### Step 15. Next steps
|
||||||
|
|
||||||
Compare performance metrics between speculative decoding and baseline reports to quantify speed improvements. Use the multi-node setup as a foundation for deploying other large models requiring tensor parallelism, or scale to additional nodes for higher throughput workloads.
|
Compare performance metrics between speculative decoding and baseline reports to quantify speed improvements. Use the multi-node setup as a foundation for deploying other large models requiring tensor parallelism, or scale to additional nodes for higher throughput workloads.
|
||||||
|
|
||||||
@ -729,7 +647,6 @@ Compare performance metrics between speculative decoding and baseline reports to
|
|||||||
|
|
||||||
| Symptom | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
|---------|-------|-----|
|
|---------|-------|-----|
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
|
||||||
| OOM during weight loading (e.g., [Nemotron Super 49B](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5)) | Parallel weight-loading memory pressure | `export TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=1` |
|
| OOM during weight loading (e.g., [Nemotron Super 49B](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5)) | Parallel weight-loading memory pressure | `export TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=1` |
|
||||||
| "CUDA out of memory" | GPU VRAM insufficient for model | Reduce `free_gpu_memory_fraction: 0.9` or batch size or use smaller model |
|
| "CUDA out of memory" | GPU VRAM insufficient for model | Reduce `free_gpu_memory_fraction: 0.9` or batch size or use smaller model |
|
||||||
| "Model not found" error | HF_TOKEN invalid or model inaccessible | Verify token and model permissions |
|
| "Model not found" error | HF_TOKEN invalid or model inaccessible | Verify token and model permissions |
|
||||||
@ -742,7 +659,6 @@ Compare performance metrics between speculative decoding and baseline reports to
|
|||||||
|---------|-------|-----|
|
|---------|-------|-----|
|
||||||
| MPI hostname test returns single hostname | Network connectivity issues | Verify both nodes are on reachable IP addresses |
|
| MPI hostname test returns single hostname | Network connectivity issues | Verify both nodes are on reachable IP addresses |
|
||||||
| "Permission denied" on HuggingFace download | Invalid or missing HF_TOKEN | Set valid token: `export HF_TOKEN=<TOKEN>` |
|
| "Permission denied" on HuggingFace download | Invalid or missing HF_TOKEN | Set valid token: `export HF_TOKEN=<TOKEN>` |
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
|
||||||
| "CUDA out of memory" errors | Insufficient GPU memory | Reduce `--max_batch_size` or `--max_num_tokens` |
|
| "CUDA out of memory" errors | Insufficient GPU memory | Reduce `--max_batch_size` or `--max_num_tokens` |
|
||||||
| Container exits immediately | Missing entrypoint script | Ensure `trtllm-mn-entrypoint.sh` download succeeded and has executable permissions, also ensure you are not running the container already on your node. If port 2233 is already utilized, the entrypoint script will not start. |
|
| Container exits immediately | Missing entrypoint script | Ensure `trtllm-mn-entrypoint.sh` download succeeded and has executable permissions, also ensure you are not running the container already on your node. If port 2233 is already utilized, the entrypoint script will not start. |
|
||||||
|
|
||||||
|
|||||||
@ -5,7 +5,7 @@ WORKDIR /app
|
|||||||
# Install Flask and other required packages
|
# Install Flask and other required packages
|
||||||
RUN pip install --no-cache-dir \
|
RUN pip install --no-cache-dir \
|
||||||
flask==2.0.1 \
|
flask==2.0.1 \
|
||||||
gunicorn==20.1.0 \
|
gunicorn==23.0.0 \
|
||||||
tqdm
|
tqdm
|
||||||
|
|
||||||
# Create model directory
|
# Create model directory
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
sentence-transformers==2.3.1
|
sentence-transformers==2.3.1
|
||||||
transformers==4.36.2
|
transformers==4.46.3
|
||||||
torch==2.1.2
|
torch==2.1.2
|
||||||
flask==2.3.3
|
flask==2.3.3
|
||||||
gunicorn==21.2.0
|
gunicorn==23.0.0
|
||||||
numpy==1.26.2
|
numpy==1.26.2
|
||||||
@ -52,7 +52,7 @@
|
|||||||
"langchain": "^0.3.19",
|
"langchain": "^0.3.19",
|
||||||
"lucide-react": "^0.454.0",
|
"lucide-react": "^0.454.0",
|
||||||
"neo4j-driver": "^5.28.1",
|
"neo4j-driver": "^5.28.1",
|
||||||
"next": "15.1.0",
|
"next": "15.2.4",
|
||||||
"next-themes": "^0.4.4",
|
"next-themes": "^0.4.4",
|
||||||
"openai": "^4.91.0",
|
"openai": "^4.91.0",
|
||||||
"react": "^19",
|
"react": "^19",
|
||||||
|
|||||||
@ -7,7 +7,7 @@
|
|||||||
- [Overview](#overview)
|
- [Overview](#overview)
|
||||||
- [Instructions](#instructions)
|
- [Instructions](#instructions)
|
||||||
- [Run on two Sparks](#run-on-two-sparks)
|
- [Run on two Sparks](#run-on-two-sparks)
|
||||||
- [Step 14. (Optional) Launch 405B inference server](#step-14-optional-launch-405b-inference-server)
|
- [Step 11. (Optional) Launch 405B inference server](#step-11-optional-launch-405b-inference-server)
|
||||||
- [Troubleshooting](#troubleshooting)
|
- [Troubleshooting](#troubleshooting)
|
||||||
|
|
||||||
---
|
---
|
||||||
@ -39,7 +39,7 @@ support for ARM64.
|
|||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
- DGX Spark device with ARM64 processor and Blackwell GPU architecture
|
- DGX Spark device with ARM64 processor and Blackwell GPU architecture
|
||||||
- CUDA 12.9 or CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version.
|
- CUDA 13.0 toolkit installed: `nvcc --version` shows CUDA toolkit version.
|
||||||
- Docker installed and configured: `docker --version` succeeds
|
- Docker installed and configured: `docker --version` succeeds
|
||||||
- NVIDIA Container Toolkit installed
|
- NVIDIA Container Toolkit installed
|
||||||
- Python 3.12 available: `python3.12 --version` succeeds
|
- Python 3.12 available: `python3.12 --version` succeeds
|
||||||
@ -125,52 +125,17 @@ sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
|
|||||||
|
|
||||||
## Run on two Sparks
|
## Run on two Sparks
|
||||||
|
|
||||||
## Step 1. Verify hardware connectivity
|
## Step 1. Configure network connectivity
|
||||||
|
|
||||||
Connect the QSFP cable between both DGX Spark systems using the rightmost QSFP interface on each device. This step establishes the 200GbE direct connection required for high-speed inter-node communication.
|
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
|
||||||
|
|
||||||
```bash
|
This includes:
|
||||||
## Check QSFP interface availability on both nodes
|
- Physical QSFP cable connection
|
||||||
ip link show | grep enP2p1s0f1np1
|
- Network interface configuration (automatic or manual IP assignment)
|
||||||
```
|
- Passwordless SSH setup
|
||||||
|
- Network connectivity verification
|
||||||
|
|
||||||
Expected output shows the interface exists but may be down initially.
|
## Step 2. Download cluster deployment script
|
||||||
|
|
||||||
## Step 2. Configure cluster network on Node 1
|
|
||||||
|
|
||||||
Set up the static IP address for the cluster network interface on the first DGX Spark system. This creates a dedicated network segment for distributed inference communication.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
## Configure static IP on Node 1
|
|
||||||
sudo ip addr add 192.168.100.10/24 dev enP2p1s0f1np1
|
|
||||||
sudo ip link set enP2p1s0f1np1 up
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 3. Configure cluster network on Node 2
|
|
||||||
|
|
||||||
Configure the second node with a corresponding static IP in the same network segment.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
## Configure static IP on Node 2
|
|
||||||
sudo ip addr add 192.168.100.11/24 dev enP2p1s0f1np1
|
|
||||||
sudo ip link set enP2p1s0f1np1 up
|
|
||||||
```
|
|
||||||
|
|
||||||
## Step 4. Verify network connectivity
|
|
||||||
|
|
||||||
Test the direct connection between both nodes to ensure the cluster network is functional.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
## From Node 1, test connectivity to Node 2
|
|
||||||
ping -c 3 192.168.100.11
|
|
||||||
|
|
||||||
## From Node 2, test connectivity to Node 1
|
|
||||||
ping -c 3 192.168.100.10
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected output shows successful ping responses with low latency.
|
|
||||||
|
|
||||||
## Step 5. Download cluster deployment script
|
|
||||||
|
|
||||||
Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.
|
Obtain the vLLM cluster deployment script on both nodes. This script orchestrates the Ray cluster setup required for distributed inference.
|
||||||
|
|
||||||
@ -180,7 +145,7 @@ wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/example
|
|||||||
chmod +x run_cluster.sh
|
chmod +x run_cluster.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 6. Pull the NVIDIA vLLM Image from NGC
|
## Step 3. Pull the NVIDIA vLLM Image from NGC
|
||||||
|
|
||||||
First, you will need to configure docker to pull from NGC
|
First, you will need to configure docker to pull from NGC
|
||||||
If this is your first time using docker run:
|
If this is your first time using docker run:
|
||||||
@ -192,19 +157,14 @@ newgrp docker
|
|||||||
|
|
||||||
After this, you should be able to run docker commands without using `sudo`.
|
After this, you should be able to run docker commands without using `sudo`.
|
||||||
|
|
||||||
Next, create an NGC API Key [here](https://ngc.nvidia.com/setup/api-key) so that you can pull containers from NGC.
|
|
||||||
|
|
||||||
Once you have the API key, you can configure docker to pull from NGC and pull down the VLLM image:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker login nvcr.io
|
|
||||||
## Username will be `$oauthtoken` and the password is your NGC API Key
|
|
||||||
docker pull nvcr.io/nvidia/vllm:25.09-py3
|
docker pull nvcr.io/nvidia/vllm:25.09-py3
|
||||||
export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.09-py3
|
export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.09-py3
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Step 7. Start Ray head node
|
## Step 4. Start Ray head node
|
||||||
|
|
||||||
Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.
|
Launch the Ray cluster head node on Node 1. This node coordinates the distributed inference and serves the API endpoint.
|
||||||
|
|
||||||
@ -223,7 +183,7 @@ bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --head ~/.cache/huggingface \
|
|||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Step 8. Start Ray worker node
|
## Step 5. Start Ray worker node
|
||||||
|
|
||||||
Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism.
|
Connect Node 2 to the Ray cluster as a worker node. This provides additional GPU resources for tensor parallelism.
|
||||||
|
|
||||||
@ -241,7 +201,7 @@ bash run_cluster.sh $VLLM_IMAGE 192.168.100.10 --worker ~/.cache/huggingface \
|
|||||||
-e MASTER_ADDR=192.168.100.10
|
-e MASTER_ADDR=192.168.100.10
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 9. Verify cluster status
|
## Step 6. Verify cluster status
|
||||||
|
|
||||||
Confirm both nodes are recognized and available in the Ray cluster.
|
Confirm both nodes are recognized and available in the Ray cluster.
|
||||||
|
|
||||||
@ -252,7 +212,7 @@ docker exec node ray status
|
|||||||
|
|
||||||
Expected output shows 2 nodes with available GPU resources.
|
Expected output shows 2 nodes with available GPU resources.
|
||||||
|
|
||||||
## Step 10. Download Llama 3.3 70B model
|
## Step 7. Download Llama 3.3 70B model
|
||||||
|
|
||||||
Authenticate with Hugging Face and download the recommended production-ready model.
|
Authenticate with Hugging Face and download the recommended production-ready model.
|
||||||
|
|
||||||
@ -262,7 +222,7 @@ huggingface-cli login
|
|||||||
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct
|
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 11. Launch inference server for Llama 3.3 70B
|
## Step 8. Launch inference server for Llama 3.3 70B
|
||||||
|
|
||||||
Start the vLLM inference server with tensor parallelism across both nodes.
|
Start the vLLM inference server with tensor parallelism across both nodes.
|
||||||
|
|
||||||
@ -273,7 +233,7 @@ vllm serve meta-llama/Llama-3.3-70B-Instruct \
|
|||||||
--tensor-parallel-size 2 --max_model_len 2048
|
--tensor-parallel-size 2 --max_model_len 2048
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 12. Test 70B model inference
|
## Step 9. Test 70B model inference
|
||||||
|
|
||||||
Verify the deployment with a sample inference request.
|
Verify the deployment with a sample inference request.
|
||||||
|
|
||||||
@ -291,7 +251,7 @@ curl http://localhost:8000/v1/completions \
|
|||||||
|
|
||||||
Expected output includes a generated haiku response.
|
Expected output includes a generated haiku response.
|
||||||
|
|
||||||
## Step 13. (Optional) Deploy Llama 3.1 405B model
|
## Step 10. (Optional) Deploy Llama 3.1 405B model
|
||||||
|
|
||||||
> **Warning:** 405B model has insufficient memory headroom for production use.
|
> **Warning:** 405B model has insufficient memory headroom for production use.
|
||||||
|
|
||||||
@ -302,7 +262,7 @@ Download the quantized 405B model for testing purposes only.
|
|||||||
huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4
|
huggingface-cli download hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 14. (Optional) Launch 405B inference server
|
### Step 11. (Optional) Launch 405B inference server
|
||||||
|
|
||||||
Start the server with memory-constrained parameters for the large model.
|
Start the server with memory-constrained parameters for the large model.
|
||||||
|
|
||||||
@ -314,7 +274,7 @@ vllm serve hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 \
|
|||||||
--max-num-seqs 1 --max_num_batched_tokens 256
|
--max-num-seqs 1 --max_num_batched_tokens 256
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 15. (Optional) Test 405B model inference
|
## Step 12. (Optional) Test 405B model inference
|
||||||
|
|
||||||
Verify the 405B deployment with constrained parameters.
|
Verify the 405B deployment with constrained parameters.
|
||||||
|
|
||||||
@ -329,7 +289,7 @@ curl http://localhost:8000/v1/completions \
|
|||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 16. Validate deployment
|
## Step 13. Validate deployment
|
||||||
|
|
||||||
Perform comprehensive validation of the distributed inference system.
|
Perform comprehensive validation of the distributed inference system.
|
||||||
|
|
||||||
@ -345,7 +305,7 @@ nvidia-smi
|
|||||||
docker exec node nvidia-smi --query-gpu=memory.used,memory.total --format=csv
|
docker exec node nvidia-smi --query-gpu=memory.used,memory.total --format=csv
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 18. Cleanup and rollback
|
## Step 14. Cleanup and rollback
|
||||||
|
|
||||||
Remove temporary configurations and containers when testing is complete.
|
Remove temporary configurations and containers when testing is complete.
|
||||||
|
|
||||||
@ -362,7 +322,7 @@ sudo ip addr del 192.168.100.11/24 dev enP2p1s0f1np1 # Node 2
|
|||||||
sudo ip link set enP2p1s0f1np1 down
|
sudo ip link set enP2p1s0f1np1 down
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 19. Next steps
|
## Step 15. Next steps
|
||||||
|
|
||||||
Access the Ray dashboard for cluster monitoring and explore additional features:
|
Access the Ray dashboard for cluster monitoring and explore additional features:
|
||||||
|
|
||||||
@ -382,7 +342,6 @@ http://192.168.100.10:8265
|
|||||||
| Symptom | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
|---------|--------|-----|
|
|---------|--------|-----|
|
||||||
| Node 2 not visible in Ray cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |
|
| Node 2 not visible in Ray cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
|
||||||
| Model download fails | Authentication or network issue | Re-run `huggingface-cli login`, check internet access |
|
| Model download fails | Authentication or network issue | Re-run `huggingface-cli login`, check internet access |
|
||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your HuggingFace token; and request access to the gated model on your web browser |
|
||||||
| CUDA out of memory with 405B | Insufficient GPU memory | Use 70B model or reduce max_model_len parameter |
|
| CUDA out of memory with 405B | Insufficient GPU memory | Use 70B model or reduce max_model_len parameter |
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user