chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-12 17:01:59 +00:00
parent 983c5e8f68
commit e8a3c50028
3 changed files with 28 additions and 27 deletions

View File

@ -6,12 +6,13 @@
- [Overview](#overview)
- [Run on two Sparks](#run-on-two-sparks)
- [Troubleshooting](#troubleshooting)
---
## Overview
## Basic idea
## Basic Idea
NCCL (NVIDIA Collective Communication Library) enables high-performance GPU-to-GPU communication
across multiple nodes. This walkthrough sets up NCCL for multi-node distributed training on
@ -40,11 +41,9 @@ and proper GPU topology detection.
## Time & risk
**Duration**: 30 minutes for setup and validation
**Risk level**: Medium - involves network configuration changes
**Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
* **Duration**: 30 minutes for setup and validation
* **Risk level**: Medium - involves network configuration changes
* **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
## Run on two Sparks
@ -174,6 +173,8 @@ Now you can try running a larger distributed workload such as TRT-LLM or vLLM in
## Troubleshooting
## Common issues for running on two Spark
| Issue | Cause | Solution |
|-------|-------|----------|
| mpirun hangs or times out | SSH connectivity issues | 1. Test basic SSH connectivity: `ssh <remote_ip>` should work without password prompts<br>2. Try a simple mpirun test: `mpirun -np 2 -H <IP for Node 1>:1,<IP for Node 2>:1 hostname`<br>3. Verify SSH keys are setup correctly for all nodes |

View File

@ -310,7 +310,7 @@ docker run \
```
> **Note:** If you hit a host OOM during downloads or first run, free the OS page cache on the host (outside the container) and retry:
> Note: If you hit a host OOM during downloads or first run, free the OS page cache on the host (outside the container) and retry:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```
@ -411,7 +411,7 @@ docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
### Step 1. Configure network connectivity
Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/stack-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes.
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
This includes:
- Physical QSFP cable connection
@ -447,13 +447,13 @@ First, find your GPU UUID by running:
nvidia-smi -a | grep UUID
```
Next, modify the Docker daemon configuration to advertise the GPU to Swarm. Edit **/etc/docker/daemon.json**:
Next, modify the Docker daemon configuration to advertise the GPU to Swarm. Edit /etc/docker/daemon.json:
```bash
sudo nano /etc/docker/daemon.json
```
Add or modify the file to include the nvidia runtime and GPU UUID (replace **GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1** with your actual GPU UUID):
Add or modify the file to include the nvidia runtime and GPU UUID (replace GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1 with your actual GPU ID):
```json
{
@ -470,7 +470,7 @@ Add or modify the file to include the nvidia runtime and GPU UUID (replace **GPU
}
```
Modify the NVIDIA Container Runtime to advertise the GPUs to the Swarm by uncommenting the swarm-resource line in the **config.toml** file. You can do this either with your preferred text editor (e.g., vim, nano...) or with the following command:
Modify the NVIDIA Container Runtime to advertise the GPUs to the Swarm by uncommenting the swarm-resource line in the config.toml file. You can do this either with your preferred text editor (e.g., vim, nano...) or with the following command:
```bash
sudo sed -i 's/^#\s*\(swarm-resource\s*=\s*".*"\)/\1/' /etc/nvidia-container-runtime/config.toml
```
@ -519,7 +519,7 @@ On your primary node, deploy the TRT-LLM multi-node stack by downloading the [**
```bash
docker stack deploy -c $HOME/docker-compose.yml trtllm-multinode
```
**Note:** Ensure you download both files into the same directory from which you are running the command.
Note: Ensure you download both files into the same directory from which you are running the command.
You can verify the status of your worker nodes using the following
```bash
@ -534,7 +534,7 @@ oe9k5o6w41le trtllm-multinode_trtllm.1 nvcr.io/nvidia/tensorrt-llm/relea
phszqzk97p83 trtllm-multinode_trtllm.2 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1b3b Running Running 2 minutes ago
```
**Note:** If your "Current state" is not "Running", see troubleshooting section for more information.
Note: If your "Current state" is not "Running", see troubleshooting section for more information.
### Step 7. Create hosts file
@ -603,7 +603,7 @@ docker exec \
This will start the TensorRT-LLM server on port 8355. You can then make inference requests to `http://localhost:8355` using the OpenAI-compatible API format.
**Note:** You might see a warning such as `UCX WARN network device 'enp1s0f0np0' is not available, please use one or more of`. You can ignore this warning if your inference is successful, as it's related to only one of your two CX-7 ports being used, and the other being left unused.
Note: You might see a warning such as `UCX WARN network device 'enp1s0f0np0' is not available, please use one or more of`. You can ignore this warning if your inference is successful, as it's related to only one of your two CX-7 ports being used, and the other being left unused.
**Expected output:** Server startup logs and ready message.
@ -659,7 +659,7 @@ After setting up TensorRT-LLM inference server in either single-node or multi-no
Run the following command on the DGX Spark node where you have the TensorRT-LLM inference server running.
For multi-node setup, this would be the primary node.
**Note:** If you used a different port for your OpenAI-compatible API server, adjust the `OPENAI_API_BASE_URL="http://localhost:8355/v1"` to match the IP and port of your TensorRT-LLM inference server.
Note: If you used a different port for your OpenAI-compatible API server, adjust the `OPENAI_API_BASE_URL="http://localhost:8355/v1"` to match the IP and port of your TensorRT-LLM inference server.
```bash
docker run \

View File

@ -91,16 +91,7 @@ curl http://localhost:8000/v1/chat/completions \
Expected response should contain `"content": "204"` or similar mathematical calculation.
## Step 3. Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
## Step 4. Cleanup and rollback
## Step 3. Cleanup and rollback
For container approach (non-destructive):
@ -116,7 +107,7 @@ To remove CUDA 12.9:
sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
```
## Step 5. Next steps
## Step 4. Next steps
- **Production deployment:** Configure vLLM with your specific model requirements
- **Performance tuning:** Adjust batch sizes and memory settings for your workload
@ -127,7 +118,7 @@ sudo /usr/local/cuda-12.9/bin/cuda-uninstaller
## Step 1. Configure network connectivity
Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes.
Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/stack-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes.
This includes:
- Physical QSFP cable connection
@ -339,6 +330,15 @@ http://192.168.100.10:8265
## Troubleshooting
## Common issues for running on a single Spark
| Symptom | Cause | Fix |
|---------|--------|-----|
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
## Common Issues for running on two Starks
| Symptom | Cause | Fix |
|---------|--------|-----|
| Node 2 not visible in Ray cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |