From 983c5e8f68627578b38f27fff09f9a09dabb73dd Mon Sep 17 00:00:00 2001 From: GitLab CI Date: Sun, 12 Oct 2025 16:57:39 +0000 Subject: [PATCH] chore: Regenerate all playbooks --- nvidia/trt-llm/README.md | 18 +++++++++--------- nvidia/vllm/README.md | 24 ++++++++++++------------ 2 files changed, 21 insertions(+), 21 deletions(-) diff --git a/nvidia/trt-llm/README.md b/nvidia/trt-llm/README.md index 6306950..d61ad23 100644 --- a/nvidia/trt-llm/README.md +++ b/nvidia/trt-llm/README.md @@ -310,7 +310,7 @@ docker run \ ``` -> Note: If you hit a host OOM during downloads or first run, free the OS page cache on the host (outside the container) and retry: +> **Note:** If you hit a host OOM during downloads or first run, free the OS page cache on the host (outside the container) and retry: ```bash sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches' ``` @@ -411,7 +411,7 @@ docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev ### Step 1. Configure network connectivity -Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes. +Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/stack-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes. This includes: - Physical QSFP cable connection @@ -447,13 +447,13 @@ First, find your GPU UUID by running: nvidia-smi -a | grep UUID ``` -Next, modify the Docker daemon configuration to advertise the GPU to Swarm. Edit /etc/docker/daemon.json: +Next, modify the Docker daemon configuration to advertise the GPU to Swarm. Edit **/etc/docker/daemon.json**: ```bash sudo nano /etc/docker/daemon.json ``` -Add or modify the file to include the nvidia runtime and GPU UUID (replace GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1 with your actual GPU ID): +Add or modify the file to include the nvidia runtime and GPU UUID (replace **GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1** with your actual GPU UUID): ```json { @@ -470,7 +470,7 @@ Add or modify the file to include the nvidia runtime and GPU UUID (replace GPU-4 } ``` -Modify the NVIDIA Container Runtime to advertise the GPUs to the Swarm by uncommenting the swarm-resource line in the config.toml file. You can do this either with your preferred text editor (e.g., vim, nano...) or with the following command: +Modify the NVIDIA Container Runtime to advertise the GPUs to the Swarm by uncommenting the swarm-resource line in the **config.toml** file. You can do this either with your preferred text editor (e.g., vim, nano...) or with the following command: ```bash sudo sed -i 's/^#\s*\(swarm-resource\s*=\s*".*"\)/\1/' /etc/nvidia-container-runtime/config.toml ``` @@ -519,7 +519,7 @@ On your primary node, deploy the TRT-LLM multi-node stack by downloading the [** ```bash docker stack deploy -c $HOME/docker-compose.yml trtllm-multinode ``` -Note: Ensure you download both files into the same directory from which you are running the command. +**Note:** Ensure you download both files into the same directory from which you are running the command. You can verify the status of your worker nodes using the following ```bash @@ -534,7 +534,7 @@ oe9k5o6w41le trtllm-multinode_trtllm.1 nvcr.io/nvidia/tensorrt-llm/relea phszqzk97p83 trtllm-multinode_trtllm.2 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1b3b Running Running 2 minutes ago ``` -Note: If your "Current state" is not "Running", see troubleshooting section for more information. +**Note:** If your "Current state" is not "Running", see troubleshooting section for more information. ### Step 7. Create hosts file @@ -603,7 +603,7 @@ docker exec \ This will start the TensorRT-LLM server on port 8355. You can then make inference requests to `http://localhost:8355` using the OpenAI-compatible API format. -Note: You might see a warning such as `UCX WARN network device 'enp1s0f0np0' is not available, please use one or more of`. You can ignore this warning if your inference is successful, as it's related to only one of your two CX-7 ports being used, and the other being left unused. +**Note:** You might see a warning such as `UCX WARN network device 'enp1s0f0np0' is not available, please use one or more of`. You can ignore this warning if your inference is successful, as it's related to only one of your two CX-7 ports being used, and the other being left unused. **Expected output:** Server startup logs and ready message. @@ -659,7 +659,7 @@ After setting up TensorRT-LLM inference server in either single-node or multi-no Run the following command on the DGX Spark node where you have the TensorRT-LLM inference server running. For multi-node setup, this would be the primary node. -Note: If you used a different port for your OpenAI-compatible API server, adjust the `OPENAI_API_BASE_URL="http://localhost:8355/v1"` to match the IP and port of your TensorRT-LLM inference server. +**Note:** If you used a different port for your OpenAI-compatible API server, adjust the `OPENAI_API_BASE_URL="http://localhost:8355/v1"` to match the IP and port of your TensorRT-LLM inference server. ```bash docker run \ diff --git a/nvidia/vllm/README.md b/nvidia/vllm/README.md index 77788da..842c71b 100644 --- a/nvidia/vllm/README.md +++ b/nvidia/vllm/README.md @@ -91,7 +91,16 @@ curl http://localhost:8000/v1/chat/completions \ Expected response should contain `"content": "204"` or similar mathematical calculation. -## Step 3. Cleanup and rollback +## Step 3. Troubleshooting + +| Symptom | Cause | Fix | +|---------|--------|-----| +| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer | +| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token | +| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source | + + +## Step 4. Cleanup and rollback For container approach (non-destructive): @@ -107,7 +116,7 @@ To remove CUDA 12.9: sudo /usr/local/cuda-12.9/bin/cuda-uninstaller ``` -## Step 4. Next steps +## Step 5. Next steps - **Production deployment:** Configure vLLM with your specific model requirements - **Performance tuning:** Adjust batch sizes and memory settings for your workload @@ -118,7 +127,7 @@ sudo /usr/local/cuda-12.9/bin/cuda-uninstaller ## Step 1. Configure network connectivity -Follow the network setup instructions from the [Connect two Sparks](https://build.nvidia.com/spark/stack-sparks/stacked-sparks) playbook to establish connectivity between your DGX Spark nodes. +Follow the network setup instructions from the Connect two Sparks playbook to establish connectivity between your DGX Spark nodes. This includes: - Physical QSFP cable connection @@ -330,15 +339,6 @@ http://192.168.100.10:8265 ## Troubleshooting -## Common issues for running on a single Spark - -| Symptom | Cause | Fix | -|---------|--------|-----| -| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer | -| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token | -| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source | - -## Common Issues for running on two Starks | Symptom | Cause | Fix | |---------|--------|-----| | Node 2 not visible in Ray cluster | Network connectivity issue | Verify QSFP cable connection, check IP configuration |