chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-12-09 19:08:18 +00:00
parent 7ad00de39e
commit 6ef03c813f
2 changed files with 96 additions and 123 deletions

View File

@ -61,6 +61,8 @@ You'll launch a NIM container on your DGX Spark device to expose a GPU-accelerat
* GPU memory requirements vary by model size
* Container startup time depends on model loading
* **Rollback:** Stop and remove containers with `docker stop <CONTAINER_NAME> && docker rm <CONTAINER_NAME>`. Remove cached models from `~/.cache/nim` if disk space recovery is needed.
* **Last Updated:** 12/09/2025
* Update docker container version to cuda:13.0.1-devel-ubuntu24.04
## Instructions
@ -71,7 +73,7 @@ Check that your system meets the basic requirements for running GPU-enabled cont
```bash
nvidia-smi
docker --version
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi
```
### Step 2. Configure NGC authentication
@ -91,6 +93,9 @@ Choose a specific LLM NIM from NGC and set up local caching for model assets.
export CONTAINER_NAME="nim-llm-demo"
export IMG_NAME="nvcr.io/nim/meta/llama-3.1-8b-instruct-dgx-spark:latest"
export LOCAL_NIM_CACHE=~/.cache/nim
export LOCAL_NIM_WORKSPACE=~/.local/share/nim/workspace
mkdir -p "$LOCAL_NIM_WORKSPACE"
chmod -R a+w "$LOCAL_NIM_WORKSPACE"
mkdir -p "$LOCAL_NIM_CACHE"
chmod -R a+w "$LOCAL_NIM_CACHE"
```
@ -101,12 +106,11 @@ Start the containerized LLM service with GPU acceleration and proper resource al
```bash
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-v "$LOCAL_NIM_WORKSPACE:/opt/nim/workspace" \
-p 8000:8000 \
$IMG_NAME
```

View File

@ -18,18 +18,17 @@
- [Run on two Sparks](#run-on-two-sparks)
- [Step 1. Configure network connectivity](#step-1-configure-network-connectivity)
- [Step 2. Configure Docker permissions](#step-2-configure-docker-permissions)
- [Step 3. Install NVIDIA Container Toolkit & setup Docker environment](#step-3-install-nvidia-container-toolkit-setup-docker-environment)
- [Step 4. Enable resource advertising](#step-4-enable-resource-advertising)
- [Step 5. Initialize Docker Swarm](#step-5-initialize-docker-swarm)
- [Step 6. Join worker nodes and deploy](#step-6-join-worker-nodes-and-deploy)
- [Step 7. Create hosts file](#step-7-create-hosts-file)
- [Step 8. Find your Docker container ID](#step-8-find-your-docker-container-id)
- [Step 9. Generate configuration file](#step-9-generate-configuration-file)
- [Step 10. Download model](#step-10-download-model)
- [Step 11. Serve the model](#step-11-serve-the-model)
- [Step 12. Validate API server](#step-12-validate-api-server)
- [Step 14. Cleanup and rollback](#step-14-cleanup-and-rollback)
- [Step 15. Next steps](#step-15-next-steps)
- [Step 3. Create OpenMPI hostfile](#step-3-create-openmpi-hostfile)
- [Step 4. Start containers on both nodes](#step-4-start-containers-on-both-nodes)
- [Step 5. Verify containers are running](#step-5-verify-containers-are-running)
- [Step 6. Copy hostfile to primary container](#step-6-copy-hostfile-to-primary-container)
- [Step 7. Save container reference](#step-7-save-container-reference)
- [Step 8. Generate configuration file](#step-8-generate-configuration-file)
- [Step 9. Download model](#step-9-download-model)
- [Step 10. Serve the model](#step-10-serve-the-model)
- [Step 11. Validate API server](#step-11-validate-api-server)
- [Step 12. Cleanup and rollback](#step-12-cleanup-and-rollback)
- [Step 13. Next steps](#step-13-next-steps)
- [Open WebUI for TensorRT-LLM](#open-webui-for-tensorrt-llm)
- [Step 1. Set up the prerequisites to use Open WebUI with TRT-LLM](#step-1-set-up-the-prerequisites-to-use-open-webui-with-trt-llm)
- [Step 2. Launch Open WebUI container](#step-2-launch-open-webui-container)
@ -438,132 +437,98 @@ If you see a permission denied error (something like permission denied while try
sudo usermod -aG docker $USER
newgrp docker
```
### Step 3. Install NVIDIA Container Toolkit & setup Docker environment
Ensure the NVIDIA drivers and the NVIDIA Container Toolkit are installed on each node (both manager and workers) that will provide GPU resources. This package enables Docker containers to access the host's GPU hardware. Ensure you complete the [installation steps](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), including the [Docker configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) for NVIDIA Container Toolkit.
Repeat this step on both nodes.
### Step 4. Enable resource advertising
### Step 3. Create OpenMPI hostfile
First, find your GPU UUID by running:
```bash
nvidia-smi -a | grep UUID
```
Next, modify the Docker daemon configuration to advertise the GPU to Swarm. Edit **/etc/docker/daemon.json**:
Create a hostfile with the IP addresses of both nodes for MPI operations. On each node, get the IP address of your network interface:
```bash
sudo nano /etc/docker/daemon.json
ip a show enp1s0f0np0
```
Add or modify the file to include the nvidia runtime and GPU UUID (replace **GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1** with your actual GPU UUID):
```json
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia",
"node-generic-resources": [
"NVIDIA_GPU=GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1"
]
}
```
Modify the NVIDIA Container Runtime to advertise the GPUs to the Swarm by uncommenting the swarm-resource line in the **config.toml** file. You can do this either with your preferred text editor (e.g., vim, nano...) or with the following command:
```bash
sudo sed -i 's/^#\s*\(swarm-resource\s*=\s*".*"\)/\1/' /etc/nvidia-container-runtime/config.toml
```
Finally, restart the Docker daemon to apply all changes:
```bash
sudo systemctl restart docker
```
Repeat these steps on all nodes.
### Step 5. Initialize Docker Swarm
On whichever node you want to use as primary, run the following swarm initialization command
```bash
docker swarm init --advertise-addr $(ip -o -4 addr show enp1s0f0np0 | awk '{print $4}' | cut -d/ -f1) $(ip -o -4 addr show enp1s0f1np1 | awk '{print $4}' | cut -d/ -f1)
```
The typical output of the above would be similar to the following:
```
Swarm initialized: current node (node-id) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join --token <worker-token> <advertise-addr>:<port>
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
```
### Step 6. Join worker nodes and deploy
Now we can proceed with setting up the worker nodes of your cluster. Repeat these steps on all worker nodes.
Run the command suggested by the docker swarm init on each worker node to join the Docker swarm
```bash
docker swarm join --token <worker-token> <advertise-addr>:<port>
```
On both nodes, download the [**trtllm-mn-entrypoint.sh**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/trt-llm/assets/trtllm-mn-entrypoint.sh) script into your home directory and run the following command to make it executable:
Or if you're using the second interface:
```bash
chmod +x $HOME/trtllm-mn-entrypoint.sh
ip a show enp1s0f1np1
```
On your primary node, deploy the TRT-LLM multi-node stack by downloading the [**docker-compose.yml**](https://github.com/NVIDIA/dgx-spark-playbooks/blob/main/nvidia/trt-llm/assets/docker-compose.yml) file into your home directory and running the following command:
Look for the `inet` line to find the IP address (e.g., `192.168.1.10/24`).
On your primary node, create the hostfile `~/openmpi-hostfile` with the collected IPs:
```bash
docker stack deploy -c $HOME/docker-compose.yml trtllm-multinode
cat > ~/openmpi-hostfile <<EOF
192.168.1.10
192.168.1.11
EOF
```
> [!NOTE]
> Ensure you download both files into the same directory from which you are running the command.
You can verify the status of your worker nodes using the following
Replace the IP addresses with your actual node IPs.
### Step 4. Start containers on both nodes
On **each node** (primary and worker), run the following command to start the TRT-LLM container:
```bash
docker stack ps trtllm-multinode
```
If everything is healthy, you should see a similar output to the following:
```
nvidia@spark-1b3b:~$ docker stack ps trtllm-multinode
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
oe9k5o6w41le trtllm-multinode_trtllm.1 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1d84 Running Running 2 minutes ago
phszqzk97p83 trtllm-multinode_trtllm.2 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1b3b Running Running 2 minutes ago
docker run -d --rm \
--name trtllm-multinode \
--gpus '"device=all"' \
--network host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--device /dev/infiniband:/dev/infiniband \
-e UCX_NET_DEVICES="enp1s0f0np0,enp1s0f1np1" \
-e NCCL_SOCKET_IFNAME="enp1s0f0np0,enp1s0f1np1" \
-e OMPI_MCA_btl_tcp_if_include="enp1s0f0np0,enp1s0f1np1" \
-e OMPI_MCA_orte_default_hostfile="/etc/openmpi-hostfile" \
-e OMPI_MCA_rmaps_ppr_n_pernode="1" \
-e OMPI_ALLOW_RUN_AS_ROOT="1" \
-e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM="1" \
-v ~/.cache/huggingface/:/root/.cache/huggingface/ \
-v ~/.ssh:/tmp/.ssh:ro \
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 \
sh -c "curl https://raw.githubusercontent.com/NVIDIA/dgx-spark-playbooks/refs/heads/main/nvidia/trt-llm/assets/trtllm-mn-entrypoint.sh | sh"
```
> [!NOTE]
> If your "Current state" is not "Running", see troubleshooting section for more information.
> Make sure to run this command on **both** the primary and worker nodes.
### Step 7. Create hosts file
### Step 5. Verify containers are running
You can check the available nodes using `docker node ls`
```
nvidia@spark-1b3b:~$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
hza2b7yisatqiezo33zx4in4i * spark-1b3b Ready Active Leader 28.3.3
m1k22g3ktgnx36qz4jg5fzhr4 spark-1d84 Ready Active 28.3.3
```
On each node, verify the container is running:
Generate a file containing all Docker Swarm node addresses for MPI operations, and then copy it over to your container:
```bash
docker node ls --format '{{.ID}}' | xargs -n1 docker node inspect --format '{{ .Status.Addr }}' > ~/openmpi-hostfile
docker cp ~/openmpi-hostfile $(docker ps -q -f name=trtllm-multinode):/etc/openmpi-hostfile
docker ps
```
### Step 8. Find your Docker container ID
You should see output similar to:
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
abc123def456 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 "sh -c 'curl https:…" 10 seconds ago Up 8 seconds trtllm-multinode
```
### Step 6. Copy hostfile to primary container
On your primary node, copy the OpenMPI hostfile into the container:
You can use `docker ps` to find your Docker container ID. Alternatively, you can save the container ID in a variable:
```bash
export TRTLLM_MN_CONTAINER=$(docker ps -q -f name=trtllm-multinode)
docker cp ~/openmpi-hostfile trtllm-multinode:/etc/openmpi-hostfile
```
### Step 9. Generate configuration file
### Step 7. Save container reference
On your primary node, save the container name in a variable for convenience:
```bash
export TRTLLM_MN_CONTAINER=trtllm-multinode
```
### Step 8. Generate configuration file
On your primary node, generate the configuration file inside the container:
```bash
docker exec $TRTLLM_MN_CONTAINER bash -c 'cat <<EOF > /tmp/extra-llm-api-config.yml
@ -576,9 +541,10 @@ cuda_graph_config:
EOF'
```
### Step 10. Download model
### Step 9. Download model
We can download a model using the following command. You can replace `nvidia/Qwen3-235B-A22B-FP4` with the model of your choice.
```bash
## Need to specify huggingface token for model download.
export HF_TOKEN=<your-huggingface-token>
@ -589,7 +555,9 @@ docker exec \
-it $TRTLLM_MN_CONTAINER bash -c 'mpirun -x HF_TOKEN bash -c "huggingface-cli download $MODEL"'
```
### Step 11. Serve the model
### Step 10. Serve the model
On your primary node, start the TensorRT-LLM server:
```bash
docker exec \
@ -612,8 +580,9 @@ This will start the TensorRT-LLM server on port 8355. You can then make inferenc
**Expected output:** Server startup logs and ready message.
### Step 12. Validate API server
Once the server is running, you can test it with a CURL request. Please ensure the CURL request is run on the primary node where you previously ran Step 11.
### Step 11. Validate API server
Once the server is running, you can test it with a CURL request. Run this on the primary node:
```bash
curl -s http://localhost:8355/v1/chat/completions \
@ -627,24 +596,24 @@ curl -s http://localhost:8355/v1/chat/completions \
**Expected output:** JSON response with generated text completion.
### Step 14. Cleanup and rollback
### Step 12. Cleanup and rollback
Stop and remove containers by using the following command on the leader node:
Stop and remove containers on **each node**. SSH to each node and run:
```bash
docker stack rm trtllm-multinode
docker stop trtllm-multinode
```
> [!WARNING]
> This removes all inference data and performance reports. Copy `/opt/*perf-report.json` files before cleanup if needed.
> This removes all inference data and performance reports. Copy any necessary files before cleanup if needed.
Remove downloaded models to free disk space:
Remove downloaded models to free disk space on each node:
```bash
rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3*
```
### Step 15. Next steps
### Step 13. Next steps
You can now deploy other models on your DGX Spark cluster.