mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-26 03:43:52 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
d060c4abfe
commit
302c15b6cf
@ -48,7 +48,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
|
|||||||
- [Text to Knowledge Graph](nvidia/txt2kg/)
|
- [Text to Knowledge Graph](nvidia/txt2kg/)
|
||||||
- [Unsloth on DGX Spark](nvidia/unsloth/)
|
- [Unsloth on DGX Spark](nvidia/unsloth/)
|
||||||
- [Vibe Coding in VS Code](nvidia/vibe-coding/)
|
- [Vibe Coding in VS Code](nvidia/vibe-coding/)
|
||||||
- [Install and use vLLM](nvidia/vllm/)
|
- [vLLM for Inference](nvidia/vllm/)
|
||||||
- [Vision-Language Model Fine-tuning](nvidia/vlm-finetuning/)
|
- [Vision-Language Model Fine-tuning](nvidia/vlm-finetuning/)
|
||||||
- [Install VS Code](nvidia/vscode/)
|
- [Install VS Code](nvidia/vscode/)
|
||||||
- [Video Search and Summarization](nvidia/vss/)
|
- [Video Search and Summarization](nvidia/vss/)
|
||||||
|
|||||||
@ -30,6 +30,11 @@
|
|||||||
- [Step 12. Validate API server](#step-12-validate-api-server)
|
- [Step 12. Validate API server](#step-12-validate-api-server)
|
||||||
- [Step 14. Cleanup and rollback](#step-14-cleanup-and-rollback)
|
- [Step 14. Cleanup and rollback](#step-14-cleanup-and-rollback)
|
||||||
- [Step 15. Next steps](#step-15-next-steps)
|
- [Step 15. Next steps](#step-15-next-steps)
|
||||||
|
- [Open WebUI for TensorRT-LLM](#open-webui-for-tensorrt-llm)
|
||||||
|
- [Prerequisites](#prerequisites)
|
||||||
|
- [Step 1. Launch Open WebUI container](#step-1-launch-open-webui-container)
|
||||||
|
- [Step 2. Access the interface](#step-2-access-the-interface)
|
||||||
|
- [Step 3. Cleanup and rollback](#step-3-cleanup-and-rollback)
|
||||||
- [Troubleshooting](#troubleshooting)
|
- [Troubleshooting](#troubleshooting)
|
||||||
|
|
||||||
---
|
---
|
||||||
@ -129,12 +134,9 @@ If you see a permission denied error (something like `permission denied while tr
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
sudo usermod -aG docker $USER
|
sudo usermod -aG docker $USER
|
||||||
|
newgrp docker
|
||||||
```
|
```
|
||||||
|
|
||||||
> **Warning**: After running usermod, you must log out and log back in to start a new
|
|
||||||
> session with updated group permissions.
|
|
||||||
|
|
||||||
|
|
||||||
### Step 2. Verify environment prerequisites
|
### Step 2. Verify environment prerequisites
|
||||||
|
|
||||||
Confirm your Spark device has the required GPU access and network connectivity for downloading
|
Confirm your Spark device has the required GPU access and network connectivity for downloading
|
||||||
@ -430,10 +432,9 @@ docker ps
|
|||||||
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
|
If you see a permission denied error (something like `permission denied while trying to connect to the Docker daemon socket`), add your user to the docker group:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
sudo usermod -aG docker nvidia
|
sudo usermod -aG docker $USER
|
||||||
|
newgrp docker
|
||||||
```
|
```
|
||||||
Note: Replace `nvidia` with the username of the user you want to allow Docker access to.
|
|
||||||
Note: After running usermod, you must log out and log back in to start a new session with updated group permissions.
|
|
||||||
|
|
||||||
### Step 3. Install NVIDIA Container Toolkit & setup Docker environment
|
### Step 3. Install NVIDIA Container Toolkit & setup Docker environment
|
||||||
|
|
||||||
@ -464,7 +465,7 @@ Add or modify the file to include the nvidia runtime and GPU UUID (replace GPU-4
|
|||||||
},
|
},
|
||||||
"default-runtime": "nvidia",
|
"default-runtime": "nvidia",
|
||||||
"node-generic-resources": [
|
"node-generic-resources": [
|
||||||
"gpu=GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1"
|
"NVIDIA_GPU=GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
@ -479,6 +480,8 @@ Finally, restart the Docker daemon to apply all changes:
|
|||||||
sudo systemctl restart docker
|
sudo systemctl restart docker
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Repeat these steps on all nodes.
|
||||||
|
|
||||||
### Step 5. Initialize Docker Swarm
|
### Step 5. Initialize Docker Swarm
|
||||||
|
|
||||||
On whichever node you want to use as primary, run the following swarm initialization command
|
On whichever node you want to use as primary, run the following swarm initialization command
|
||||||
@ -499,17 +502,21 @@ To add a manager to this swarm, run 'docker swarm join-token manager' and follow
|
|||||||
|
|
||||||
### Step 6. Join worker nodes and deploy
|
### Step 6. Join worker nodes and deploy
|
||||||
|
|
||||||
Now we can proceed with setting up other nodes of your cluster.
|
Now we can proceed with setting up the worker nodes of your cluster. Repeat these steps on all worker nodes.
|
||||||
|
|
||||||
Run the command suggested by the docker swarm init on each worker node to join the Docker swarm
|
Run the command suggested by the docker swarm init on each worker node to join the Docker swarm
|
||||||
```bash
|
```bash
|
||||||
docker swarm join --token <worker-token> <advertise-addr>:<port>
|
docker swarm join --token <worker-token> <advertise-addr>:<port>
|
||||||
```
|
```
|
||||||
|
|
||||||
On your primary node, deploy the TRT-LLM multi-node stack by downloading the [**docker-compose.yml**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/docker-compose.yml) and [**trtllm-mn-entrypoint.sh**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/trtllm-mn-entrypoint.sh) files into your home directory and running the following command:
|
On both nodes, download the [**trtllm-mn-entrypoint.sh**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/trtllm-mn-entrypoint.sh) script into your home directory and run the following command to make it executable:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
chmod +x $HOME/trtllm-mn-entrypoint.sh
|
chmod +x $HOME/trtllm-mn-entrypoint.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
On your primary node, deploy the TRT-LLM multi-node stack by downloading the [**docker-compose.yml**](https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main/${MODEL}/assets/docker-compose.yml) file into your home directory and running the following command:
|
||||||
|
```bash
|
||||||
docker stack deploy -c $HOME/docker-compose.yml trtllm-multinode
|
docker stack deploy -c $HOME/docker-compose.yml trtllm-multinode
|
||||||
```
|
```
|
||||||
Note: Ensure you download both files into the same directory from which you are running the command.
|
Note: Ensure you download both files into the same directory from which you are running the command.
|
||||||
@ -527,6 +534,8 @@ oe9k5o6w41le trtllm-multinode_trtllm.1 nvcr.io/nvidia/tensorrt-llm/relea
|
|||||||
phszqzk97p83 trtllm-multinode_trtllm.2 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1b3b Running Running 2 minutes ago
|
phszqzk97p83 trtllm-multinode_trtllm.2 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1b3b Running Running 2 minutes ago
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Note: If your "Current state" is not "Running", see troubleshooting section for more information.
|
||||||
|
|
||||||
### Step 7. Create hosts file
|
### Step 7. Create hosts file
|
||||||
|
|
||||||
You can check the available nodes using `docker node ls`
|
You can check the available nodes using `docker node ls`
|
||||||
@ -565,6 +574,7 @@ EOF'
|
|||||||
|
|
||||||
### Step 10. Download model
|
### Step 10. Download model
|
||||||
|
|
||||||
|
We can download a model using the following command. You can replace `nvidia/Qwen3-235B-A22B-FP4` with the model of your choice.
|
||||||
```bash
|
```bash
|
||||||
## Need to specify huggingface token for model download.
|
## Need to specify huggingface token for model download.
|
||||||
export HF_TOKEN=<your-huggingface-token>
|
export HF_TOKEN=<your-huggingface-token>
|
||||||
@ -588,34 +598,25 @@ docker exec \
|
|||||||
--max_num_tokens 32768 \
|
--max_num_tokens 32768 \
|
||||||
--max_batch_size 4 \
|
--max_batch_size 4 \
|
||||||
--extra_llm_api_options /tmp/extra-llm-api-config.yml \
|
--extra_llm_api_options /tmp/extra-llm-api-config.yml \
|
||||||
--port 8000'
|
--port 8355'
|
||||||
```
|
```
|
||||||
|
|
||||||
This will start the TensorRT-LLM server on port 8000. You can then make inference requests to `http://localhost:8000` using the OpenAI-compatible API format.
|
This will start the TensorRT-LLM server on port 8355. You can then make inference requests to `http://localhost:8355` using the OpenAI-compatible API format.
|
||||||
|
|
||||||
|
Note: You might see a warning such as `UCX WARN network device 'enp1s0f0np0' is not available, please use one or more of`. You can ignore this warning if your inference is successful, as it's related to only one of your two CX-7 ports being used, and the other being left unused.
|
||||||
|
|
||||||
**Expected output:** Server startup logs and ready message.
|
**Expected output:** Server startup logs and ready message.
|
||||||
|
|
||||||
### Step 12. Validate API server
|
### Step 12. Validate API server
|
||||||
|
|
||||||
Verify successful deployment by checking container status and testing the API endpoint.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker stack ps trtllm-multinode
|
|
||||||
```
|
|
||||||
|
|
||||||
**Expected output:** Two running containers in the stack across different nodes.
|
|
||||||
|
|
||||||
Once the server is running, you can test it with a CURL request. Please ensure the CURL request is run on the primary node where you previously ran Step 11.
|
Once the server is running, you can test it with a CURL request. Please ensure the CURL request is run on the primary node where you previously ran Step 11.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
curl -s http://localhost:8355/v1/chat/completions \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"model": "nvidia/Qwen3-235B-A22B-FP4",
|
"model": "nvidia/Qwen3-235B-A22B-FP4",
|
||||||
"prompt": "What is artificial intelligence?",
|
"messages": [{"role": "user", "content": "Paris is great because"}],
|
||||||
"max_tokens": 100,
|
"max_tokens": 64
|
||||||
"temperature": 0.7,
|
|
||||||
"stream": false
|
|
||||||
}'
|
}'
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -639,7 +640,73 @@ rm -rf $HOME/.cache/huggingface/hub/models--nvidia--Qwen3*
|
|||||||
|
|
||||||
### Step 15. Next steps
|
### Step 15. Next steps
|
||||||
|
|
||||||
Compare performance metrics between speculative decoding and baseline reports to quantify speed improvements. Use the multi-node setup as a foundation for deploying other large models requiring tensor parallelism, or scale to additional nodes for higher throughput workloads.
|
You can now deploy other models on your DGX Spark cluster.
|
||||||
|
|
||||||
|
## Open WebUI for TensorRT-LLM
|
||||||
|
|
||||||
|
## Open WebUI for TensorRT-LLM
|
||||||
|
|
||||||
|
After setting up TensorRT-LLM inference server in either single-node or multi-node configuration, you can deploy Open WebUI to interact with your models through a user-friendly interface.
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
- TensorRT-LLM inference server running and accessible at http://localhost:8355
|
||||||
|
- Docker installed and configured (see earlier steps)
|
||||||
|
- Port 3000 available on your DGX Spark
|
||||||
|
|
||||||
|
### Step 1. Launch Open WebUI container
|
||||||
|
|
||||||
|
Run the following command on the DGX Spark node where you have the TensorRT-LLM inference server running.
|
||||||
|
For multi-node setup, this would be the primary node.
|
||||||
|
|
||||||
|
Note: If you used a different port for your OpenAI-compatible API server, adjust the `OPENAI_API_BASE_URL="http://localhost:8355/v1"` to match the IP and port of your TensorRT-LLM inference server.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run \
|
||||||
|
-d \
|
||||||
|
-e OPENAI_API_BASE_URL="http://localhost:8355/v1" \
|
||||||
|
-v open-webui:/app/backend/data \
|
||||||
|
--network host \
|
||||||
|
--add-host=host.docker.internal:host-gateway \
|
||||||
|
--name open-webui \
|
||||||
|
--restart always \
|
||||||
|
ghcr.io/open-webui/open-webui:main
|
||||||
|
```
|
||||||
|
|
||||||
|
This command:
|
||||||
|
- Connects to your OpenAI-compatible API server for TensorRT-LLM at http://localhost:8355
|
||||||
|
- Provides access to the Open WebUI interface at http://localhost:8080
|
||||||
|
- Persists chat data in a Docker volume
|
||||||
|
- Enables automatic container restart
|
||||||
|
- Uses the latest Open WebUI image
|
||||||
|
|
||||||
|
### Step 2. Access the interface
|
||||||
|
|
||||||
|
Open your web browser and navigate to:
|
||||||
|
|
||||||
|
```
|
||||||
|
http://localhost:8080
|
||||||
|
```
|
||||||
|
|
||||||
|
You should see the Open WebUI interface at http://localhost:8080 where you can:
|
||||||
|
- Chat with your deployed models
|
||||||
|
- Adjust model parameters
|
||||||
|
- View chat history
|
||||||
|
- Manage model configurations
|
||||||
|
|
||||||
|
You can select your model(s) from the dropdown menu on the top left corner. That's all you need to do to start using Open WebUI with your deployed models.
|
||||||
|
|
||||||
|
**Note:** If accessing from a remote machine, replace localhost with your DGX Spark's IP address.
|
||||||
|
|
||||||
|
### Step 3. Cleanup and rollback
|
||||||
|
**Warning:** This removes all chat data and may require re-uploading for future runs.
|
||||||
|
Remove the container by using the following command:
|
||||||
|
```bash
|
||||||
|
docker stop open-webui
|
||||||
|
docker rm open-webui
|
||||||
|
docker volume rm open-webui
|
||||||
|
docker rmi ghcr.io/open-webui/open-webui:main
|
||||||
|
```
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
@ -663,6 +730,10 @@ Compare performance metrics between speculative decoding and baseline reports to
|
|||||||
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
|
||||||
| "CUDA out of memory" errors | Insufficient GPU memory | Reduce `--max_batch_size` or `--max_num_tokens` |
|
| "CUDA out of memory" errors | Insufficient GPU memory | Reduce `--max_batch_size` or `--max_num_tokens` |
|
||||||
| Container exits immediately | Missing entrypoint script | Ensure `trtllm-mn-entrypoint.sh` download succeeded and has executable permissions, also ensure you are not running the container already on your node. If port 2233 is already utilized, the entrypoint script will not start. |
|
| Container exits immediately | Missing entrypoint script | Ensure `trtllm-mn-entrypoint.sh` download succeeded and has executable permissions, also ensure you are not running the container already on your node. If port 2233 is already utilized, the entrypoint script will not start. |
|
||||||
|
| Error response from daemon: error while validating Root CA Certificate | System clock out of sync or expired certificates | Update system time to sync with NTP server `sudo timedatectl set-ntp true`|
|
||||||
|
| "invalid mount config for type 'bind'" | Missing or non-executable entrypoint script | Run `docker inspect <container_id>` to see full error message. Verify `trtllm-mn-entrypoint.sh` exists on both nodes in your home directory (`ls -la $HOME/trtllm-mn-entrypoint.sh`) and has executable permissions (`chmod +x $HOME/trtllm-mn-entrypoint.sh`) |
|
||||||
|
| "task: non-zero exit (255)" | Container exit with error code 255 | Check container logs with `docker ps -a --filter "name=trtllm-multinode_trtllm"` to get container ID, then `docker logs <container_id>` to see detailed error messages |
|
||||||
|
| Docker state stuck in "Pending" with "no suitable node (insufficien...)" | Docker daemon not properly configured for GPU access | Verify steps 2-4 were completed successfully and check that `/etc/docker/daemon.json` contains correct GPU configuration |
|
||||||
|
|
||||||
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
|
||||||
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
|
||||||
|
|||||||
@ -1,4 +1,4 @@
|
|||||||
# Install and use vLLM
|
# vLLM for Inference
|
||||||
|
|
||||||
> Use a container or build vLLM from source for Spark
|
> Use a container or build vLLM from source for Spark
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user