mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-24 10:53:52 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
6ef03c813f
commit
49bdd1d7d1
@ -46,7 +46,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
|
|||||||
- [Text to Knowledge Graph](nvidia/txt2kg/)
|
- [Text to Knowledge Graph](nvidia/txt2kg/)
|
||||||
- [Unsloth on DGX Spark](nvidia/unsloth/)
|
- [Unsloth on DGX Spark](nvidia/unsloth/)
|
||||||
- [Vibe Coding in VS Code](nvidia/vibe-coding/)
|
- [Vibe Coding in VS Code](nvidia/vibe-coding/)
|
||||||
- [Install and Use vLLM for Inference](nvidia/vllm/)
|
- [vLLM for Inference](nvidia/vllm/)
|
||||||
- [VS Code](nvidia/vscode/)
|
- [VS Code](nvidia/vscode/)
|
||||||
- [Build a Video Search and Summarization (VSS) Agent](nvidia/vss/)
|
- [Build a Video Search and Summarization (VSS) Agent](nvidia/vss/)
|
||||||
|
|
||||||
|
|||||||
@ -93,13 +93,13 @@ Verify the virtual environment is active by checking the command prompt shows `(
|
|||||||
|
|
||||||
## Step 3. Install PyTorch with CUDA support
|
## Step 3. Install PyTorch with CUDA support
|
||||||
|
|
||||||
Install PyTorch with CUDA 12.9 support.
|
Install PyTorch with CUDA 13.0 support.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130
|
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130
|
||||||
```
|
```
|
||||||
|
|
||||||
This installation targets CUDA 12.9 compatibility with Blackwell architecture GPUs.
|
This installation targets CUDA 13.0 compatibility with Blackwell architecture GPUs.
|
||||||
|
|
||||||
## Step 4. Clone ComfyUI repository
|
## Step 4. Clone ComfyUI repository
|
||||||
|
|
||||||
|
|||||||
@ -67,6 +67,8 @@ applications, and manage your DGX Spark remotely from your laptop.
|
|||||||
- **Time estimate:** 5-10 minutes
|
- **Time estimate:** 5-10 minutes
|
||||||
- **Risk level:** Low - SSH setup involves credential configuration but no system-level changes to the DGX Spark device
|
- **Risk level:** Low - SSH setup involves credential configuration but no system-level changes to the DGX Spark device
|
||||||
- **Rollback:** SSH key removal can be done by editing `~/.ssh/authorized_keys` on your DGX Spark.
|
- **Rollback:** SSH key removal can be done by editing `~/.ssh/authorized_keys` on your DGX Spark.
|
||||||
|
- **Last Updated:** 10/28/2025
|
||||||
|
* Minor copyedits
|
||||||
|
|
||||||
## Connect with NVIDIA Sync
|
## Connect with NVIDIA Sync
|
||||||
|
|
||||||
|
|||||||
@ -52,6 +52,9 @@ All required files for this playbook can be found [here on GitHub](https://githu
|
|||||||
|
|
||||||
- **Rollback:** Network changes can be reversed by removing netplan configs or IP assignments
|
- **Rollback:** Network changes can be reversed by removing netplan configs or IP assignments
|
||||||
|
|
||||||
|
- **Last Updated:** 11/24/2025
|
||||||
|
* Minor copyedits
|
||||||
|
|
||||||
## Run on Two Sparks
|
## Run on Two Sparks
|
||||||
|
|
||||||
## Step 1. Ensure Same Username on Both Systems
|
## Step 1. Ensure Same Username on Both Systems
|
||||||
|
|||||||
@ -34,6 +34,8 @@ You will accelerate popular machine learning algorithms and data analytics opera
|
|||||||
* Data download slowness or failure due to network issues
|
* Data download slowness or failure due to network issues
|
||||||
* Kaggle API generation failure requiring retries
|
* Kaggle API generation failure requiring retries
|
||||||
* **Rollback:** No permanent system changes made during normal usage.
|
* **Rollback:** No permanent system changes made during normal usage.
|
||||||
|
* **Last Updated:** 11/07/2025
|
||||||
|
* Minor copyedits
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
|||||||
@ -47,6 +47,8 @@ The setup includes:
|
|||||||
* Docker permission issues may require user group changes and session restart
|
* Docker permission issues may require user group changes and session restart
|
||||||
* The recipe would require hyperparameter tuning and a high-quality dataset for the best results
|
* The recipe would require hyperparameter tuning and a high-quality dataset for the best results
|
||||||
* **Rollback**: Stop and remove Docker containers, delete downloaded models if needed.
|
* **Rollback**: Stop and remove Docker containers, delete downloaded models if needed.
|
||||||
|
* **Last Updated:** 11/07/2025
|
||||||
|
* Minor copyedits
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
|||||||
@ -65,6 +65,8 @@ All required assets can be found [here on GitHub](https://github.com/NVIDIA/dgx-
|
|||||||
* Package dependency conflicts in Python environment
|
* Package dependency conflicts in Python environment
|
||||||
* Performance validation may require architecture-specific optimizations
|
* Performance validation may require architecture-specific optimizations
|
||||||
* **Rollback:** Container environments provide isolation; remove containers and restart to reset state.
|
* **Rollback:** Container environments provide isolation; remove containers and restart to reset state.
|
||||||
|
* **Last Updated:** 11/07/2025
|
||||||
|
* Minor copyedits
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
|||||||
@ -67,6 +67,8 @@ model adaptation for specialized domains while leveraging hardware-specific opti
|
|||||||
* **Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size and dataset.
|
* **Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size and dataset.
|
||||||
* **Risks:** Model downloads require significant bandwidth and storage. Training may consume substantial GPU memory and require parameter tuning for hardware constraints.
|
* **Risks:** Model downloads require significant bandwidth and storage. Training may consume substantial GPU memory and require parameter tuning for hardware constraints.
|
||||||
* **Rollback:** Remove Docker containers and cloned repositories. Training checkpoints are saved locally and can be deleted to reclaim storage space.
|
* **Rollback:** Remove Docker containers and cloned repositories. Training checkpoints are saved locally and can be deleted to reclaim storage space.
|
||||||
|
* **Last Updated:** 10/12/2025
|
||||||
|
* First publication
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
|||||||
@ -65,6 +65,9 @@ All necessary files can be found in the TensorRT repository [here on GitHub](htt
|
|||||||
- Remove downloaded models from HuggingFace cache
|
- Remove downloaded models from HuggingFace cache
|
||||||
- Then exit the container environment
|
- Then exit the container environment
|
||||||
|
|
||||||
|
* **Last Updated:** 10/12/2025
|
||||||
|
* First publication
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
## Step 1. Launch the TensorRT container environment
|
## Step 1. Launch the TensorRT container environment
|
||||||
|
|||||||
@ -41,9 +41,11 @@ and proper GPU topology detection.
|
|||||||
|
|
||||||
## Time & risk
|
## Time & risk
|
||||||
|
|
||||||
- **Duration**: 30 minutes for setup and validation
|
* **Duration**: 30 minutes for setup and validation
|
||||||
- **Risk level**: Medium - involves network configuration changes
|
* **Risk level**: Medium - involves network configuration changes
|
||||||
- **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
|
* **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
|
||||||
|
* **Last Updated:** 10/12/2025
|
||||||
|
* First publication
|
||||||
|
|
||||||
## Run on two Sparks
|
## Run on two Sparks
|
||||||
|
|
||||||
|
|||||||
@ -47,6 +47,8 @@ All necessary files for the playbook can be found [here on GitHub](https://githu
|
|||||||
* **Duration:** 45-90 minutes for complete setup and initial model fine-tuning
|
* **Duration:** 45-90 minutes for complete setup and initial model fine-tuning
|
||||||
* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
|
* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
|
||||||
* **Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations.
|
* **Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations.
|
||||||
|
* **Last Updated:** 10/22/2025
|
||||||
|
* Minor copyedits
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
|||||||
@ -44,13 +44,16 @@ the powerful GPU capabilities of your Spark device without complex network confi
|
|||||||
|
|
||||||
## Time & risk
|
## Time & risk
|
||||||
|
|
||||||
**Duration**: 10-15 minutes for initial setup, 2-3 minutes for model download (varies by model size)
|
* **Duration**: 10-15 minutes for initial setup, 2-3 minutes for model download (varies by model size)
|
||||||
|
|
||||||
**Risk level**: Low - No system-level changes, easily reversible by stopping the custom app
|
* **Risk level**: Low - No system-level changes, easily reversible by stopping the custom app
|
||||||
|
|
||||||
**Rollback**: Stop the custom app in NVIDIA Sync and uninstall Ollama with standard package
|
* **Rollback**: Stop the custom app in NVIDIA Sync and uninstall Ollama with standard package
|
||||||
removal if needed
|
removal if needed
|
||||||
|
|
||||||
|
* **Last Updated:** 10/12/2025
|
||||||
|
* First publication
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
## Step 1. Verify Ollama installation status
|
## Step 1. Verify Ollama installation status
|
||||||
|
|||||||
@ -38,6 +38,8 @@ You will have a fully functional Open WebUI installation running on your DGX Spa
|
|||||||
* **Risks**:
|
* **Risks**:
|
||||||
* Docker permission issues may require user group changes and session restart
|
* Docker permission issues may require user group changes and session restart
|
||||||
* Large model downloads may take significant time depending on network speed
|
* Large model downloads may take significant time depending on network speed
|
||||||
|
* **Last Updated:** 10/28/2025
|
||||||
|
* Minor copyedits
|
||||||
|
|
||||||
## Set up Open WebUI on Remote Spark with NVIDIA Sync
|
## Set up Open WebUI on Remote Spark with NVIDIA Sync
|
||||||
|
|
||||||
|
|||||||
@ -51,6 +51,8 @@ ALl files required for fine-tuning are included in the folder in [the GitHub rep
|
|||||||
|
|
||||||
* **Time estimate:** 30-45 mins for setup and runing fine-tuning. Fine-tuning run time varies depending on model size
|
* **Time estimate:** 30-45 mins for setup and runing fine-tuning. Fine-tuning run time varies depending on model size
|
||||||
* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.
|
* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.
|
||||||
|
* **Last Updated:** 11/07/2025
|
||||||
|
* Fix broken commands to access files from GitHub
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
|||||||
@ -58,7 +58,7 @@ architectures.
|
|||||||
* **Estimated time:** 30-45 minutes (including AI Workbench installation if needed)
|
* **Estimated time:** 30-45 minutes (including AI Workbench installation if needed)
|
||||||
* **Risk level:** Low - Uses pre-built containers and established APIs
|
* **Risk level:** Low - Uses pre-built containers and established APIs
|
||||||
* **Rollback:** Simply delete the cloned project from AI Workbench to remove all components. No system changes are made outside the AI Workbench environment.
|
* **Rollback:** Simply delete the cloned project from AI Workbench to remove all components. No system changes are made outside the AI Workbench environment.
|
||||||
* **Last Updated:** 11/21/2025
|
* **Last Updated:** 10/28/2025
|
||||||
* Minor copyedits
|
* Minor copyedits
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|||||||
@ -55,6 +55,8 @@ These examples demonstrate how to accelerate large language model inference whil
|
|||||||
* **Duration:** 10-20 minutes for setup, additional time for model downloads (varies by network speed)
|
* **Duration:** 10-20 minutes for setup, additional time for model downloads (varies by network speed)
|
||||||
* **Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
|
* **Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
|
||||||
* **Rollback:** Stop Docker containers and optionally clean up downloaded model cache.
|
* **Rollback:** Stop Docker containers and optionally clean up downloaded model cache.
|
||||||
|
* **Last Updated:** 10/12/2025
|
||||||
|
* First publication
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
|||||||
@ -73,7 +73,7 @@ all traffic automatically encrypted and NAT traversal handled transparently.
|
|||||||
* Network connectivity issues during initial setup
|
* Network connectivity issues during initial setup
|
||||||
* Authentication provider service dependencies
|
* Authentication provider service dependencies
|
||||||
* **Rollback**: Tailscale can be completely removed with `sudo apt remove tailscale` and all network routing automatically reverts to default settings.
|
* **Rollback**: Tailscale can be completely removed with `sudo apt remove tailscale` and all network routing automatically reverts to default settings.
|
||||||
* **Last Updated:** 11/21/2025
|
* **Last Updated:** 11/07/2025
|
||||||
* Minor copyedits
|
* Minor copyedits
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
# TRT LLM for Inference
|
# TRT LLM for Inference
|
||||||
|
|
||||||
> Install and configure TRT LLM to run on a single Spark or on two Sparks
|
> Install and use TensorRT-LLM on DGX Spark Sparks
|
||||||
|
|
||||||
## Table of Contents
|
## Table of Contents
|
||||||
|
|
||||||
@ -117,6 +117,8 @@ Reminder: not all model architectures are supported for NVFP4 quantization.
|
|||||||
* **Duration**: 45-60 minutes for setup and API server deployment
|
* **Duration**: 45-60 minutes for setup and API server deployment
|
||||||
* **Risk level**: Medium - container pulls and model downloads may fail due to network issues
|
* **Risk level**: Medium - container pulls and model downloads may fail due to network issues
|
||||||
* **Rollback**: Stop inference servers and remove downloaded models to free resources.
|
* **Rollback**: Stop inference servers and remove downloaded models to free resources.
|
||||||
|
* **Last Updated:** 10/18/2025
|
||||||
|
* Fix broken links
|
||||||
|
|
||||||
## Single Spark
|
## Single Spark
|
||||||
|
|
||||||
|
|||||||
@ -55,6 +55,9 @@ The Python test script can be found [here on GitHub](https://github.com/NVIDIA/d
|
|||||||
* CUDA toolkit configuration issues may prevent kernel compilation
|
* CUDA toolkit configuration issues may prevent kernel compilation
|
||||||
* Memory constraints on smaller models require batch size adjustments
|
* Memory constraints on smaller models require batch size adjustments
|
||||||
* **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`.
|
* **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`.
|
||||||
|
* **Last Updated:** 11/07/2025
|
||||||
|
* Add required python dependencies
|
||||||
|
* Fix broken commands to access files on GitHub
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
|||||||
@ -43,6 +43,8 @@ You'll have a fully configured DGX Spark system capable of:
|
|||||||
* **Duration:** About 30 minutes
|
* **Duration:** About 30 minutes
|
||||||
* **Risks:** Data download slowness or failure due to network issues
|
* **Risks:** Data download slowness or failure due to network issues
|
||||||
* **Rollback:** No permanent system changes made during normal usage.
|
* **Rollback:** No permanent system changes made during normal usage.
|
||||||
|
* **Last Updated:** 10/21/2025
|
||||||
|
* First publication
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
# Install and Use vLLM for Inference
|
# vLLM for Inference
|
||||||
|
|
||||||
> Use a container or build vLLM from source for Spark
|
> Install and use vLLM on DGX Spark
|
||||||
|
|
||||||
## Table of Contents
|
## Table of Contents
|
||||||
|
|
||||||
@ -52,6 +52,8 @@ support for ARM64.
|
|||||||
* **Duration:** 30 minutes for Docker approach
|
* **Duration:** 30 minutes for Docker approach
|
||||||
* **Risks:** Container registry access requires internal credentials
|
* **Risks:** Container registry access requires internal credentials
|
||||||
* **Rollback:** Container approach is non-destructive.
|
* **Rollback:** Container approach is non-destructive.
|
||||||
|
* **Last Updated:** 10/18/2025
|
||||||
|
* Minor copyedits
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
|||||||
@ -52,6 +52,9 @@ You will deploy NVIDIA's VSS AI Blueprint on NVIDIA Spark hardware with Blackwel
|
|||||||
* Network configuration conflicts if shared network already exists
|
* Network configuration conflicts if shared network already exists
|
||||||
* Remote API endpoints may have rate limits or connectivity issues (hybrid deployment)
|
* Remote API endpoints may have rate limits or connectivity issues (hybrid deployment)
|
||||||
* **Rollback:** Stop all containers with `docker compose down`, remove shared network with `docker network rm vss-shared-network`, and clean up temporary media directories.
|
* **Rollback:** Stop all containers with `docker compose down`, remove shared network with `docker network rm vss-shared-network`, and clean up temporary media directories.
|
||||||
|
* **Last Updated:** 10/18/2025
|
||||||
|
* Update required OS and Driver versions
|
||||||
|
* Add instructions to fully local VSS deployment
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user