mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-21 17:43:52 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
6ef03c813f
commit
49bdd1d7d1
@ -46,7 +46,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
|
||||
- [Text to Knowledge Graph](nvidia/txt2kg/)
|
||||
- [Unsloth on DGX Spark](nvidia/unsloth/)
|
||||
- [Vibe Coding in VS Code](nvidia/vibe-coding/)
|
||||
- [Install and Use vLLM for Inference](nvidia/vllm/)
|
||||
- [vLLM for Inference](nvidia/vllm/)
|
||||
- [VS Code](nvidia/vscode/)
|
||||
- [Build a Video Search and Summarization (VSS) Agent](nvidia/vss/)
|
||||
|
||||
|
||||
@ -93,13 +93,13 @@ Verify the virtual environment is active by checking the command prompt shows `(
|
||||
|
||||
## Step 3. Install PyTorch with CUDA support
|
||||
|
||||
Install PyTorch with CUDA 12.9 support.
|
||||
Install PyTorch with CUDA 13.0 support.
|
||||
|
||||
```bash
|
||||
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130
|
||||
```
|
||||
|
||||
This installation targets CUDA 12.9 compatibility with Blackwell architecture GPUs.
|
||||
This installation targets CUDA 13.0 compatibility with Blackwell architecture GPUs.
|
||||
|
||||
## Step 4. Clone ComfyUI repository
|
||||
|
||||
|
||||
@ -67,6 +67,8 @@ applications, and manage your DGX Spark remotely from your laptop.
|
||||
- **Time estimate:** 5-10 minutes
|
||||
- **Risk level:** Low - SSH setup involves credential configuration but no system-level changes to the DGX Spark device
|
||||
- **Rollback:** SSH key removal can be done by editing `~/.ssh/authorized_keys` on your DGX Spark.
|
||||
- **Last Updated:** 10/28/2025
|
||||
* Minor copyedits
|
||||
|
||||
## Connect with NVIDIA Sync
|
||||
|
||||
|
||||
@ -52,6 +52,9 @@ All required files for this playbook can be found [here on GitHub](https://githu
|
||||
|
||||
- **Rollback:** Network changes can be reversed by removing netplan configs or IP assignments
|
||||
|
||||
- **Last Updated:** 11/24/2025
|
||||
* Minor copyedits
|
||||
|
||||
## Run on Two Sparks
|
||||
|
||||
## Step 1. Ensure Same Username on Both Systems
|
||||
|
||||
@ -34,6 +34,8 @@ You will accelerate popular machine learning algorithms and data analytics opera
|
||||
* Data download slowness or failure due to network issues
|
||||
* Kaggle API generation failure requiring retries
|
||||
* **Rollback:** No permanent system changes made during normal usage.
|
||||
* **Last Updated:** 11/07/2025
|
||||
* Minor copyedits
|
||||
|
||||
## Instructions
|
||||
|
||||
|
||||
@ -47,6 +47,8 @@ The setup includes:
|
||||
* Docker permission issues may require user group changes and session restart
|
||||
* The recipe would require hyperparameter tuning and a high-quality dataset for the best results
|
||||
* **Rollback**: Stop and remove Docker containers, delete downloaded models if needed.
|
||||
* **Last Updated:** 11/07/2025
|
||||
* Minor copyedits
|
||||
|
||||
## Instructions
|
||||
|
||||
|
||||
@ -65,6 +65,8 @@ All required assets can be found [here on GitHub](https://github.com/NVIDIA/dgx-
|
||||
* Package dependency conflicts in Python environment
|
||||
* Performance validation may require architecture-specific optimizations
|
||||
* **Rollback:** Container environments provide isolation; remove containers and restart to reset state.
|
||||
* **Last Updated:** 11/07/2025
|
||||
* Minor copyedits
|
||||
|
||||
## Instructions
|
||||
|
||||
|
||||
@ -67,6 +67,8 @@ model adaptation for specialized domains while leveraging hardware-specific opti
|
||||
* **Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size and dataset.
|
||||
* **Risks:** Model downloads require significant bandwidth and storage. Training may consume substantial GPU memory and require parameter tuning for hardware constraints.
|
||||
* **Rollback:** Remove Docker containers and cloned repositories. Training checkpoints are saved locally and can be deleted to reclaim storage space.
|
||||
* **Last Updated:** 10/12/2025
|
||||
* First publication
|
||||
|
||||
## Instructions
|
||||
|
||||
|
||||
@ -65,6 +65,9 @@ All necessary files can be found in the TensorRT repository [here on GitHub](htt
|
||||
- Remove downloaded models from HuggingFace cache
|
||||
- Then exit the container environment
|
||||
|
||||
* **Last Updated:** 10/12/2025
|
||||
* First publication
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Launch the TensorRT container environment
|
||||
|
||||
@ -41,9 +41,11 @@ and proper GPU topology detection.
|
||||
|
||||
## Time & risk
|
||||
|
||||
- **Duration**: 30 minutes for setup and validation
|
||||
- **Risk level**: Medium - involves network configuration changes
|
||||
- **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
|
||||
* **Duration**: 30 minutes for setup and validation
|
||||
* **Risk level**: Medium - involves network configuration changes
|
||||
* **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
|
||||
* **Last Updated:** 10/12/2025
|
||||
* First publication
|
||||
|
||||
## Run on two Sparks
|
||||
|
||||
|
||||
@ -47,6 +47,8 @@ All necessary files for the playbook can be found [here on GitHub](https://githu
|
||||
* **Duration:** 45-90 minutes for complete setup and initial model fine-tuning
|
||||
* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
|
||||
* **Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations.
|
||||
* **Last Updated:** 10/22/2025
|
||||
* Minor copyedits
|
||||
|
||||
## Instructions
|
||||
|
||||
|
||||
@ -44,13 +44,16 @@ the powerful GPU capabilities of your Spark device without complex network confi
|
||||
|
||||
## Time & risk
|
||||
|
||||
**Duration**: 10-15 minutes for initial setup, 2-3 minutes for model download (varies by model size)
|
||||
* **Duration**: 10-15 minutes for initial setup, 2-3 minutes for model download (varies by model size)
|
||||
|
||||
**Risk level**: Low - No system-level changes, easily reversible by stopping the custom app
|
||||
* **Risk level**: Low - No system-level changes, easily reversible by stopping the custom app
|
||||
|
||||
**Rollback**: Stop the custom app in NVIDIA Sync and uninstall Ollama with standard package
|
||||
* **Rollback**: Stop the custom app in NVIDIA Sync and uninstall Ollama with standard package
|
||||
removal if needed
|
||||
|
||||
* **Last Updated:** 10/12/2025
|
||||
* First publication
|
||||
|
||||
## Instructions
|
||||
|
||||
## Step 1. Verify Ollama installation status
|
||||
|
||||
@ -38,6 +38,8 @@ You will have a fully functional Open WebUI installation running on your DGX Spa
|
||||
* **Risks**:
|
||||
* Docker permission issues may require user group changes and session restart
|
||||
* Large model downloads may take significant time depending on network speed
|
||||
* **Last Updated:** 10/28/2025
|
||||
* Minor copyedits
|
||||
|
||||
## Set up Open WebUI on Remote Spark with NVIDIA Sync
|
||||
|
||||
|
||||
@ -51,6 +51,8 @@ ALl files required for fine-tuning are included in the folder in [the GitHub rep
|
||||
|
||||
* **Time estimate:** 30-45 mins for setup and runing fine-tuning. Fine-tuning run time varies depending on model size
|
||||
* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.
|
||||
* **Last Updated:** 11/07/2025
|
||||
* Fix broken commands to access files from GitHub
|
||||
|
||||
## Instructions
|
||||
|
||||
|
||||
@ -58,7 +58,7 @@ architectures.
|
||||
* **Estimated time:** 30-45 minutes (including AI Workbench installation if needed)
|
||||
* **Risk level:** Low - Uses pre-built containers and established APIs
|
||||
* **Rollback:** Simply delete the cloned project from AI Workbench to remove all components. No system changes are made outside the AI Workbench environment.
|
||||
* **Last Updated:** 11/21/2025
|
||||
* **Last Updated:** 10/28/2025
|
||||
* Minor copyedits
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -55,6 +55,8 @@ These examples demonstrate how to accelerate large language model inference whil
|
||||
* **Duration:** 10-20 minutes for setup, additional time for model downloads (varies by network speed)
|
||||
* **Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
|
||||
* **Rollback:** Stop Docker containers and optionally clean up downloaded model cache.
|
||||
* **Last Updated:** 10/12/2025
|
||||
* First publication
|
||||
|
||||
## Instructions
|
||||
|
||||
|
||||
@ -73,7 +73,7 @@ all traffic automatically encrypted and NAT traversal handled transparently.
|
||||
* Network connectivity issues during initial setup
|
||||
* Authentication provider service dependencies
|
||||
* **Rollback**: Tailscale can be completely removed with `sudo apt remove tailscale` and all network routing automatically reverts to default settings.
|
||||
* **Last Updated:** 11/21/2025
|
||||
* **Last Updated:** 11/07/2025
|
||||
* Minor copyedits
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
# TRT LLM for Inference
|
||||
|
||||
> Install and configure TRT LLM to run on a single Spark or on two Sparks
|
||||
> Install and use TensorRT-LLM on DGX Spark Sparks
|
||||
|
||||
## Table of Contents
|
||||
|
||||
@ -117,6 +117,8 @@ Reminder: not all model architectures are supported for NVFP4 quantization.
|
||||
* **Duration**: 45-60 minutes for setup and API server deployment
|
||||
* **Risk level**: Medium - container pulls and model downloads may fail due to network issues
|
||||
* **Rollback**: Stop inference servers and remove downloaded models to free resources.
|
||||
* **Last Updated:** 10/18/2025
|
||||
* Fix broken links
|
||||
|
||||
## Single Spark
|
||||
|
||||
|
||||
@ -55,6 +55,9 @@ The Python test script can be found [here on GitHub](https://github.com/NVIDIA/d
|
||||
* CUDA toolkit configuration issues may prevent kernel compilation
|
||||
* Memory constraints on smaller models require batch size adjustments
|
||||
* **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`.
|
||||
* **Last Updated:** 11/07/2025
|
||||
* Add required python dependencies
|
||||
* Fix broken commands to access files on GitHub
|
||||
|
||||
## Instructions
|
||||
|
||||
|
||||
@ -43,6 +43,8 @@ You'll have a fully configured DGX Spark system capable of:
|
||||
* **Duration:** About 30 minutes
|
||||
* **Risks:** Data download slowness or failure due to network issues
|
||||
* **Rollback:** No permanent system changes made during normal usage.
|
||||
* **Last Updated:** 10/21/2025
|
||||
* First publication
|
||||
|
||||
## Instructions
|
||||
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
# Install and Use vLLM for Inference
|
||||
# vLLM for Inference
|
||||
|
||||
> Use a container or build vLLM from source for Spark
|
||||
> Install and use vLLM on DGX Spark
|
||||
|
||||
## Table of Contents
|
||||
|
||||
@ -52,6 +52,8 @@ support for ARM64.
|
||||
* **Duration:** 30 minutes for Docker approach
|
||||
* **Risks:** Container registry access requires internal credentials
|
||||
* **Rollback:** Container approach is non-destructive.
|
||||
* **Last Updated:** 10/18/2025
|
||||
* Minor copyedits
|
||||
|
||||
## Instructions
|
||||
|
||||
|
||||
@ -52,6 +52,9 @@ You will deploy NVIDIA's VSS AI Blueprint on NVIDIA Spark hardware with Blackwel
|
||||
* Network configuration conflicts if shared network already exists
|
||||
* Remote API endpoints may have rate limits or connectivity issues (hybrid deployment)
|
||||
* **Rollback:** Stop all containers with `docker compose down`, remove shared network with `docker network rm vss-shared-network`, and clean up temporary media directories.
|
||||
* **Last Updated:** 10/18/2025
|
||||
* Update required OS and Driver versions
|
||||
* Add instructions to fully local VSS deployment
|
||||
|
||||
## Instructions
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user