chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-12-11 20:20:28 +00:00
parent 6ef03c813f
commit 49bdd1d7d1
22 changed files with 55 additions and 14 deletions

View File

@ -46,7 +46,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
- [Text to Knowledge Graph](nvidia/txt2kg/)
- [Unsloth on DGX Spark](nvidia/unsloth/)
- [Vibe Coding in VS Code](nvidia/vibe-coding/)
- [Install and Use vLLM for Inference](nvidia/vllm/)
- [vLLM for Inference](nvidia/vllm/)
- [VS Code](nvidia/vscode/)
- [Build a Video Search and Summarization (VSS) Agent](nvidia/vss/)

View File

@ -93,13 +93,13 @@ Verify the virtual environment is active by checking the command prompt shows `(
## Step 3. Install PyTorch with CUDA support
Install PyTorch with CUDA 12.9 support.
Install PyTorch with CUDA 13.0 support.
```bash
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130
```
This installation targets CUDA 12.9 compatibility with Blackwell architecture GPUs.
This installation targets CUDA 13.0 compatibility with Blackwell architecture GPUs.
## Step 4. Clone ComfyUI repository

View File

@ -67,6 +67,8 @@ applications, and manage your DGX Spark remotely from your laptop.
- **Time estimate:** 5-10 minutes
- **Risk level:** Low - SSH setup involves credential configuration but no system-level changes to the DGX Spark device
- **Rollback:** SSH key removal can be done by editing `~/.ssh/authorized_keys` on your DGX Spark.
- **Last Updated:** 10/28/2025
* Minor copyedits
## Connect with NVIDIA Sync

View File

@ -52,6 +52,9 @@ All required files for this playbook can be found [here on GitHub](https://githu
- **Rollback:** Network changes can be reversed by removing netplan configs or IP assignments
- **Last Updated:** 11/24/2025
* Minor copyedits
## Run on Two Sparks
## Step 1. Ensure Same Username on Both Systems

View File

@ -34,6 +34,8 @@ You will accelerate popular machine learning algorithms and data analytics opera
* Data download slowness or failure due to network issues
* Kaggle API generation failure requiring retries
* **Rollback:** No permanent system changes made during normal usage.
* **Last Updated:** 11/07/2025
* Minor copyedits
## Instructions

View File

@ -47,6 +47,8 @@ The setup includes:
* Docker permission issues may require user group changes and session restart
* The recipe would require hyperparameter tuning and a high-quality dataset for the best results
* **Rollback**: Stop and remove Docker containers, delete downloaded models if needed.
* **Last Updated:** 11/07/2025
* Minor copyedits
## Instructions

View File

@ -65,6 +65,8 @@ All required assets can be found [here on GitHub](https://github.com/NVIDIA/dgx-
* Package dependency conflicts in Python environment
* Performance validation may require architecture-specific optimizations
* **Rollback:** Container environments provide isolation; remove containers and restart to reset state.
* **Last Updated:** 11/07/2025
* Minor copyedits
## Instructions

View File

@ -67,6 +67,8 @@ model adaptation for specialized domains while leveraging hardware-specific opti
* **Duration:** 30-60 minutes for initial setup, 1-7 hours for training depending on model size and dataset.
* **Risks:** Model downloads require significant bandwidth and storage. Training may consume substantial GPU memory and require parameter tuning for hardware constraints.
* **Rollback:** Remove Docker containers and cloned repositories. Training checkpoints are saved locally and can be deleted to reclaim storage space.
* **Last Updated:** 10/12/2025
* First publication
## Instructions

View File

@ -65,6 +65,9 @@ All necessary files can be found in the TensorRT repository [here on GitHub](htt
- Remove downloaded models from HuggingFace cache
- Then exit the container environment
* **Last Updated:** 10/12/2025
* First publication
## Instructions
## Step 1. Launch the TensorRT container environment

View File

@ -41,9 +41,11 @@ and proper GPU topology detection.
## Time & risk
- **Duration**: 30 minutes for setup and validation
- **Risk level**: Medium - involves network configuration changes
- **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
* **Duration**: 30 minutes for setup and validation
* **Risk level**: Medium - involves network configuration changes
* **Rollback**: The NCCL & NCCL Tests repositories can be deleted from DGX Spark
* **Last Updated:** 10/12/2025
* First publication
## Run on two Sparks

View File

@ -47,6 +47,8 @@ All necessary files for the playbook can be found [here on GitHub](https://githu
* **Duration:** 45-90 minutes for complete setup and initial model fine-tuning
* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
* **Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations.
* **Last Updated:** 10/22/2025
* Minor copyedits
## Instructions

View File

@ -44,13 +44,16 @@ the powerful GPU capabilities of your Spark device without complex network confi
## Time & risk
**Duration**: 10-15 minutes for initial setup, 2-3 minutes for model download (varies by model size)
* **Duration**: 10-15 minutes for initial setup, 2-3 minutes for model download (varies by model size)
**Risk level**: Low - No system-level changes, easily reversible by stopping the custom app
* **Risk level**: Low - No system-level changes, easily reversible by stopping the custom app
**Rollback**: Stop the custom app in NVIDIA Sync and uninstall Ollama with standard package
* **Rollback**: Stop the custom app in NVIDIA Sync and uninstall Ollama with standard package
removal if needed
* **Last Updated:** 10/12/2025
* First publication
## Instructions
## Step 1. Verify Ollama installation status

View File

@ -38,6 +38,8 @@ You will have a fully functional Open WebUI installation running on your DGX Spa
* **Risks**:
* Docker permission issues may require user group changes and session restart
* Large model downloads may take significant time depending on network speed
* **Last Updated:** 10/28/2025
* Minor copyedits
## Set up Open WebUI on Remote Spark with NVIDIA Sync

View File

@ -51,6 +51,8 @@ ALl files required for fine-tuning are included in the folder in [the GitHub rep
* **Time estimate:** 30-45 mins for setup and runing fine-tuning. Fine-tuning run time varies depending on model size
* **Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting.
* **Last Updated:** 11/07/2025
* Fix broken commands to access files from GitHub
## Instructions

View File

@ -58,7 +58,7 @@ architectures.
* **Estimated time:** 30-45 minutes (including AI Workbench installation if needed)
* **Risk level:** Low - Uses pre-built containers and established APIs
* **Rollback:** Simply delete the cloned project from AI Workbench to remove all components. No system changes are made outside the AI Workbench environment.
* **Last Updated:** 11/21/2025
* **Last Updated:** 10/28/2025
* Minor copyedits
## Instructions

View File

@ -55,6 +55,8 @@ These examples demonstrate how to accelerate large language model inference whil
* **Duration:** 10-20 minutes for setup, additional time for model downloads (varies by network speed)
* **Risks:** GPU memory exhaustion with large models, container registry access issues, network timeouts during downloads
* **Rollback:** Stop Docker containers and optionally clean up downloaded model cache.
* **Last Updated:** 10/12/2025
* First publication
## Instructions

View File

@ -73,7 +73,7 @@ all traffic automatically encrypted and NAT traversal handled transparently.
* Network connectivity issues during initial setup
* Authentication provider service dependencies
* **Rollback**: Tailscale can be completely removed with `sudo apt remove tailscale` and all network routing automatically reverts to default settings.
* **Last Updated:** 11/21/2025
* **Last Updated:** 11/07/2025
* Minor copyedits
## Instructions

View File

@ -1,6 +1,6 @@
# TRT LLM for Inference
> Install and configure TRT LLM to run on a single Spark or on two Sparks
> Install and use TensorRT-LLM on DGX Spark Sparks
## Table of Contents
@ -117,6 +117,8 @@ Reminder: not all model architectures are supported for NVFP4 quantization.
* **Duration**: 45-60 minutes for setup and API server deployment
* **Risk level**: Medium - container pulls and model downloads may fail due to network issues
* **Rollback**: Stop inference servers and remove downloaded models to free resources.
* **Last Updated:** 10/18/2025
* Fix broken links
## Single Spark

View File

@ -55,6 +55,9 @@ The Python test script can be found [here on GitHub](https://github.com/NVIDIA/d
* CUDA toolkit configuration issues may prevent kernel compilation
* Memory constraints on smaller models require batch size adjustments
* **Rollback**: Uninstall packages with `pip uninstall unsloth torch torchvision`.
* **Last Updated:** 11/07/2025
* Add required python dependencies
* Fix broken commands to access files on GitHub
## Instructions

View File

@ -43,6 +43,8 @@ You'll have a fully configured DGX Spark system capable of:
* **Duration:** About 30 minutes
* **Risks:** Data download slowness or failure due to network issues
* **Rollback:** No permanent system changes made during normal usage.
* **Last Updated:** 10/21/2025
* First publication
## Instructions

View File

@ -1,6 +1,6 @@
# Install and Use vLLM for Inference
# vLLM for Inference
> Use a container or build vLLM from source for Spark
> Install and use vLLM on DGX Spark
## Table of Contents
@ -52,6 +52,8 @@ support for ARM64.
* **Duration:** 30 minutes for Docker approach
* **Risks:** Container registry access requires internal credentials
* **Rollback:** Container approach is non-destructive.
* **Last Updated:** 10/18/2025
* Minor copyedits
## Instructions

View File

@ -52,6 +52,9 @@ You will deploy NVIDIA's VSS AI Blueprint on NVIDIA Spark hardware with Blackwel
* Network configuration conflicts if shared network already exists
* Remote API endpoints may have rate limits or connectivity issues (hybrid deployment)
* **Rollback:** Stop all containers with `docker compose down`, remove shared network with `docker network rm vss-shared-network`, and clean up temporary media directories.
* **Last Updated:** 10/18/2025
* Update required OS and Driver versions
* Add instructions to fully local VSS deployment
## Instructions