chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-09 22:43:59 +00:00
parent bf842ce358
commit b24d088d46
4 changed files with 60 additions and 50 deletions

View File

@ -6,6 +6,7 @@
- [Overview](#overview)
- [Instructions](#instructions)
- [Troubleshooting](#troubleshooting)
---
@ -63,10 +64,6 @@ All required assets can be found [in the ComfyUI repository on GitHub](https://g
* Model downloads are large (~2GB) and may fail due to network issues
* Port 8188 must be accessible for web interface functionality
* **Rollback:** Virtual environment can be deleted to remove all installed packages. Downloaded models can be removed manually from the checkpoints directory.
* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```
## Instructions
@ -157,17 +154,7 @@ Expected output should show HTTP 200 response indicating the web server is opera
Open a web browser and navigate to `http://<SPARK_IP>:8188` where `<SPARK_IP>` is your device's IP address.
## Step 9. Optional - Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| PyTorch CUDA not available | Incorrect CUDA version or missing drivers | Verify `nvcc --version` matches cu129, reinstall PyTorch |
| Model download fails | Network connectivity or storage space | Check internet connection, verify 20GB+ available space |
| Web interface inaccessible | Firewall blocking port 8188 | Configure firewall to allow port 8188, check IP address |
| Out of GPU memory errors | Insufficient VRAM for model | Use smaller models or enable CPU fallback mode |
## Step 10. Optional - Cleanup and rollback
## Step 9. Optional - Cleanup and rollback
If you need to remove the installation completely, follow these steps:
@ -181,7 +168,7 @@ rm -rf ComfyUI/
To rollback during installation, press `Ctrl+C` to stop the server and remove the virtual environment.
## Step 11. Optional - Next steps
## Step 10. Optional - Next steps
Test the installation with a basic image generation workflow:
@ -191,3 +178,19 @@ Test the installation with a basic image generation workflow:
4. Monitor GPU usage with `nvidia-smi` in a separate terminal
The image generation should complete within 30-60 seconds depending on your hardware configuration.
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| PyTorch CUDA not available | Incorrect CUDA version or missing drivers | Verify `nvcc --version` matches cu129, reinstall PyTorch |
| Model download fails | Network connectivity or storage space | Check internet connection, verify 20GB+ available space |
| Web interface inaccessible | Firewall blocking port 8188 | Configure firewall to allow port 8188, check IP address |
| Out of GPU memory errors after manually flushing buffer cache | Insufficient VRAM for model | Use smaller models or enable CPU fallback mode |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```

View File

@ -7,6 +7,7 @@
- [Overview](#overview)
- [Connect with NVIDIA Sync](#connect-with-nvidia-sync)
- [Connect with Manual SSH](#connect-with-manual-ssh)
- [Troubleshooting](#troubleshooting)
---
@ -70,7 +71,7 @@ applications, and manage your DGX Spark remotely from your laptop.
**Risk level:** Low - SSH setup involves credential configuration but no system-level changes
to the DGX Spark device
**Rollback:** SSH key removal can be done by editing `~/.ssh/authorized_keys` on the DGX Spark
**Rollback:** SSH key removal can be done by editing `~/.ssh/authorized_keys` on the DGX Spark.
## Connect with NVIDIA Sync
@ -206,16 +207,7 @@ Exit the SSH session
exit
```
## Step 6. Troubleshooting
| Symptom | Cause | Fix |
|---------|--------|-----|
| Device name doesn't resolve | mDNS blocked on network | Use IP address instead of hostname.local |
| Connection refused/timeout | DGX Spark not booted or SSH not ready | Wait for device boot completion; SSH available after updates finish |
| Authentication failed | SSH key setup incomplete | Re-run device setup in NVIDIA Sync; check credentials |
## Step 7. Next steps
## Step 6. Next steps
Test your setup by launching a development tool:
- Click the NVIDIA Sync system tray icon.
@ -333,17 +325,26 @@ ssh -L 11000:localhost:11000 <YOUR_USERNAME>@<SPARK_HOSTNAME>.local
After establishing the tunnel, access the forwarded web app in your browser: [http://localhost:11000](http://localhost:11000)
## Step 6. Next steps
## Step 6. Troubleshooting
With SSH access configured, you can:
- Open persistent terminal sessions: `ssh <YOUR_USERNAME>@<SPARK_HOSTNAME>.local`.
- Forward web application ports: `ssh -L <local_port>:localhost:<remote_port> <YOUR_USERNAME>@<SPARK_HOSTNAME>.local`.
## Troubleshooting
## Possible issues connecting via NVIDIA Sync
| Symptom | Cause | Fix |
|---------|--------|-----|
| Device name doesn't resolve | mDNS blocked on network | Use IP address instead of hostname.local |
| Connection refused/timeout | DGX Spark not booted or SSH not ready | Wait for device boot completion; SSH available after updates finish |
| Authentication failed | SSH key setup incomplete | Re-run device setup in NVIDIA Sync; check credentials |
## Possible issues connecting via manual SSH
| Symptom | Cause | Fix |
|---------|--------|-----|
| `ssh: Could not resolve hostname` | mDNS not working | Use IP address instead of .local hostname |
| `Connection refused` | Device not booted or SSH disabled | Wait for full boot; SSH available after system updates complete |
| `Port forwarding fails` | Service not running or port conflict | Verify remote service is active; try different local port |
## Step 7. Next steps
With SSH access configured, you can:
- Open persistent terminal sessions: `ssh <YOUR_USERNAME>@<SPARK_HOSTNAME>.local`.
- Forward web application ports: `ssh -L <local_port>:localhost:<remote_port> <YOUR_USERNAME>@<SPARK_HOSTNAME>.local`.

View File

@ -6,6 +6,7 @@
- [Overview](#overview)
- [Instructions](#instructions)
- [Troubleshooting](#troubleshooting)
---
@ -203,17 +204,7 @@ From the Settings page, under the "Updates" tab:
> **Warning**: System updates will upgrade packages, firmware if available, and trigger a reboot. Save your work before proceeding.
## Step 7. Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| User can't run updates | User not in sudo group | Add user to sudo group: `sudo usermod -aG sudo <USERNAME>` |
| JupyterLab won't start | Issue with current virtual environment | Change the working directory in the JupyterLab panel and start a new instance |
| SSH tunnel connection refused | Incorrect IP or port | Verify Spark device IP and ensure SSH service is running |
| GPU not visible in monitoring | Driver issues | Check GPU status with `nvidia-smi` |
## Step 8. Cleanup and rollback
## Step 7. Cleanup and rollback
To clean up resources and return system to original state:
@ -224,9 +215,18 @@ To clean up resources and return system to original state:
No permanent changes are made to the system during normal dashboard usage.
## Step 9. Next steps
## Step 8. Next steps
Now that you have DGX Dashboard configured, you can:
- Create additional JupyterLab environments for different projects
- Use the dashboard to manage system maintenance and updates
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| User can't run updates | User not in sudo group | Add user to sudo group: `sudo usermod -aG sudo <USERNAME>` |
| JupyterLab won't start | Issue with current virtual environment | Change the working directory in the JupyterLab panel and start a new instance |
| SSH tunnel connection refused | Incorrect IP or port | Verify Spark device IP and ensure SSH service is running |
| GPU not visible in monitoring | Driver issues | Check GPU status with `nvidia-smi` |

View File

@ -6,6 +6,7 @@
- [Overview](#overview)
- [Instructions](#instructions)
- [Troubleshooting](#troubleshooting)
---
@ -46,10 +47,6 @@ The setup includes:
* Docker permission issues may require user group changes and session restart
* The recipe would require hyperparameter tuning and a high-quality dataset for the best results
**Rollback**: Stop and remove Docker containers, delete downloaded models if needed.
* DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```
## Instructions
@ -171,3 +168,12 @@ Find the workflow section on the left-side panel of ComfyUI (or press `w`). Upon
Provide your prompt in the `CLIP Text Encode (Prompt)` block. Now let's incorporate our custom concepts into our prompt for the fine-tuned model. For example, we will use `tjtoy toy holding sparkgpu gpu in a datacenter`. You can expect the generation to take ~3 mins since it is compute intesive to create high-resolution 1024px images.
Unlike the base model, we can see that the fine-tuned model can generate multiple concepts in a single image. Additionally, ComfyUI exposes several fields to tune and change the look and feel of the generated images.
## Troubleshooting
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```