chore: Regenerate all playbooks

This commit is contained in:
GitLab CI 2025-10-12 20:53:42 +00:00
parent 8f5d38151e
commit c5e890f836
22 changed files with 83 additions and 48 deletions

View File

@ -158,7 +158,8 @@ Open a web browser and navigate to `http://<SPARK_IP>:8188` where `<SPARK_IP>` i
If you need to remove the installation completely, follow these steps:
> **Warning:** This will delete all installed packages and downloaded models.
> [!WARNING]
> This will delete all installed packages and downloaded models.
```bash
deactivate

View File

@ -66,12 +66,9 @@ applications, and manage your DGX Spark remotely from your laptop.
## Time & risk
**Time estimate:** 5-10 minutes
**Risk level:** Low - SSH setup involves credential configuration but no system-level changes
to the DGX Spark device
**Rollback:** SSH key removal can be done by editing `~/.ssh/authorized_keys` on the DGX Spark.
- **Time estimate:** 5-10 minutes
- **Risk level:** Low - SSH setup involves credential configuration but no system-level changes to the DGX Spark device
- **Rollback:** SSH key removal can be done by editing `~/.ssh/authorized_keys` on the DGX Spark.
## Connect with NVIDIA Sync
@ -146,9 +143,9 @@ Finally, connect your DGX Spark by filling out the form:
- **Username**: Your DGX Spark user account name
- **Password**: Your DGX Spark user account password
**Note:** Your password is used only during this initial setup to configure SSH key-based
authentication. It is not stored or transmitted after setup completion. NVIDIA Sync will SSH into your device and
configure its locally provisioned SSH key pair.
> [!NOTE]
> Your password is used only during this initial setup to configure SSH key-based authentication. It is not stored or transmitted after setup completion. NVIDIA Sync will SSH into your device and
> configure its locally provisioned SSH key pair.
Click add "Add" and NVIDIA Sync will automatically:

View File

@ -198,7 +198,8 @@ From the Settings page, under the "Updates" tab:
2. Click "Update Now" to initiate the update process
3. Wait for the update to complete and your device to reboot
> **Warning**: System updates will upgrade packages, firmware if available, and trigger a reboot. Save your work before proceeding.
> [!WARNING]
> System updates will upgrade packages, firmware if available, and trigger a reboot. Save your work before proceeding.
## Step 7. Cleanup and rollback
@ -207,7 +208,8 @@ To clean up resources and return system to original state:
1. Stop any running JupyterLab instances via dashboard
2. Delete the JupyterLab working directory
> **Warning**: If you ran system updates, the only rollback is to restore from a system backup or recovery media.
> [!WARNING]
> If you ran system updates, the only rollback is to restore from a system backup or recovery media.
No permanent changes are made to the system during normal dashboard usage.

View File

@ -111,7 +111,8 @@ After playing around with the base model, you have 2 possible next steps.
* If you already have fine-tuned LoRAs placed inside `models/loras/`, please skip to `Step 7. Fine-tuned model inference` section.
* If you wish to train a LoRA for your custom concepts, first make sure that the ComfyUI inference container is brought down before proceeding to train. You can bring it down by interrupting the terminal with `Ctrl+C` keystroke.
> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
> [!NOTE]
> To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```

View File

@ -99,7 +99,8 @@ git clone https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-
## Step 3. Build the Docker image
> **Warning:** This command will download a base image and build a container locally to support this environment.
> [!WARNING]
> This command will download a base image and build a container locally to support this environment.
```bash
cd jax/assets

View File

@ -183,7 +183,8 @@ llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
## Step 11. Cleanup and rollback
> **Warning:** This will delete all training progress and checkpoints.
> [!WARNING]
> This will delete all training progress and checkpoints.
To remove all generated files and free up storage space:

View File

@ -266,7 +266,8 @@ You can now upload a chest X-ray image and ask questions directly in the chat in
To stop and remove the containers and network, run the following commands. This will not
delete your downloaded model weights.
> **Warning:** This will stop all running containers and remove the network.
> [!WARNING]
> This will stop all running containers and remove the network.
```bash
## Stop containers

View File

@ -37,7 +37,8 @@ The setup includes:
- No other processes running on the DGX Spark GPU
- Enough disk space for model downloads
> **Note**: This demo uses ~120 out of the 128GB of DGX Spark's memory by default.
> [!NOTE]
> This demo uses ~120 out of the 128GB of DGX Spark's memory by default.
> Please ensure that no other workloads are running on your Spark using `nvidia-smi`, or switch to a smaller supervisor model like gpt-oss-20B.
@ -104,7 +105,8 @@ watch 'docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"'
Open your browser and go to: http://localhost:3000
> **Note**: If you are running this on a remote GPU via an SSH connection, in a new terminal window, you need to run the following command to be able to access the UI at localhost:3000 and for the UI to be able to communicate to the backend at localhost:8000.
> [!NOTE]
> If you are running this on a remote GPU via an SSH connection, in a new terminal window, you need to run the following command to be able to access the UI at localhost:3000 and for the UI to be able to communicate to the backend at localhost:8000.
>```ssh -L 3000:localhost:3000 -L 8000:localhost:8000 username@IP-address```

View File

@ -128,7 +128,8 @@ python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry b
Test the faster Flux.1 Schnell variant with different precision formats.
> **Warning**: FP16 Flux.1 Schnell requires >48GB VRAM for native export
> [!WARNING]
> FP16 Flux.1 Schnell requires >48GB VRAM for native export
**Substep A. FP16 precision (high VRAM requirement)**
@ -190,7 +191,8 @@ python3 -c "import tensorrt as trt; print(f'TensorRT version: {trt.__version__}'
Remove downloaded models and exit container environment to free disk space.
> **Warning**: This will delete all cached models and generated images
> [!WARNING]
> This will delete all cached models and generated images
```bash
## Exit container

View File

@ -280,7 +280,8 @@ print('✅ Setup complete')
Remove the installation and restore the original environment if needed. These commands safely remove all installed components.
> **Warning:** This will delete all virtual environments and downloaded models. Ensure you have backed up any important training checkpoints.
> [!WARNING]
> This will delete all virtual environments and downloaded models. Ensure you have backed up any important training checkpoints.
```bash
## Remove virtual environment

View File

@ -152,7 +152,8 @@ Expected output should be a JSON response containing a completion field with gen
Remove the running container and optionally clean up cached model files.
> **Warning:** Removing cached models will require re-downloading on next run.
> [!WARNING]
> Removing cached models will require re-downloading on next run.
```bash
docker stop $CONTAINER_NAME

View File

@ -226,7 +226,8 @@ curl -X POST http://localhost:8000/v1/chat/completions \
To clean up the environment and remove generated files:
> **Warning:** This will permanently delete all quantized model files and cached data.
> [!WARNING]
> This will permanently delete all quantized model files and cached data.
```bash
## Remove output directory and all quantized models

View File

@ -370,8 +370,8 @@ deactivate
rm -rf openfold_env/
```
> **Warning:** The following will delete downloaded databases (>3TB). Only run if you need to
> free disk space and are willing to re-download.
> [!WARNING]
> The following will delete downloaded databases (>3TB). Only run if you need to free disk space and are willing to re-download.
```bash
## Remove all databases (requires re-download)

View File

@ -119,7 +119,8 @@ python Llama3_3B_full_finetuning.py
|---------|--------|-----|
| Cannot access gated repo for URL | Certain HuggingFace models have restricted access | Regenerate your [HuggingFace token](https://huggingface.co/docs/hub/en/security-tokens); and request access to the [gated model](https://huggingface.co/docs/hub/en/models-gated#customize-requested-information) on your web browser |
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> [!NOTE]
> DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU.
> With many applications still updating to take advantage of UMA, you may encounter memory issues even when within
> the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
```bash

View File

@ -151,7 +151,8 @@ Upload a custom dataset, adjust the Router prompt, and submit custom queries to
This step explains how to remove the project if needed and what changes were made to your system.
> **Warning:** This will permanently delete the project and all associated data.
> [!WARNING]
> This will permanently delete the project and all associated data.
To remove the project completely:

View File

@ -203,7 +203,8 @@ Common issues and their resolutions:
Stop and remove containers to clean up resources. This step returns your system to its
original state.
> **Warning:** This will stop all SGLang containers and remove temporary data.
> [!WARNING]
> This will stop all SGLang containers and remove temporary data.
```bash
## Stop all SGLang containers

View File

@ -108,7 +108,8 @@ The following models are supported with TensorRT-LLM on Spark. All listed models
| **Llama-4-Scout-17B-16E-Instruct** | NVFP4 | ✅ | `nvidia/Llama-4-Scout-17B-16E-Instruct-FP4` |
| **Qwen3-235B-A22B (two Sparks only)** | NVFP4 | ✅ | `nvidia/Qwen3-235B-A22B-FP4` |
**Note:** You can use the NVFP4 Quantization documentation to generate your own NVFP4-quantized checkpoints for your favorite models. This enables you to take advantage of the performance and memory benefits of NVFP4 quantization even for models not already published by NVIDIA.
> [!NOTE]
> You can use the NVFP4 Quantization documentation to generate your own NVFP4-quantized checkpoints for your favorite models. This enables you to take advantage of the performance and memory benefits of NVFP4 quantization even for models not already published by NVIDIA.
Reminder: not all model architectures are supported for NVFP4 quantization.
@ -396,7 +397,8 @@ curl -s http://localhost:8355/v1/chat/completions \
Remove downloaded models and containers to free up space when testing is complete.
> **Warning:** This will delete all cached models and may require re-downloading for future runs.
> [!WARNING]
> This will delete all cached models and may require re-downloading for future runs.
```bash
## Remove Hugging Face cache
@ -519,7 +521,8 @@ On your primary node, deploy the TRT-LLM multi-node stack by downloading the [**
```bash
docker stack deploy -c $HOME/docker-compose.yml trtllm-multinode
```
**Note:** Ensure you download both files into the same directory from which you are running the command.
> [!NOTE]
> Ensure you download both files into the same directory from which you are running the command.
You can verify the status of your worker nodes using the following
```bash
@ -534,7 +537,8 @@ oe9k5o6w41le trtllm-multinode_trtllm.1 nvcr.io/nvidia/tensorrt-llm/relea
phszqzk97p83 trtllm-multinode_trtllm.2 nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc3 spark-1b3b Running Running 2 minutes ago
```
**Note:** If your "Current state" is not "Running", see troubleshooting section for more information.
> [!NOTE]
> If your "Current state" is not "Running", see troubleshooting section for more information.
### Step 7. Create hosts file
@ -603,7 +607,8 @@ docker exec \
This will start the TensorRT-LLM server on port 8355. You can then make inference requests to `http://localhost:8355` using the OpenAI-compatible API format.
**Note:** You might see a warning such as `UCX WARN network device 'enp1s0f0np0' is not available, please use one or more of`. You can ignore this warning if your inference is successful, as it's related to only one of your two CX-7 ports being used, and the other being left unused.
> [!NOTE]
> You might see a warning such as `UCX WARN network device 'enp1s0f0np0' is not available, please use one or more of`. You can ignore this warning if your inference is successful, as it's related to only one of your two CX-7 ports being used, and the other being left unused.
**Expected output:** Server startup logs and ready message.
@ -630,7 +635,8 @@ Stop and remove containers by using the following command on the leader node:
docker stack rm trtllm-multinode
```
> **Warning:** This removes all inference data and performance reports. Copy `/opt/*perf-report.json` files before cleanup if needed.
> [!WARNING]
> This removes all inference data and performance reports. Copy `/opt/*perf-report.json` files before cleanup if needed.
Remove downloaded models to free disk space:
@ -659,7 +665,8 @@ After setting up TensorRT-LLM inference server in either single-node or multi-no
Run the following command on the DGX Spark node where you have the TensorRT-LLM inference server running.
For multi-node setup, this would be the primary node.
**Note:** If you used a different port for your OpenAI-compatible API server, adjust the `OPENAI_API_BASE_URL="http://localhost:8355/v1"` to match the IP and port of your TensorRT-LLM inference server.
> [!NOTE]
> If you used a different port for your OpenAI-compatible API server, adjust the `OPENAI_API_BASE_URL="http://localhost:8355/v1"` to match the IP and port of your TensorRT-LLM inference server.
```bash
docker run \
@ -696,10 +703,13 @@ You should see the Open WebUI interface at http://localhost:8080 where you can:
You can select your model(s) from the dropdown menu on the top left corner. That's all you need to do to start using Open WebUI with your deployed models.
**Note:** If accessing from a remote machine, replace localhost with your DGX Spark's IP address.
> [!NOTE]
> If accessing from a remote machine, replace localhost with your DGX Spark's IP address.
### Step 3. Cleanup and rollback
**Warning:** This removes all chat data and may require re-uploading for future runs.
> [!WARNING]
> This removes all chat data and may require re-uploading for future runs.
Remove the container by using the following command:
```bash
docker stop open-webui

View File

@ -89,7 +89,8 @@ docker exec ollama-compose ollama pull <model-name>
Browse available models at [https://ollama.com/search](https://ollama.com/search)
> **Note**: The unified memory architecture enables running larger models like 70B parameters, which produce significantly more accurate knowledge triples.
> [!NOTE]
> The unified memory architecture enables running larger models like 70B parameters, which produce significantly more accurate knowledge triples.
## Step 4. Access the web interface

View File

@ -244,7 +244,8 @@ Expected output includes a generated haiku response.
## Step 10. (Optional) Deploy Llama 3.1 405B model
> **Warning:** 405B model has insufficient memory headroom for production use.
> [!WARNING]
> 405B model has insufficient memory headroom for production use.
Download the quantized 405B model for testing purposes only.
@ -300,7 +301,8 @@ docker exec node nvidia-smi --query-gpu=memory.used,memory.total --format=csv
Remove temporary configurations and containers when testing is complete.
> **Warning:** This will stop all inference services and remove cluster configuration.
> [!WARNING]
> This will stop all inference services and remove cluster configuration.
```bash
## Stop containers on both nodes

View File

@ -104,7 +104,8 @@ sh launch.sh
## Enter the mounted directory within the container
cd /vlm_finetuning
```
**Note**: The same Docker container and launch commands work for both image and video VLM recipes. The container features all necessary dependencies, including FFmpeg, Decord, and optimized libraries for both workflows.
> [!NOTE]
> The same Docker container and launch commands work for both image and video VLM recipes. The container features all necessary dependencies, including FFmpeg, Decord, and optimized libraries for both workflows.
## Step 5. [Option A] For image VLM fine-tuning (Wildfire Detection)
@ -129,7 +130,8 @@ cd ui_image/data
For this fine-tuning playbook, we will use the [Wildfire Prediction Dataset](https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset) from Kaggle. Visit the kaggle dataset page [here](https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset) to click the download button. Select the `cURL` option in the `Download Via` dropdown and copy the curl command.
> **Note**: You will need to be logged into Kaggle and may need to accept the dataset terms before the download link works.
> [!NOTE]
> You will need to be logged into Kaggle and may need to accept the dataset terms before the download link works.
Run the following commands in your container:
@ -235,7 +237,8 @@ dataset/
#### 6.2. Model download
> **Note**: These instructions assume you are already inside the Docker container. For container setup, refer to the section above to `Build the Docker container`.
> [!NOTE]
> These instructions assume you are already inside the Docker container. For container setup, refer to the section above to `Build the Docker container`.
```bash
hf download OpenGVLab/InternVL3-8B
@ -262,7 +265,8 @@ Scroll down, enter your prompt in the chat box and hit `Generate`. Your prompt w
If you are proceeding to train a fine-tuned model, ensure that the streamlit demo UI is brought down before proceeding to train. You can bring it down by interrupting the terminal with `Ctrl+C` keystroke.
> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
> [!NOTE]
> To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```
@ -294,7 +298,8 @@ You can monitor and evaluate the training progress and metrics, as they will be
After training, ensure that you shutdown the jupyter kernel in the notebook and kill the jupyter server in the terminal with a `Ctrl+C` keystroke.
> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
> [!NOTE]
> To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```

View File

@ -152,7 +152,8 @@ Within VS Code:
## Step 8. Uninstalling VS Code
> **Warning:** Uninstalling VS Code will remove all user settings and extensions.
> [!WARNING]
> Uninstalling VS Code will remove all user settings and extensions.
To remove VS Code if needed:
```bash

View File

@ -128,7 +128,8 @@ Create a Docker network that will be shared between VSS services and CV pipeline
docker network create vss-shared-network
```
> **Warning:** If the network already exists, you may see an error. Remove it first with `docker network rm vss-shared-network` if needed.
> [!WARNING]
> If the network already exists, you may see an error. Remove it first with `docker network rm vss-shared-network` if needed.
## Step 6. Authenticate with NVIDIA Container Registry
@ -369,7 +370,8 @@ Follow the steps [here](https://docs.nvidia.com/vss/latest/content/ui_app.html)
To completely remove the VSS deployment and free up system resources:
> **Warning:** This will destroy all processed video data and analysis results.
> [!WARNING]
> This will destroy all processed video data and analysis results.
```bash
## For Event Reviewer deployment