mirror of
https://github.com/NVIDIA/dgx-spark-playbooks.git
synced 2026-04-22 01:53:53 +00:00
chore: Regenerate all playbooks
This commit is contained in:
parent
67e50f83fd
commit
91ba405901
@ -22,16 +22,16 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
|
||||
### NVIDIA
|
||||
|
||||
- [Comfy UI](nvidia/comfy-ui/)
|
||||
- [Connect to Your Spark](nvidia/connect-to-your-spark/)
|
||||
- [Connect to Your Spark from Another Computer](nvidia/connect-to-your-spark/)
|
||||
- [DGX Dashboard](nvidia/dgx-dashboard/)
|
||||
- [FLUX.1 Dreambooth LoRA Fine-tuning](nvidia/flux-finetuning/)
|
||||
- [Optimized JAX](nvidia/jax/)
|
||||
- [Llama Factory](nvidia/llama-factory/)
|
||||
- [LLaMA Factory](nvidia/llama-factory/)
|
||||
- [MONAI-Reasoning-CXR-3B Model](nvidia/monai-reasoning/)
|
||||
- [Build and Deploy a Multi-Agent Chatbot](nvidia/multi-agent-chatbot/)
|
||||
- [Multi-modal Inference](nvidia/multi-modal-inference/)
|
||||
- [NCCL for Two Sparks](nvidia/nccl/)
|
||||
- [Fine tune with Nemo](nvidia/nemo-fine-tune/)
|
||||
- [Fine-tune with NeMo](nvidia/nemo-fine-tune/)
|
||||
- [Use a NIM on Spark](nvidia/nim-llm/)
|
||||
- [Quantize to NVFP4](nvidia/nvfp4-quantization/)
|
||||
- [Ollama](nvidia/ollama/)
|
||||
@ -43,7 +43,7 @@ Each playbook includes prerequisites, step-by-step instructions, troubleshooting
|
||||
- [SGLang Inference Server](nvidia/sglang/)
|
||||
- [Speculative Decoding](nvidia/speculative-decoding/)
|
||||
- [Stack two Sparks](nvidia/stack-sparks/)
|
||||
- [Setup Tailscale on your Spark](nvidia/tailscale/)
|
||||
- [Set up Tailscale on your Spark](nvidia/tailscale/)
|
||||
- [TRT LLM for Inference](nvidia/trt-llm/)
|
||||
- [Text to Knowledge Graph](nvidia/txt2kg/)
|
||||
- [Unsloth on DGX Spark](nvidia/unsloth/)
|
||||
|
||||
@ -1,4 +1,4 @@
|
||||
# Connect to Your Spark
|
||||
# Connect to Your Spark from Another Computer
|
||||
|
||||
> Use NVIDIA Sync or manual SSH to connect to your Spark
|
||||
|
||||
|
||||
@ -40,7 +40,7 @@ The setup includes:
|
||||
## Time & risk
|
||||
|
||||
**Duration**:
|
||||
- 15 minutes for initial setup model download time
|
||||
- 30-45 minutes for initial setup model download time
|
||||
- 1-2 hours for dreambooth LoRA training
|
||||
|
||||
**Risks**:
|
||||
@ -85,9 +85,10 @@ If you do not have a `HF_TOKEN` already, follow the instructions [here](https://
|
||||
|
||||
```bash
|
||||
export HF_TOKEN=<YOUR_HF_TOKEN>
|
||||
cd flux-finetuning/assets
|
||||
cd dgx-spark-playbooks/nvidia/flux-finetuning/assets
|
||||
sh download.sh
|
||||
```
|
||||
The download script can take about 30-45 minutes to complete based on your internet speed.
|
||||
|
||||
If you already have fine-tuned LoRAs, place them inside `models/loras`. If you do not have one yet, proceed to the `Step 6. Training` section for more details.
|
||||
|
||||
@ -120,27 +121,29 @@ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
|
||||
|
||||
## Step 5. Dataset preparation
|
||||
|
||||
Let's prepare our dataset to perform Dreambooth LoRA fine-tuning on the FLUX.1-dev 12B model. However, if you wish to continue with the provided dataset of Toy Jensen and DGX Spark, feel free to skip to the Training section below. This dataset is a collection of public assets accessible via Google Images.
|
||||
Let's prepare our dataset to perform Dreambooth LoRA fine-tuning on the FLUX.1-dev 12B model.
|
||||
|
||||
You will need to prepare a dataset of all the concepts you would like to generate and about 5-10 images for each concept. For this example, we would like to generate images with 2 concepts.
|
||||
For this playbook, we have already prepared a dataset of 2 concepts - Toy Jensen and DGX Spark. This dataset is a collection of public assets accessible via Google Images. If you wish to generate images with these concepts, you do not need to modify the `data.toml` file.
|
||||
|
||||
**TJToy Concept**
|
||||
- **Trigger phrase**: `tjtoy toy`
|
||||
- **Training images**: 6 high-quality images of custom toy figures
|
||||
- **Training images**: 6 high-quality images of Toy Jensen figures available in the public domain
|
||||
- **Use case**: Generate images featuring the specific toy character in various scenes
|
||||
|
||||
**SparkGPU Concept**
|
||||
- **Trigger phrase**: `sparkgpu gpu`
|
||||
- **Training images**: 7 images of custom GPU hardware
|
||||
- **Training images**: 7 images of DGX Spark GPU available in the public domain
|
||||
- **Use case**: Generate images featuring the specific GPU design in different contexts
|
||||
|
||||
If you wish to generate images with custom concepts, you would need to prepare a dataset of all the concepts you would like to generate and about 5-10 images for each concept.
|
||||
|
||||
Create a folder for each concept with its corresponding name and place it inside the `flux_data` directory. In our case, we have used `sparkgpu` and `tjtoy` as our concepts, and placed a few images inside each of them.
|
||||
|
||||
Now, let's modify the `flux_data/data.toml` file to reflect the concepts chosen. Ensure that you update/create entries for each of your concept by modifying the `image_dir` and `class_tokens` fields under `[[datasets.subsets]]`. For better performance in fine-tuning, it is good practice to append a class token to your concept name (like `toy` or `gpu`).
|
||||
Now, let's modify the `flux_data/data.toml` file to reflect the concepts chosen. Ensure that you update/create entries for each of your concepts by modifying the `image_dir` and `class_tokens` fields under `[[datasets.subsets]]`. For better performance in fine-tuning, it is good practice to append a class token to your concept name (like `toy` or `gpu`).
|
||||
|
||||
## Step 6. Training
|
||||
|
||||
Launch training by executing the follow command. The training script is set up to use a default configuration that can generate reasonable images for your dataset, in about ~90 mins of training. This train command will automatically store checkpoints in the `models/loras/` directory.
|
||||
Launch training by executing the following command. The training script uses a default configuration that produces images that capture your DreamBooth concepts effectively after about 90 minutes of training. This train command will automatically store checkpoints in the `models/loras/` directory.
|
||||
|
||||
```bash
|
||||
## Build the inference docker image
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
# Llama Factory
|
||||
# LLaMA Factory
|
||||
|
||||
> Install and fine-tune models with LLama Factory
|
||||
> Install and fine-tune models with LLaMA Factory
|
||||
|
||||
## Table of Contents
|
||||
|
||||
@ -12,6 +12,13 @@
|
||||
|
||||
## Overview
|
||||
|
||||
## Basic idea
|
||||
LLaMA Factory is an open-source framework that simplifies the process of training and fine
|
||||
tuning large language models. It offers a unified interface for a variety of cutting edge
|
||||
methods such as SFT, RLHF, and QLoRA techniques. It also supports a wide range of LLM
|
||||
architectures such as LLaMA, Mistral and Qwen. This playbook demonstrates how to fine-tune
|
||||
large language models using LLaMA Factory CLI on your NVIDIA Spark device.
|
||||
|
||||
## What you'll accomplish
|
||||
|
||||
You'll set up LLaMA Factory on NVIDIA Spark with Blackwell architecture to fine-tune large
|
||||
@ -107,7 +114,9 @@ pip install -e ".[metrics]"
|
||||
|
||||
## Step 5. Verify Pytorch CUDA support.
|
||||
|
||||
PyTorch is pre-installed with CUDA support. Verify installation:
|
||||
PyTorch is pre-installed with CUDA support.
|
||||
|
||||
To verify installation:
|
||||
|
||||
```bash
|
||||
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
|
||||
@ -123,7 +132,8 @@ cat examples/train_lora/llama3_lora_sft.yaml
|
||||
|
||||
## Step 7. Launch fine-tuning training
|
||||
|
||||
> **Note:** Login to your hugging face hub to download the model if the model is gated
|
||||
> **Note:** Login to your hugging face hub to download the model if the model is gated.
|
||||
|
||||
Execute the training process using the pre-configured LoRA setup.
|
||||
|
||||
```bash
|
||||
|
||||
@ -83,6 +83,10 @@ and remove the Docker containers
|
||||
|
||||
## Instructions
|
||||
|
||||
> **Note:** DGX Spark uses a Unified Memory Architecture (UMA), which enables dynamic memory sharing between the GPU and CPU. With many applications still updating to take advantage of UMA, you may encounter memory issues even when within the memory capacity of DGX Spark. If that happens, manually flush the buffer cache with:
|
||||
```bash
|
||||
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
|
||||
```
|
||||
## Step 1. Create the Project Directory
|
||||
|
||||
First, create a dedicated directory to store your model weights and configuration files. This
|
||||
|
||||
@ -1,4 +1,4 @@
|
||||
# Fine tune with Nemo
|
||||
# Fine-tune with NeMo
|
||||
|
||||
> Use NVIDIA NeMo to fine-tune models locally
|
||||
|
||||
@ -6,18 +6,12 @@
|
||||
|
||||
- [Overview](#overview)
|
||||
- [Instructions](#instructions)
|
||||
- [If system installation fails](#if-system-installation-fails)
|
||||
- [Install from wheel package (recommended)](#install-from-wheel-package-recommended)
|
||||
- [Full Fine-tuning example:](#full-fine-tuning-example)
|
||||
- [LoRA fine-tuning example:](#lora-fine-tuning-example)
|
||||
- [QLoRA fine-tuning example:](#qlora-fine-tuning-example)
|
||||
- [Step 9. Configure distributed training (optional)](#step-9-configure-distributed-training-optional)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
## Basic Idea
|
||||
## Basic idea
|
||||
|
||||
This playbook guides you through setting up and using NVIDIA NeMo AutoModel for fine-tuning large language models and vision-language models on NVIDIA Spark devices. NeMo AutoModel provides GPU-accelerated, end-to-end training for Hugging Face models with native PyTorch support, enabling instant fine-tuning without conversion delays. The framework supports distributed training across single GPU to multi-node clusters, with optimized kernels and memory-efficient recipes specifically designed for ARM64 architecture and Blackwell GPU systems.
|
||||
|
||||
@ -36,20 +30,11 @@ You'll establish a complete fine-tuning environment for large language models (1
|
||||
## Prerequisites
|
||||
|
||||
- NVIDIA Spark device with Blackwell architecture GPU access
|
||||
- CUDA toolkit 12.0+ installed and configured
|
||||
```bash
|
||||
nvcc --version
|
||||
```
|
||||
- Python 3.10+ environment available
|
||||
```bash
|
||||
python3 --version
|
||||
```
|
||||
- CUDA toolkit 12.0+ installed and configured: `nvcc --version`
|
||||
- Python 3.10+ environment available: `python3 --version`
|
||||
- Minimum 32GB system RAM for efficient model loading and training
|
||||
- Active internet connection for downloading models and packages
|
||||
- Git installed for repository cloning
|
||||
```bash
|
||||
git --version
|
||||
```
|
||||
- Git installed for repository cloning: `git --version`
|
||||
- SSH access to your NVIDIA Spark device configured
|
||||
|
||||
## Ancillary files
|
||||
@ -58,11 +43,11 @@ All necessary files for the playbook can be found [here on GitHub](https://githu
|
||||
|
||||
## Time & risk
|
||||
|
||||
**Time estimate:** 45-90 minutes for complete setup and initial model fine-tuning
|
||||
**Duration:** 45-90 minutes for complete setup and initial model fine-tuning
|
||||
|
||||
**Risks:** Model downloads can be large (several GB), ARM64 package compatibility issues may require troubleshooting, distributed training setup complexity increases with multi-node configurations
|
||||
|
||||
**Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations
|
||||
**Rollback:** Virtual environments can be completely removed; no system-level changes are made to the host system beyond package installations.
|
||||
|
||||
## Instructions
|
||||
|
||||
@ -113,7 +98,7 @@ pip3 install uv
|
||||
uv --version
|
||||
```
|
||||
|
||||
### If system installation fails
|
||||
**If system installation fails:**
|
||||
|
||||
```bash
|
||||
## Install for current user only
|
||||
@ -139,7 +124,7 @@ cd Automodel
|
||||
|
||||
Set up the virtual environment and install NeMo AutoModel. Choose between wheel package installation for stability or source installation for latest features.
|
||||
|
||||
### Install from wheel package (recommended)
|
||||
**Install from wheel package (recommended):**
|
||||
|
||||
```bash
|
||||
## Initialize virtual environment
|
||||
@ -187,7 +172,7 @@ uv run --frozen --no-sync python -c "import nemo_automodel; print('✅ NeMo Auto
|
||||
ls -la examples/
|
||||
```
|
||||
|
||||
## Step 6. Explore available examples
|
||||
## Step 8. Explore available examples
|
||||
|
||||
Review the pre-configured training recipes available for different model types and training scenarios. These recipes provide optimized configurations for ARM64 and Blackwell architecture.
|
||||
|
||||
@ -199,18 +184,21 @@ ls examples/llm_finetune/
|
||||
cat examples/llm_finetune/finetune.py | head -20
|
||||
```
|
||||
|
||||
## Step 7. Run sample fine-tuning
|
||||
## Step 9. Run sample fine-tuning
|
||||
The following commands show how to perform full fine-tuning (SFT), parameter-efficient fine-tuning (PEFT) with LoRA and QLoRA.
|
||||
|
||||
First, you need to export your HF_TOKEN so that gated models can be downloaded.
|
||||
First, export your HF_TOKEN so that gated models can be downloaded.
|
||||
|
||||
```bash
|
||||
## Run basic LLM fine-tuning example
|
||||
export HF_TOKEN=<your_huggingface_token>
|
||||
```
|
||||
> **Note:** Please Replace `<your_huggingface_token>` with your Hugging Face access token to access gated models (e.g., Llama).
|
||||
|
||||
### Full Fine-tuning example:
|
||||
Once inside the `Automodel` directory you git cloned from github, run:
|
||||
**Full Fine-tuning example:**
|
||||
|
||||
Once inside the `Automodel` directory you cloned from github, run:
|
||||
|
||||
```bash
|
||||
uv run --frozen --no-sync \
|
||||
examples/llm_finetune/finetune.py \
|
||||
@ -224,7 +212,8 @@ These overrides ensure the Qwen3-8B SFT run behaves as expected:
|
||||
- `--loss_fn._target_`: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs.
|
||||
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
|
||||
|
||||
### LoRA fine-tuning example:
|
||||
**LoRA fine-tuning example:**
|
||||
|
||||
Execute a basic fine-tuning example to validate the complete setup. This demonstrates parameter-efficient fine-tuning using a small model suitable for testing.
|
||||
|
||||
```bash
|
||||
@ -234,8 +223,10 @@ examples/llm_finetune/finetune.py \
|
||||
-c examples/llm_finetune/llama3_2/llama3_2_1b_squad_peft.yaml \
|
||||
--model.pretrained_model_name_or_path meta-llama/Llama-3.1-8B
|
||||
```
|
||||
### QLoRA fine-tuning example:
|
||||
**QLoRA fine-tuning example:**
|
||||
|
||||
We can use QLoRA to fine-tune large models in a memory-efficient manner.
|
||||
|
||||
```bash
|
||||
uv run --frozen --no-sync \
|
||||
examples/llm_finetune/finetune.py \
|
||||
@ -250,7 +241,7 @@ These overrides ensure the 70B QLoRA run behaves as expected:
|
||||
- `--loss_fn._target_`: uses the TransformerEngine-parallel cross-entropy loss variant compatible with tensor-parallel training for large LLMs.
|
||||
- `--step_scheduler.local_batch_size`: sets the per-GPU micro-batch size to 1 to fit 70B in memory; overall effective batch size is still driven by gradient accumulation and data/tensor parallel settings from the recipe.
|
||||
|
||||
## Step 8. Validate training output
|
||||
## Step 10. Validate training output
|
||||
|
||||
Check that fine-tuning completed successfully and inspect the generated model artifacts. This confirms the training pipeline works correctly on your Spark device.
|
||||
|
||||
@ -268,26 +259,8 @@ print('GPU available:', torch.cuda.is_available())
|
||||
print('GPU count:', torch.cuda.device_count())
|
||||
"
|
||||
```
|
||||
<!--
|
||||
### Step 9. Configure distributed training (optional)
|
||||
|
||||
Set up multi-GPU training configuration for larger models. This step is optional but recommended for models requiring more computational resources.
|
||||
|
||||
```bash
|
||||
## Check available GPUs
|
||||
nvidia-smi -L
|
||||
|
||||
## Configure distributed training environment
|
||||
export CUDA_VISIBLE_DEVICES=0,1
|
||||
|
||||
## Run distributed training example
|
||||
uv run torchrun --nproc_per_node=2 \
|
||||
recipes/llm_finetune/finetune.py \
|
||||
--model_id meta-llama/Llama-2-7b-hf \
|
||||
--distributed
|
||||
``` -->
|
||||
|
||||
## Step 9. Validate complete setup
|
||||
## Step 11. Validate complete setup
|
||||
|
||||
Perform final validation to ensure all components are working correctly. This comprehensive check confirms the environment is ready for production fine-tuning workflows.
|
||||
|
||||
@ -303,7 +276,7 @@ print('✅ Setup complete')
|
||||
"
|
||||
```
|
||||
|
||||
## Step 10. Troubleshooting
|
||||
## Step 12. Troubleshooting
|
||||
|
||||
Common issues and solutions for NeMo AutoModel setup on NVIDIA Spark devices.
|
||||
|
||||
@ -315,7 +288,7 @@ Common issues and solutions for NeMo AutoModel setup on NVIDIA Spark devices.
|
||||
| Out of memory during training | Model too large for available GPU memory | Reduce batch size, enable gradient checkpointing, or use model parallelism |
|
||||
| ARM64 package compatibility issues | Package not available for ARM architecture | Use source installation or build from source with ARM64 flags |
|
||||
|
||||
## Step 11. Cleanup and rollback
|
||||
## Step 13. Cleanup and rollback
|
||||
|
||||
Remove the installation and restore the original environment if needed. These commands safely remove all installed components.
|
||||
|
||||
@ -336,7 +309,7 @@ pip3 uninstall uv
|
||||
rm -rf ~/.cache/pip
|
||||
```
|
||||
|
||||
## Step 12. Next steps
|
||||
## Step 14. Next steps
|
||||
|
||||
Begin using NeMo AutoModel for your specific fine-tuning tasks. Start with provided recipes and customize based on your model requirements and dataset.
|
||||
|
||||
|
||||
@ -5,7 +5,7 @@
|
||||
## Table of Contents
|
||||
|
||||
- [Overview](#overview)
|
||||
- [Basic Idea](#basic-idea)
|
||||
- [Basic idea](#basic-idea)
|
||||
- [What you'll accomplish](#what-youll-accomplish)
|
||||
- [What to know before starting](#what-to-know-before-starting)
|
||||
- [Prerequisites](#prerequisites)
|
||||
@ -17,7 +17,7 @@
|
||||
|
||||
## Overview
|
||||
|
||||
### Basic Idea
|
||||
### Basic idea
|
||||
|
||||
NVIDIA Inference Microservices (NIMs) provide optimized containers for deploying large language
|
||||
models with simplified APIs. This playbook demonstrates how to run LLM NIMs on DGX Spark devices,
|
||||
@ -44,11 +44,11 @@ completions.
|
||||
```bash
|
||||
nvidia-smi
|
||||
```
|
||||
- Docker with NVIDIA Container Toolkit configured, instructions here: https://******.nvidia.com/dgx-docs/review/621/dgx-spark/latest/nvidia-container-runtime-for-docker.html
|
||||
- Docker with NVIDIA Container Toolkit configured, instructions [here](https://******.nvidia.com/dgx-docs/review/621/dgx-spark/latest/nvidia-container-runtime-for-docker.html)
|
||||
```bash
|
||||
docker run -it --gpus=all nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 nvidia-smi
|
||||
```
|
||||
- NGC account with API key from https://ngc.nvidia.com/setup/api-key
|
||||
- NGC account with API key from [here](https://ngc.nvidia.com/setup/api-key)
|
||||
```bash
|
||||
echo $NGC_API_KEY | grep -E '^[a-zA-Z0-9]{86}=='
|
||||
```
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
# Quantize to NVFP4
|
||||
|
||||
> Quantize a model to NVFP4 to run on Spark
|
||||
> Quantize a model to NVFP4 to run on Spark using TensorRT Model Optimizer
|
||||
|
||||
## Table of Contents
|
||||
|
||||
@ -29,6 +29,8 @@
|
||||
You'll quantize the DeepSeek-R1-Distill-Llama-8B model using NVIDIA's TensorRT Model Optimizer
|
||||
inside a TensorRT-LLM container, producing an NVFP4 quantized model for deployment on NVIDIA DGX Spark.
|
||||
|
||||
The examples use NVIDIA FP4 quantized models which help reduce model size by approximately 2x by reducing the precision of model layers.
|
||||
This quantization approach aims to preserve accuracy while providing significant throughput improvements. However, it's important to note that quantization can potentially impact model accuracy - we recommend running evaluations to verify if the quantized model maintains acceptable performance for your use case.
|
||||
|
||||
## What to know before starting
|
||||
|
||||
@ -162,12 +164,16 @@ You should see model weight files, configuration files, and tokenizer files in t
|
||||
|
||||
## Step 7. Test model loading
|
||||
|
||||
Verify the quantized model can be loaded properly using a simple Python test.
|
||||
First, set the path to your quantized model:
|
||||
|
||||
```bash
|
||||
|
||||
## Set path to quantized model directory
|
||||
export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4_hf/"
|
||||
```
|
||||
|
||||
Now verify the quantized model can be loaded properly using a simple test:
|
||||
|
||||
```bash
|
||||
docker run \
|
||||
-e HF_TOKEN=$HF_TOKEN \
|
||||
-v $HOME/.cache/huggingface/:/root/.cache/huggingface/ \
|
||||
@ -183,7 +189,41 @@ docker run \
|
||||
'
|
||||
```
|
||||
|
||||
## Step 8. Troubleshooting
|
||||
## Step 8. Serve the model with OpenAI-compatible API
|
||||
Start the TensorRT-LLM OpenAI-compatible API server with the quantized model.
|
||||
First, set the path to your quantized model:
|
||||
|
||||
```bash
|
||||
## Set path to quantized model directory
|
||||
export MODEL_PATH="./output_models/saved_models_DeepSeek-R1-Distill-Llama-8B_nvfp4_hf/"
|
||||
|
||||
docker run \
|
||||
-e HF_TOKEN=$HF_TOKEN \
|
||||
-v "$MODEL_PATH:/workspace/model" \
|
||||
--rm -it --ulimit memlock=-1 --ulimit stack=67108864 \
|
||||
--gpus=all --ipc=host --network host \
|
||||
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
|
||||
trtllm-serve /workspace/model \
|
||||
--backend pytorch \
|
||||
--max_batch_size 4 \
|
||||
--port 8000
|
||||
```
|
||||
|
||||
Run the following to test the server with a client CURL request:
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
|
||||
"prompt": "What is artificial intelligence?",
|
||||
"max_tokens": 100,
|
||||
"temperature": 0.7,
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
## Step 9. Troubleshooting
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|--------|-----|
|
||||
@ -193,7 +233,7 @@ docker run \
|
||||
| Git clone fails inside container | Network connectivity issues | Check internet connection and retry |
|
||||
| Quantization process hangs | Container resource limits | Increase Docker memory limits or use `--ulimit` flags |
|
||||
|
||||
## Step 9. Cleanup and rollback
|
||||
## Step 10. Cleanup and rollback
|
||||
|
||||
To clean up the environment and remove generated files:
|
||||
|
||||
@ -210,7 +250,7 @@ rm -rf ~/.cache/huggingface
|
||||
docker rmi nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
|
||||
```
|
||||
|
||||
## Step 10. Next steps
|
||||
## Step 11. Next steps
|
||||
|
||||
The quantized model is now ready for deployment. Common next steps include:
|
||||
- Benchmarking inference performance compared to the original model.
|
||||
|
||||
@ -9,9 +9,9 @@
|
||||
- [Step 1. Configure Docker permissions](#step-1-configure-docker-permissions)
|
||||
- [Step 2. Run draft-target speculative decoding](#step-2-run-draft-target-speculative-decoding)
|
||||
- [Step 3. Test the draft-target setup](#step-3-test-the-draft-target-setup)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
- [Cleanup](#cleanup)
|
||||
- [Next Steps](#next-steps)
|
||||
- [Step 4. Troubleshooting](#step-4-troubleshooting)
|
||||
- [Step 5. Cleanup](#step-5-cleanup)
|
||||
- [Step 6. Next Steps](#step-6-next-steps)
|
||||
|
||||
---
|
||||
|
||||
@ -25,7 +25,6 @@ This way, the big model doesn't need to predict every token step-by-step, reduci
|
||||
## What you'll accomplish
|
||||
|
||||
You'll explore speculative decoding using TensorRT-LLM on NVIDIA Spark using the traditional Draft-Target approach.
|
||||
|
||||
These examples demonstrate how to accelerate large language model inference while maintaining output quality.
|
||||
|
||||
## What to know before starting
|
||||
@ -132,13 +131,14 @@ curl -X POST http://localhost:8000/v1/completions \
|
||||
}'
|
||||
```
|
||||
|
||||
#### Key features of draft-target:
|
||||
**Key features of draft-target:**
|
||||
|
||||
- **Efficient resource usage**: 8B draft model accelerates 70B target model
|
||||
- **Flexible configuration**: Adjustable draft token length for optimization
|
||||
- **Memory efficient**: Uses FP4 quantized models for reduced memory footprint
|
||||
- **Compatible models**: Uses Llama family models with consistent tokenization
|
||||
|
||||
### Troubleshooting
|
||||
### Step 4. Troubleshooting
|
||||
|
||||
Common issues and solutions:
|
||||
|
||||
@ -149,7 +149,7 @@ Common issues and solutions:
|
||||
| Model download fails | Network or authentication issues | Check HuggingFace authentication and network connectivity |
|
||||
| Server doesn't respond | Port conflicts or firewall | Check if port 8000 is available and not blocked |
|
||||
|
||||
### Cleanup
|
||||
### Step 5. Cleanup
|
||||
|
||||
Stop the Docker container when finished:
|
||||
|
||||
@ -162,7 +162,7 @@ docker stop <container_id>
|
||||
## rm -rf $HOME/.cache/huggingface/hub/models--*gpt-oss*
|
||||
```
|
||||
|
||||
### Next Steps
|
||||
### Step 6. Next Steps
|
||||
|
||||
- Experiment with different `max_draft_len` values (1, 2, 3, 4, 8)
|
||||
- Monitor token acceptance rates and throughput improvements
|
||||
|
||||
@ -1,4 +1,4 @@
|
||||
# Setup Tailscale on your Spark
|
||||
# Set up Tailscale on your Spark
|
||||
|
||||
> Use Tailscale to connect to your Spark on your home network no matter where you are
|
||||
|
||||
@ -25,7 +25,7 @@
|
||||
|
||||
## Overview
|
||||
|
||||
## Basic Idea
|
||||
## Basic idea
|
||||
|
||||
Tailscale creates an encrypted peer-to-peer mesh network that allows secure access
|
||||
to your NVIDIA Spark device from anywhere without complex firewall configurations
|
||||
@ -51,8 +51,9 @@ all traffic automatically encrypted and NAT traversal handled transparently.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- NVIDIA Spark device running Ubuntu (ARM64/AArch64)
|
||||
- NVIDIA Spark device running DGX OS (ARM64/AArch64)
|
||||
- Client device (Mac, Windows, or Linux) for remote access
|
||||
- Client device and DGX Spark not on the same network when testing connectivity
|
||||
- Internet connectivity on both devices
|
||||
- Valid email account for Tailscale authentication (Google, GitHub, Microsoft)
|
||||
- SSH server availability check: `systemctl status ssh`
|
||||
@ -61,7 +62,7 @@ all traffic automatically encrypted and NAT traversal handled transparently.
|
||||
|
||||
## Time & risk
|
||||
|
||||
**Time estimate**: 15-30 minutes for initial setup, 5 minutes per additional device
|
||||
**Duration**: 15-30 minutes for initial setup, 5 minutes per additional device
|
||||
|
||||
**Risks**:
|
||||
- Potential SSH service configuration conflicts
|
||||
@ -98,10 +99,10 @@ the Spark device.
|
||||
|
||||
```bash
|
||||
## Check if SSH is running
|
||||
systemctl status ssh
|
||||
systemctl status ssh --no-pager
|
||||
```
|
||||
|
||||
#### If SSH is not installed or running
|
||||
**If SSH is not installed or running:**
|
||||
|
||||
```bash
|
||||
## Install OpenSSH server
|
||||
@ -109,10 +110,10 @@ sudo apt update
|
||||
sudo apt install -y openssh-server
|
||||
|
||||
## Enable and start SSH service
|
||||
sudo systemctl enable ssh --now
|
||||
sudo systemctl enable ssh --now --no-pager
|
||||
|
||||
## Verify SSH is running
|
||||
systemctl status ssh
|
||||
systemctl status ssh --no-pager
|
||||
```
|
||||
|
||||
### Step 3. Install Tailscale on NVIDIA Spark
|
||||
@ -153,7 +154,7 @@ with authentication.
|
||||
tailscale version
|
||||
|
||||
## Check Tailscale service status
|
||||
sudo systemctl status tailscaled
|
||||
sudo systemctl status tailscaled --no-pager
|
||||
```
|
||||
|
||||
### Step 5. Connect Spark device to Tailscale network
|
||||
@ -172,31 +173,44 @@ sudo tailscale up
|
||||
### Step 6. Install Tailscale on client devices
|
||||
|
||||
Install Tailscale on the devices you'll use to connect to your Spark remotely.
|
||||
|
||||
Choose the appropriate method for your client operating system.
|
||||
|
||||
#### On macOS
|
||||
**On macOS:**
|
||||
- Option 1: Install from Mac App Store by searching for "Tailscale" and then clicking Get → Install
|
||||
- Option 2: Download the .pkg installer from the [Tailscale website](https://tailscale.com/download)
|
||||
|
||||
|
||||
**On Windows:**
|
||||
- Download installer from the [Tailscale website](https://tailscale.com/download)
|
||||
- Run the .msi file and follow installation prompts
|
||||
- Launch Tailscale from Start Menu or system tray
|
||||
|
||||
|
||||
**On Linux:**
|
||||
|
||||
Use the same instructions as were done for installing on your DGX Spark.
|
||||
|
||||
```bash
|
||||
## Option 1: Install from Mac App Store
|
||||
## Search for "Tailscale" and click Get → Install
|
||||
## Update package list
|
||||
sudo apt update
|
||||
|
||||
## Option 2: Download from website
|
||||
## Visit https://tailscale.com/download and download .pkg installer
|
||||
```
|
||||
## Install required tools for adding external repositories
|
||||
sudo apt install -y curl gnupg
|
||||
|
||||
#### On Windows
|
||||
## Add Tailscale signing key
|
||||
curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/noble.noarmor.gpg | \
|
||||
sudo tee /usr/share/keyrings/tailscale-archive-keyring.gpg > /dev/null
|
||||
|
||||
```bash
|
||||
## Download installer from https://tailscale.com/download
|
||||
## Run the .msi file and follow installation prompts
|
||||
## Launch Tailscale from Start Menu or system tray
|
||||
```
|
||||
## Add Tailscale repository
|
||||
curl -fsSL https://pkgs.tailscale.com/stable/ubuntu/noble.tailscale-keyring.list | \
|
||||
sudo tee /etc/apt/sources.list.d/tailscale.list
|
||||
|
||||
#### On Linux
|
||||
## Update package list with new repository
|
||||
sudo apt update
|
||||
|
||||
```bash
|
||||
## Use same installation steps as Spark device (Steps 3-4)
|
||||
## Adjust repository URLs for your specific distribution if needed
|
||||
## Install Tailscale
|
||||
sudo apt install -y tailscale
|
||||
```
|
||||
|
||||
### Step 7. Connect client devices to tailnet
|
||||
@ -204,12 +218,13 @@ Choose the appropriate method for your client operating system.
|
||||
Log in to Tailscale on each client device using the same identity provider
|
||||
account you used for the Spark device.
|
||||
|
||||
#### On macOS/Windows (GUI)
|
||||
**On macOS/Windows (GUI):**
|
||||
- Launch Tailscale app
|
||||
- Click "Log in" button
|
||||
- Sign in with same account used on Spark
|
||||
|
||||
#### On Linux (CLI)
|
||||
**On Linux (CLI):**
|
||||
|
||||
```bash
|
||||
## Start Tailscale on client
|
||||
sudo tailscale up
|
||||
@ -237,7 +252,7 @@ tailscale ping <SPARK_HOSTNAME>
|
||||
Set up SSH key authentication for secure access to your Spark device. This
|
||||
step runs on your client device and Spark device.
|
||||
|
||||
#### Generate SSH key on client (if not already done)
|
||||
**Generate SSH key on client (if not already done):**
|
||||
|
||||
```bash
|
||||
## Generate new SSH key pair
|
||||
@ -247,7 +262,7 @@ ssh-keygen -t ed25519 -f ~/.ssh/tailscale_spark
|
||||
cat ~/.ssh/tailscale_spark.pub
|
||||
```
|
||||
|
||||
#### Add public key to Spark device
|
||||
**Add public key to Spark device:**
|
||||
|
||||
```bash
|
||||
## On Spark device, add client's public key
|
||||
@ -282,6 +297,9 @@ Verify that Tailscale is working correctly and your SSH connection is stable.
|
||||
## From client device, check connection status
|
||||
tailscale status
|
||||
|
||||
## Create a test file on the client device
|
||||
echo "test file for the spark" > test.txt
|
||||
|
||||
## Test file transfer over SSH
|
||||
scp -i ~/.ssh/tailscale_spark test.txt <USERNAME>@<SPARK_HOSTNAME>:~/
|
||||
|
||||
@ -289,7 +307,7 @@ scp -i ~/.ssh/tailscale_spark test.txt <USERNAME>@<SPARK_HOSTNAME>:~/
|
||||
ssh -i ~/.ssh/tailscale_spark <USERNAME>@<SPARK_HOSTNAME> 'nvidia-smi'
|
||||
```
|
||||
|
||||
Expected output should show:
|
||||
Expected output:
|
||||
- Tailscale status displaying both devices as "active"
|
||||
- Successful file transfers
|
||||
- Remote command execution working
|
||||
@ -301,7 +319,7 @@ Common issues and their solutions:
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| `tailscale up` auth fails | Network issues | Check internet, try `curl -I login.tailscale.com` |
|
||||
| SSH connection refused | SSH not running | Run `sudo systemctl start ssh` on Spark |
|
||||
| SSH connection refused | SSH not running | Run `sudo systemctl start ssh --no-pager` on Spark |
|
||||
| SSH auth failure | Wrong SSH keys | Check public key in `~/.ssh/authorized_keys` |
|
||||
| Cannot ping hostname | DNS issues | Use IP from `tailscale status` instead |
|
||||
| Devices missing | Different accounts | Use same identity provider for all devices |
|
||||
@ -337,5 +355,5 @@ Your Tailscale setup is complete. You can now:
|
||||
|
||||
- Access your Spark device from any network with: `ssh <USERNAME>@<SPARK_HOSTNAME>`
|
||||
- Transfer files securely: `scp file.txt <USERNAME>@<SPARK_HOSTNAME>:~/`
|
||||
- Run Jupyter notebooks remotely by SSH tunneling:
|
||||
`ssh -L 8888:localhost:8888 <USERNAME>@<SPARK_HOSTNAME>`
|
||||
- Open the DGX Dashboard and start JupyterLab, then connect with:
|
||||
`ssh -L 8888:localhost:1102 <USERNAME>@<SPARK_HOSTNAME>`
|
||||
|
||||
@ -485,7 +485,7 @@ rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
|
||||
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
|
||||
```
|
||||
|
||||
Use an interface that shows as "(Up)" in your output. In this example, we'll use enP2p1s0f0np0.
|
||||
Use an interface that shows as "(Up)" in your output. In this example, we'll use **enp1s0f0np0**.
|
||||
|
||||
On Node 1:
|
||||
```bash
|
||||
|
||||
@ -13,6 +13,14 @@
|
||||
|
||||
## Overview
|
||||
|
||||
## Basic idea
|
||||
|
||||
vLLM is an inference engine designed to run large language models efficiently. The key idea is **maximizing throughput and minimizing memory waste** when serving LLMs.
|
||||
|
||||
- It uses a memory-efficient attention algoritm called **PagedAttention** to handle long sequences without running out of GPU memory.
|
||||
- New requests can be added to a batch already in process through **continuous batching** to keep GPUs fully utilized.
|
||||
- It has an **OpenAI-compatible API** so applications built for the OpenAI API can switch to a vLLM backend with little or no modification.
|
||||
|
||||
## What you'll accomplish
|
||||
|
||||
You'll set up vLLM high-throughput LLM serving on DGX Spark with Blackwell architecture,
|
||||
@ -40,7 +48,7 @@ support for ARM64.
|
||||
|
||||
## Time & risk
|
||||
|
||||
**Time estimate:** 30 minutes for Docker approach
|
||||
**Duration:** 30 minutes for Docker approach
|
||||
|
||||
**Risks:** Container registry access requires internal credentials
|
||||
|
||||
@ -91,8 +99,7 @@ Expected response should contain `"content": "204"` or similar mathematical calc
|
||||
| CUDA version mismatch errors | Wrong CUDA toolkit version | Reinstall CUDA 12.9 using exact installer |
|
||||
| Container registry authentication fails | Invalid or expired GitLab token | Generate new auth token |
|
||||
| SM_121a architecture not recognized | Missing LLVM patches | Verify SM_121a patches applied to LLVM source |
|
||||
| Reduce MAX_JOBS to 1-2, add swap space |
|
||||
| Environment variables not set |
|
||||
|
||||
|
||||
## Step 4. Cleanup and rollback
|
||||
|
||||
@ -184,11 +191,11 @@ sudo usermod -aG docker $USER
|
||||
newgrp docker
|
||||
```
|
||||
|
||||
You can now run docker commands without running `sudo`
|
||||
Next, ensure you have an NGC API Key to be able to pull containers from NGC
|
||||
## More info on setup of -- https://docs.nvidia.com/ngc/latest/ngc-private-registry-user-guide.html#accessing-the-ngc-container-registry
|
||||
After this, you should be able to run docker commands without using `sudo`.
|
||||
|
||||
With your API key ready, configure docker to pull from NGC and pull down the VLLM Image
|
||||
Next, create an NGC API Key [here](https://ngc.nvidia.com/setup/api-key) so that you can pull containers from NGC.
|
||||
|
||||
Once you have the API key, you can configure docker to pull from NGC and pull down the VLLM image:
|
||||
|
||||
```bash
|
||||
docker login nvcr.io
|
||||
|
||||
@ -82,16 +82,17 @@ sudo usermod -aG docker $USER
|
||||
In a terminal, clone the repository and navigate to the VLM fine-tuning directory.
|
||||
|
||||
```bash
|
||||
git clone https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main dgx-spark-playbooks
|
||||
git clone https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main
|
||||
```
|
||||
|
||||
## Step 3. Build the Docker container
|
||||
|
||||
Build the Docker image. This will set up the environment for both image and video VLM fine-tuning:
|
||||
Build the Docker image. This will set up the environment for both image and video VLM fine-tuning.
|
||||
Please export your Hugging Face token as an environment variable - `HF_TOKEN`. You may encounter warnings when building the image. This is expected and can be ignored.
|
||||
|
||||
```bash
|
||||
## Enter the correct directory for building the image
|
||||
cd vlm-finetuning/assets
|
||||
cd dgx-spark-playbooks/nvidia/vlm-finetuning/assets
|
||||
|
||||
## Build the VLM fine-tuning container
|
||||
docker build --build-arg HF_TOKEN=$HF_TOKEN -t vlm_demo .
|
||||
@ -118,9 +119,30 @@ hf download Qwen/Qwen2.5-VL-7B-Instruct
|
||||
|
||||
If you already have a fine-tuned checkpoint, place it in the `saved_model/` folder. Note that your checkpoint number can be different. For a comparative analysis against the base model, skip directly to the `Finetuned Model Inference` section.
|
||||
|
||||
#### 5.2. Download the wildfire dataset from Kaggle and place it in the `data` directory
|
||||
#### 5.2. Download the wildfire dataset
|
||||
|
||||
The wildfire dataset can be found here: https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset.
|
||||
The project uses a **Wildfire Detection Dataset** with satellite imagery for training the model to identify wildfire-affected regions. The dataset includes:
|
||||
- Satellite and aerial imagery from wildfire-affected areas
|
||||
- Binary classification: wildfire vs no wildfire
|
||||
|
||||
```bash
|
||||
mkdir -p ui_image/data
|
||||
cd ui_image/data
|
||||
```
|
||||
|
||||
For this fine-tuning playbook, we will use the [Wildfire Prediction Dataset](https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset) from Kaggle. Visit the kaggle dataset page [here](https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset) to click the download button. Select the `cURL` option in the `Download Via` dropdown and copy the curl command.
|
||||
|
||||
> **Note**: You will need to be logged into Kaggle and may need to accept the dataset terms before the download link works.
|
||||
|
||||
Run the following commands in your container:
|
||||
|
||||
```bash
|
||||
## Paste and run the curl command from Kaggle here, and then continue to unzip the dataset
|
||||
|
||||
unzip -qq wildfire-prediction-dataset.zip
|
||||
rm wildfire-prediction-dataset.zip
|
||||
cd ..
|
||||
```
|
||||
|
||||
#### 5.3. Base model inference
|
||||
|
||||
@ -134,26 +156,61 @@ Access the streamlit demo at http://localhost:8501/.
|
||||
|
||||
When you access the streamlit demo for the first time, the backend triggers vLLM servers to spin up for the base model. You will see a spinner on the demo site as vLLM is being brought up for optimized inference. This step can take up to 15 mins.
|
||||
|
||||
Since we are currently focused on inferring the base model, let's scroll down to the `Image Inference` section of the UI. Here, you should see a sample pre-loaded satellite image of a potentially wildfire-affected region.
|
||||
|
||||
Enter your prompt in the chat box and hit `Generate`. Your prompt would be first sent to the base model and you should see the generation response on the left chat box. If you did not provide a fine-tuned model, you should not see any generations from the right chat box. You can use the following prompt to quickly test inference:
|
||||
|
||||
`Identify if this region has been affected by a wildfire`
|
||||
|
||||
As you can see, the base model is incapable of providing the right response for this domain-specific task. Let's try to improve the model's accuracy by performing GRPO fine-tuning.
|
||||
|
||||
#### 5.4. GRPO fine-tuning
|
||||
|
||||
We will perform GRPO fine-tuning to add reasoning capabilities to our base model and improve the model's understanding of the underlying domain. Considering that you have already spun up the streamlit demo, scroll to the `GRPO Training section`.
|
||||
|
||||
Configure the finetuning method and lora parameters based on the following options.
|
||||
|
||||
- `Finetuning Method`: Choose from Full Finetuning or LoRA
|
||||
- `LoRA Parameters`: Adjustable rank (8-64) and alpha (8-64)
|
||||
|
||||
You can additionally choose whether the layers you want to fine-tune in the VLM. For the best performance, ensure that all options are toggled on. Note that this will increase the model training time as well.
|
||||
|
||||
In this section, we can select certain model parameters as relevant to our training run.
|
||||
|
||||
- `Steps`: 1-1000
|
||||
- `Batch Size`: 1, 2, 4, 8, or 16
|
||||
- `Learning Rate`: 1e-6 to 1e-2
|
||||
- `Optimizer`: AdamW or Adafactor
|
||||
|
||||
For a GRPO setup, we also have the flexibility in choosing the reward that is assigned to the model based on certain criteria
|
||||
|
||||
- `Format Reward`: 2.0 (reward for proper reasoning format)
|
||||
- `Correctness Reward`: 5.0 (reward for correct answers)
|
||||
- `Number of Generations`: 4 (for preference optimization)
|
||||
|
||||
After configuring all the parameters, hit `Start Finetuning` to begin the training process. You will need to wait about 15 minutes for the model to load and start recording metadata on the UI. As the training progresses, information such as the loss, step, and GRPO rewards will be recorded on a live table.
|
||||
|
||||
The default loaded configuration should give you reasonable accuracy, taking 100 steps of training over a period of up to 2 hours. We achieved our best accuracy with around 1000 steps of training, taking close to 16 hours.
|
||||
|
||||
After training is complete, the script automatically merges LoRA weights into the base model. After the training process has reached the desired number of training steps, it can take 5 mins to merge the LoRA weights.
|
||||
|
||||
If you wish to stop training, just hit the `Stop Finetuning` button. Please use this button only to interrupt training. This button does not guarantee that the checkpoints will be properly stored or merged with lora adapter layers.
|
||||
|
||||
Once you stop training, the UI will automatically bring up the vLLM servers for the base model and the newly fine-tuned model.
|
||||
|
||||
#### 5.5. Fine-tuned model inference
|
||||
|
||||
Now we are ready to perform a comparative analysis between the base model and the fine-tuned model.
|
||||
|
||||
If you haven't spun up the streamlit demo already, execute the following command. If had just just stopped training and are still within the live UI, skip this step.
|
||||
|
||||
```bash
|
||||
streamlit run Image_VLM.py
|
||||
```
|
||||
|
||||
Regardless of whether you just spun up the demo or just stopped training, please wait about 15 minutes for the vLLM servers to be brought up.
|
||||
|
||||
Scroll down to the `Image Inference` section and enter your prompt in the provided chat box.
|
||||
Upon clicking `Generate` your prompt will be first sent to the base model and then to the fine-tuned model. You can use the following prompt to quickly test inference:
|
||||
Scroll down to the `Image Inference` section and enter your prompt in the provided chat box. Upon clicking `Generate` your prompt will be first sent to the base model and then to the fine-tuned model. You can use the following prompt to quickly test inference:
|
||||
|
||||
`Identify if this region has been affected by a wildfire`
|
||||
|
||||
@ -161,6 +218,12 @@ If you trained your model sufficiently, you should see that the fine-tuned model
|
||||
|
||||
## Step 6. [Option B] For video VLM fine-tuning (Driver Behaviour Analysis)
|
||||
|
||||
Within the same container, navigate to the `ui_video` directory.
|
||||
|
||||
```bash
|
||||
cd /vlm_finetuning/ui_video
|
||||
```
|
||||
|
||||
#### 6.1. Prepare your video dataset
|
||||
|
||||
Structure your dataset as follows. Ensure that `metadata.jsonl` contains rows of structured JSON data about each video.
|
||||
@ -175,7 +238,7 @@ dataset/
|
||||
|
||||
#### 6.2. Model download
|
||||
|
||||
> **Note**: These instructions assume you are already inside the Docker container. For container setup, refer to the main project README at `vlm-finetuning/assets/README.md`.
|
||||
> **Note**: These instructions assume you are already inside the Docker container. For container setup, refer to the section above to `Build the Docker container`.
|
||||
|
||||
```bash
|
||||
hf download OpenGVLab/InternVL3-8B
|
||||
@ -186,7 +249,7 @@ hf download OpenGVLab/InternVL3-8B
|
||||
Before going ahead to fine-tune our video VLM for this task, let's see how the base InternVL3-8B does.
|
||||
|
||||
```bash
|
||||
## cd into vlm_finetuning/assets/ui_video if you haven't already
|
||||
## cd into /vlm_finetuning/ui_video if you haven't already
|
||||
streamlit run Video_VLM.py
|
||||
```
|
||||
|
||||
@ -196,6 +259,10 @@ When you access the streamlit demo for the first time, the backend triggers Hugg
|
||||
|
||||
First, let's select a video from our dashcam gallery. Upon clicking the green file open icon near a video, you should see the video render and play automatically for our reference.
|
||||
|
||||
Scroll down, enter your prompt in the chat box and hit `Generate`. Your prompt would be first sent to the base model and you should see the generation response on the left chat box. If you did not provide a finetuned model, you should not see any generations from the right chat box. You can use the following prompt to quickly test inference:
|
||||
|
||||
`Analyze the dashcam footage for unsafe driver behavior`
|
||||
|
||||
If you are proceeding to train a fine-tuned model, ensure that the streamlit demo UI is brought down before proceeding to train. You can bring it down by interrupting the terminal with `Ctrl+C` keystroke.
|
||||
|
||||
> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
|
||||
@ -207,7 +274,7 @@ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
|
||||
|
||||
```bash
|
||||
## Enter the correct directory
|
||||
cd /vlm-finetuning/ui_video/train
|
||||
cd train
|
||||
|
||||
## Start Jupyter Lab
|
||||
jupyter notebook video_vlm.ipynb
|
||||
@ -217,6 +284,17 @@ Access Jupyter at `http://localhost:8888`. Ensure that you set the path to your
|
||||
```python
|
||||
dataset_path = "/path/to/your/dataset"
|
||||
```
|
||||
|
||||
Here are some of the key training parameters that are configurable. Please note that for reasonable quality, you will need to train your video VLM for atleast 24 hours given the complexity of processing spatio-temporal video sequences.
|
||||
|
||||
- **Model**: InternVL3-8B
|
||||
- **Video Frames**: 12 to 16 frames per video
|
||||
- **Sampling Mode**: Uniform temporal sampling
|
||||
- **LoRA Configuration**: Efficient parameter updates for large-scale fine-tuning
|
||||
- **Hyperparameters**: Exhaustive suite of hyperparameters to tune for video VLM fine-tuning
|
||||
|
||||
You can monitor and evaluate the training progress and metrics, as they will be continuously shown in the notebook.
|
||||
|
||||
After training, ensure that you shutdown the jupyter kernel in the notebook and kill the jupyter server in the terminal with a `Ctrl+C` keystroke.
|
||||
|
||||
> **Note**: To clear out any extra occupied memory from your system, execute the following command outside the container after interrupting the ComfyUI server.
|
||||
@ -230,11 +308,14 @@ Now we are ready to perform a comparative analysis between the base model and th
|
||||
If you haven't spun up the streamlit demo already, execute the following command. If you have just stopped training and are still within the live UI, skip to the next step.
|
||||
|
||||
```bash
|
||||
## cd back to /vlm_finetuning/ui_video if you haven't already
|
||||
streamlit run Video_VLM.py
|
||||
```
|
||||
|
||||
Access the streamlit demo at http://localhost:8501/.
|
||||
|
||||
If you trained your model sufficiently, you should see that the fine-tuned model is able to identify the salient events from the video and generate a structured output.
|
||||
If you trained your model sufficiently, you should see that the fine-tuned model is able to identify the salient events from the video and generate a structured output.
|
||||
|
||||
Since the model's output adheres to the schema we trained, we can directly export the model's prediction into a database for video analytics.
|
||||
|
||||
Feel free to play around with additional videos available in the gallery.
|
||||
|
||||
@ -24,17 +24,16 @@ You will deploy NVIDIA's VSS AI Blueprint on NVIDIA Spark hardware with Blackwel
|
||||
- Working with NVIDIA Docker containers and container registries
|
||||
- Setting up Docker Compose environments with shared networks
|
||||
- Managing environment variables and authentication tokens
|
||||
- Working with NVIDIA DeepStream and computer vision pipelines
|
||||
- Basic understanding of video processing and analysis workflows
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- NVIDIA Spark device with ARM64 architecture and Blackwell GPU
|
||||
- FastOS 1.81.38 or compatible ARM64 system
|
||||
- Driver version 580.82.09 installed: `nvidia-smi | grep "Driver Version"`
|
||||
- Driver version 580.82.09 or higher installed: `nvidia-smi | grep "Driver Version"`
|
||||
- CUDA version 13.0 installed: `nvcc --version`
|
||||
- Docker installed and running: `docker --version && docker compose version`
|
||||
- Access to NVIDIA Container Registry with NGC API Key
|
||||
- Access to NVIDIA Container Registry with [NGC API Key](https://org.ngc.nvidia.com/setup/api-keys)
|
||||
- [Optional] NVIDIA API Key for remote model endpoints (hybrid deployment only)
|
||||
- Sufficient storage space for video processing (>10GB recommended in `/tmp/`)
|
||||
|
||||
@ -64,7 +63,7 @@ Check that your system meets the hardware and software prerequisites.
|
||||
```bash
|
||||
## Verify driver version
|
||||
nvidia-smi | grep "Driver Version"
|
||||
## Expected output: Driver Version: 580.82.09
|
||||
## Expected output: Driver Version: 580.82.09 or higher
|
||||
|
||||
## Verify CUDA version
|
||||
nvcc --version
|
||||
@ -91,7 +90,9 @@ newgrp docker
|
||||
```
|
||||
|
||||
> **Warning**: After running usermod, you must log out and log back in to start a new
|
||||
> session with updated group permissions.
|
||||
> session with updated group permissions, or in rare cases restart their spark for the
|
||||
> changes to take effect.
|
||||
|
||||
|
||||
Additionally, configure Docker so that it can use the NVIDIA Container Runtime.
|
||||
|
||||
@ -138,6 +139,8 @@ docker network create vss-shared-network
|
||||
|
||||
Log in to NVIDIA's container registry using your [NGC API Key](https://org.ngc.nvidia.com/setup/api-keys).
|
||||
|
||||
> **Note:** If you don’t have an NVIDIA account already, you’ll have to create one and register for the [developer program](https://developer.nvidia.com/nvidia-developer-program).
|
||||
|
||||
```bash
|
||||
## Log in to NVIDIA Container Registry
|
||||
docker login nvcr.io
|
||||
@ -195,7 +198,7 @@ Launch the complete VSS Event Reviewer stack including Alert Bridge, VLM Pipelin
|
||||
IS_SBSA=1 IS_AARCH64=1 ALERT_REVIEW_MEDIA_BASE_DIR=/tmp/alert-media-dir docker compose up
|
||||
```
|
||||
|
||||
> **Note:** This step will take several minutes as containers are pulled and services initialize. The VSS backend requires additional startup time.
|
||||
> **Note:** This step will take several minutes as containers are pulled and services initialize. The VSS backend requires additional startup time. Proceed to the next step in a new terminal in the meantime.
|
||||
|
||||
**8.5 Navigate to CV Event Detector directory**
|
||||
|
||||
@ -230,7 +233,16 @@ Allow time for all containers to fully initialize before accessing the user inte
|
||||
```bash
|
||||
## Monitor container status
|
||||
docker ps
|
||||
## Verify all containers show "Up" status and VSS backend logs show ready state
|
||||
## Verify all containers show "Up" status and VSS backend logs (vss-engine-sbsa:2.4.0) show ready state "Uvicorn running on http://0.0.0.0:7860"
|
||||
## In total, there should be 8 containers:
|
||||
## nvcr.io/nvidia/blueprint/nv-cv-event-detector-ui:2.4.0
|
||||
## nvcr.io/nvidia/blueprint/nv-cv-event-detector-sbsa:2.4.0
|
||||
## nginx:alpine
|
||||
## nvcr.io/nvidia/blueprint/vss-alert-inspector-ui:2.4.0
|
||||
## nvcr.io/nvidia/blueprint/alert-bridge:0.19.0-multiarch
|
||||
## nvcr.io/nvidia/blueprint/vss-engine-sbsa:2.4.0
|
||||
## nvcr.io/nvidia/blueprint/vst-storage:2.1.0-25.07.1
|
||||
## redis/redis-stack-server:7.2.0-v9
|
||||
```
|
||||
|
||||
**8.9 Validate Event Reviewer deployment**
|
||||
@ -238,18 +250,28 @@ docker ps
|
||||
Access the web interfaces to confirm successful deployment and functionality.
|
||||
|
||||
```bash
|
||||
## Test CV UI accessibility (replace <NODE_IP> with your system's IP)
|
||||
curl -I http://<NODE_IP>:7862
|
||||
## Test CV UI accessibility (default: localhost)
|
||||
curl -I http://localhost:7862
|
||||
## Expected: HTTP 200 response
|
||||
|
||||
## Test Alert Inspector UI accessibility
|
||||
curl -I http://<NODE_IP>:7860
|
||||
## Test Alert Inspector UI accessibility (default: localhost)
|
||||
curl -I http://localhost:7860
|
||||
## Expected: HTTP 200 response
|
||||
|
||||
## If you are running your Spark in Remote or Accessory mode, replace 'localhost' with the IP address or hostname of your Spark device.
|
||||
## To find your Spark's IP address, run the following command on the Spark system:
|
||||
hostname -I
|
||||
## Or to get the hostname:
|
||||
hostname
|
||||
## Then use the IP/hostname in place of 'localhost', for example:
|
||||
## curl -I http://<SPARK_IP_OR_HOSTNAME>:7862
|
||||
```
|
||||
|
||||
Open these URLs in your browser:
|
||||
- `http://<NODE_IP>:7862` - CV UI to launch and monitor CV pipeline
|
||||
- `http://<NODE_IP>:7860` - Alert Inspector UI to view clips and review VLM results
|
||||
- `http://localhost:7862` - CV UI to launch and monitor CV pipeline
|
||||
- `http://localhost:7860` - Alert Inspector UI to view clips and review VLM results
|
||||
|
||||
> **Note:** You may now proceed to step 10.
|
||||
|
||||
## Step 9. Option B
|
||||
|
||||
@ -305,32 +327,43 @@ cat config.yaml | grep -A 10 "model"
|
||||
docker compose up
|
||||
```
|
||||
|
||||
> **Note:** This step will take several minutes as containers are pulled and services initialize. The VSS backend requires additional startup time.
|
||||
|
||||
**9.7 Validate Standard VSS deployment**
|
||||
|
||||
Access the VSS UI to confirm successful deployment.
|
||||
|
||||
```bash
|
||||
## Test VSS UI accessibility (replace <NODE_IP> with your system's IP)
|
||||
curl -I http://<NODE_IP>:9100
|
||||
## Test VSS UI accessibility
|
||||
## If running locally on your Spark device, use localhost:
|
||||
curl -I http://localhost:9100
|
||||
## Expected: HTTP 200 response
|
||||
|
||||
## If your Spark is running in Remote/Accessory mode, replace 'localhost' with the IP address or hostname of your Spark device.
|
||||
## To find your Spark's IP address, run the following command on the Spark terminal:
|
||||
hostname -I
|
||||
## Or to get the hostname:
|
||||
hostname
|
||||
## Then test accessibility (replace <SPARK_IP_OR_HOSTNAME> with the actual value):
|
||||
curl -I http://<SPARK_IP_OR_HOSTNAME>:9100
|
||||
```
|
||||
|
||||
Open `http://<NODE_IP>:9100` in your browser to access the VSS interface.
|
||||
Open `http://localhost:9100` in your browser to access the VSS interface.
|
||||
|
||||
## Step 10. Test video processing workflow
|
||||
|
||||
Run a basic test to verify the video analysis pipeline is functioning based on your deployment.
|
||||
Run a basic test to verify the video analysis pipeline is functioning based on your deployment. The UI comes with a few example videos pre-populated for uploading and testing
|
||||
|
||||
**For Event Reviewer deployment**
|
||||
|
||||
Follow the steps [here](https://docs.nvidia.com/vss/latest/content/vss_event_reviewer.html#vss-alert-inspector-ui) to access and use the Event Reviewer workflow.
|
||||
- Access CV UI at `http://<NODE_IP>:7862` to upload and process videos
|
||||
- Monitor results in Alert Inspector UI at `http://<NODE_IP>:7860`
|
||||
- Access CV UI at `http://localhost:7862` to upload and process videos
|
||||
- Monitor results in Alert Inspector UI at `http://localhost:7860`
|
||||
|
||||
**For Standard VSS deployment**
|
||||
|
||||
Follow the steps [here](https://docs.nvidia.com/vss/latest/content/ui_app.html) to navigate VSS UI - File Summarization, Q&A, and Alerts.
|
||||
- Access VSS interface at `http://<NODE_IP>:9100`
|
||||
- Access VSS interface at `http://localhost:9100`
|
||||
- Upload videos and test summarization features
|
||||
|
||||
## Step 11. Troubleshooting
|
||||
|
||||
Loading…
Reference in New Issue
Block a user