sarman/dgx-spark-playbooks

Fork 0

mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-22 18:13:52 +00:00

GitLab CI e139105ed3 chore: Regenerate all playbooks

2025-10-05 00:47:50 +00:00

6.4 KiB

Raw Blame History

Vision-Language Model Fine-tuning

Fine-tune Vision-Language Models for image and video understanding tasks using Qwen2.5-VL and InternVL3

Overview
Instructions
- Video VLM Testing

Overview

Basic Idea

This playbook demonstrates how to fine-tune Vision-Language Models (VLMs) for both image and video understanding tasks on DGX Spark. With 128GB of unified memory and powerful GPU acceleration, DGX Spark provides an ideal environment for training VRAM intensive multimodal models that can understand and reason about visual content.

The playbook covers two distinct VLM fine-tuning approaches:

Image VLM Fine-tuning: Using Qwen2.5-VL-7B for wildfire detection from satellite imagery with GRPO (Generalized Reward Preference Optimization)
Video VLM Fine-tuning: Using InternVL3 8B for dangerous driving detection and structured metadata generation from driving videos

Both approaches leverage advanced training techniques including LoRA fine-tuning, preference optimization, and structured reasoning to achieve superior performance on specialized tasks.

What you'll accomplish

You will have fine-tuned VLM models capable of understanding and analyzing both images and videos for specialized use cases, accessible through interactive Web UIs. The setup includes:

Image VLM: Qwen2.5-VL fine-tuned for wildfire detection with reasoning capability
Video VLM: InternVL3 fine-tuned for dangerous driving analysis and structured metadata generation
Interactive Streamlit interfaces for both training and inference
Side-by-side model comparison (base vs fine-tuned) in the Web UIs
Docker containerization for reproducible environments

Prerequisites

DGX Spark device is set up and accessible
No other processes running on the DGX Spark GPU
Enough disk space for model downloads and datasets
NVIDIA Docker installed and configured
Weights & Biases account for training monitoring (optional but recommended)

Time & risk

Duration:

15-20 minutes for initial setup and model downloads
30-60 minutes for image VLM training (depending on dataset size)
1-2 hours for video VLM training (depending on video dataset size)

Risks:

Docker permission issues may require user group changes and session restart
Large model downloads and datasets may require significant disk space and time
Training requires sustained GPU usage and memory
Dataset preparation may require manual steps (Kaggle downloads, video processing)

Rollback: Stop and remove Docker containers, delete downloaded models and datasets if needed

Instructions

Step 1. Configure Docker permissions

To easily manage containers without sudo, you must be in the docker group. If you choose to skip this step, you will need to run Docker commands with sudo.

Open a new terminal and test Docker access. In the terminal, run:

docker ps

If you see a permission denied error (something like permission denied while trying to connect to the Docker daemon socket), add your user to the docker group:

sudo usermod -aG docker $USER

Warning

: After running usermod, you must log out and log back in to start a new session with updated group permissions.

Step 2. Clone the repository

In a terminal, clone the repository and navigate to the VLM fine-tuning directory.

git clone https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets/-/blob/main dgx-spark-playbooks

Step 3. Build the Docker container

Build the Docker image. This will set up the environment for both image and video VLM fine-tuning:

docker build -t vlm-finetuning .

Step 4. Run the Docker container

## Run with GPU support and mount current directory
docker run --gpus all -it --rm \
    -v $(pwd):/workspace \
    -p 8501:8501 \
    -p 8888:8888 \
    -p 6080:6080 \
    vlm-finetuning

Step 5. [Option A] For image VLM fine-tuning (Wildfire Detection)

5.1. Set up Weights & Biases

Configure your wandb credentials for training monitoring:

export WANDB_ENTITY=<your_wandb_username>
export WANDB_PROJECT="vlm_finetuning"
export WANDB_API_KEY=<your_wandb_api_key>

5.2. Download the wildfire dataset from Kaggle and place it in the `data` directory

The wildfire dataset can be found here: https://www.kaggle.com/datasets/abdelghaniaaba/wildfire-prediction-dataset

5.3. Launch the Image VLM UI

cd ui_image
streamlit run Image_VLM_Finetuning.py

Access the interface at http://localhost:8501

5.4. Configure and start training

Configure training parameters through the web interface
Choose fine-tuning method (LoRA, QLoRA, or Full-Finetuning)
Set hyperparameters (epochs, batch size, learning rate)
Click "▶️ Start Finetuning" to begin GRPO training
Monitor progress via embedded wandb charts

5.5. Test the fine-tuned model

After training completes:

Bring down the UI with Ctrl+C
Edit src/image_vlm_config.yaml and update finetuned_model_id to point to your model in saved_model/
Restart the interface to test your fine-tuned model

Step 6. [Option B] For video VLM fine-tuning (Driver Behaviour Analysis)

6.1. Prepare your video dataset

Structure your dataset as follows:

dataset/
├── videos/
│   ├── video1.mp4
│   ├── video2.mp4
│   └── ...
└── metadata.jsonl

6.2. Start Jupyter Lab

jupyter lab --ip=0.0.0.0 --port=8888 --allow-root

Access Jupyter at http://localhost:8888

6.3. Run the training notebook

cd ui_video/train
## Open and run internvl3_dangerous_driving.ipynb
## Update dataset path in the notebook to point to your data

6.4. Run inference

Video VLM Testing

Use the inference notebook to test on dashcam footage videos
Generate structured JSON metadata for dangerous driving events
Analyze traffic violations and safety risks

Step 7. Cleanup

Exit the container and optionally remove the Docker image:

## Exit container
exit

## Remove Docker image (optional)
docker stop <container_id>
docker rmi vlm-finetuning

Step 8. Next steps

Train on your own custom datasets for specialized use cases
Combine multiple VLM models for comprehensive multimodal analysis
Explore other VLM architectures and training techniques
Deploy fine-tuned models in production environments

6.4 KiB Raw Blame History