mirror of https://github.com/NVIDIA/dgx-spark-playbooks.git synced 2026-04-22 18:13:52 +00:00

History

GitLab CI 35c96dce16 chore: Regenerate all playbooks		2025-10-06 19:24:34 +00:00
..
ui_image	chore: Regenerate all playbooks	2025-10-06 15:32:36 +00:00
ui_video	chore: Regenerate all playbooks	2025-10-06 15:32:36 +00:00
Dockerfile	chore: Regenerate all playbooks	2025-10-06 19:24:34 +00:00
launch.sh	chore: Regenerate all playbooks	2025-10-06 12:57:08 +00:00
README.md	chore: Regenerate all playbooks	2025-10-06 12:57:08 +00:00

README.md

VLM Fine-tuning Recipes

This repository contains comprehensive fine-tuning recipes for Vision-Language Models (VLMs), supporting both image and video understanding tasks with modern models and training techniques.

🎯 Available Recipes

📸 Image VLM Fine-tuning (`ui_image/`)

Model: Qwen2.5-VL-7B-Instruct
Task: Wildfire detection from satellite imagery
Training Method: GRPO (Generalized Reward Preference Optimization) and LoRA (Low-rank Adaptation)

🎥 Video VLM Fine-tuning (`ui_video/`)

Model: InternVL3-8B
Task: Dangerous driving detection and structured metadata generation
Training Method: Supervised Fine-tuning on Multimodal Instructions

🚀 Quick Start

1. Build the Docker Container

# Build the VLM fine-tuning container
docker build --build-arg HF_TOKEN=$HF_TOKEN -t vlm_demo .

2. Launch the Container

# Enter the correct directory for building the image
cd vlm-finetuning/assets

# Run the container with GPU support
sh launch.sh

# Enter the mounted directory within the container
cd /vlm_finetuning

Note

: The same Docker container and launch commands work for both image and video VLM recipes. The container includes all necessary dependencies including FFmpeg, Decord, and optimized libraries for both workflows.

📚 Detailed Instructions

Each recipe includes comprehensive documentation:

Image VLM README: Complete guide for wildfire detection fine-tuning with Qwen2.5-VL, including dataset setup, GRPO training configuration, and interactive inference
Video VLM README: Full walkthrough for dangerous driving detection with InternVL3, covering video data preparation, training notebooks, and structured output generation